Unit IV
Unit IV
To use deep learning effectively, it's not enough to just know the algorithms and how they work. A good
machine learning practitioner must also know how to choose the right algorithm for a task and how to
learn from experiment results to improve the system. In daily work, they must decide things like whether
to collect more data, change the model size, adjust regularization, improve training methods, or fix bugs.
Since trying each option takes time, it's important to choose wisely instead of guessing.
Till now we focused on various machine learning models, training algorithms, and objective functions.
While it might seem that expertise in a wide range of techniques and mathematical skills is crucial,
practical success often comes from correctly applying a well-known algorithm rather than poorly
implementing a complex one. Proper application of an algorithm relies on mastering some basic
methods.
1. Determine Your Goals: Decide on the error metric to use and set a target value for this metric.
These goals should align with the problem you are trying to solve.
2. Establish a Pipeline: Create a working end-to-end pipeline as soon as possible, including estimating
the appropriate performance metrics.
3. Monitor and Diagnose: Instrument your system well to identify performance bottlenecks.
Determine which components are underperforming and whether it's due to issues like overfitting,
underfitting, or problems with the data or software.
4. Iterate and Improve: Make incremental changes, such as gathering new data, adjusting
hyperparameters, or switching algorithms, based on specific insights from your monitoring and
diagnostics.
Performance Metrics :
1. Choosing the Right Metric: Before applying machine learning, set clear goals and select an
appropriate error metric to measure performance. This helps determine how well the model is
performing.
2. Why Zero Error Is Impossible:
The Bayes error represents the minimum possible error rate for any predictive model, even with
infinite training data and perfect knowledge of the probability distribution.
This is due to incomplete information in input features or inherent randomness in the system.
Additionally, having a finite amount of training data imposes further limitations.
3. Data Limitations & Benchmarks: Training data can be limited for many reasons.
In real-world applications, collecting more data helps but comes with costs (time, effort, money).
In research, when testing algorithms on a fixed benchmark, you're usually not allowed to add
more data.
4. Types of Performance Metrics:
Basic metrics: accuracy and error rate measure general performance.
Advanced metrics account for different error costs. In some cases, certain types of errors are more
costly than others. For instance, consider an email spam detection system, Blocking a legitimate
message is generally more problematic than letting a questionable message pass through.
Therefore, instead of just measuring the error rate, it might be more useful to consider the total
cost, where the cost of blocking legitimate messages is higher than that of allowing spam
messages.
5. Precision & Recall for Rare Events: For rare event detection (e.g., medical tests), accuracy alone
can be misleading.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
2
Precision: Correctly predicted positives among all positive detections.
Recall: Correctly detected actual positives.
F-score: Balances precision & recall: F = (2* precision * recall) / (precision + recall)
6. Confidence-Based Decisions: Some systems avoid making a prediction when unsure. Example: A
system transcribing street addresses should only output results when confident.
Coverage: Measures how many cases the system handles while maintaining accuracy.
Trade-off: 100% accuracy but 0% coverage is useless. Real goals focus on balancing both.
7. Application-Specific Metrics: Click-through rates, user satisfaction, and cost-benefit analysis help
refine machine learning success.
Default Baseline Models: After choosing your performance metric, build a basic working
model quickly. Deep learning evolves fast, so initial choices may need updating.
Determining Whether to Gather More Data: Once an initial end-to-end machine learning system
is set up, the next step is to evaluate its performance and identify areas for improvement. Here's a
structured approach to deciding whether to gather more data or refine the algorithm:
1. Evaluate Training Performance: First, check if the performance on the training set is
acceptable. Poor performance on the training set indicates that the algorithm isn't effectively utilizing the
existing data. In this case, gathering more data isn't the solution. Consider increasing the model size by
adding more layers or hidden units, or improving the learning algorithm by tuning hyperparameters like
the learning rate.
2. Check Data Quality: If large models and well-tuned algorithms still perform poorly, the issue
might be with the quality of the training data. The data could be too noisy or lack the necessary features
to predict the desired outputs. In such cases, it may be necessary to collect cleaner data or a richer set of
features.
3. Evaluate Test Performance: If the training set performance is acceptable, evaluate the
performance on the test set. If test performance is also acceptable, no further action is needed. If test
performance is significantly worse than training performance, gathering more data is often an effective
solution.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
3
4. Considerations for Gathering More Data: Assess the cost and feasibility of gathering more
data versus other methods of reducing test error. Determine the amount of additional data needed to
significantly improve test performance. In large-scale applications, gathering more data might be more
feasible and cost-effective.
5. Alternatives to Gathering More Data: If gathering more data is not feasible, consider reducing the
model size or improving regularization by adjusting hyperparameters or adding strategies like dropout.
6. Predicting Data Needs: Plot and analyze the relationship between training set size and
generalization error to predict how much additional data is needed to achieve desired performance levels.
Experiment with training set sizes on a logarithmic scale to observe noticeable impacts on generalization
error.
7. Improving the Learning Algorithm: If gathering more data is not an option, focus on improving
the learning algorithm itself, which often involves research and development beyond standard practices.
By following these steps, you can systematically decide whether to gather more data or refine your
algorithm to improve the performance of your machine learning system.
Selecting Hyperparameters Most deep learning algorithms have numerous hyperparameters that
influence various aspects of the algorithm's behavior. These hyperparameters can be categorized based
on their impact:
Computational Cost: Some hyperparameters affect the time and memory required to run the
algorithm.
Model Quality: Others influence the quality of the model produced during training and its ability to
make accurate predictions when used with new data.
Adjusting these hyperparameters is crucial for optimizing both the efficiency and effectiveness of deep
learning models.
There are two basic approaches to choosing these hyperparameters: choosing them manually and
choosing them automatically.
Choosing the hyperparameters manually requires understanding what the hyperparameters do and
how machine learning models achieve good generalization.
Automatic hyperparameter selection algorithms greatly reduce the need to understand these ideas, but
they are often much more computationally costly.
Manual Hyperparameter Tuning: To manually tune hyperparameters, it's essential to understand how
they affect training error, generalization error, and computational resources like memory and runtime.
The main objective is to minimize generalization error while staying within a specified runtime and
memory budget.
The primary goal of manually tuning hyperparameters is to align the model's capacity with the task's
complexity. This capacity is influenced by three key factors:
The model's ability to represent data.
The learning algorithm's effectiveness in minimizing the training cost function.
The extent to which the cost function and training process regulate the model.
A model with more layers and hidden units has a greater capacity to represent complex functions.
However, it may not always learn these functions effectively if the training process fails to identify those
that minimize the training cost, or if regularization constraints, like weight decay, prevent it from doing
so.
The generalization error typically follows a U-shaped curve when plotted as a function of one of the
hyperparameters, as shown in figure. At one extreme, the hyperparameter value corresponds to low
capacity, and generalization error is high because training error is high. This is the underfitting regime.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
4
At the other extreme, the hyperparameter value corresponds to high capacity, and the generalization error
is high because the gap between training and test error is high. Somewhere in the middle lies the optimal
model capacity, which achieves the lowest possible generalization error, by adding a medium
generalization gap to a medium amount of training error.
For certain hyperparameters, overfitting happens when their values are too large. For example, increasing
the number of hidden units in a layer boosts the model's capacity, leading to overfitting.
Conversely, for other hyperparameters, overfitting occurs when their values are too small. For instance, a
weight decay coefficient of zero allows the highest model capacity, which can also result in overfitting.
Not every hyperparameter will be able to explore the entire U-shaped curve.
The learning rate is perhaps the most important hyperparameter. If you have time to tune only one
hyperparameter, tune the learning rate. It controls the effective capacity of the model. The effective
capacity of the model is highest when the learning rate is correct for the optimization problem, not when
the learning rate is especially large or especially small.
When tuning hyperparameters other than the learning rate, it's essential to monitor both training and test
errors to diagnose and address overfitting or underfitting:
High Training Error:
o If the training error is higher than desired, you need to increase the model's capacity.
o This can be done by adding more layers or hidden units, assuming regularization is not being
used and the optimization algorithm is functioning correctly.
o Be aware that increasing capacity raises computational costs.
High Test Error:
o If the test error is higher than desired, you have two potential courses of action.
o Test error consists of training error and the gap between training and test error. The goal is to
balance these components.
o Typically, neural networks perform best with low training error (high capacity) and minimal
gap between training and test errors.
o To reduce this gap, adjust regularization hyperparameters to decrease effective model
capacity, such as by implementing dropout or weight decay.
o Optimal performance often comes from a large model that is well-regularized.
Most hyperparameters can be set by reasoning about whether they increase or decrease model capacity.
Some examples are included in Table 11.1.
When tuning hyperparameters, focus on your main goal: good test set performance.
Regularization is one way to improve this.
If training error is low, you can reduce generalization error by adding more training data.
A simple strategy is to increase both model size and training data until performance is good.
However, this increases computational cost, so you need enough resources.
Optimization issues are possible but usually not a big problem if the model is chosen well.
Debugging Strategies: Most debugging strategies for neural nets are designed to get around one or
both of the below two difficulties.
I. When a machine learning system performs poorly, it's hard to tell if the problem is with the algorithm
itself or a bug in the code. Debugging is challenging because we usually don't know in advance what the
"correct" behavior should look like.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
6
For example, if a neural network gets 5% test error, we can't easily say if that's good or if something
went wrong. That's because the goal of machine learning is to discover patterns we can’t define
ourselves.
II. Debugging machine learning systems can be particularly challenging due to the adaptive nature of
their components. If one part of the model is flawed, other parts can often adapt and compensate, leading
to seemingly acceptable performance despite the underlying issue.
For example, consider a neural network with multiple layers, each defined by weights and biases. If
there's a mistake in the bias update rule during gradient descent, such as an incorrect update that doesn't
use the gradient (e.g., b ← b − α), the biases will continuously decrease. This is clearly incorrect, yet the
error might not be immediately obvious from the model's output alone. Depending on the input data
distribution, the weights might adapt to counteract the negative biases, masking the bug's presence.
Most debugging strategies for neural nets are designed to get around one or both of these two difficulties.
Either we design a case that is so simple that the correct behavior actually can be predicted, or we design
a test that exercises one part of the neural net implementation in isolation.
I. Visualizing the model in action is crucial for understanding its performance beyond just numerical
metrics. Here's how you can do it:
Object Detection Models: When training a model to detect objects in images, overlay the detected
objects on the images to visually inspect the model's performance.
Generative Speech Models: For models generating speech, listen to the speech samples produced to
assess the quality and naturalness of the output.
While it might seem straightforward, it's easy to rely solely on quantitative metrics like accuracy or log-
likelihood. Directly observing the model's outputs helps ensure that these metrics truly reflect reasonable
performance. This practice can also help identify evaluation bugs, which can be particularly misleading
as they might suggest the system is performing well when it is not.
II. Visualize the worst mistakes : To improve a model, it's helpful to visualize and analyze its worst
mistakes. Most models provide a confidence measure for their predictions. For example, classifiers using
a softmax layer assign a probability to each class, indicating the model's confidence in its decision.
Although these probabilities are often overestimated, they can still help identify examples that the model
finds challenging.
By examining the examples that the model struggles with the most, you can often uncover issues with
data preprocessing or labeling. For instance, an address number detection system initially had a problem
where it cropped images too tightly, omitting some digits. The transcription network then assigned low
probabilities to the correct answers for these images.
By identifying and reviewing the most confident errors, it became clear that there was a systematic issue
with the cropping. Adjusting the detection system to crop images more widely significantly improved
overall performance, even though it introduced more variability in the position and scale of the address
numbers for the transcription network to handle.
III. Reasoning about software using train and test error: Using training and testing errors can provide
insights into whether the software is correctly implemented:
Low Training Error, High Test Error: If the training error is low but the test error is high, it
suggests that the training process is likely functioning correctly, but the model may be overfitting.
Another possibility is that there might be an issue with how the test error is measured, such as
problems with saving and reloading the model or differences in how the test data was prepared
compared to the training data.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
7
High Training and Test Errors: If both training and test errors are high, it's challenging to
determine whether there's a software defect or if the model is underfitting due to fundamental
algorithmic reasons. This situation requires further investigation.
IV.Fit a tiny dataset: To determine if high training error is due to genuine underfitting or a
software defect, try fitting a very small dataset:
Even small models should be able to fit a sufficiently small dataset. For instance, a classification
dataset with just one example can be fit by correctly setting the biases of the output layer.
If a model cannot be trained to correctly label a single example, accurately reproduce a single
example (in the case of an autoencoder), or consistently generate samples resembling a single
example (in the case of a generative model), it suggests there is a software defect preventing
successful optimization on the training set.
This test can be extended to a small dataset with a few examples to further diagnose the issue.
We can improve the accuracy of the approximation by using the centered difference:
By performing these tests, you can ensure that your implementation of gradient computations is correct,
which is crucial for the effective training and performance of machine learning models.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
8
Case study: Multi-Digit Number Recognition: The Street View transcription system project,
focused on multi-digit number recognition, demonstrates a practical application of a deep learning design
methodology. Here's a simplified overview of the process:
1. Data Collection and Preparation:
1. Data was collected using Street View cars, and human operators labeled the data.
2. Significant dataset curation was performed, including using machine learning techniques to
detect house numbers before transcription.
2. Setting Goals:
1. The project aimed for high accuracy, specifically 98%, aligning with business goals.
2. Coverage became the main performance metric, with accuracy maintained at 98%.
3. Baseline System:
1. A convolutional network with rectified linear units was established as the baseline.
2. Initially, the model used multiple softmax units to predict sequences of characters, treating
each prediction independently.
4. Iterative Refinement:
1. The system was iteratively refined. Improvements were made by developing a specialized
output layer and cost function to compute a principled log-likelihood, enhancing the
effectiveness of the example rejection mechanism.
5. Debugging and Adjustments:
1. With coverage below 90% and similar training and test errors, the issue was identified as
underfitting or data-related.
2. Visualizing the model's worst errors revealed that images were often cropped too tightly,
omitting parts of the address numbers.
3. Instead of refining the address number detection system, the crop region was widened,
significantly improving coverage.
6. Hyperparameter Tuning: Final performance improvements came from adjusting
hyperparameters, primarily by increasing the model size while managing computational costs.
7. Outcome: The project successfully transcribed hundreds of millions of addresses more
efficiently than human effort, showcasing the effectiveness of the applied design principles.
Large-Scale Deep Learning Applications: Deep learning is rooted in the idea of connectionism,
where intelligence emerges not from individual neurons or features but from large populations of them
working together.
The size of the neural network is crucial. A significant factor in the improvement of neural networks'
accuracy and their ability to solve complex tasks over the decades is the substantial increase in network
size.
Over the past thirty years, the size of neural networks has grown exponentially. Despite this growth,
artificial neural networks are still only as large as the nervous systems of insects.
Due to the importance of network size, deep learning demands high-performance hardware and software
infrastructure to support the computational requirements of large-scale models.
Fast CPU implementations: Traditionally, neural networks were trained using the CPU of a
single machine. Today, this approach is generally considered insufficient. We now mostly use GPU
computing or the CPUs of many machines networked together.
GPU Implementations: Modern neural network implementations are based on GPUs. GPUs are
specialized hardware components, originally developed for graphics. Consumer market for video game
rendering, High degree of parallelism, high bandwidth. Performance characteristics for video gaming
systems are beneficial for neural networks as well.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
9
Large-Scale Distributed Implementations: Sometimes, one machine isn't enough for training or
running deep learning models. So, we spread the work across multiple machines.
Data Parallelism: Inference can be easily distributed using data parallelism, where different input
examples are processed by separate machines.
During training, data parallelism involves increasing the minibatch size for stochastic gradient descent
(SGD) steps, although this often results in less than linear improvements in optimization performance.
Model Parallelism: Model parallelism involves multiple machines collaborating on a single data point,
with each machine handling a different part of the model. This approach is applicable to both inference
and training.
Large-scale distributed asynchronous gradient descent is primarily used by major industry groups for
training large deep networks.
Academic researchers typically work with more limited resources but have explored building distributed
networks with cost-effective hardware.
Overall, distributed implementations are crucial for scaling deep learning tasks beyond the capabilities of
single machines.
Model Compression In many commercial applications, it is much more important that the time
and memory cost of running inference/ prediction in a machine learning model be low than that the time
and memory cost of training.
For applications that do not require personalization, it is possible to train a model once, then deploy it to
be used by billions of users. In many cases, the end user is more resource-constrained than the developer.
For example, one might train a speech recognition network with a powerful computer cluster, then
deploy it on mobile phones.
A key strategy for reducing the cost of inference is model compression. The basic idea of model
compression is to replace the original, expensive model with a smaller model that requires less memory
and runtime to store and evaluate.
Model compression is applicable when the size of the original model is driven primarily by a need to
prevent overfitting. In most cases, the model with the lowest generalization error is an ensemble of
several independently trained models.
The original large model learns a function f(x) using more parameters than necessary, primarily due to
limited training data.
Once f(x) is learned, a new, larger training set can be generated by applying f to randomly sampled
points x.
A smaller model is then trained on this new dataset to replicate f(x). To maximize efficiency, the
new x points should resemble actual test inputs, which can be achieved by corrupting training examples
or using a generative model.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
10
Alternative Approach: The smaller model can also be trained on the original training data but is
designed to mimic additional features of the original model, such as its posterior distribution over
incorrect classes.
Dynamic Structure: One effective strategy to speed up data processing systems is to build
dynamic structures in the graph describing the computations needed to process the given input. The data
processing systems can adaptively determine which parts of a neural network or which specific features
(hidden units) are necessary to process a given input. This approach, known as conditional computation,
enhances efficiency by computing only relevant features for each input.
A well-known method for accelerating inference in classifiers is using a cascade of classifiers. This is
particularly useful for detecting rare objects or events. The cascade starts with low-capacity classifiers
that quickly filter out non-relevant inputs, followed by more sophisticated classifiers for detailed
analysis. This setup ensures high confidence in detection while minimizing computational costs.
Decision trees are another example of dynamic structures, where each node determines which subtree to
evaluate next. Combining decision trees with neural networks, where neural networks make splitting
decisions, is a straightforward way to integrate dynamic structures with deep learning.
Another approach involves using a neural network, called a gater, to select which expert network from a
pool should process a given input. This concept, known as a mixture of experts, can significantly speed
up processing when a single expert is chosen for each input, known as a hard mixture of experts.
Dynamic routing, where hidden units receive inputs from different units based on context, can be seen as
an attention mechanism. However, implementing hard switches in large-scale applications has been
challenging. Current methods often use weighted averages over possible inputs, which do not fully
exploit the computational benefits of dynamic structures.
A major challenge with dynamically structured systems is the reduced parallelism, which complicates
efficient implementation on both CPUs and GPUs. This issue can sometimes be mitigated by grouping
similar examples and processing them together, though this can lead to load-balancing problems in real-
time settings.
Computer Vision: Computer Vision is one of the most active areas for deep learning research, since
Vision is a task effortless for humans but difficult for computers. Standard benchmarks for deep learning
algorithms are: Object recognition and OCR.
Most deep learning applications in computer vision focus on tasks like identifying objects in images,
drawing bounding boxes around them, transcribing text from images, or labeling each pixel in an image.
Additionally, there's significant work on using deep learning for generating images. Although creating
images from scratch isn't traditionally seen as a computer vision task, models that can generate images
are often useful for restoring images, such as fixing defects or removing unwanted objects.
Preprocessing: Many areas of deep learning applications require complex preprocessing because
the raw input data isn't in a suitable format for deep learning models. However, computer vision typically
needs minimal preprocessing. Here are the key points:
Standardization: Images should be standardized so that pixel values are within a consistent range,
such as [0, 1] or [-1, 1]. Mixing images with different ranges, like [0, 1] and [0, 255], can cause
issues.
Resizing: Many computer vision models require images of a standard size, so images may need to be
cropped or resized. However, some convolutional models can handle variably-sized inputs by
adjusting pooling regions or producing variable-sized outputs.
Dataset Augmentation: This technique, applied during training, involves creating varied versions of
training images (e.g., slightly different crops) to reduce generalization error. A similar idea at test
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
11
time involves using multiple versions of the same input and aggregating results, which can also
reduce generalization error.
Canonical Form Preprocessing: Both training and test sets can be preprocessed to reduce variability
and simplify the task for the model. This preprocessing removes irrelevant variations, potentially
reducing generalization error and the model size needed.
However, with large datasets and models, extensive preprocessing is often unnecessary, as the model
can learn to handle variability on its own. For example, the AlexNet system for ImageNet
classification only subtracts the mean pixel value across training examples.
Contrast Normalization One important aspect of many image-related tasks is the contrast in
the image. Contrast refers to the difference in brightness between the light and dark pixels in an image.
In the context of deep learning, contrast is often measured by the standard deviation of pixel intensities in
an image or a specific region of an image.
Global Contrast Normalization (GCN) aims to ensure that all images have a consistent level of contrast.
This is done by adjusting each image so that its pixel intensities have a standard deviation equal to a
predefined constant ss.
First, the mean pixel intensity is subtracted from each image to center the data around zero. Then, the
image is rescaled so that the standard deviation of its pixel intensities matches the constant s.
Images with zero contrast (where all pixels have the same intensity) cannot be adjusted using any scaling
factor. Images with very low contrast often contain little useful information. Simply dividing by the
standard deviation in these cases can amplify noise or compression artifacts.
To address these issues, a small positive regularization parameter λ is introduced. This parameter helps to
bias the estimate of the standard deviation, preventing division by very small numbers. Alternatively, the
denominator used in the rescaling process can be constrained to be at least ϵ, ensuring that the rescaling
does not excessively amplify noise.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
12
The above image demonstrates how GCN can be used to map data onto a sphere, with different values of
λ affecting how strictly the data conforms to the spherical shape. Regularization with λ > 0 allows for
some retention of the original data's norm variation.
Raw Input: This plot shows the raw input data points, which are scattered and do not lie on a sphere.
The data points have varying norms, meaning their distances from the origin differ.
GCN with λ = 0: This plot demonstrates the effect of applying a GCN with λ set to 0. Here, all non-
zero data points are mapped perfectly onto a sphere. The sphere is not a unit sphere because the GCN
normalizes the standard deviation rather than the L2 norm. The parameters used are s=1 and ϵ =10−8.
Regularized GCN with λ = 10-2: This plot shows the result of applying a regularized GCN with λ set
−2
to 10 . The data points are drawn toward the sphere but do not completely discard the variation in their
norms. This means that while the points are closer to the sphere compared to the raw input, they retain
some of their original norm variations.
Global contrast normalization is a technique used to adjust the contrast of an entire image, but it often
fails to enhance specific features like edges and corners effectively. This limitation is particularly
noticeable in scenes with large variations in brightness, such as an image that is half in shadow and half
brightly lit. In such cases, while global contrast normalization can create a significant difference between
the bright and dark areas, it does not necessarily improve the visibility of finer details within those areas.
To address this issue, local contrast normalization is used. Unlike global contrast normalization, which
applies adjustments to the entire image, local contrast normalization works on smaller sections or
windows of the image. This approach ensures that contrast is normalized within each small window,
thereby enhancing local features and details more effectively across different regions of the image.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
13
Global Contrast Normalization: This technique subtly adjusts images to a similar scale, aiding learning
algorithms by reducing the need to manage varying scales. However, its effects are less pronounced
visually.
Local Contrast Normalization: This method significantly alters images by eliminating areas of uniform
intensity, thereby emphasizing edges. It helps models concentrate on these edges, although it may result
in the loss of some detailed textures, such as those seen in the houses in the second row, due to the
normalization kernel's bandwidth being excessively high.
Local contrast normalization can be efficiently implemented using separable convolution to calculate
local means and standard deviations, followed by element-wise operations. This technique is
differentiable and can be applied both as a preprocessing step and within the hidden layers of a network.
Regularization is crucial to prevent division by zero, especially since smaller windows are more prone to
having uniform values, leading to zero standard deviation.
Speech Recognition: The task of speech recognition is to map an acoustic signal containing a
spoken natural language utterance into the corresponding sequence of words intended by the speaker. Let
X = (x(1), x(2) , . . . , x (T)) denote the sequence of acoustic input vectors (traditionally produced by splitting
the audio into 20ms frames).
Most speech recognition systems preprocess the input using specialized hand-designed features, but
some deep learning systems learn features from raw input. Let y = (y 1 , y2 , . . . , yN ) denote the target
output sequence (usually a sequence of words or characters).
The automatic speech recognition (ASR) task consists of creating a function f ∗ASR that computes the
most probable linguistic sequence y given the acoustic sequence X:
f∗ASR = argmax y P ∗(y | X = X) , where P ∗ is the true conditional distribution relating the inputs X to
the targets y.
From the 1980s until around 2009-2012, speech recognition systems primarily used Hidden Markov
models (HMMs) and Gaussian mixture models (GMMs). GMMs linked acoustic features to phonemes,
while HMMs modeled phoneme sequences. Despite early applications of neural networks in speech
recognition achieving comparable performance to GMM-HMM systems, the industry largely continued
with GMM-HMMs due to the extensive investment in these systems.
The shift towards neural networks began around 2009, with deep learning techniques based on Restricted
Boltzmann machines (RBMs) significantly improving recognition accuracy. This transition reduced the
phoneme error rate on the TIMIT corpus from about 26% to 20.7%. Further advancements included the
use of rectified linear units and dropout, leading to collaborations between industry and academia that
integrated deep learning into commercial products like mobile phones.
As research progressed, it became evident that unsupervised pretraining was not necessary for achieving
significant improvements. The speech recognition community rapidly adopted deep learning, resulting in
a 30% improvement in word error rates. This shift led to the incorporation of deep neural networks in
most industrial speech recognition products within a few years.
Innovations continued with the application of convolutional networks, which treated input spectrograms
as images, improving upon earlier models. Additionally, there was a push towards end-to-end deep
learning speech recognition systems that eliminated the need for HMMs. Notable breakthroughs included
the use of deep LSTM RNNs, which achieved a record low phoneme error rate of 17.7% on the TIMIT
corpus. Contemporary efforts also focused on enabling systems to learn the alignment of acoustic and
phonetic information autonomously.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
14
Natural Language Processing: Natural Language Processing (NLP) involves the use of
human languages ( English, French, Telugu) by computers, unlike typical computer programs ( C, C++,
Python) that use specialized, unambiguous languages. NLP encompasses applications like machine
translation, where the goal is to convert sentences from one language to another. It often relies on
language models that define probability distributions over sequences of words, characters, or bytes.
Why Use Probability in NLP? Natural language is inherently ambiguous and uncertain, making
rigid, rule-based systems ineffective. Probability helps manage this ambiguity by:
1. Handling Ambiguity in Language
Contextual Dependence: The same word or phrase can have multiple meanings (e.g., "bank" =
financial institution or river edge). Probabilistic models estimate which interpretation is more likely
given the context.
Implicit Information: Humans omit obvious details when speaking. Probability helps infer missing
information (e.g., predicting that "Can you pass the salt?" is a request, not a yes/no question).
Natural language is inherently sequential. It is typically processed as a sequence of words (rather than
characters or bytes) because words carry more meaningful information. However, modeling language at
the word level introduces challenges due to: High Dimensionality & Sparsity
Large Vocabulary Size: The total number of unique words (vocabulary) is very large (e.g., hundreds
of thousands or more).
Sparse Discrete Space: Each word is a discrete unit, leading to a high-dimensional,
sparse representation (e.g., one-hot encoding). This makes statistical modeling difficult because most
word sequences are rare or unseen in training data.
N-grams N-gram models are a type of Language Model that define a probability distribution over
sequences of tokens, such as words, characters, or bytes, in natural language. These models focus on
sequences of a fixed length, known as n-grams, where an n-gram is essentially a sequence of n tokens.
(e.g., "I love" is a 2-gram or bigram).
Models based on n-grams define the conditional probability of the n th token given the preceding n − 1
tokens. The model uses products of these conditional distributions to define the probability distribution
over longer sequences:
P(x1,…,xτ): This represents the joint probability of observing the entire sequence of tokens x1,…,xτ
P(x1,…,xn−1): This is the initial probability of the first n−1 tokens in the sequence. It serves as the starting
point for the n-gram model.
∏t=nτ: This is the product notation, indicating that we multiply the conditional probabilities for each token
xt from t=n to t=τ.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
P(xt ∣ xt−n+1,…,xt−1): This term represents the conditional probability of the token x txt given the previous n
15
– 1 tokens xt−n+1,…,xt−1. It captures the idea that the probability of a token occurring depends on the n−1
tokens that immediately precede it.
This approach leverages the context provided by the preceding tokens to predict the likelihood of the
next token in the sequence
This decomposition is comes from the chain rule of probability P(x1,..xn) = P(xn | x1,..xn-1)P(x1,..xn-1)
Training n-gram models is straightforward because the maximum likelihood estimate can be computed
simply by counting how many times each possible n gram occurs in the training set. Models based on n -
grams have been the core building block of Natural language Processing for many decades.
For small values of n, models have particular names: unigram for n=1, bigram for n=2, and trigram for
n=3.
Usually we train both an n-gram model and an n−1 gram model simultaneously.
To simplify computation, we can look up two stored probabilities. To ensure this method accurately
reflects the inference process in Pn, we need to exclude the final character from each sequence when
training Pn−1.
Let us use an example to understand how a trigram model calculates the probability of the sentence
"THE DOG RAN AWAY," let's break down the process step-by-step:
Initial Words Handling:
o The first two words of the sentence, "THE DOG," cannot be handled using the standard
conditional probability formula because there is no preceding context at the beginning of the
sentence.
o Instead, we use the marginal probability for the initial sequence of words. For a trigram
model, this involves evaluating the probability of the first three words as a whole, denoted
as P3(THE DOG RAN).
Subsequent Words Handling:
o For the last word in the sentence, "AWAY," we use the typical conditional probability
approach. This involves calculating the probability of "AWAY" given the two preceding
words, "DOG RAN," denoted as P(AWAY∣DOG RAN).
Combining Probabilities:
o To find the overall probability of the sentence, we combine these probabilities using the
following equation derived from the trigram model:
P(THE DOG RAN AWAY) = P3(THE DOG RAN) × P3(DOG RAN AWAY)
P2(DOG RAN)
here, P3(THE DOG RAN) is the joint probability of the first three words.
P3(DOG RAN AWAY) is the joint probability of the last three words.
P2(DOG RAN) is the joint probability of the second and third words, which acts as a normalizing
factor to ensure the probabilities are correctly scaled.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
16
Limitation of Maximum Likelihood for n-gram models: A fundamental limitation of maximum
likelihood for n -gram models is that P n as estimated from training set counts is very likely to be zero in
many cases, even though the tuple (xt−n+1, . . . , xt ) may appear in the test set.
To avoid such catastrophic outcomes, most n-gram models employ some form of smoothing. Smoothing
Techniques shift probability mass from the observed tuples to unobserved ones that are similar.
Smoothing techniques:
1. Adding non-zero probability mass to all of the possible next symbol values. This method can be
justified as Bayesian inference with a uniform or Dirichlet prior over the count parameters.
2. Another very popular idea is to form a mixture model containing higher-order and lower-order n-gram
models, with the higher-order models providing more capacity and the lower-order models being more
likely to avoid counts of zero. more likely to avoid counts of zero.
3. Back-off methods look-up the lower-order n-grams if the frequency of the context x t−1, . . . , xt−n+1 is
too small to use the higher-order model. More formally, they estimate the distribution over x t by using
contexts xt−n+k, . . . , xt−1, for increasing k, until a sufficiently reliable estimate is found.
There are |V| n possible n-grams and |V| is often very large. As ∣V∣ (the vocabulary size) increases, the
Vulnerable to curse of dimensionality.
For example, if ∣V∣=10,000 (a modest vocabulary size) and n=5 (5-grams), the number of possible 5-
number of possible n-grams grows exponentially with n. This is known as a combinatorial explosion.
Classical n-gram models can be seen as performing nearest-neighbor lookups, acting like local non-
parametric predictors similar to k-nearest neighbors. However, these models face significant statistical
challenges, particularly in language modeling where words in one-hot vector space are equidistant,
making it difficult to leverage information from similar contexts unless they are identical.
To address these issues and improve statistical efficiency, class-based language models introduce word
categories. These models use clustering algorithms to group words into classes based on their co-
occurrence frequencies. By using word class IDs instead of individual word IDs, these models can share
statistical strength among similar words, enhancing generalization.
Composite models that combine word-based and class-based approaches through techniques like mixing
or back-off are also feasible, although some information is inevitably lost in this categorization process.
Neural language models (NLMs) are designed to tackle the challenge of the curse of dimensionality in
modeling natural language sequences by utilizing distributed representations of words. Unlike traditional
class-based n-gram models, NLMs can recognize similarities between words while still treating each
word as distinct. This is achieved through shared statistical strength among similar words and contexts,
facilitated by distributed representations that capture common features among words.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
17
For example if the word dog and the word cat map to representations that share many attributes, then
sentences that contain the word cat can inform the predictions that will be made by the model for
sentences that contain the word dog, and vice-versa.
These distributed representations are often referred to as word embeddings. In this framework, words
are initially represented as points in a high-dimensional space equal to the vocabulary size, using one-hot
vectors.
Word embeddings then map these points into a lower-dimensional feature space where words with
similar meanings or contexts are positioned close to each other. This transformation allows words that
frequently appear in similar contexts to be neighbors in the embedding space, providing a more
meaningful and efficient representation of language.
The concept of embeddings is not unique to NLP; for example, convolutional networks in image
processing also use embeddings. However, in NLP, the shift to embeddings is particularly impactful
because it transforms discrete linguistic symbols into continuous vector spaces, dramatically changing
how language data is represented and processed.
Other Applications
Recommender Systems : One of the major families of applications of machine learning in the
information technology sector is the ability to make recommendations of items to potential users or
customers.
Two major types of applications can be distinguished: online advertising and item recommendations.
Both rely on predicting the association between a user and an item, either to predict the probability of
some action (the user buying the product, or some proxy for this action) or the expected gain (which may
depend on the value of the product) if an ad is shown or a recommendation is made regarding that
product to that user. Some examples of recommender systems are:
1. Media: Recommending news, video on demand, music
2. eCommerce: Product recommendation
3. Jobs Boards: Newsletters with job offers
4. Travel and Real Estate:: Events, places, chatbot
5. Education: Personalizing education, recommend materials
The internet is currently financed in great part by various forms of online advertising. There are major
parts of the economy that rely on online shopping. Companies including Amazon and eBay use machine
learning, including deep learning, for their product recommendations.
Sometimes, the items are not products that are actually for sale. Examples include
selecting posts to display on social network news feeds,
recommending movies to watch,
recommending jokes,
advice from experts,
matching players for video games, or matching people in dating services.
Surprisingly, similar machine learning algorithms are used for recommending news, Videos for media,
Products, Personalization in travel and retail etc.
Given some information about the item and about the user, Recommender systems predict the proxy of
interest
user clicks on ad,
user enters a rating,
user clicks on a “like” button,
user buys product,
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
18
user spends some amount of money on the product,
user spends time visiting a page for the product, etc).
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].
19
When training a machine learning algorithm, the training error and testing error are key metrics used to
evaluate the performance of the model. Here's what each error represents and what you should aim for:
1. Training Error:
Definition: This is the error rate of the model on the training dataset. It measures how
well the model is fitting the data it has seen during training.
Goal: Ideally, you want the training error to be as low as possible. A low training error
indicates that the model has learned the patterns in the training data well.
2. Testing Error:
Definition: This is the error rate of the model on a separate testing dataset that the model
has not seen during training. It measures how well the model generalizes to new, unseen
data.
Goal: The testing error should also be as low as possible. A low testing error indicates
good generalization performance, meaning the model performs well on data it wasn't
trained on.
Relationship and Considerations:
Balance: You generally want both errors to be low, but it's crucial to have a low testing error as it
reflects the model's ability to generalize. If the training error is low but the testing error is high, it
indicates overfitting. Conversely, if both errors are high, it suggests underfitting.
Generalization Gap: The difference between the training error and testing error is known as the
generalization gap. A small gap indicates that the model generalizes well, while a large gap
suggests overfitting.
Model Complexity: Adjust the complexity of the model to balance these errors. If the model is
too simple, both errors will be high (underfitting). If the model is too complex, the training error
will be low, but the testing error will be high (overfitting).
Regularization: Use techniques like dropout, weight decay, or early stopping to reduce
overfitting and help close the generalization gap.
In summary, while training a machine learning algorithm, aim for low training and testing errors, with a
particular emphasis on minimizing the testing error to ensure good generalization to new data.
These notes are prepared using material from "Deep Learning" by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, MIT Press, 2016, and various
resources from the World Wide Web. Please email your valuable suggestions to [email protected].