0% found this document useful (0 votes)
12 views32 pages

Deep Learning

ImageNet is a large-scale image database with over 14 million annotated images across 20,000 categories, serving as a crucial resource for training computer vision algorithms. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition that benchmarks object classification and detection, significantly influencing deep learning advancements. The success of deep neural networks, particularly AlexNet in 2012, showcased the effectiveness of deep learning and spurred further developments in model architectures and applications across various domains.

Uploaded by

ganeshpmps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views32 pages

Deep Learning

ImageNet is a large-scale image database with over 14 million annotated images across 20,000 categories, serving as a crucial resource for training computer vision algorithms. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition that benchmarks object classification and detection, significantly influencing deep learning advancements. The success of deep neural networks, particularly AlexNet in 2012, showcased the effectiveness of deep learning and spurred further developments in model architectures and applications across various domains.

Uploaded by

ganeshpmps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Module 1: Introduction to Deep Learning & NLP

(CO1)
1.
Explain the architecture and functioning of the Neocognitron. for 10 marks use this above text
book for answer
The Neocognitron is a pioneering artificial neural network architecture proposed by Kunihiko
Fukushima in the late 1970s, designed for robust pattern recognition tasks like handwritten
character identification. [1]

Architecture of the Neocognitron


The Neocognitron consists of a hierarchical, multi-layered structure inspired directly by
biological visual processing, particularly Hubel and Wiesel's model of the visual cortex. In this
architecture:
Layers are organized sequentially.
The first layer contains “simple cells” that react maximally to basic patterns such as edges
with specific orientations.
The next layer contains “complex cells,” responsible for combining various basic features
detected by simple cells and representing more abstract formations like corners and curves.
This hierarchy continues across several layers, where each deeper layer processes more
complex and abstract information derived from the preceding layer. [1]

Functioning of the Neocognitron


Input data (such as images) is passed through the model beginning at the lowest layer,
where local features like edges or lines are detected.
Each subsequent layer performs increasingly sophisticated combination and abstraction,
allowing the network to gradually create comprehensive representations of complex shapes.
Like the visual cortex, this layered representation enables the network to recognize patterns
regardless of their position or minor distortions, resulting in shift-invariant recognition.
Summary Table
Feature Description Reference

Biological Inspiration Models visual cortex hierarchy (simple & complex cells) [1]

Architecture Sequential, hierarchical layers for feature abstraction [1]

Functioning Local feature extraction, followed by abstraction [1]

Shift Invariance Recognizes patterns despite positional changes [1]

Application Used for handwriting and pattern recognition [1]

Importance and Legacy


The Neocognitron directly inspired later deep learning architectures like convolutional neural
networks (CNNs), serving as the foundational concept for automatic hierarchical feature
extraction.
Modern image recognition systems leverage this approach to build deep, multi-layered
networks that learn increasingly complex representations as data passes through each
layer. [1]
In conclusion, the Neocognitron's architecture reflects biological principles by building
successive layers of abstraction, enabling robust pattern recognition and forming the conceptual
bridge to modern deep learning networks. [1]

AlexNet Architecture Overview


AlexNet is a landmark deep convolutional neural network introduced by Alex Krizhevsky, Ilya
Sutskever, and Geoffrey Hinton in 2012. It achieved breakthrough performance in the ImageNet
Large Scale Visual Recognition Challenge, dramatically outperforming previous models and
popularizing deep learning for computer vision. [2] [3] [4]

Layer-by-Layer Structure
Input Layer: Accepts RGB images of size (original paper used ,
but is common in implementations). [3] [4]
Convolutional Layers:
1. Conv1: 96 filters of size , stride 4, followed by ReLU and max pooling.
2. Conv2: 256 filters of size , stride 1, followed by ReLU and max pooling.
3. Conv3: 384 filters of size , stride 1, followed by ReLU.
4. Conv4: 384 filters of size , stride 1, followed by ReLU.
5. Conv5: 256 filters of size , stride 1, followed by ReLU and max pooling. [5] [4] [3]
Fully Connected Layers:
6. FC1: 4096 neurons, ReLU, dropout.
7. FC2: 4096 neurons, ReLU, dropout.
8. FC3: 1000 neurons (for ImageNet), softmax output. [4] [2] [3]

Key Innovations Over Traditional ML


Deep Hierarchical Feature Learning: Unlike traditional ML, which relies on hand-crafted
features, AlexNet learns features directly from raw pixel data through multiple layers.
ReLU Activation: Uses the non-saturating ReLU function , enabling
faster training and better gradient flow compared to sigmoid/tanh. [6] [2] [3]
Dropout Regularization: Randomly disables neurons during training to prevent overfitting,
improving generalization. [3] [4]
Overlapping Max Pooling: Reduces spatial dimensions while preserving more information
than non-overlapping pooling. [6] [3]
GPU Training: Utilized parallel GPU computation, making deep networks feasible for large
datasets. [2] [6]
Local Response Normalization (LRN): Helps generalization and improves performance by
normalizing neuron outputs locally. [2] [3]

AlexNet vs. Traditional ML


Feature AlexNet (Deep Learning) Traditional ML (e.g., SVM, kNN)

Feature Extraction Learned, hierarchical Hand-crafted

Depth 8 layers (deep) Shallow (often 1-2 layers)

Activation Function ReLU Linear, sigmoid, or none

Regularization Dropout, LRN L2/L1, early stopping

Training Data Millions of images Usually much smaller

Hardware Utilization GPU parallelism CPU-based

Performance State-of-the-art (ImageNet) Lower accuracy on complex tasks

Summary
AlexNet's architecture—deep convolutional layers, ReLU activations, dropout, and GPU training
—enabled it to learn complex features directly from data, outperforming traditional machine
learning approaches that relied on manual feature engineering. This model set the stage for
modern deep learning in computer vision.

Comparison: LeNet-5 vs. AlexNet
Let's break down the design and application differences between LeNet-5 and AlexNet, two
landmark convolutional neural network (CNN) architectures.

1. Design Differences
Feature LeNet-5 AlexNet

Year Introduced 1998 2012

Input Type Grayscale images ( ) RGB images ( )

Depth 7 layers (2 conv, 2 pool, 2 FC, output) 8 layers (5 conv, 3 FC)

Activation Function Sigmoid/Tanh ReLU

Pooling Average pooling Max pooling

Regularization None Dropout, Local Response Normalization

Parameters ~60,000 ~60 million

Hardware Used CPU GPU (parallel computation)

LeNet-5 is a shallow network designed for simple tasks like digit recognition (MNIST
dataset). It uses basic activations and average pooling.
AlexNet is much deeper and wider, designed for large-scale image classification
(ImageNet). It introduced ReLU activations, max pooling, dropout, and GPU training for
scalability and speed. [9] [10] [11] [12] [13] [14] [15]

2. Application Differences
Application
LeNet-5 AlexNet
Area

Handwritten digit recognition, document Large-scale object recognition, image


Typical Tasks
analysis classification

Dataset Size Small (MNIST: 60,000 images) Large (ImageNet: 1.2 million images)

Accuracy High for simple tasks State-of-the-art for complex tasks

Training Time Fast, efficient Slow, resource-intensive

Inference Time Very fast Slower due to complexity

LeNet-5 is ideal for simple, small-scale problems where speed and efficiency are important.
AlexNet excels in complex, large-scale problems, handling color images and learning rich
hierarchical features, but requires much more computational power and time.
3. Key Innovations
AlexNet improved upon LeNet-5 by:
Using ReLU for faster training.
Employing max pooling for better feature selection.
Adding dropout and normalization to reduce overfitting.
Leveraging GPUs for large-scale data.

4. Summary
LeNet-5: Simple, efficient, best for small grayscale image tasks.
AlexNet: Deep, powerful, best for large, complex color image tasks, and set the stage for
modern deep learning in computer vision.
Would you like to see a diagram or code example for either architecture?

Representing Vocabulary: One-Hot Encoding vs. Word2Vec


Let's see how one-hot encoding and Word2Vec represent words in a vocabulary, with clear
examples.

1. One-Hot Encoding
Definition: Each word is represented by a binary vector. The vector's length equals the
vocabulary size. Only one position (corresponding to the word) is set to 1; all others are 0.
[16] [17]

Example:
Suppose our vocabulary is:
['cat', 'dog', 'apple', 'car']

Word One-Hot Vector

cat [16]

dog [16]

apple [16]

car [16]

Usage: Simple, easy to implement, but does not capture meaning or relationships between
words.
2. Word2Vec
Definition: Each word is mapped to a dense, low-dimensional vector (embedding) learned
from context in large text corpora. Words with similar meanings have similar vectors.
Example:
Suppose Word2Vec produces 3-dimensional vectors for our vocabulary:

Word Word2Vec Vector

cat [0.21, 0.55, -0.13]

dog [0.20, 0.57, -0.11]

apple [-0.33, 0.12, 0.88]

car [0.77, -0.44, 0.09]

Usage: Captures semantic relationships (e.g., 'cat' and 'dog' vectors are close), enabling
models to understand word meaning and context.

Summary Table
Method Representation Captures Meaning? Example Vector for 'cat'

One-Hot [16] No [16]

Word2Vec [0.21, 0.55, -0.13] Yes [0.21, 0.55, -0.13]

Key Points
One-hot encoding is simple but limited: no information about word similarity or meaning.
Word2Vec provides rich, learned representations that reflect word relationships and
context.
Would you like to try creating one-hot or Word2Vec vectors for a custom vocabulary?

Localist vs. Distributed Representations in NLP


Let's clarify the difference between localist and distributed representations in natural language
processing (NLP), using concrete examples.

Localist Representation
Definition: Each concept (e.g., a word) is represented by a single unit (neuron or vector
position) that is uniquely associated with that concept. The meaning of each unit is
interpretable on its own.
Example: One-hot encoding is a classic localist representation. For a vocabulary of four
words—['cat', 'dog', 'apple', 'car']—the word 'dog' is represented as:
Only the second position is '1', directly indicating 'dog'. Each word has its own unique
position; no overlap or shared meaning between units.

Distributed Representation
Definition: Each concept is represented by a pattern of activity across multiple units. Each
unit participates in representing many concepts, and the meaning of a unit depends on the
activity of others. The representation is typically dense and lower-dimensional than the
vocabulary size.
Example: Word2Vec embeddings are distributed representations. For the same vocabulary,
'dog' might be represented as:

Here, each value contributes to the meaning, and similar words (like 'cat') will have similar
vectors, reflecting semantic relationships. No single position uniquely identifies 'dog';
instead, the pattern as a whole does.

Key Differences Illustrated


Representation Type Example for 'dog' Interpretable by Single Unit? Captures Word Similarity?

Localist [22] Yes No

Distributed [0.20, 0.57, -0.11] No Yes

Localist: Each unit has a clear, standalone meaning. No overlap between words.
Distributed: Meaning is spread across units. Similar words have similar patterns, enabling
models to generalize and capture relationships.

Summary
Localist representations (like one-hot encoding) are simple and interpretable, but do not
capture relationships between words.
Distributed representations (like Word2Vec) are dense, share units among concepts, and
reflect semantic similarity, making them powerful for modern NLP tasks.
Would you like to see how these representations affect downstream NLP models or try building
them for a custom vocabulary?

Motivation Behind Distributed Representations in Deep Learning


Distributed representations are a core idea in deep learning, especially for tasks like natural
language processing and computer vision. Let's break down why they're so important:
1. Efficient Encoding of Information
In distributed representations, each concept (like a word or image) is represented by a
pattern of activity across many units (dimensions), not just a single unit. This allows a model
to encode a huge number of possible concepts using relatively few parameters. [30] [31] [32]
For example, with 10 binary features, you can represent different patterns,
while a localist (one-hot) approach would only allow 10 concepts.

2. Capturing Semantic Relationships


Distributed representations can capture similarities and relationships between concepts.
Words with similar meanings (like 'cat' and 'dog') end up with similar vectors, making it
easier for models to generalize and understand context. [31] [30]
This is crucial for tasks like translation, sentiment analysis, and image recognition, where
understanding relationships is key.

3. Generalization and Transfer Learning


Because distributed representations share features across many concepts, models can
generalize better to unseen data. If a model learns that certain features are important for
'cat', it can apply that knowledge to 'dog' even if it hasn't seen many examples. [33] [32]
This enables transfer learning, where knowledge from one task helps with another.

4. Learning Complex Functions with Fewer Examples


Distributed representations allow neural networks to learn complex, high-dimensional
functions efficiently. They can represent an exponential number of regions in input space
with a linear number of parameters, making learning more data-efficient. [32]

5. Hierarchical Abstraction
Deep networks use distributed representations to build up multiple levels of abstraction.
Early layers might capture simple features (edges, colors), while deeper layers combine
these into complex concepts (objects, meanings). [31] [32]

Summary Table
Motivation Benefit

Efficient encoding Exponential concepts with few parameters

Semantic relationships Similar concepts have similar representations

Generalization Better performance on unseen data

Data efficiency Learn complex functions with less data

Hierarchical abstraction Build up from simple to complex features


In Practice
Distributed representations are learned automatically by deep neural networks during training.
This is why deep learning models are so powerful for tasks involving language, images, and
other complex data—they can discover and use rich, shared features that make sense of the
world in a flexible, scalable way. [30] [32] [31]
Would you like to see a concrete example of how distributed representations work in a neural
network?

7.
What is ImageNet and ILSVRC? Describe their role in deep learning evolution. for 10 marks using
the above textbook
ImageNet is a large-scale, structured image database containing over 14 million annotated
images across more than 20,000 categories. It was created to provide researchers with a
comprehensive resource for training and evaluating computer vision algorithms, especially for
object recognition and classification tasks. [39] [40] [41] [42]

What is ILSVRC?
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual competition that
began in 2010, using a subset of ImageNet with 1,000 object categories. Researchers compete
to develop algorithms that can classify and detect objects in images with the highest accuracy.
ILSVRC quickly became the gold standard for benchmarking computer vision models. [43] [41] [42]

Role in Deep Learning Evolution


Catalyst for Deep Learning Breakthroughs: The vast size and diversity of ImageNet
enabled the training of deep neural networks, which require large amounts of labeled data
to learn effectively. [41] [42] [43]
AlexNet Revolution: In 2012, AlexNet—a deep convolutional neural network—won the
ILSVRC by a large margin, demonstrating the power of deep learning and sparking
widespread interest and adoption in the field. [44] [42] [41]
Benchmarking and Progress: ILSVRC provided a common benchmark, driving rapid
improvements in model architectures (e.g., VGGNet, GoogLeNet, ResNet) and setting new
standards for accuracy in image classification. [42] [43] [41]
Transfer to Other Domains: Successes on ImageNet and ILSVRC led to the application of
deep learning in other areas, such as natural language processing, medical imaging, and
autonomous vehicles. [41] [42]
Summary Table
Dataset/Challenge Description Role in Deep Learning

14M+ images, 20K+ Enabled large-scale training, benchmarking, and model


ImageNet
categories development [41] [42]

Annual challenge, 1K Drove innovation, set benchmarks, popularized deep


ILSVRC
categories learning [41] [43] [42]

ImageNet and ILSVRC have been instrumental in the evolution of deep learning, providing the
data and competitive environment needed to push the boundaries of computer vision and
artificial intelligence. [43] [42] [41]

How Deep Learning Networks Learn Language Representations Automatically


Deep learning networks, such as recurrent neural networks (RNNs), convolutional neural
networks (CNNs), and transformers, automatically learn language representations by
discovering patterns and relationships in raw text data—without manual feature engineering. [49]
[50] [51] [52]

1. Hierarchical Feature Learning


Multiple Layers: Deep networks process text through several layers. Early layers learn basic
features (e.g., word identity or simple patterns), while deeper layers capture complex
relationships (e.g., syntax, semantics, context).
Example: In a transformer model, the encoder converts raw text into embeddings that
reflect word meaning and context. The decoder uses these embeddings to predict or
generate text. [51]

2. Automatic Feature Extraction


No Manual Engineering: Unlike traditional methods, deep learning models do not require
hand-crafted features. They learn useful representations directly from data by optimizing
for the task (e.g., translation, sentiment analysis). [50] [52] [49]
Word Embeddings: Models like Word2Vec, GloVe, and BERT learn dense vector
representations for words, where similar words have similar vectors. These embeddings
capture semantic relationships automatically.

3. Contextual Understanding
Sequence Models: RNNs and transformers process sequences of words, learning how word
meaning changes with context. For example, the word "bank" in "river bank" vs. "money
bank" gets different representations depending on surrounding words. [52] [51]
Self-Attention: Transformers use self-attention to weigh the importance of each word in a
sentence, enabling nuanced understanding of meaning and relationships.
4. Training Process
Supervised Learning: Models are trained on labeled data (e.g., sentiment labels, translation
pairs), learning representations that help solve specific tasks.
Unsupervised Learning: Models can also learn from unlabeled data by predicting missing
words (e.g., BERT's masked language modeling), discovering general language patterns.
[49] [51]

5. Transfer and Adaptation


Transfer Learning: Once a model learns general language representations, it can be fine-
tuned for new tasks with less data, making deep learning highly adaptable. [51] [52] [49]

Example: Word2Vec Representation


Suppose a model is trained on sentences like "The cat sat on the mat" and "The dog lay on the
rug." Word2Vec will learn that "cat" and "dog" are similar, as are "mat" and "rug," by analyzing
their contexts. The resulting vectors reflect these relationships automatically.

Summary Table
Step What Happens

Data Input Raw text (sentences, documents)

Layered Processing Early: basic features; Deep: complex meaning

Representation Learning Dense vectors capture word/context relationships

Automatic Feature Extraction No manual features; model discovers patterns

Contextualization Meaning adapts to sentence context

Transfer Learning Representations reused for new tasks

In essence, deep learning networks learn language representations by automatically


discovering and encoding the structure, meaning, and relationships in text data, enabling
powerful and flexible NLP applications.
Would you like to see a step-by-step example of how a transformer learns word representations?

Role of L1 and L2 Regularization in Deep Learning


Regularization is a set of techniques used in deep learning to improve a model's ability to
generalize to new, unseen data by preventing overfitting. Two of the most common
regularization methods are L1 and L2 regularization, which add penalty terms to the loss
function based on the size of the model's weights.
L2 Regularization (Weight Decay)
Definition: Adds a penalty proportional to the sum of the squares of the weights: $
\Omega(w) = \frac{1}{2} |w|_2^2 = \frac{1}{2} \sum_i w_i^2 $.
Effect: Encourages the model to keep weights small but not necessarily zero. This helps
smooth the model and reduces its complexity, making it less likely to fit noise in the training
data.
Mathematical Impact: The gradient update rule is modified so that weights are
multiplicatively shrunk towards zero at each step, in addition to the usual gradient descent
update.
Behavior: L2 regularization tends to spread out the weights, keeping all features but
reducing their magnitude. It does not induce sparsity (weights are rarely exactly zero).

L1 Regularization
Definition: Adds a penalty proportional to the sum of the absolute values of the weights: $
\Omega(w) = |w|_1 = \sum_i |w_i| $.
Effect: Encourages sparsity in the model by driving some weights exactly to zero. This can
act as a form of automatic feature selection, as irrelevant features are effectively removed
from the model.
Mathematical Impact: The gradient update rule is modified by adding a constant (the sign
of each weight), which can push weights to zero if they are not strongly supported by the
data.
Behavior: L1 regularization leads to sparse solutions, where many weights are exactly zero,
simplifying the model and potentially improving interpretability.

Comparison Table
Penalty
Regularization Typical Effect Use Case
Term

$ \sum_i Small, nonzero General


L2
w_i^2 $ weights smoothing

Many Feature selection,


L1 $ \sum_i w_i $
weights zero sparsity

Why Use L1 and L2 Regularization?


Prevent Overfitting: Both methods help control model complexity, reducing the risk of
fitting noise in the training data.
Improve Generalization: By penalizing large weights, the model is forced to find simpler
solutions that work better on new data.
Feature Selection (L1): L1 regularization can automatically select relevant features by
setting others to zero.
Smoothness (L2): L2 regularization encourages the model to use all features but with
smaller weights, leading to smoother predictions.

Summary
L2 regularization (weight decay) shrinks weights smoothly, reducing model complexity and
variance.
L1 regularization induces sparsity, setting many weights to zero and performing feature
selection.
Both are essential tools for building deep learning models that generalize well to new data.
If you'd like, we can work through a concrete example of how L1 and L2 regularization affect a
simple neural network's weights during training.

Comparing L1 & L2 Parameter Norm Penalties with Dropout Regularization


Let's compare L1 and L2 regularization (parameter norm penalties) with Dropout—three widely
used techniques to prevent overfitting in deep learning.

L1 & L2 Regularization (Parameter Norm Penalties)


L1 Regularization (Lasso): Adds a penalty equal to the sum of the absolute values of the
weights ($ \sum_i |w_i| $). This encourages sparsity—many weights become exactly zero,
which can lead to automatic feature selection and simpler models [57] [58] [59] .
L2 Regularization (Ridge/Weight Decay): Adds a penalty equal to the sum of the squares
of the weights ($ \sum_i w_i^2 $). This discourages large weights, spreading the influence
across all features, and helps with stability and generalization, especially when features are
correlated. [57] [60] [59]
How They Work: Both methods directly modify the loss function, penalizing large weights
during training. They do not change the network structure or introduce randomness during
training.

Dropout Regularization
Dropout: Randomly "drops out" (sets to zero) a fraction of neurons during each training
iteration. This means the network learns to be robust to missing information and cannot rely
on any single neuron or path. [61] [60] [58]
How It Works: Dropout is implemented as a layer that randomly disables a set percentage
of neurons in each forward pass. During inference, all neurons are used, but their outputs
are scaled to account for the training-time dropout rate.
Effect: Dropout acts like training an ensemble of many smaller networks and averaging their
predictions, which improves generalization and reduces overfitting. [58] [61]
Pros and Cons Table
Technique Pros Cons

- Produces sparse models (feature


L1 Regularization - Can underfit if penalty is too strong
selection)

- Useful for high-dimensional data - May ignore useful features

L2
- Spreads weights, improves stability - Does not produce sparse models
Regularization

- All features retained, even unimportant


- Handles multicollinearity well
ones

Dropout - Prevents co-adaptation of neurons - Increases training time

- Acts as model ensemble, strong


- Adds randomness, can slow convergence
generalization

- May not be as effective for small/simple


- Simple to implement in deep networks
models

Summary
L1 and L2 directly penalize large weights, controlling model complexity and improving
generalization. L1 is best for feature selection; L2 is best for stability and when all features
are useful.
Dropout introduces randomness by dropping neurons, forcing the network to learn
redundant, robust representations. It's especially effective in deep, complex networks.
In practice, Dropout is often combined with L2 for best results. [62] [60] [58]
Would you like to see a code example or a visualization of how these techniques affect a neural
network during training?

Data Augmentation Techniques in Computer Vision


Data augmentation is the process of artificially increasing the size and diversity of a training
dataset by applying various transformations to existing images. This helps deep learning models
generalize better, reduces overfitting, and improves robustness to real-world variations. [63] [64]
[65]

Common Data Augmentation Techniques


1. Flipping
Horizontal flip: Mirrors the image left-to-right. Useful for object detection and
recognition when orientation is not fixed.
Vertical flip: Mirrors the image top-to-bottom. Less common, but useful in specific
domains like medical imaging or astronomy. [64] [65]
Example: Flipping a cat image horizontally so the cat faces the opposite direction.
2. Rotation
Rotates the image by a random angle (e.g., 90°, 180°, or any degree).
Example: Rotating a handwritten digit image to simulate different writing angles.
3. Cropping
Randomly crops a region from the image and resizes it to the original size.
Example: Cropping a section of a street scene to focus on a pedestrian.
4. Scaling
Resizes the image, making objects appear larger or smaller.
Example: Zooming in on a flower in a garden image.
5. Translation
Shifts the image along the x or y axis.
Example: Moving a car image slightly to the left or right.
6. Brightness and Contrast Adjustment
Changes the brightness or contrast to simulate different lighting conditions.
Example: Making an outdoor photo appear as if taken at dusk or in bright sunlight. [66]
[67] [64]

7. Adding Noise
Injects random noise (e.g., salt-and-pepper, Gaussian) to make the model robust to
noisy inputs.
Example: Adding white and black dots to a scanned document image.
8. Color Jittering and Saturation
Randomly alters color properties like hue, saturation, and intensity.
Example: Changing the color tone of a fruit image to simulate different ripeness levels.
9. Perspective and Affine Transformations
Alters the viewpoint or geometry of the image.
Example: Tilting a building image to mimic different camera angles. [65]
10. Blurring and Sharpening
Applies filters to make images less or more detailed.
Example: Blurring a face image to simulate motion or sharpening a landscape photo.

Why Use Data Augmentation?


Improves generalization: Models learn to recognize objects under varied conditions.
Reduces overfitting: Prevents the model from memorizing training data.
Handles data scarcity: Generates more examples when collecting new data is difficult.
Example Workflow
Suppose you have a dataset of 1,000 cat images. By applying random flips, rotations, crops,
and color changes, you can generate thousands of new, diverse images for training, helping your
model recognize cats in many real-world scenarios.
In summary: Data augmentation is essential in computer vision for building robust, accurate
models, especially when original data is limited or lacks diversity. [63] [64] [65]

Early Stopping: Concept, Implementation, and Overfitting Prevention


Early stopping is a regularization technique used in deep learning to prevent overfitting by
halting training when a model's performance on a validation set stops improving. Instead of
training for a fixed number of epochs, early stopping monitors a chosen metric (usually validation
loss or accuracy) and stops training when further improvement ceases. [71] [72] [73]

How Early Stopping Works


1. Monitor Validation Performance: During training, the model is evaluated on both the
training and validation sets after each epoch.
2. Track a Metric: The most common metric is validation loss, but accuracy or other metrics
can be used.
3. Set Patience: A 'patience' parameter defines how many epochs to wait for improvement
before stopping. For example, if patience is 5, training will stop if validation loss doesn't
improve for 5 consecutive epochs.
4. Restore Best Weights: After stopping, the model reverts to the weights from the epoch with
the best validation performance, ensuring optimal generalization. [72] [73]

Why Early Stopping Prevents Overfitting


Overfitting occurs when a model learns the training data too well, including noise, and
performs poorly on new data. Typically, training loss keeps decreasing, but validation loss
starts increasing after a certain point.
Early stopping halts training at the point where validation loss is lowest, before the model
starts to overfit. This ensures the model generalizes better to unseen data. [73] [71] [72]

Implementation Example (Pseudocode)

from tensorflow.keras.callbacks import EarlyStopping

# Define early stopping callback


early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
# Fit model with early stopping
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=100, callbacks=[early_

monitor: Metric to track (e.g., 'val_loss').


patience: Number of epochs to wait for improvement.
restore_best_weights: Ensures the model uses the best weights found during training.

Key Benefits
Prevents overfitting by stopping at the optimal point.
Reduces training time and computational cost.
Simple to implement with most deep learning frameworks. [74] [71] [72]

Summary Table
Step Description

Monitor metric Track validation loss/accuracy

Set patience Wait for improvement before stopping

Stop training When no improvement for 'patience' epochs

Restore best model Use weights from epoch with best validation loss

In summary: Early stopping is a practical, effective way to prevent overfitting and improve
model generalization by halting training at the right moment based on validation performance.

Adaptive Optimization Algorithms: RMSProp & Adam


Let's explore RMSProp and Adam, two widely used adaptive optimization algorithms in deep
learning, and see how they work.

1. RMSProp (Root Mean Square Propagation)


Concept: RMSProp adapts the learning rate for each parameter by maintaining an
exponentially decaying average of past squared gradients. This helps the optimizer handle
non-stationary objectives and avoid oscillations, especially in ravine-like loss surfaces.
How it works:
1. For each parameter, compute the moving average of the squared gradients:
$ v_t = \beta v_{t-1} + (1 - \beta) (g_t)^2 $
where $ g_t $ is the gradient at time $ t $, and $ \beta $ is typically set close to 0.9.
2. Update the parameter using:
$ w_{t+1} = w_t - \frac{\alpha}{\sqrt{v_t + \epsilon}} g_t $
where $ \alpha $ is the learning rate and $ \epsilon $ is a small constant to prevent
division by zero.
Benefits:
Adapts learning rates for each parameter.
Handles noisy and sparse gradients well.
Prevents learning rate from decaying too quickly (unlike AdaGrad).

2. Adam (Adaptive Moment Estimation)


Concept: Adam combines the ideas of RMSProp and Momentum. It keeps track of both the
exponentially decaying average of past gradients (first moment) and squared gradients
(second moment), with bias correction for both.
How it works:
1. Compute the moving average of the gradients:
$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $
2. Compute the moving average of the squared gradients:
$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (g_t)^2 $
3. Apply bias correction:
$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t} $
$ \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $
4. Update the parameter:
$ w_{t+1} = w_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $
where typical values are $ \beta_1 = 0.9 $, $ \beta_2 = 0.999 $, and $ \epsilon =
10^{-8} $.
Benefits:
Combines advantages of RMSProp and Momentum.
Adapts learning rates for each parameter and direction.
Robust to noisy gradients and works well for large datasets and deep networks.
Minimal hyperparameter tuning required.

Summary Table
Optimizer Key Idea How It Adapts Typical Use

RMSProp Moving average of squared gradients Per-parameter learning rate RNNs, noisy data

Moving average of gradients & squared Per-parameter learning rate Most deep
Adam
gradients (with bias correction) + momentum learning tasks

In practice: Both RMSProp and Adam are preferred over vanilla SGD for training deep neural
networks, as they speed up convergence and handle complex, high-dimensional loss surfaces
more effectively.

Impact of Xavier and He Initialization on Deep Model Training
Weight initialization is a crucial step in deep learning, directly affecting how well and how
quickly a neural network trains. Poor initialization can lead to vanishing or exploding gradients,
making deep models hard to optimize. Two widely used strategies—Xavier (Glorot)
initialization and He (Kaiming) initialization—were developed to address these issues for
different activation functions.

Xavier Initialization
Purpose: Designed for layers with sigmoid or tanh activations.
Method: Weights are initialized so that the variance of activations remains constant across
layers, preventing gradients from vanishing or exploding.
Formula: For a layer with $ n_{in} $ inputs and $ n_{out} $ outputs, weights are drawn from:
Uniform: $ w \sim U\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} +
n_{out}}}\right) $
Normal: $ w \sim N\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right) $
Effect: Maintains signal flow, allowing gradients to propagate effectively in deep networks
with tanh/sigmoid activations. [84] [85] [86] [87]

He Initialization
Purpose: Tailored for layers with ReLU or its variants.
Method: Weights are initialized with a higher variance to compensate for ReLU's tendency
to zero out half the inputs.
Formula: For a layer with $ n_{in} $ inputs:
Normal: $ w \sim N\left(0, \frac{2}{n_{in}}\right) $
Effect: Prevents "dying ReLU" and keeps gradients robust, enabling stable and fast training
in deep networks with ReLU activations. [88] [89] [86]

Why Initialization Matters


Prevents Vanishing/Exploding Gradients: Proper initialization keeps the scale of
activations and gradients consistent, so learning is stable across many layers.
Speeds Up Convergence: Models reach good solutions faster, reducing training time.
Enables Deep Architectures: Without these strategies, very deep networks would be nearly
impossible to train from scratch.
Summary Table
Initialization Best For Formula (Variance) Prevents

$ \frac{1}{n_{in}} $ or $ \frac{1}{n_{in} +
Xavier Sigmoid/Tanh Vanishing gradients
n_{out}} $

Dying ReLU, vanishing


He ReLU/Variants $ \frac{2}{n_{in}} $
gradients

In practice: Match your initialization to your activation function—Xavier for tanh/sigmoid, He for
ReLU. This simple choice can dramatically improve training stability and final model accuracy.
[86] [87] [88] [84]

Would you like to see code examples for applying these initializations in PyTorch or TensorFlow?

Short Notes

a) Semi-supervised Learning
Semi-supervised learning is a machine learning approach that combines a small amount of
labeled data with a large amount of unlabeled data during training. The main motivation is that
labeled data can be expensive or time-consuming to obtain, while unlabeled data is often
abundant and easy to collect. Semi-supervised learning leverages the structure and patterns in
the unlabeled data to improve learning accuracy, especially when labeled examples are scarce.
How it works: The model is first trained on the labeled data, then uses the unlabeled data to
refine its understanding of the data distribution. Techniques include self-training, co-
training, and graph-based methods.
Benefits: Improves generalization, reduces the need for large labeled datasets, and can
achieve better performance than purely supervised or unsupervised methods in many real-
world scenarios.
Example: In image classification, a model might use a few labeled images of cats and dogs,
along with thousands of unlabeled images, to learn to distinguish between the two more
effectively.

b) Multi-task Learning
Multi-task learning (MTL) is a machine learning paradigm where a single model is trained to
perform multiple related tasks simultaneously, rather than training separate models for each
task. [91] [92] [93] [94] [95] [96]
How it works: MTL models typically share some layers (feature extractors) across all tasks,
while having task-specific output layers ("heads"). The shared layers learn representations
useful for all tasks, while the heads specialize for each task's output.
Benefits:
Improved generalization: By learning from multiple tasks, the model captures more
robust and general features, reducing overfitting.
Data efficiency: Shared representations allow the model to make better use of limited
labeled data for each task.
Reduced model complexity: One model handles several tasks, saving computational
resources.
Enhanced stability: Knowledge transfer between tasks can help compensate for
challenges in individual tasks.
Example: In natural language processing, a single neural network might be trained to
perform both part-of-speech tagging and named entity recognition on the same text,
sharing most of the network's parameters but having separate output layers for each task.
Summary:
Semi-supervised learning leverages both labeled and unlabeled data to improve learning
when labels are scarce.
Multi-task learning trains a single model on multiple related tasks, sharing knowledge to
boost generalization, efficiency, and stability.

Optimization vs. Learning in Deep Networks


Optimization and learning are closely related but distinct concepts in the context of deep
neural networks. Understanding their differences is crucial for designing and training effective
models.

Optimization
Definition: Optimization refers to the mathematical process of finding the best set of model
parameters (weights and biases) that minimize a loss function on the training data. [100] [101]
Goal: The primary goal is to reduce the training error by adjusting parameters using
algorithms like gradient descent, RMSProp, or Adam. [101]
Process: Optimization algorithms iteratively update parameters to find a minimum of the
loss function, often facing challenges like local minima, saddle points, and vanishing
gradients. [100]
Scope: Focuses on the training dataset and the objective function, without direct concern
for how well the model generalizes to new data. [100]

Learning
Definition: Learning in deep networks is the broader process of inferring a suitable model
from data, aiming to capture underlying patterns and generalize to unseen examples. [100]
Goal: The ultimate goal is to minimize generalization error, ensuring the model performs well
on new, unseen data—not just the training set. [100]
Process: Learning involves not only optimization but also regularization, model selection,
and validation to avoid overfitting and underfitting. [100]
Scope: Encompasses the entire modeling pipeline, including data preprocessing,
architecture design, regularization, and evaluation.

Key Differences
Aspect Optimization Learning

Focus Minimizing training loss Minimizing generalization error

Scope Parameter adjustment Model inference and generalization

Algorithms Gradient descent, Adam, RMSProp Includes optimization, regularization, etc.

Challenges Local minima, saddle points, vanishing gradients Overfitting, underfitting, data bias

Outcome Best parameters for training data Best model for unseen data

Summary
Optimization is a subset of learning, concerned with finding parameters that minimize the
loss on training data.
Learning is the overall process of building a model that generalizes well, combining
optimization with strategies to prevent overfitting and ensure robustness. [101] [100]
In deep learning, successful training requires both effective optimization and thoughtful learning
strategies to achieve high performance on real-world tasks. [101] [100]

Module 3 (Half): CNN Basics (CO3)


17.
Apply the convolution operation on a 5×5 input with a 3×3 filter and stride 1 (show steps). for 10
marks using the above textbook
Let's walk through the convolution operation step-by-step using a input matrix, a
filter (kernel), and stride 1. This is a foundational operation in CNNs, as described in your
textbook (Module 3, Chapter 9).

Step 1: Define the Input and Filter


Suppose your input matrix and filter are:
Input :
Filter :

Step 2: Calculate Output Size


For stride 1 and no padding (valid convolution):

So, the output will be a matrix.

Step 3: Perform the Convolution


For each position, place the filter over the input, multiply corresponding elements, and sum them.

Example: Top-left position (output)


Overlay the filter on the top-left region of :

Multiply element-wise with :

Sum:

So, output = 7.

Repeat for all positions:


Move the filter one step right (stride 1), repeat the multiplication and sum for output, output,
then move down for output, etc. [107]
Step 4: Output Matrix
After applying the filter to all valid positions, you get:

Summary Table
Step Action

Define input matrix

Define filter kernel

Slide filter Stride 1, no padding

Compute Multiply, sum for each position

Output result matrix

Would you like to try a different filter or see how padding changes the output size?

Convolution and Pooling in CNNs


Let's break down the concepts of convolution and pooling in Convolutional Neural Networks
(CNNs), using clear step-by-step explanations and diagram descriptions.

1. Convolution Operation
Convolution is the process where a small matrix (called a filter or kernel) slides over the input
image and computes a dot product at each position. This operation extracts features like edges,
textures, or patterns from the image. [108] [109] [110]
Diagram Description:
Imagine a 6×6 grayscale image as a grid of numbers.
Place a 3×3 filter (kernel) over the top-left corner of the image.
Multiply each filter value by the corresponding image pixel, sum the results, and write this
value in the output feature map at the corresponding location.
Slide the filter one pixel to the right (stride 1), repeat the process, and continue until the filter
has covered the entire image.
The output is a smaller grid (feature map) that highlights where the filter detected its
pattern.
Key Terms:
Kernel/Filter: The small matrix used for feature extraction.
Stride: How many pixels the filter moves at each step.
Padding: Adding extra pixels around the image to control output size.
Receptive Field: The region of the input image the filter covers at each step.

2. Pooling Operation
Pooling is used to reduce the spatial size of feature maps, making the network more efficient
and robust to small translations in the input. The most common type is max pooling. [109] [108]
Diagram Description:
Take a feature map (e.g., 4×4 grid).
Place a 2×2 window over the top-left corner.
Find the maximum value in that window and write it in the output.
Slide the window two pixels to the right (stride 2), repeat, and continue for the whole feature
map.
The output is a smaller grid (e.g., 2×2), where each value is the maximum from its window.
Types of Pooling:
Max Pooling: Takes the maximum value in each window.
Average Pooling: Takes the average value in each window.

Summary Table
Operation Purpose How It Works (Diagram)

Convolution Feature extraction Filter slides over image, dot product

Pooling Downsampling, robustness Window slides, max/avg value chosen

In practice:
Convolution layers learn filters to detect features.
Pooling layers reduce feature map size, making the network faster and less sensitive to
small changes in the input.
If you want, I can walk you through a specific example with numbers or help you visualize these
steps for your own data.

Types of Pooling Layers in CNNs and Their Effects
Pooling layers are essential components in Convolutional Neural Networks (CNNs), used to
reduce the spatial dimensions of feature maps while retaining important information. This
process helps make models faster, more efficient, and more robust to variations in input images.
Let's explore the main types of pooling layers and their effects: [116] [117] [118]

1. Max Pooling
How it works: Divides the input feature map into non-overlapping regions (e.g., 2×2) and
selects the maximum value from each region as the output.
Effect: Retains the most prominent features, discards less important details, and provides
translation invariance (the output remains stable even if the input shifts slightly). [117] [118]
Use case: Most common pooling method in image recognition and object detection.

2. Average Pooling
How it works: Computes the average value of each region in the input feature map.
Effect: Produces smoother, more generalized feature maps by reducing the impact of
outliers and noise. [118] [119]
Use case: Useful when input features are noisy or when a more generalized representation
is needed.

3. Global Pooling
How it works: Applies max or average pooling over the entire spatial dimension of the
feature map, reducing each feature map to a single value. [120] [121] [117]
Effect: Drastically reduces dimensionality, often used before fully connected layers for
classification tasks.
Use case: Common in architectures like Global Average Pooling (GAP) before the output
layer.

4. Stochastic Pooling
How it works: Randomly selects a value from each pooling region based on a probability
distribution derived from the region's values. [122] [117]
Effect: Adds randomness, which can help regularize the model and improve generalization.
Use case: Less common, but can be useful for certain regularization needs.

5. Lp Pooling (including L2 Pooling)


How it works: Computes the Lp norm (e.g., L2 norm is the square root of the sum of
squares) for each region. [117] [118] [122]
Effect: Offers a flexible way to aggregate information, balancing between max and average
pooling depending on the value of p.
Use case: Used in specialized applications where different forms of regularization are
needed.

Effects of Pooling on Feature Maps


Dimensionality Reduction: Pooling decreases the spatial size of feature maps, reducing the
number of parameters and computations required. [116] [118]
Translation Invariance: Pooling makes the network less sensitive to small shifts or
distortions in the input image. [116] [117]
Prevents Overfitting: By reducing the amount of information, pooling acts as a form of
regularization. [118] [116]
Feature Hierarchy: Helps build hierarchical representations, where lower layers capture fine
details and higher layers capture more abstract features. [116]

Summary Table

Pooling Type Operation Effect on Feature Map

Max Pooling Max value per region Highlights strongest features

Average Pooling Mean value per region Smooths, generalizes features

Global Pooling Max/mean over all values Reduces to single value per channel

Stochastic Pooling Random value per region Adds regularization, randomness

Lp Pooling Lp norm per region Flexible aggregation, regularization

Pooling layers are chosen based on the task and desired properties. Max pooling is most
common, but other types can be more suitable for specific needs or architectures.

Structured Outputs in CNNs: Concept and Examples


Structured outputs in Convolutional Neural Networks (CNNs) refer to predictions that are more
complex than a single label or number. Instead of just classifying an image (e.g., 'cat' or 'dog'),
the network produces outputs with internal structure—such as sequences, grids, or sets of
labels—reflecting relationships or spatial organization in the data. [123] [124]

What Are Structured Outputs?


Definition: Structured outputs are organized, multi-dimensional predictions where the
output elements are interrelated. This contrasts with simple outputs like a single class label
or scalar value.
Why Important: Many real-world tasks require understanding not just what is present, but
where and how things are arranged in the input.
Examples in Computer Vision

1. Image Segmentation
Task: Assign a class label to every pixel in an image.
Output: A 2D grid (same size as the input image) where each cell contains a class label
(e.g., background, car, person).
Structure: The output is a structured map, not a single value. Neighboring pixels often have
related labels, capturing spatial structure.
Application: Medical imaging (tumor segmentation), autonomous driving (road/lane
detection).

2. Object Detection
Task: Identify and locate multiple objects in an image.
Output: A set of bounding boxes, each with a class label and coordinates (x, y, width,
height).
Structure: The output is a list of structured records, each describing an object and its
position.
Application: Surveillance, robotics, self-driving cars.

3. Other Examples
Pose Estimation: Predicts coordinates of keypoints (e.g., joints in a human body),
outputting a structured set of points.
Instance Segmentation: Combines detection and segmentation, outputting a mask for each
detected object.

Why Use Structured Outputs?


Captures Relationships: Structured outputs allow CNNs to model spatial, sequential, or
hierarchical relationships in data.
Enables Complex Tasks: Many advanced vision tasks (like segmentation and detection)
require outputs that reflect the structure of the input, not just its content.

Summary Table
Task Output Type Example Output Structure

Classification Single label 'cat'

Segmentation 2D label grid [[0,0,1,...],[1,1,0,...],...]

Detection List of boxes [{x, y, w, h, label}, ...]

Pose Estimation List of keypoints [{x1, y1}, {x2, y2}, ...]


In summary: Structured outputs in CNNs enable the network to solve tasks that require
detailed, organized predictions—such as labeling every pixel (segmentation) or finding and
describing multiple objects (detection)—by producing outputs that mirror the complexity and
structure of the input data. [124] [123]

1. BCS714A-module-1-textbook.pdf
2. https://en.wikipedia.org/wiki/AlexNet
3. https://viso.ai/deep-learning/alexnet/
4. https://www.digitalocean.com/community/tutorials/popular-deep-learning-architectures-alexnet-vgg-g
ooglenet
5. http://d2l.ai/chapter_convolutional-modern/alexnet.html
6. https://www.geeksforgeeks.org/deep-learning/difference-between-alexnet-and-googlenet/
7. https://www.geeksforgeeks.org/machine-learning/ml-getting-started-with-alexnet/
8. https://www.kaggle.com/code/blurredmachine/alexnet-architecture-a-complete-guide
9. https://massedcompute.com/faq-answers/?question=What+are+the+key+differences+between+LeNet+
and+AlexNet+in+terms+of+architecture+and+applications%3F
10. https://www.geeksforgeeks.org/machine-learning/convolutional-neural-network-cnn-architectures/
11. https://pabloinsente.github.io/the-convolutional-network
12. https://www.iieta.org/download/file/fid/182194
13. https://www.cse.iitm.ac.in/~miteshk/CS7015/Slides/Teaching/pdf/Lecture11.pdf
14. https://www.youtube.com/watch?v=QJVKIHyQzWU
15. https://www.kaggle.com/code/samuelcortinhas/a-piece-of-history-lenet-5-alexnet-from-scratch
16. https://www.geeksforgeeks.org/nlp/one-hot-encoding-in-nlp/
17. https://www.educative.io/answers/one-hot-encoding-of-text-data-in-natural-language-processing
18. https://eavelardev.github.io/gcp_courses/nlp_on_gcp/text_representation/one_hot_encoding_and_bag_o
f_words.html
19. https://www.geeksforgeeks.org/machine-learning/ml-one-hot-encoding/
20. https://www.cloudskillsboost.google/course_templates/40/video/534085
21. https://www.youtube.com/watch?v=2d8iP2_cS-U
22. https://www.youtube.com/watch?v=4l_ybHoKK_4
23. https://ntanmayee.github.io/articles/2017/09/15/distributed-vs-distributional.html
24. https://med.libretexts.org/Bookshelves/Pharmacology_and_Neuroscience/Computational_Cognitive_Neu
roscience_3e_(O'Reilly_and_Munakata)/03:_Networks/3.03:_Categorization_and_Distributed_Representa
tions
25. https://www.cs.toronto.edu/~lczhang/321/notes/notes07.pdf
26. https://pmc.ncbi.nlm.nih.gov/articles/PMC3576056/
27. https://www.biorxiv.org/content/10.1101/2023.02.01.526470v2.full-text
28. https://www.sciencedirect.com/science/article/abs/pii/S0925231218307902
29. https://www.tandfonline.com/doi/full/10.1080/23273798.2016.1267782
30. https://deepai.org/machine-learning-glossary-and-terms/distributed-representation
31. https://www.oreilly.com/content/how-neural-networks-learn-distributed-representations/
32. https://rinuboney.github.io/2015/10/18/theoretical-motivations-deep-learning.html
33. https://stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf
34. https://arxiv.org/abs/2312.17285
35. https://www.sciencedirect.com/science/article/abs/pii/S0950705124002703
36. https://www.dbs.ifi.lmu.de/Lehre/DLAI/WS18-19/script/07_representation.pdf
37. https://www.sciencedirect.com/science/article/abs/pii/S0020025522000585
38. http://www.cs.toronto.edu/~bonner/courses/2014s/csc321/lectures/lec5.pdf
39. https://www.image-net.org/about.php
40. https://www.historyofdatascience.com/imagenet-a-pioneering-vision-for-computers/
41. https://deepai.org/machine-learning-glossary-and-terms/imagenet
42. BCS714A-module-1-textbook.pdf
43. https://viso.ai/deep-learning/imagenet/
44. https://www.pinecone.io/learn/series/image-search/imagenet/
45. https://en.wikipedia.org/wiki/ImageNet
46. https://www.image-net.org
47. https://www.kaggle.com/getting-started/149448
48. https://journals.sagepub.com/doi/full/10.1177/20539517211035955
49. https://viso.ai/deep-learning/representation-learning/
50. https://en.wikipedia.org/wiki/Deep_learning
51. https://www.ibm.com/think/topics/deep-learning
52. https://deepgram.com/ai-glossary/representation-learning
53. https://www.geeksforgeeks.org/deep-learning/introduction-deep-learning/
54. https://www.sciencedirect.com/science/article/pii/S1532046420302653
55. https://onlinecourses.nptel.ac.in/noc25_cs22/preview
56. BCS714A-module-2-textbook.pdf
57. https://www.e2enetworks.com/blog/regularization-in-deep-learning-l1-l2-dropout
58. https://www.linkedin.com/pulse/understanding-regularization-techniques-l1-l2-dropout-joshua-cox-aig
uc
59. https://towardsdatascience.com/l1-vs-l2-regularization-in-machine-learning-differences-advantages-a
nd-how-to-apply-them-in-72eb12f102b5/
60. https://www.skillcamper.com/blog/the-role-of-regularization-in-deep-learning-models
61. https://www.geeksforgeeks.org/deep-learning/dropout-regularization-in-deep-learning/
62. https://massedcompute.com/faq-answers/?question=What+are+the+differences+between+dropout+an
d+L1+and+L2+regularization%3F
63. https://encord.com/blog/data-augmentation-guide/
64. https://research.aimultiple.com/data-augmentation-techniques/
65. https://www.ultralytics.com/blog/the-ultimate-guide-to-data-augmentation-in-2025
66. https://blog.roboflow.com/data-augmentation/
67. https://www.ccslearningacademy.com/what-is-data-augmentation/
68. https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/
69. https://aws.amazon.com/what-is/data-augmentation/
70. https://www.sciencedirect.com/science/article/pii/S2590005622000911
71. https://www.geeksforgeeks.org/deep-learning/using-early-stopping-to-reduce-overfitting-in-neural-n
etworks/
72. https://www.geeksforgeeks.org/machine-learning/regularization-by-early-stopping/
73. https://milvus.io/ai-quick-reference/what-is-early-stopping
74. https://www.linkedin.com/pulse/real-world-ml-early-stopping-deep-learning-guide-olamendy-turruella
s-pip9c
75. https://github.com/phuongpho/early-stopping
76. https://www.machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-tim
e-using-early-stopping/
77. https://studyglance.in/dl/display.php?tno=12&topic=Early-stopping
78. https://codesignal.com/learn/courses/modeling-the-iris-dataset-with-tensorflow/lessons/implementing-
early-stopping-in-tensorflow-to-prevent-overfitting
79. https://towardsdatascience.com/understanding-deep-learning-optimizers-momentum-adagrad-rmspro
p-adam-e311e377e9c2/
80. https://community.deeplearning.ai/t/difference-between-rmsprop-and-adam/310187
81. https://joiv.org/index.php/joiv/article/view/1818
82. https://www.digitalocean.com/community/tutorials/intro-to-optimization-momentum-rmsprop-adam
83. https://www.kaggle.com/code/harpdeci/intuitive-explanation-of-sgd-adam-and-rmsprop
84. https://cs230.stanford.edu/section/4/
85. https://www.geeksforgeeks.org/deep-learning/xavier-initialization/
86. https://businessanalytics.substack.com/p/weight-initialization-in-neural-networks
87. https://365datascience.com/tutorials/machine-learning-tutorials/what-is-xavier-initialization/
88. https://stackoverflow.com/questions/48641192/xavier-and-he-normal-initialization-difference
89. https://www.deeplearning.ai/ai-notes/initialization/
90. https://en.wikipedia.org/wiki/Weight_initialization
91. https://www.geeksforgeeks.org/deep-learning/introduction-to-multi-task-learningmtl-for-deep-learnin
g/
92. https://www.infosysbpm.com/glossary/multi-task-learning.html
93. https://www.v7labs.com/blog/multi-task-learning-guide
94. https://studyglance.in/dl/display.php?tno=11&topic=Multi-Task-Learning
95. https://milvus.io/ai-quick-reference/how-does-multitask-learning-work-in-deep-learning
96. https://codefinity.com/blog/What-is-Multi-task-Learning
97. https://www.jmlr.org/papers/volume17/15-242/15-242.pdf
98. https://www.sciencedirect.com/science/article/abs/pii/S0010482522012045
99. https://arxiv.org/abs/2404.18961
100. http://d2l.ai/chapter_optimization/optimization-intro.html
101. https://www.geeksforgeeks.org/deep-learning/optimization-rule-in-deep-neural-networks/
102. https://aws.amazon.com/compare/the-difference-between-deep-learning-and-neural-networks/
103. https://arxiv.org/abs/2007.14166
104. https://www.worldscientific.com/doi/10.1142/S0218001420520138
105. https://www.ibm.com/think/topics/ai-vs-machine-learning-vs-deep-learning-vs-neural-networks
106. https://www.reddit.com/r/deeplearning/comments/1dgkut0/why_are_neural_networks_optimized_instea
d_of_just/
107. BCS714A-module-3-textbook.pdf
108. https://learnopencv.com/understanding-convolutional-neural-networks-cnn/
109. https://viso.ai/deep-learning/convolution-operations/
110. https://en.wikipedia.org/wiki/Convolutional_neural_network
111. https://www.geeksforgeeks.org/machine-learning/introduction-convolution-neural-network/
112. https://poloclub.github.io/cnn-explainer/
113. https://towardsdatascience.com/convolutional-neural-network-cnn-architecture-explained-in-plain-eng
lish-using-simple-diagrams-e5de17eacc8f/
114. https://www.sciencedirect.com/topics/computer-science/convolution-operation
115. https://developer.nvidia.com/discover/convolutional-neural-network
116. https://www.geeksforgeeks.org/deep-learning/cnn-introduction-to-pooling-layer/
117. https://www.deepchecks.com/glossary/pooling-layers-in-cnn/
118. https://www.linkedin.com/pulse/pooling-cnn-types-its-use-priyanka-yadav-5innc
119. https://www.nature.com/articles/s41598-024-51258-6
120. https://www.baeldung.com/cs/neural-networks-pooling-layers
121. https://www.digitalocean.com/community/tutorials/pooling-in-convolutional-neural-networks
122. https://en.wikipedia.org/wiki/Pooling_layer
123. https://www.janbasktraining.com/tutorials/deep-learning-structured-outputs/
124. https://www.scribd.com/document/851025794/4-Structured-outputs-Data-types
125. https://cookbook.openai.com/examples/structured_outputs_intro
126. https://www.geeksforgeeks.org/deep-learning/convolutional-neural-network-cnn-in-machine-learning/
127. https://huggingface.co/docs/inference-providers/en/guides/structured-output
128. https://python.langchain.com/docs/concepts/structured_outputs/
129. https://www.upgrad.com/blog/basic-cnn-architecture/
130. https://towardsdatascience.com/structured-outputs-and-how-to-use-them-40bd86881d39/
131. https://community.openai.com/t/structured-outputs-deep-dive/930169

You might also like