0% found this document useful (0 votes)
21 views10 pages

Module 2 Notes

Notes

Uploaded by

astrospace369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views10 pages

Module 2 Notes

Notes

Uploaded by

astrospace369
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Evolution of CNN Models

Over time, researchers have built many CNN (Convolutional Neural Network) models. Each new
model tried to fix the limitations of older ones and improve performance on tasks like image
classification. Let’s go step by step.

1. LeNet-5 (1998)

 Developed by: Yann LeCun.


 Task: Recognize handwritten digits (like postal codes, bank checks).
 Structure:
o Input → Convolution layers → Subsampling (pooling) → Fully connected layers
→ Output.
o Used sigmoid/tanh activations (ReLU was not yet popular).
o Very small compared to today’s models (~60,000 parameters).

Figure (LeNet architecture):

 A small diagram showing an image going through alternating convolution + pooling


layers, then into fully connected layers, and finally giving classification output.

👉 Key point: LeNet was the first successful CNN, but computers at that time were too slow, so
CNNs didn’t immediately become popular.

2. AlexNet (2012)

 Developed by: Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton.


 Task: ImageNet Challenge (classify 1.2 million images into 1000 categories).
 Breakthrough: Reduced error rate drastically (from ~26% to ~15%). This shocked the
AI community.
 Improvements over LeNet:
o Used ReLU activation (faster training than sigmoid/tanh).
o Dropout to reduce overfitting.
o Data augmentation (flipping, cropping images).
o Trained on GPUs for speed.
 Structure:
o 5 convolution layers + 3 fully connected layers.
o Used overlapping max-pooling.

Figure (AlexNet architecture):

 Shows a bigger CNN compared to LeNet with multiple conv + pooling layers, followed
by dense layers, and finally a 1000-class softmax.

👉 Key point: AlexNet re-started the “deep learning boom.”

3. ZFNet (2013)

 Developed by: Matthew Zeiler & Rob Fergus.


 Improvement of AlexNet.
 Used deconvolution visualization to understand what CNN layers were learning.
 Made small adjustments like:
o Smaller filter size in the first layer (7x7 instead of 11x11).
o Smaller stride (2 instead of 4).
 Result: Better accuracy than AlexNet on ImageNet.

Figure (ZFNet visualization):

 Shows feature maps at different layers, helping understand what the network is detecting
(edges, textures, objects).

👉 Key point: First attempt to “open the black box” of CNNs.

4. VGGNet (2014)

 Developed by: Oxford Visual Geometry Group.


 Contribution: Showed that using very small filters (3x3) stacked multiple times works
well.
 Tested 16-layer (VGG16) and 19-layer (VGG19) versions.
 Simpler architecture:
o Just 3x3 convolutions and 2x2 pooling, repeated many times.
 Downside: Extremely large (138 million parameters). Requires lots of memory and
computation.
Figure (VGG architecture):

 Shows a deep stack of 3x3 conv layers followed by fully connected layers.

👉 Key point: Popular because of simplicity and uniform design, still used today as a baseline.

5. GoogLeNet / Inception (2014)

 Developed by: Google.


 New idea: The Inception module.
o Instead of picking filter size (1x1, 3x3, 5x5), it used all of them in parallel and
concatenated outputs.
o This lets the network learn both fine and coarse features at the same time.
 GoogLeNet (Inception v1): 22 layers deep.
 Used 1x1 convolutions for dimensionality reduction, reducing computation.

Figure (Inception module):

 Shows parallel paths with 1x1, 3x3, and 5x5 conv filters + pooling, then combining
outputs.

👉 Key point: Very efficient, achieved high accuracy with fewer parameters than VGG.

6. ResNet (2015)

 Developed by: Microsoft Research.


 Big breakthrough: Introduced Residual Connections (skip connections).
 Problem solved: As networks got deeper, training became harder (vanishing gradient
problem).
 Residual block idea: Instead of learning a full mapping, the network learns the
“difference” (residual).

y=F(x)+xy = F(x) + xy=F(x)+x

 Allowed training of very deep networks (50, 101, even 152 layers).
 Won the ImageNet 2015 challenge with a large margin.

Figure (ResNet block):


 A diagram showing input going through conv layers, then being added back to the
original input (skip connection).

👉 Key point: ResNet changed deep learning forever — now almost all modern CNNs use skip
connections.

7. Xception (2017)

 Developed by: Google.


 Based on: Inception, but replaced standard convolutions with Depthwise Separable
Convolutions.
 This reduces computation while keeping accuracy high.
 Depthwise separable convolution:
o First apply a depthwise conv (one filter per channel).
o Then a pointwise conv (1x1) to combine them.
 More efficient than Inception modules.

Figure (Xception module):

 Shows depth wise + point wise conv sequence compared to normal convolution.

👉 Key point: Efficient and accurate, often used in mobile/edge devices.

What is Convolution?

 Convolution is a mathematical operation where we combine two functions (or two sets
of data) to produce a third one.
 In image processing, one function is the image (input), and the other is the filter/kernel
(a small matrix).
 The result of convolution is a feature map that highlights important patterns from the
image.

General 1D Convolution

(f∗g)(n)=∑mf(m)g(n−m)

🔹 Meaning:

 You have two functions (or sequences) f and g.


 To compute their convolution at position n:
o Flip one function (g),
o Shift it by n,
o Multiply element-by-element with f,
o Sum the results.

👉 This is the basic definition of convolution in math.

2. Alternative Form (Commutativity)

(f∗g)(n)=∑mf(n-m)g(m)

🔹 Explanation:

 Convolution is commutative → meaning f∗g=g∗f


 That’s why you can swap f and g in the formula.
 The result is the same whether you slide g over f or f over g.

3. 2D Convolution (used in images)

F(i,j)=(A∗K)(i,j)=∑m∑nA(m,n) K(i−m,j−n)

🔹 Meaning:

 A = the input image (2D grid).


 K = the kernel/filter (small 2D matrix).
 F = the output feature map.
 At each location (i,j):
o Take the overlapping region of the image and the kernel.
o Multiply element by element.
o Add them all → this gives one number in the feature map at (i,j).

👉 This is the same sliding process we did in the example.

4. Commutative Property in 2D
F(i,j)=(A∗K)(i,j)=∑m∑nA(m,n) K(i−m,j−n)

🔹 Explanation:

Same as above, but here we swapped the positions of A and K.

 Shows again that convolution is commutative.

5. Cross-Correlation (when kernel is not flipped)

F(i,j)=∑m∑nA(i-m,j-n) K(m,n)

🔹 Where:

 In true convolution, the kernel is flipped before sliding.


 But in cross-correlation, we don’t flip the kernel.
 Most Deep Learning libraries (TensorFlow, PyTorch, Keras) actually use cross-
correlation, but they still call it convolution (because results are similar and easier to
implement).

👉 In practice, when you hear "convolution layer" in CNNs, it’s usually cross-correlation.

🔹 How it works on Images

1. Input (Image)
o Think of the image as a big grid of numbers (pixel values).
2. Filter (Kernel)
o A small grid (like 3×3 or 5×5) with numbers in it.
o Each filter is designed to detect a specific pattern, such as edges, corners, or
textures.
3. Convolution Operation
o Place the filter on top of the input image.
o Multiply each filter value with the overlapping image pixel values (element-wise
multiplication).
o Add all the results together to get one number.
o This one number goes into the feature map at the corresponding position.
o Then, slide the filter across the whole image (left to right, top to bottom) and
repeat.

 Flipping:
In strict math convolution, the filter is flipped before sliding. But in practice (CNNs),
most libraries skip the flip — this is called cross-correlation.
 Feature Map:
The output after sliding the filter across the image. It shows where the filter detected its
pattern strongly.

Convolution can be done as either image ∗ filter or filter ∗ image, giving the same
 Commutative Property:

result.

🔹 Example

Imagine the input image as a big piece of paper with numbers.

 The filter is like a small stamp with numbers on it.


 You place the stamp on the paper, multiply overlapping numbers, add them, and write the
result on a new sheet (feature map).
 Then you slide the stamp around and repeat — eventually, you get a new picture (feature
map) that highlights patterns.
Example:

We take a 5×5 input image and a 3×3 filter.

🔹 Input Image (A)

1 2 3 0 1
0 1 2 3 1
1 0 1 2 2
2 1 0 1 0
0 1 2 1 1

🔹 Filter / Kernel (K)

1 0 1
0 1 0
1 0 1
Step 1: Place the filter at the top-left of the input

Take the first 3×3 block of the input:

1 2 3
0 1 2
1 0 1

Step 2: Multiply element-wise with filter


(1*1) + (2*0) + (3*1) +
(0*0) + (1*1) + (2*0) +
(1*1) + (0*0) + (1*1)

=1+0+3+0+1+0+1+0+1
=7

Step 3: Write result in the feature map

The top-left cell of the feature map becomes 7.

Step 4: Slide the filter

Now slide the filter one step to the right and repeat.
Do this for the whole input.

Since the input is 5×5 and filter is 3×3, the output (feature map) will be 3×3.

Final Feature Map (F)

After sliding over the whole input:

7 7 7
4 5 5
6 4 6

You might also like