RNS INSTITUTE OF TECHNOLOGY
Affiliated to VTU, Recognized by GOK, Approved by AICTE, New Delhi
(NAAC ‘A+ Grade’ Accredited, NBA Accredited (UG - CSE, ECE, ISE, EIE and EEE)
Channasandra, Dr. Vishnuvardhan Road, Bengaluru - 560 098
Ph:(080)28611880,28611881 URL: www.rnsit.ac.in
MODULE-2:
Basics of Supervised Deep Learning
Pooja R Rao
Assistant Professor
Department of CSE-DS
RNSIT
Introduction
•Supervised and unsupervised deep learning models are rapidly growing due to
their success in solving complex problems.
•Growth is supported by:
•High-performance computing resources.
•Availability of large amounts of labeled and unlabeled data.
•Access to advanced open-source libraries.
•This makes deep learning increasingly feasible for many applications.
Convolutional Neural Network (ConvNet/CNN)
•Deep learning models (supervised and unsupervised) are rapidly advancing
due to their effectiveness in solving complex problems.
•The growth is driven by powerful computing resources, large datasets, and
advanced open-source libraries.
•These factors make deep learning feasible for a wide range of applications.
•Convolutional Neural Networks (ConvNets or CNNs) are deep learning
models with multiple layers.
•CNNs are inspired by the human visual cortex.
•They are highly effective in tasks like image classification, object detection, speech
recognition, natural language processing, and medical image analysis.
•CNNs play a crucial role in computer vision applications such as self-driving cars,
robotics, and helping the visually impaired.
•They work by extracting local features at higher layers and combining them into
more complex features at deeper layers.
•CNNs are computationally intensive due to their deep architecture.
•Training CNNs on large datasets often requires several days and the use of GPUs.
•CNNs outperform most traditional techniques in visual recognition tasks.
Evolution of Convolutional Neural Network Models
LeNet
The first practical convolutional neural network (CNN), designed to
classify handwritten digits (MNIST).
Used backpropagation for training and was adopted for reading
handwritten checks.
Did not scale well to larger problems due to:
o Small labeled datasets
o Slow computers
o Use of unsuitable activation functions (like sigmoid/tanh) leading to
vanishing gradients, which make training deep networks difficult.
AlexNet
• Achieved the first major breakthrough in 2012 by winning the ImageNet Large-
Scale Visual Recognition Challenge (ILSVRC).
• Reduced classification error rate from 26% to 15%.
• Improvements over LeNet included:
• Large labeled dataset (ImageNet: ~15 million images in 22,000+ categories).
• Training on high-speed GPUs (GTX 580) for several days.
• Use of ReLU activation function (f(x) = max(x, 0)), which is faster and avoids
vanishing gradient problems.
• Architecture: 5 convolutional layers, 3 pooling layers, 3 fully connected layers, and
a 1000-way softmax classifier.
ZFNet (2013):
o An improved version of CNN architecture by reducing the first-
layer filter size from 11×11 to 7×7 and stride from 4 to 2.
o This led to better feature extraction and fewer dead features.
o ZFNet won the ILSVRC 2013 competition.
VGGNet (2014):
o The depth of the network was made 19 layers by adding more
convolutional layers with 3 × 3 filters, along with 2 × 2 max-pooling
layers with stride and padding of 1 in all layers.
o The deeper, simpler architecture improved accuracy significantly.
o VGGNet achieved 7.32% error rate and was the runner-up in
ILSVRC 2014.
GoogLeNet (2015):
• Google developed a ConvNet model called GoogLeNet in 2015. It
uses an inception module which helps in reducing the number of
parameters in the network.
• The inception module is actually a concatenated layer of
convolutions (3 × 3 and 5 ×5convolutions) and pooling sub-layers
at different scales with their output filter banks concatenated into a
single output vector making the input for the succeeding stage.
• These sub-layers are not stacked sequentially but the sub-layers
are connected in parallel as shown in Fig. 2.1.
Increasing network layers can improve accuracy by learning more
features, but has limits:
1. Vanishing gradients: Very deep networks may lose important
information during training.
2. Optimization difficulty: Too many parameters make training harder.
To address this, network depth should be increased carefully.
GoogLeNet won ILSVRC 2015 with a 6.7% error rate.
Later versions include Inception V3 (2016) and Inception-ResNet
(2017).
ResNet:
• Microsoft Research Asia proposed a CNN architecture in 2015, which is, 152
layers deep and is called ResNet. ResNet introduced residual connections in
which the output of a conv-relu-conv series is added to the original input and
then passed through Rectified Linear Unit (ReLU) as shown in Fig. 2.2.
• In this way, the information is carried from the previous layer to the next
layer and during backpropagation, the gradient flows easily because of the
addition operations, which distributes the gradient. ResNet proved that a
complex architecture like Inception is not required to achieve the best results.
Performed exceptionally well, winning ILSVRC 2015 with a
3.6% error rate.
Inception-ResNet (2017):
Combined the Inception module with residual connections to
form a hybrid model.
This design significantly increased training speed.
It slightly outperformed ResNet in terms of accuracy.
Xception:
A convolutional neural network architecture based on depthwise separable convolution
layers is called Xception. The architecture is actually inspired by inception model and
that is why it is called Xception (Extreme Inception). Xception architecture is a pile of
depthwise separable convolution layers with residual connections. Xception has 36
convolutional layers organized into 14 modules, all having linear residual connections
around them, except for the first and last modules. The Xception has claimed to perform
slightly better than Inception V3 on ImageNet. Table 2.1 and Fig. 2.3 show classification
performance of VGG-16, ResNet-152, Inception V3 and Xception on ImageNet.
SqueezeNet:
Researchers developed SqueezeNet to reduce the size and complexity of convolutional
neural networks without sacrificing accuracy. The approach included pruning small-weight
parameters to create sparse models and retraining them. Additionally, SqueezeNet adopted
three main strategies to minimize parameters and computation:
(a) Replacing 3 × 3 filters with 1 × 1 filters.
(b) Reducing the number of input channels to 3 × 3 filters.
(c) Delaying subsampling to later layers to preserve larger activation maps. (subsampling means
reducing the size of the feature maps-Instead of shrinking the image too early, SqueezeNet keeps larger feature maps for more layers)
With these methods, SqueezeNet achieved AlexNet-level accuracy on ImageNet using 50
times fewer parameters.
ShuffleNet:
Another ConvNet architecture called ShuffleNet was introduced in 2017
for devices with limited computational power, like mobile devices,
without compromising on accuracy. ShuffleNet used two ideas,
pointwise group convolution(split the channels into groups and only connect within a group, computation reduces a
lot) and channel shuffle, to considerably decrease the computational cost
while maintaining the accuracy.
Convolution Operation
Architecture of CNN
Traditional Neural Network Limitations
Fully connected layers connect every neuron in one layer to every neuron in the previous layer.
This dense connectivity does not scale well to large images.
Need for CNN
CNNs are better for large images and data with grid-like structure (e.g., 1D time-series, 2D images, 3D
volumes, 4D videos).
Designed to process structured data efficiently.
Key Features of CNNs
(i) Local Receptive Field:
o Each neuron connects only to a small region of the input.
o Helps extract local features like edges, corners.
(ii) Weight Sharing:
o Same filter (set of weights) is applied across all positions in the input.
o Reduces number of parameters and enables feature detection anywhere in the input.
o A typical convolutional neural network consists of the following layers:
• Convolutional layer
• Activation function layer (ReLU)
• Pooling layer
• Fully connected layer and
• Dropout layer
(iii) Subsampling (Pooling):
o Reduces spatial size and network parameters.
o Most common method is max-pooling.
Convolution Layer
The convolution layer is the main building block of a convolutional neural network
(CNN).
It uses the convolution operation (denoted by *) instead of general matrix multiplication.
It has a set of learnable filters or kernels as its parameters.
Its main task is to detect features in local regions of the input image that are common
across the dataset.
A feature map is created for each filter by convolving it over subregions of the image.
The process includes performing the convolution, adding a bias term, and applying an
activation function.
The local receptive field is the region of the input the filter is applied to, and its size
matches the filter size.
Filters/Kernels
The weights in each convolutional layer define the convolution filters (kernels)
There can be multiple filters in a single convolutional layer.
Each filter is designed to capture specific features like edges or corners.
During the forward pass, each filter slides over the input’s width and height to
produce its feature map.
Hyperparameters
Convolutional neural networks have hyperparameters that control model
behavior, output size, runtime, and memory.
Four important hyperparameters in the convolution layer are:
Filter Size: Typically between 3×3 and 11×11. Size is independent of input size.
Number of Filters: Can vary. For example, AlexNet used 96 filters of size 11×11,
VGGNet used filters of size 7×7 or 11×11.
Stride: Number of pixels the filter moves at each step. Small stride = more overlap
and larger output size; large stride = less overlap and smaller output size.
Zero Padding: Number of pixels added as zeros around the input to control the
output’s spatial size.
• Each filter in the convolution layer produces a feature map of size
([A−K +2P]/S) + 1 where A is the input volume size, K is the size of
the filter, P is the number of padding applied and S is the stride.
• Suppose the input image has size 128 × 128, and 5 filters of size 5 × 5
are applied, with single stride and zero padding, i.e., A 128, F 5,P
0andS 1.
• The number of feature maps produced will be equal to the number of
filters applied, i.e., 5 and the size of each feature map will be ([128 − 5
+0]/1)+1 124. Therefore, the output volume will be 124 × 124 × 5.
Activation Function (ReLU)
The output of each convolutional layer is passed through an activation function layer.
The activation function transforms the feature map into an activation map.
It determines the output signal of a neuron for a given input.
Activation functions typically squash inputs to a specific range (e.g., 0–1 or −1 to 1).
They perform a mathematical operation on the input to produce the neuron's activation
level.
A good activation function is usually continuous and differentiable everywhere.
Differentiability is important for gradient-based training methods used in ConvNets.
If non-gradient-based methods are used, differentiability is not required.
Many activation functions are used in ANNs and some of the commonly used activation
functions are as follows:
(non-monotonic: slopes up, dips slightly below zero, then rises)
Pooling Layer
• Pooling layers follow the convolution and activation layers in ConvNets to
reduce the spatial size of feature maps.
This reduction lowers the number of parameters and computational cost in
the network.
A pooling layer down-samples the input feature maps by summarizing
regions of neurons to select representative values.
Max-pooling is the most common technique, dividing the input into small
regions (e.g., 2 × 2) and selecting the maximum value from each region.
For a 2 × 2 region, max-pooling outputs the single highest value among the
four values.
• Other pooling types include average pooling (computes the mean of the region) and L2-norm
pooling (calculates the square root of the sum of squares of the values).
Pooling layers discard less important details while preserving essential features in a smaller,
more manageable form.
The idea behind pooling is that detecting a feature is more important than knowing its exact
location.
This strategy works well for simple tasks but can have limitations for more complex problems.
Fully Connected Layer
•Convolutional Neural Networks (CNNs) consist of two main stages: feature extraction and classification.
•The feature extraction stage includes convolution and pooling layers that detect features from input data.
•Once enough features are extracted, the classification stage begins.
•The classification stage consists of one or more fully connected layers followed by a classifier.
•Fully connected layers take input from all neurons of the previous layer, enabling every value to contribute
to the prediction.
•These layers transform the spatial feature data into class scores or probabilities.
•Multiple fully connected layers can be used to learn complex feature relationships.
•The output from the last fully connected layer is sent to a classifier.
•Common classifiers used are Softmax and Support Vector Machines (SVMs).
•The Softmax classifier outputs class probabilities that sum to 1.
•The SVM classifier outputs class scores, and the class with the highest score is selected.
Dropout
Deep neural networks have multiple hidden layers that help learn complex features.
These are followed by fully connected layers used for decision-making.
Fully connected layers are prone to overfitting due to their dense connections.
Overfitting occurs when the model performs well on training data but poorly on new,
unseen data.
To address overfitting, a dropout layer is used during training.
Dropout randomly removes some neurons and their connections from the network
during each training iteration.
The remaining reduced network is trained on the data at that stage.
Dropped-out neurons are reinserted later with their original weights.
This technique reduces overfitting and enhances the model's ability to generalize.
2.6 Challenges and Future Research Direction:
Strong Performance: Convolutional Neural Networks (ConvNets) have shown excellent results in tasks like object
classification and detection, sometimes matching human-level accuracy.
Vulnerabilities Exist: Despite their success, ConvNets are vulnerable to small, imperceptible changes in input images, which
can lead to incorrect classifications.
Cause of Vulnerability: One key reason for this vulnerability is the pooling operation, which reduces the feature space but
also discards important spatial information.
Loss of Spatial Relationships: ConvNets detect if a feature is present in a region but fail to capture the exact spatial
relationships between features, making it harder to recognize complex objects.
Reliability Concern: These limitations raise concerns about the generalization and reliability of ConvNets in real-world
applications.
Capsule Networks as a Solution: Capsule Networks have been proposed to overcome some of these issues. They use
capsules (groups of neurons) to represent objects and their parts more precisely.
Dynamic Routing: Instead of max pooling, Capsule Networks use dynamic routing to preserve spatial relationships between
features across layers.
Ongoing Research: Capsule Networks are still in the early stages of research, and their effectiveness across various visual
tasks remains under investigation.