Lecture 3: Unsupervised Learning & Neural Net-
works - MCQ Study Guide
Key Concepts Explained Simply
Unsupervised Learning
What is Unsupervised Learning? Unsupervised learning finds patterns in
data without labeled responses. It’s like sorting objects without being told what
categories to use.
Clustering Algorithms
K-means Clustering
• What it is: Groups similar data points into K clusters
• How it works:
1. Choose K (number of clusters)
2. Randomly place K centroids
3. Assign each point to the nearest centroid
4. Recalculate centroids as the average of all points in the cluster
5. Repeat steps 3-4 until convergence
• Objective function: Minimize the sum of squared distances from points
to their centroids
• Determining optimal K:
– Elbow method: Plot error vs. K and look for the “elbow”
– Silhouette score: Measures how similar points are to their own cluster
vs. other clusters
• Limitations:
– Sensitive to initial centroid positions
– Assumes spherical clusters
– Sensitive to outliers
Hierarchical Clustering
• What it is: Builds a tree of clusters (dendrogram)
• Types:
– Agglomerative (bottom-up): Start with each point as a cluster and
merge
– Divisive (top-down): Start with one cluster and divide
• Linkage methods:
– Single linkage: Minimum distance between points in clusters
– Complete linkage: Maximum distance between points in clusters
– Average linkage: Average distance between all pairs of points
– Ward’s method: Minimizes variance within clusters
• Advantages:
– No need to specify number of clusters beforehand
1
– Creates a hierarchy that can be cut at different levels
DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
• What it is: Groups dense regions of points, marking sparse regions as
noise
• Parameters:
– � (epsilon): Maximum distance between two points to be considered
neighbors
– MinPts: Minimum number of points required to form a dense region
• Point types:
– Core points: Have at least MinPts points within distance �
– Border points: Within distance � of a core point but have fewer than
MinPts neighbors
– Noise points: Neither core nor border points
• Advantages:
– Can find arbitrarily shaped clusters
– Robust to outliers
– Doesn’t require specifying number of clusters
• Disadvantages:
– Sensitive to parameter selection
– Struggles with varying density clusters
Dimensionality Reduction
Principal Component Analysis (PCA)
• What it is: Reduces dimensions while preserving as much variance as
possible
• How it works:
1. Standardize the data
2. Compute covariance matrix
3. Calculate eigenvectors and eigenvalues
4. Sort eigenvectors by eigenvalues (highest to lowest)
5. Select top k eigenvectors to form new feature space
6. Transform original data to new space
• Applications:
– Data compression
– Visualization
– Noise reduction
– Feature extraction
t-SNE (t-Distributed Stochastic Neighbor Embedding)
• What it is: Nonlinear dimensionality reduction technique for visualiza-
tion
2
• Key idea: Convert similarities between data points to joint probabilities
and minimize the KL divergence
• Advantages:
– Preserves local structure
– Effective for visualization
• Disadvantages:
– Computationally intensive
– Non-deterministic
– Not suitable for dimensionality reduction for modeling
Association Rule Learning
• What it is: Discovers interesting relations between variables in large
databases
• Key metrics:
– Support: Frequency of an itemset = (transactions containing itemset)
/ (total transactions)
– Confidence: Likelihood of Y given X = support(X�Y) / support(X)
– Lift: Ratio of observed support to expected support if X and Y were
independent
• Apriori Algorithm:
1. Find all frequent itemsets with support � minimum support
2. Generate rules with confidence � minimum confidence
• Applications:
– Market basket analysis
– Product recommendation
– Cross-selling
Neural Networks
Basic Structure
• Input layer: Receives the features
• Hidden layer(s): Processes the information
• Output layer: Produces the prediction
• Neuron (Perceptron): Basic unit that:
– Receives inputs
– Applies weights
– Adds bias
– Applies activation function
– Produces output
Activation Functions
• Sigmoid: f(x) = 1/(1+e^(-x))
– Range: (0, 1)
– Used in binary classification output layers
3
• Tanh: f(x) = (e^x - e(-x))/(e x + e^(-x))
– Range: (-1, 1)
– Zero-centered
• ReLU (Rectified Linear Unit): f(x) = max(0, x)
– Range: [0, ∞)
– Computationally efficient
– Helps mitigate vanishing gradient problem
• Leaky ReLU: f(x) = max(�x, x) where � is a small constant
– Addresses “dying ReLU” problem
• Softmax: f(x_i) = e(x_i)/Σ(e (x_j))
– Used for multi-class classification output layers
– Outputs sum to 1 (probability distribution)
Training Neural Networks
• Forward Propagation: Compute outputs given inputs
• Loss Function:
– Mean Squared Error (regression)
– Binary Cross-Entropy (binary classification)
– Categorical Cross-Entropy (multi-class classification)
• Backpropagation: Calculate gradients of the loss with respect to weights
• Optimization Algorithms:
– Stochastic Gradient Descent (SGD)
– Adam
– RMSprop
– Adagrad
Regularization Techniques
• Dropout: Randomly deactivate neurons during training
• L1/L2 Regularization: Add penalty terms to the loss function
• Batch Normalization: Normalize layer inputs for each mini-batch
• Early Stopping: Stop training when validation error starts increasing
Convolutional Neural Networks (CNNs)
Key Components
• Convolutional layers: Apply filters to detect features
• Filters/Kernels: Small matrices that slide over the input
• Feature Maps: Output of applying filters to the input
• Pooling layers: Reduce spatial dimensions
– Max pooling: Take maximum value in each window
– Average pooling: Take average value in each window
• Fully connected layers: Final classification
CNN Architecture
4
• Input layer: Holds the raw pixel values
• Convolutional layer: Applies convolution operation
• Activation layer: Applies non-linearity (usually ReLU)
• Pooling layer: Reduces dimensions
• Fully connected layer: Connects to all neurons in previous layer
CNN Applications
• Image classification
• Object detection
• Face recognition
• Medical image analysis
CNN Calculations
• Output size after convolution: ((W-F+2P)/S)+1
– W: input size
– F: filter size
– P: padding
– S: stride
Recurrent Neural Networks (RNNs)
What are RNNs?
• Neural networks with loops to maintain information over time
• Designed for sequential data (time series, text, speech)
Types of RNNs
• Simple RNN: Basic recurrent structure
• LSTM (Long Short-Term Memory): Solves vanishing gradient prob-
lem with gates
• GRU (Gated Recurrent Unit): Simplified version of LSTM
Applications
• Natural language processing
• Speech recognition
• Time series prediction
• Machine translation
MCQ Practice Questions
Question 1
In DBSCAN clustering, what are points called that have at least
MinPts points within distance �? - A) Border points - B) Core points - C)
Noise points - D) Centroid points
5
Answer: B) Core points
Explanation: In DBSCAN, core points are defined as points that have at least
MinPts points within distance �, making them central to forming clusters.
Question 2
What is the primary purpose of Principal Component Analysis
(PCA)? - A) Classification - B) Clustering - C) Dimensionality reduction - D)
Association rule learning
Answer: C) Dimensionality reduction
Explanation: PCA is a technique used to reduce the dimensionality of a
dataset while preserving as much variance as possible.
Question 3
In a neural network with an input layer of 10 nodes, a hidden layer
of 20 nodes, and an output layer of 5 nodes, how many weights are
there in total (excluding biases)? - A) 35 - B) 300 - C) 200 - D) 250
Answer: B) 300
Explanation: The number of weights is calculated as: (input nodes × hidden
nodes) + (hidden nodes × output nodes) = (10 × 20) + (20 × 5) = 200 + 100
= 300.
Question 4
Which activation function is commonly used in the output layer for
multi-class classification problems? - A) ReLU - B) Sigmoid - C) Tanh -
D) Softmax
Answer: D) Softmax
Explanation: Softmax converts a vector of values into a probability distribu-
tion, making it ideal for multi-class classification where outputs need to sum to
1.
Question 5
In CNN architecture, what is the purpose of a pooling layer? - A) To
apply filters to the input - B) To reduce spatial dimensions - C) To fully connect
all neurons - D) To normalize the input
Answer: B) To reduce spatial dimensions
Explanation: Pooling layers reduce the spatial dimensions (width and height)
of the input volume, which helps reduce computation and control overfitting.
6
Question 6
For deep learning with CNN, what is the size of filter (Kernel) re-
quired to produce a 4x4 feature map from an image of 6x6 pixels,
assuming the filter is applied by sliding window of 1 pixel? - A) 3x3 -
B) 4x4 - C) 2x2 - D) 6x6
Answer: A) 3x3
Explanation: Using the formula: Output size = ((Input size - Filter size) /
Stride) + 1 4 = ((6 - Filter size) / 1) + 1 3 = (6 - Filter size) Filter size = 3
Therefore, a 3x3 filter is needed.
Question 7
In association rule mining, if the marketing team specified the min-
imum support of 0.2, what is the maximum support that can be
specified? - A) 0.2 - B) 1 - C) 0 - D) 0.5
Answer: B) 1
Explanation: Support is a probability measure ranging from 0 to 1. A support
of 1 means the itemset appears in 100% of transactions, which is the maximum
possible value.
Question 8
Which of the following is NOT a type of point in DBSCAN clustering?
- A) Core point - B) Border point - C) Noise point - D) Centroid point
Answer: D) Centroid point
Explanation: DBSCAN defines three types of points: core points, border
points, and noise points. Centroid points are a concept from K-means clustering,
not DBSCAN.
Question 9
What does the “vanishing gradient problem” refer to in neural net-
works? - A) When gradients become too large during backpropagation - B)
When gradients become very small during backpropagation - C) When the learn-
ing rate is too high - D) When there are too many hidden layers
Answer: B) When gradients become very small during backpropagation
Explanation: The vanishing gradient problem occurs when gradients become
extremely small as they propagate backward through the network, making it
difficult for early layers to learn.
7
Calculation Problems
Problem 1: CNN Output Size
If you have an input image of size 28x28 and apply a 5x5 convolutional
filter with stride 1 and no padding, what will be the size of the output
feature map?
Solution: Using the formula: Output size = ((Input size - Filter size) / Stride)
+ 1 Output size = ((28 - 5) / 1) + 1 = 23 + 1 = 24 Therefore, the output
feature map will be 24x24.
Problem 2: Neural Network Weights
A neural network has 3 layers: an input layer with 8 nodes, a hidden
layer with 12 nodes, and an output layer with 4 nodes. How many
weights and biases does this network have in total?
Solution: Weights: - Between input and hidden: 8 × 12 = 96 - Between hidden
and output: 12 × 4 = 48 - Total weights: 96 + 48 = 144
Biases: - Hidden layer: 12 - Output layer: 4 - Total biases: 16
Total parameters: 144 + 16 = 160
Problem 3: Association Rule Metrics
In a market basket analysis of 200 transactions, itemset {bread, but-
ter} appears in 40 transactions, itemset {bread} appears in 100 trans-
actions, and itemset {butter} appears in 80 transactions. Calculate
the support, confidence, and lift for the rule “bread → butter”.
Solution: Support({bread, butter}) = 40/200 = 0.2 or 20% Support({bread})
= 100/200 = 0.5 or 50% Support({butter}) = 80/200 = 0.4 or 40%
Confidence(bread → butter) = Support({bread, butter}) / Support({bread}) =
0.2 / 0.5 = 0.4 or 40%
Lift(bread → butter) = Confidence(bread → butter) / Support({butter}) = 0.4
/ 0.4 = 1
Problem 4: PCA Variance Explained
After performing PCA on a dataset with 10 features, you find that
the first 3 principal components have eigenvalues of 4.2, 2.8, and
1.5, while the remaining components have eigenvalues summing to
1.5. What percentage of variance is explained by the first 3 principal
components?
Solution: Total variance = 4.2 + 2.8 + 1.5 + 1.5 = 10 Variance explained by
first 3 components = 4.2 + 2.8 + 1.5 = 8.5 Percentage of variance explained =
(8.5 / 10) × 100 = 85%
8
Key Formulas to Remember
1. CNN Output Size: ((W-F+2P)/S)+1
• W: input size
• F: filter size
• P: padding
• S: stride
2. Support: (Transactions containing itemset) / (Total transactions)
3. Confidence: Support(X�Y) / Support(X)
4. Lift: Confidence(X→Y) / Support(Y)
5. Sigmoid Function: f(x) = 1/(1+e^(-x))
6. ReLU Function: f(x) = max(0, x)
7. Softmax Function: f(x_i) = e(x_i)/Σ(e (x_j))
8. Binary Cross-Entropy: -[y·log(p) + (1-y)·log(1-p)]
9. Categorical Cross-Entropy: -Σ[y_i·log(p_i)]
Tips for MCQ Questions
1. Understand the algorithms: Know how each clustering and dimension-
ality reduction algorithm works.
2. Memorize CNN formulas: Be able to calculate output sizes after con-
volution and pooling.
3. Know activation functions: Understand which activation functions are
used for different purposes.
4. Practice calculations: Be comfortable with calculating support, confi-
dence, and lift for association rules.
5. Understand neural network architecture: Know how to calculate
the number of parameters in a network.