Deep Learning Notes - January 2025
Neural Networks & Deep Learning
From Perceptrons to Deep Architectures - My Learning Journey
Big Picture: Neural Networks are inspired by the human brain, using
layers of interconnected "neurons" to learn complex patterns. Deep
Learning = Neural Networks with many layers!
1. The Biological Inspiration
Just like our brain has ~86 billion neurons connected by synapses, artificial
neural networks have nodes (neurons) connected by weights. The magic
happens when these simple units work together!
Key Insight: Each neuron does something simple, but together they
can approximate ANY function (Universal Approximation Theorem)
2. The Perceptron - Where It All Began
Single Perceptron Model:
Output = activation(Σ(wi × xi) + bias) where: wi = weights
xi = inputs bias = threshold adjustment
Input Layer Perceptron Output
x1 ----w1----\
\
x2 ----w2----- [Σ → f()] → y
/
x3 ----w3----/
+
bias
Limitations of Single Perceptron:
Can only solve linearly separable problems
XOR problem exposed this limitation!
Solution? Stack multiple layers → Multi-Layer Perceptron (MLP)
3. Anatomy of a Neural Network
Essential Components:
Input Layer - Raw features (pixels, words, numbers)
Hidden Layers - Where the learning happens
Output Layer - Final predictions
Weights & Biases - The parameters we learn
Activation Functions - Add non-linearity
Simple Neural Network Architecture: Input Hidden Layer 1 Hidden Layer 2
Output O ---------> O \ O -----------> O O ---------> O - - - - -> O |
\ / \ v O ---------> O - - - -> O O ---------> [y] \ / / O ---------> O
- - -> O ------ [784 inputs] [128 neurons] [64 neurons] [10 classes]
4. Activation Functions - Adding Non-linearity
Without activation functions, even deep networks would just be linear
transformations!
Common Activation Functions:
1. ReLU (Rectified Linear Unit) - My go-to for hidden layers!
f(x) = max(0, x)
Pros: Simple, no vanishing gradient, fast Cons: Dead neurons problem
2. Sigmoid - Classic, outputs between 0 and 1
f(x) = 1 / (1 + e^(-x))
Use case: Binary classification output layer Issue: Vanishing gradients
in deep networks
3. Tanh - Centered around zero
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Better than sigmoid for hidden layers
4. Softmax - For multi-class output
f(xi) = e^xi / Σ(e^xj)
Outputs probability distribution
5. Forward Propagation
The journey of data through the network:
1. Input data enters
2. Multiply by weights, add bias
3. Apply activation function
4. Pass to next layer
5. Repeat until output
Layer output: a[l] = activation(W[l] × a[l-1] + b[l])
6. Backpropagation - The Learning Magic
The Chain Rule is Everything!
Backprop = Computing gradients using chain rule + Gradient descent
Steps in Backpropagation:
1. Forward Pass: Compute predictions
2. Calculate Loss: How wrong were we?
3. Backward Pass: Compute gradients using chain rule
4. Update Weights: W = W - α × ∂L/∂W
Weight Update Rule: W[l] = W[l] - α × ∂L/∂W[l] b[l] = b[l] -
α × ∂L/∂b[l] where α = learning rate
7. Loss Functions
For Regression:
MSE : L = (1/n) × Σ(y_true - y_pred)²
MAE : L = (1/n) × Σ|y_true - y_pred|
For Classification:
Binary Cross-Entropy : -Σ(y×log(p) + (1-y)×log(1-p))
Categorical Cross-Entropy : -Σ(y×log(p))
8. Optimization Algorithms
Gradient Descent is great, but we can do better!
Evolution of Optimizers:
SGD (Stochastic Gradient Descent) - The classic
Momentum - Adds velocity to updates
RMSprop - Adaptive learning rates
Adam - Combines momentum + RMSprop (my favorite!)
Adam Update: m = β1×m + (1-β1)×gradient v = β2×v + (1-
β2)×gradient² W = W - α × m/(√v + ε)
9. Regularization Techniques
Fighting Overfitting: Great training accuracy but poor test accuracy?
Time to regularize!
Key Techniques:
L1/L2 Regularization - Add penalty to loss function
Dropout - Randomly "turn off" neurons during training
Early Stopping - Stop when validation loss increases
Batch Normalization - Normalize inputs to each layer
Data Augmentation - Create more training data
10. Deep Learning Architectures
Convolutional Neural Networks (CNNs)
Specialized for images - use convolution operations
CNN Architecture: Input → Conv → Pool → Conv → Pool → Flatten → Dense →
Output (Image) (Feature Maps) (Reduced) (Classification)
Recurrent Neural Networks (RNNs)
For sequential data - have memory!
Vanilla RNN - Simple but suffers from vanishing gradients
LSTM - Long Short-Term Memory (gates solve gradient problem)
GRU - Gated Recurrent Unit (simpler than LSTM)
Transformer Architecture
The revolution in NLP - "Attention is all you need"
Self-attention mechanism allows model to focus on relevant parts of
input
11. Training Tips & Tricks
Personal Best Practices:
Start with a small network, gradually increase complexity
Always monitor training AND validation loss
Learning rate is crucial - try 0.001 as starting point
Batch size affects convergence - powers of 2 work well
Save checkpoints regularly!
12. Common Problems & Solutions
Vanishing Gradients:
Use ReLU instead of sigmoid/tanh
Proper weight initialization (Xavier/He)
Batch normalization
Exploding Gradients:
Gradient clipping
Proper weight initialization
Lower learning rate
Overfitting:
More data!
Dropout layers
L1/L2 regularization
Reduce model complexity
13. PyTorch Implementation Snippet
Simple Neural Network in PyTorch:
import [Link] as nn
class SimpleNN([Link]):
def __init__(self):
super().__init__()
self.fc1 = [Link](784, 128)
self.fc2 = [Link](128, 64)
self.fc3 = [Link](64, 10)
[Link] = [Link]()
[Link] = [Link](0.2)
def forward(self, x):
x = [Link](self.fc1(x))
x = [Link](x)
x = [Link](self.fc2(x))
x = [Link](x)
x = self.fc3(x)
return x
14. Hyperparameter Tuning
Hyperparameters to Tune (in order of importance):
1. Learning rate
2. Number of layers & neurons
3. Batch size
4. Dropout rate
5. Activation functions
6. Optimizer choice
15. My Learning Resources
Deep Learning by Ian Goodfellow (the bible!)
[Link] courses - practical approach
3Blue1Brown neural network series - visual intuition
Papers With Code - latest research
PyTorch tutorials - hands-on practice
Final Thoughts:
Neural networks seemed like magic at first, but they're just clever math!
The key is understanding the fundamentals - forward prop, backprop,
and gradient descent. Everything else builds on these concepts.
"Deep learning is not a black box - it's a very complex but understandable
system of simple operations"