1.
Deep Feedforward Neural Networks
• Definition: Deep Feedforward Neural Networks are a type of artificial neural network
where connections between the nodes do not form a cycle. This is the simplest form of
neural networks.
• Architecture: Consists of an input layer, several hidden layers, and an output layer.
• Activation Functions: Commonly used activation functions include Sigmoid, Tanh, and
ReLU.
• Forward Propagation: Involves calculating the output of each neuron from the input
layer to the output layer.
• Use Cases: Image and speech recognition, language translation, and other applications
requiring pattern recognition.
2. Gradient Descent (GD)
• Definition: An optimization algorithm used to minimize the cost function by iteratively
adjusting the model parameters in the opposite direction of the gradient.
• Types of Gradient Descent:
o Batch Gradient Descent: Uses the entire dataset for each update.
o Stochastic Gradient Descent: Uses one training example at each update.
o Mini-batch Gradient Descent: Uses a small random subset of the dataset at each
update.
3. Momentum-Based GD
• Definition: Enhances gradient descent by adding a momentum term to accelerate
convergence and prevent oscillations.
• Formula: v(t)=γv(t−1)+η∇J(θ)v(t) = \gamma v(t-1) + \eta \nabla J(\theta)
o v(t)v(t): velocity (momentum term)
o γ\gamma: momentum hyperparameter
o η\eta: learning rate
o ∇J(θ)\nabla J(\theta): gradient of the cost function
4. Nesterov Accelerated GD
• Definition: An improved version of momentum-based GD that looks ahead to the
estimated future position.
• Formula: v(t)=γv(t−1)+η∇J(θ−γv(t−1))v(t) = \gamma v(t-1) + \eta \nabla J(\theta -
\gamma v(t-1))
5. Stochastic Gradient Descent (SGD)
• Definition: An iterative method for optimizing an objective function using one training
example at a time.
• Advantages: Faster convergence for large datasets, reduced computational cost.
• Disadvantages: Can lead to noisy updates and require careful tuning of the learning rate.
6. AdaGrad
• Definition: An adaptive gradient algorithm that adjusts the learning rate for each
parameter based on historical gradient information.
• Formula: θt+1=θt−ηGt+ϵ∇J(θt)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_{t} +
\epsilon}} \nabla J(\theta_t)
o GtG_t: sum of the squares of the past gradients
o ϵ\epsilon: small constant to avoid division by zero
7. Adam
• Definition: Combines the advantages of AdaGrad and RMSProp, using adaptive learning
rates and momentum.
• Parameters: β1\beta_1 (decay rate for the first moment), β2\beta_2 (decay rate for the
second moment), ϵ\epsilon (small constant).
• Formula: mt=β1mt−1+(1−β1)∇J(θt)m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla
J(\theta_t) and vt=β2vt−1+(1−β2)(∇J(θt))2v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla
J(\theta_t))^2
8. RMSProp
• Definition: An optimization algorithm that adjusts the learning rate by dividing the
gradient by a running average of its recent magnitude.
• Formula: θt+1=θt−ηE[g2]t+ϵ∇J(θt)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_{t} +
\epsilon}} \nabla J(\theta_t)
9. Auto-encoder
• Definition: An unsupervised learning model used to encode input data into a
compressed representation and then decode it back to reconstruct the input.
• Architecture: Consists of an encoder (compresses the input) and a decoder (reconstructs
the input).
• Applications: Dimensionality reduction, feature learning, and anomaly detection.
10. Regularization in Auto-encoders
• Purpose: Prevent overfitting and improve generalization by adding constraints to the
model.
• Techniques:
o L1 Regularization: Adds the absolute value of the weights to the loss function.
o L2 Regularization: Adds the square of the weights to the loss function.
11. Denoising Auto-encoders
• Definition: Train on a noisy version of the input data and aim to reconstruct the clean
input.
• Objective: Improve the model's robustness and ability to capture relevant structures in
the data.
12. Sparse Auto-encoders
• Definition: Apply sparsity constraints on the hidden layer activations to encourage
learning a compact and efficient representation.
• Technique: Use an additional sparsity penalty term in the loss function.
13. Contractive Auto-encoders
• Definition: Penalize the gradient of the encoder's activations with respect to the input to
make the learned representation robust to small variations.
• Formula: Add a term to the loss function proportional to the Frobenius norm of the
Jacobian of the hidden representations.
14. Variational Auto-encoder
• Definition: A generative model that learns a probabilistic distribution over the latent
space, allowing for the generation of new data samples.
• Objective: Maximize the Evidence Lower Bound (ELBO) to ensure the latent variables
follow a desired distribution.
15. Auto-encoders relationship with PCA and SVD
• PCA: Auto-encoders can be seen as a non-linear extension of PCA, which performs linear
dimensionality reduction.
• SVD: Singular Value Decomposition can be used to analyze the linear transformations
performed by auto-encoders.
16. Dataset Augmentation
• Definition: Techniques to artificially increase the size and diversity of a dataset by
applying transformations like rotation, scaling, flipping, and adding noise.
• Purpose: Improve the model's generalization by providing more varied training
examples.