Module 1: Foundations - Introduction to AI & ML
• Evolution: AI → Machine Learning → Deep Learning
1. Artificial Intelligence (AI) – 1950s onwards
• Goal: Build machines that can “think” like humans.
• Focus: Symbolic reasoning, rule-based systems, expert systems.
• Examples:
o Early chess programs
o Logical problem-solving
• Limitation: Couldn’t handle large amounts of real-world, messy data.
2. Machine Learning (ML) – 1980s–2000s
• Goal: Make machines learn patterns from data instead of only following rules.
• Techniques:
o Regression, Decision Trees, SVMs, k-Means, Random Forests
• Key Shift: Instead of explicitly programming rules → let the algorithm “learn” from examples.
• Examples:
o Spam email filters
o Product recommendations
• Limitation: ML struggled with unstructured data (images, speech, videos) and feature
engineering was manual.
3. Deep Learning (DL) – 2010s onwards
• Goal: Use neural networks with many layers to automatically learn complex features.
• Breakthrough: Thanks to big data + GPUs + better algorithms, DL became practical.
• Techniques:
o Convolutional Neural Networks (CNNs) → Image recognition
o Recurrent Neural Networks (RNNs, LSTM, GRU) → Sequential data
o Transformers → NLP (BERT, GPT)
• Examples:
o Self-driving cars
o Voice assistants (Siri, Alexa)
o ChatGPT
• Strength: Handles huge amounts of unstructured data and learns features automatically.
Summary Flow
• AI (1950s): The dream of making machines intelligent → rules & logic
• ML (1980s): Data-driven learning → statistical models learn from data
• DL (2010s): Neural networks with many layers → powerful learning from big data
Think of it like this:
• AI = The goal (human-like intelligence)
• ML = The approach (learning from data)
• DL = The modern breakthrough (using deep neural networks to handle complex data)
• Applications of Deep Learning
1. Computer Vision
• Image classification (e.g., detecting cats vs dogs)
• Object detection (e.g., self-driving cars detecting pedestrians, traffic signs)
• Image segmentation (e.g., medical scans like MRI/CT analysis)
• Facial recognition (e.g., security systems, photo tagging on social media)
2. Natural Language Processing (NLP)
• Sentiment analysis (e.g., customer feedback, social media monitoring)
• Machine translation (Google Translate, DeepL)
• Chatbots & virtual assistants (ChatGPT, Siri, Alexa)
• Text summarization & question answering
3. Speech & Audio Processing
• Speech recognition (e.g., voice assistants, transcription services)
• Voice synthesis (Text-to-Speech, e.g., Google TTS, ElevenLabs)
• Speaker identification & verification
• Music generation and enhancement
4. Healthcare & Medicine
• Disease detection (e.g., cancer detection from medical images)
• Drug discovery & genomics
• Predictive healthcare analytics
• Personalized treatment recommendations
5. Autonomous Systems
• Self-driving cars (Tesla, Waymo)
• Drones & robotics (navigation, object avoidance)
• Industrial automation
6. Finance & Business
• Fraud detection in transactions
• Stock market prediction
• Credit risk assessment
• Customer recommendation systems (Amazon, Netflix, Spotify)
7. Generative AI
• Deepfake creation (faces, voices, videos)
• Art & image generation (DALL·E, MidJourney, Stable Diffusion)
• Music & video synthesis
• Content creation (AI writing assistants like ChatGPT)
8. Other Applications
• Agriculture: Crop disease detection, yield prediction
• Education: Personalized learning platforms
• Cybersecurity: Detecting network intrusions
• Smart cities: Traffic management, surveillance
• Difference between ML & DL
Machine Learning (ML) vs Deep Learning (DL)
Aspect Machine Learning (ML) Deep Learning (DL)
Definition A subset of AI that uses algorithms to A subset of ML that uses multi-layered
learn patterns from data and make artificial neural networks to learn
predictions. directly from data.
Data Works well with small to medium Requires large amounts of data to
Requirement datasets. perform well.
Feature Manual feature extraction – humans Automatic feature extraction – the
Engineering decide which features are important. network learns features itself.
Computation Can run on CPUs, less resource- Requires GPUs/TPUs due to heavy
Power intensive. computation.
Interpretability Easier to interpret results (e.g., Often considered a “black box”,
decision trees, regression). harder to interpret.
Training Time Faster training with small data. Slow training (especially with big data
and deep networks).
Examples of Regression, Decision Trees, Random CNNs, RNNs, LSTMs, Transformers,
Algorithms Forests, SVM, k-Means. GANs.
Use Cases Predicting house prices, spam Self-driving cars, facial recognition,
detection, recommendation systems voice assistants, large-scale NLP (like
(basic). ChatGPT).
Quick Analogy:
• Machine Learning = Like teaching a child to recognize animals by pointing out features (e.g.,
“a cat has whiskers, pointy ears, small size”).
• Deep Learning = Like showing the child millions of animal pictures until they automatically
learn to recognize cats vs dogs.
Feature Engineering
Feature Engineering is the process of selecting, transforming, or creating input variables
(features) from raw data to improve the performance of a machine learning model.
In simple words:
👉 It’s like preparing ingredients before cooking — the better the ingredients, the tastier the dish
(better features → better model).
🔹 Why Feature Engineering?
• Raw data is messy (missing values, irrelevant columns, noise).
• Machine learning models don’t understand raw text, dates, or images directly.
• Good features capture important patterns and relationships, making models more accurate.
🔹 Key Steps in Feature Engineering
1. Feature Selection
o Choosing the most relevant features.
o Example: In predicting house prices, "location" and "size" matter more than "color of
walls".
2. Feature Transformation
o Modifying features to make them usable.
o Examples:
▪ Normalization/Scaling (putting values in similar ranges)
▪ Encoding categorical data (e.g., male/female → 0/1)
▪ Log transformation to reduce skewness
3. Feature Creation (Feature Extraction)
o Creating new features from existing ones.
o Example:
▪ From a “date of birth” → extract “age”.
▪ From “address” → extract “city” or “zip code”.
4. Handling Missing Data
o Filling missing values with mean/median/mode.
o Dropping irrelevant or incomplete features.
🔹 Examples
• Predicting loan approval: Create a new feature “debt-to-income ratio” from income and debt.
• Predicting churn in telecom: Extract features like “number of calls per month”, “average call
duration” from call logs.
• Image recognition: Use edges, colors, textures as features (or let Deep Learning
automatically learn them).
🔹 ML vs DL in Feature Engineering
• Machine Learning (ML): Requires manual feature engineering (done by humans).
• Deep Learning (DL): Automatically learns features from raw data (like images, text, audio).
✅ In short:
Feature Engineering = Turning raw data into useful signals for your model.
Deep Learning Frameworks
1. TensorFlow
• Developer: Google Brain (2015).
• Description:
o Open-source framework for building and training deep learning models.
o Provides both low-level operations (custom math, tensors) and high-level APIs.
o Works well for research + production deployment.
• Features:
o Scalable (works on CPU, GPU, TPU).
o TensorBoard for visualization.
o Strong ecosystem (TensorFlow Lite for mobile, TensorFlow.js for browser).
• Use Case Examples:
o Google Translate, Medical Image Analysis, Recommendation Systems.
2. PyTorch
• Developer: Facebook AI Research (2016).
• Description:
o Flexible, Pythonic deep learning library.
o Very popular in research and academia (easy to experiment).
• Features:
o Dynamic computation graph (easier debugging).
o Supports distributed training.
o TorchServe for deployment.
• Use Case Examples:
o NLP models (BERT, GPT, Transformers).
o Computer Vision (object detection, segmentation).
o Research prototypes that later move to production.
3. Keras
• Developer: François Chollet (2015), now part of TensorFlow.
• Description:
o High-level deep learning API (runs on top of TensorFlow, previously also on Theano,
CNTK).
o Focused on simplicity and fast prototyping.
• Features:
o Beginner-friendly syntax.
o Quick model building with just a few lines of code.
o Supports sequential and functional model APIs.
• Use Case Examples:
o Beginners learning neural networks.
o Quick prototyping before moving to large-scale TensorFlow or PyTorch.
Quick Comparison
Feature TensorFlow PyTorch Keras (API)
Ease of Use Moderate Easy (Pythonic) Very Easy
Best For Production & Scale Research & Flexibility Beginners & Prototyping
Ecosystem Huge (Lite, JS, TPU) Growing fast Runs on TensorFlow
Community Large Very large (research) Large (via TensorFlow)
Recommendation for learning:
• Start with Keras (inside TensorFlow) → easy to build models.
• Then move to PyTorch → more flexibility and research control.
• Use TensorFlow for scaling & deployment in real-world systems.
TensorFlow
Definition
TensorFlow is an open-source deep learning and machine learning library developed by Google
Brain (2015).
It helps developers and researchers build, train, and deploy AI models efficiently.
Why the name TensorFlow?
• Tensor → A multi-dimensional array (like a matrix but can go beyond 2D).
• Flow → Data flows through a graph of operations (computational graph).
So, TensorFlow = A system where tensors flow through mathematical operations.
Key Features
1. Works on multiple devices → CPU, GPU, TPU.
2. Ecosystem support:
o TensorBoard → Visualization of training (loss, accuracy, graphs).
o TensorFlow Lite → Run models on mobile/IoT devices.
o TensorFlow.js → Run models in browsers.
o TFX → Production pipelines.
3. Flexibility → Low-level operations or high-level APIs (Keras).
4. Scalable → From a laptop experiment to distributed training on clusters.
What Can You Do with TensorFlow?
• Image recognition (cats vs dogs, face detection).
• Natural Language Processing (NLP) (translation, chatbots, text summarization).
• Speech recognition (voice assistants).
• Recommendation systems (YouTube, Netflix, Amazon).
• Healthcare (disease prediction from medical scans).
• Self-driving cars (object detection, path planning).
Mini Example in Python
import tensorflow as tf
# Define a constant tensor
hello = tf.constant("Hello, TensorFlow!")
# Run
tf.print(hello)
Output:
Hello, TensorFlow!
In short: TensorFlow is a powerful library that allows machines to learn patterns from data
and make predictions — powering many modern AI applications.
Parts of TensorFlow
1. Core TensorFlow
• The foundation of the library.
• Provides tools to build, train, and run ML/DL models.
• Includes Tensors, Computation Graphs, and Automatic Differentiation.
• Example: Building neural networks using Keras or low-level TensorFlow APIs.
2. TensorFlow Keras (High-level API)
• Integrated into TensorFlow as its official high-level API.
• Simplifies building and training deep learning models.
• Example:
• from tensorflow.keras.models import Sequential
• from tensorflow.keras.layers import Dense
• Easy to use for beginners and rapid prototyping.
3. TensorFlow Extended (TFX)
• An end-to-end ML pipeline for production.
• Handles data preprocessing, model training, evaluation, and deployment.
• Used in large-scale industry applications (e.g., Google, YouTube).
4. TensorFlow Lite
• A lightweight version for mobile and IoT devices.
• Optimizes models to run on smartphones, embedded systems, and edge devices.
• Example: Running ML models on Android/iOS apps.
5. TensorFlow.js
• Lets you run ML models in web browsers or Node.js.
• Supports training and inference directly in JavaScript.
• Example: Real-time face filters in web apps.
6. TensorBoard
• A visualization tool for monitoring training.
• Tracks loss, accuracy, learning rate, graphs, and weights.
• Helps debug and improve models.
7. TensorFlow Hub
• A repository of pre-trained models.
• You can download and reuse models for NLP, vision, speech, etc.
• Example: Use a pre-trained BERT model for text classification.
8. TensorFlow Datasets (TFDS)
• A collection of ready-to-use datasets for ML research.
• Example: MNIST, CIFAR-10, IMDB reviews.
• Easy integration with training pipelines.
Summary Diagram (Hierarchy)
TensorFlow Ecosystem
• Core TensorFlow
o Tensors, Graphs, Training, APIs
• Keras API
• TensorFlow Extended (TFX)
• TensorFlow Lite (Mobile/IoT)
• TensorFlow.js (Web)
• TensorBoard (Visualization)
• TensorFlow Hub (Pretrained Models)
• TensorFlow Datasets (Ready-made Data)
In short: TensorFlow = A complete ecosystem for building, training, deploying, and monitoring
deep learning models — from research to production.
TensorFlow Workflow
1. Import & Prepare Data → Load dataset using TensorFlow Datasets (TFDS) or Pandas.
2. Build Model → Define neural network using Keras Sequential/Functional API.
3. Compile Model → Choose optimizer (Adam, SGD), loss function, and metrics.
4. Train Model → Fit data into the model.
5. Evaluate Model → Check accuracy, loss on test data.
6. Deploy Model → Save, export, or deploy via TensorFlow Serving / Lite / JS.
Example: Simple Neural Network in TensorFlow
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample dataset: XOR problem
X = [[0,0], [0,1], [1,0], [1,1]]
y = [0, 1, 1, 0]
# Build model
model = Sequential([
Dense(4, input_dim=2, activation='relu'),
Dense(1, activation='sigmoid')
])
# Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train model
model.fit(X, y, epochs=100, verbose=0)
# Evaluate
loss, acc = model.evaluate(X, y)
print(f"Accuracy: {acc:.2f}")
WEIGHTS IN NEURAL NETWORK:
What are Weights in a Neural Network?
In a neural network, weights are the trainable parameters that determine the strength of the
connection between two neurons (nodes) in adjacent layers.
They are like "knobs" that the learning algorithm adjusts during training to make the network’s
predictions more accurate.
🔑 How Weights Work
1. Each input feature xi is multiplied by a corresponding weight wi.
2. The neuron computes a weighted sum:
Z = (w1⋅x1) + (w2⋅x2) + ⋯ + (wn⋅xn) + b
where b is the bias term (extra offset).
3. The result z is passed through an activation function to decide the neuron’s output.
🔹 Intuition
• If an input feature is important, its weight will be large (positive or negative).
• If it’s less important, its weight will be small or close to zero.
Think of weights as the “importance level” of each input in making a decision.
🔹 Example
Suppose you’re building a fraud detection model with inputs:
• x1= Transaction amount
• x2 = Time of transaction
If the network learns:
• w1 = 0.9 (amount is very important)
• w2 = 0.1 (time is less important)
Then the model will rely more on amount when predicting fraud.
🔹 Training and Weights
• At the start, weights are initialized randomly (small numbers).
• During training, the network makes predictions → calculates error → uses backpropagation +
gradient descent to adjust weights.
• Goal: Find the set of weights that minimize prediction error (loss function).
✅ In short:
Weights in a neural network are learnable parameters that control how much influence each input
has on the output. Training is all about tuning these weights until the network makes good
predictions.
🔹 Who Decides the Weights?
1. Initialization (Start of Training)
o At the beginning, weights are not set by humans.
o They are usually initialized with small random numbers (sometimes using special
methods like Xavier Initialization or He Initialization) to avoid symmetry problems.
o Example:
o w1 = 0.02, w2 = -0.01, w3 = 0.03 ...
2. Training Process (Backpropagation + Gradient Descent)
o The network makes predictions using current weights.
o It compares predictions with true labels using a loss function (error measure).
o Then, through backpropagation, the network calculates how much each weight
contributed to the error.
o Using gradient descent (or its variants like Adam, RMSProp, SGD), the weights are
updated:
where:
o L = loss function
o = gradient of loss w.r.t weight
o η = learning rate (step size)
3. Learning Rate & Optimization Algorithm
o The learning rate (η) decides how big the updates are.
▪ Too small → training is slow.
▪ Too big → training may not converge.
o Optimizers like Adam, SGD, RMSProp decide the exact way weights are adjusted.
4. Data & Loss Function
o The dataset (features + labels) drives how weights shift.
o If fraud transactions have certain patterns, the weights connected to those features will
grow in importance.
o The loss function (e.g., cross-entropy, mean squared error) defines the goal of
optimization.
✅ In short:
• We don’t set the weights manually.
• They start randomly and get decided/adjusted automatically during training by:
o Backpropagation
o Gradient descent (or other optimizers)
o Guided by the data and loss function
BIAS IN NEURAL NETWORK:
🔹 What is Bias in a Neural Network?
In a neural network, bias is an extra parameter added to a neuron, just like weights, but it is not tied
to any input feature.
It helps the model shift the activation function left or right, which allows the network to fit the data
better.
🔑 Neuron Equation
Without bias:
z= w1x1+ w2x2 + ⋯ + wnxn
With bias:
z= w1x1+ w2x2 + ⋯ + wn xn + b
Then the output is:
a=f(z)
where f is the activation function.
🔹 Why Do We Need Bias?
1. Flexibility in learning
o If bias didn’t exist, every neuron’s output would always go through the origin (0,0).
o Bias allows the neuron to shift the decision boundary away from the origin.
2. Better fitting
o It works like the intercept (c) in a linear equation:
y = mx + c
Here, ccc is bias → it allows the line to move up/down instead of always passing through (0,0).
3. Improves accuracy
o Bias ensures that the neural network can learn more complex patterns.
🔹 Intuition with Example
Suppose you want to predict whether a transaction is fraud:
• Inputs: x1= Amount, x2 = Time
• Weights: w1=0.5, w2=0.7
• Bias: b=−2
Then:
Z = (0.5⋅x1) + (0.7⋅x2) − 2
That “−2” bias shifts the output down, so even if both inputs are small, the neuron won’t always
activate.
Here’s the visualization ✅
• Blue line (without bias) → Always passes through the origin (0,0).
• Red line (with bias = +3) → Gets shifted upward; it no longer has to pass through the origin.
👉 This shows why bias is important: it allows flexibility in shifting the decision boundary instead of
being forced to cross the origin.
🔹 Analogy
Think of weights as how strongly you press the accelerator pedal in a car, and bias as the default
speed the car starts with (even without pressing).
✅ In short:
Bias in a neural network is a trainable constant added to the weighted sum of inputs, giving the model
flexibility to better fit data by shifting the activation function.
ACTIVATION FUNCTION:
An activation function is a mathematical function in a neural network that decides whether a
neuron should be activated (fired) or not.
In simple words:
• It takes the weighted sum of inputs (linear combination)
• Applies a non-linear transformation
• Passes the result to the next layer
Why Do We Need Activation Functions?
1. Introduce non-linearity
o Without them, a neural network would just be a linear model (like simple regression), no
matter how many layers we add.
o Non-linearity allows the model to learn complex patterns (images, speech, fraud
detection, etc.).
2. Control signal flow
o Activation functions decide how much information should pass forward.
3. Normalize outputs
o Some functions (Sigmoid, Tanh, Softmax) squash outputs into specific ranges (like
probabilities).
How it works in a neuron:
1. Input features → x1, x2, ..., xn
2. Each has a weight → w1, w2, ..., wn
3. Neuron computes:
4. Apply activation function:
a=f(z)
where f = activation function.
5. Output a goes to the next layer.
Example:
• If you use ReLU:
o Negative values become 0, positives pass forward.
• If you use Sigmoid:
o Output is between 0 and 1, so it can represent a probability.
SOME ACTIVATION FUNCTIONS
1. Sigmoid Function
• Range: (0, 1)
• Shape: "S"-shaped curve.
• Use case: Binary classification (output layer).
• Pros: Smooth, maps values to probability-like range.
• Cons:
o Vanishing gradient problem (derivatives shrink for large |x|).
o Outputs not centered around 0.
2. ReLU (Rectified Linear Unit)
f(x)=max(0,x)
• Range: [0, ∞]
• Shape: 0 for negative values, linear for positive.
• Use case: Most common in hidden layers.
• Pros:
o Efficient, simple, faster training.
o Reduces vanishing gradient issue.
• Cons:
o “Dead ReLU” problem → neurons stuck at 0 if weights push them negative.
3. Tanh (Hyperbolic Tangent)
• Range: (-1, 1)
• Shape: "S"-shaped, but centered at 0.
• Use case: Often used in hidden layers (better than Sigmoid).
• Pros: Zero-centered output, stronger gradients than sigmoid.
• Cons: Still suffers from vanishing gradients for large |x|.
4. Softmax Function
• Range: (0, 1), and all outputs sum to 1.
• Shape: Turns raw scores into probabilities.
• Use case: Multi-class classification (output layer).
• Pros: Gives a probability distribution across classes.
• Cons: Can be computationally expensive for large outputs.
Summary Table
Function Range Used In Key Note
Sigmoid (0, 1) Binary classification output Good for probabilities, suffers vanishing
gradient
ReLU [0, ∞) Hidden layers Most common, fast, may cause “dead
neurons”
Tanh (-1, 1) Hidden layers Zero-centered, still vanishing gradient
Softmax (0, 1), Multi-class classification Converts scores into class probabilities
sum=1 output
Gradient Descent
Gradient Descent is an optimization algorithm used to train machine learning and deep learning
models.
It helps find the best parameters (weights & biases) that minimize the loss function.
🔑 Intuition
• Imagine you’re standing on a hill (loss function curve) in the dark 🌑.
• Your goal = reach the lowest valley (minimum loss).
• You take small steps downhill in the steepest direction (the negative gradient).
• Step size = learning rate (η\etaη).
📌 Formula
For each parameter www:
• w: weight (or parameter)
• η: learning rate (controls step size)
• ∂L/∂w: gradient (slope of loss with respect to weight)
⚙️ Types of Gradient Descent
1. Batch Gradient Descent
o Uses the entire dataset for one update.
o Stable but slow for large data.
2. Stochastic Gradient Descent (SGD)
o Uses one sample at a time.
o Faster, more noise (jumps around), but often better at escaping local minima.
3. Mini-Batch Gradient Descent
o Uses a small subset (batch) of data.
o Most widely used in deep learning (balances speed + stability).
📉 Example (Simple Quadratic Function)
👉 In short:
Gradient Descent = an iterative way to adjust weights to minimize loss by moving in the direction
of steepest descent.
Backpropagation
(short for Backward Propagation of Errors) is the learning algorithm used in training neural
networks.
It’s how the network updates its weights to reduce the loss function.
Step-by-Step Intuition
1. Forward Propagation
o Input data flows through the network (layer by layer).
o The network makes predictions (y_hat).
o Loss function calculates error (L) between prediction and actual (y).
2. Backward Propagation (Backprop)
o We calculate how much each weight contributed to the error.
o This is done using calculus (chain rule of derivatives).
o The gradient tells us the slope:
▪ If positive → decrease the weight
▪ If negative → increase the weight
3. Weight Update (Gradient Descent)
o Adjust each weight to reduce loss:
where η(eta) = learning rate.
Example (Very Simple)
Backpropagation will compute and adjust w to make y^ closer to 4.
Summary
• Forward pass → Compute predictions & loss
• Backward pass → Compute gradients (error signal flows backward)
• Update → Adjust weights using gradients