What is Machine Learning?
Machine Learning (ML) is a branch of Artificial Intelligence (AI) where computers
learn patterns from data and make decisions or predictions without being explicitly
programmed.
In simple words: Instead of writing rules for every task, we give the machine data +
examples, and it learns by itself.
Definition of ML
Machine Learning is the field of study that gives computers the ability to learn from data and
improve their performance on a task without being explicitly programmed.
– Arthur Samuel (who introduced the term ML in 1959)
Real-Life Applications of ML
1. Email Spam Detection
o Gmail automatically detects spam/junk emails.
o Uses ML models trained on examples of spam and non-spam emails.
2. Movie & Music Recommendations
o Netflix, YouTube, Spotify suggest content based on what you previously
watched or listened to.
o ML models analyze user behavior and predict preferences.
3. Medical Diagnosis
o ML helps doctors detect diseases (like cancer from X-rays or diabetes from
reports).
o Trained on large sets of medical data.
4. Self-Driving Cars
o Cars like Tesla use ML to recognize objects (pedestrians, signals, other
vehicles) and take decisions.
5. Voice Assistants
o Siri, Alexa, Google Assistant use ML for speech recognition and natural
language understanding.
6. Fraud Detection
o Banks use ML to detect unusual transactions (credit card fraud).
S.
Comparison Descriptive Data Task Predictive Data Task
No
Determines what can happen in
Determines what happened in the
1 Basic the future using past data
past by analyzing stored data.
analysis.
Produces results but does not
2 Preciseness Provides accurate data.
ensure accuracy.
Uses standard reporting, Uses predictive modeling,
Practical Analysis
3 query/drill-down, and ad-hoc forecasting, simulation, and
Methods
reporting. alerts.
Works with labeled data(i/p +
4 Data Type Works mostly with unlabeled data.
o/p).
5 Type of Approach Follows a reactive approach. Follows a proactive approach.
Customer segmentation, Market Weather forecasting, Spam
6 Examples basket analysis, Social network email detection, Stock price
analysis. prediction.
Unsupervised Semi-Supervised Reinforcement
Aspect Supervised Learning
Learning Learning Learning
Unlabeled data Small labeled + No fixed dataset;
Data Labeled data (input
(only input, no large unlabeled learns from the
Used + correct output).
output). data. environment.
Learn mapping Find hidden Use both labeled Learn the best
Goal between input and patterns or groups and unlabeled data strategy by trial and
output. in data. for better accuracy. error.
Improved Sequence of actions
Predictions or Clusters, patterns,
Output prediction with less (policy) to maximize
classifications. structures.
labeled data. reward.
Direct supervision No supervision,
Reward/Penalty
Feedback with correct only data Partial supervision.
after each action.
answers. exploration.
Spam detection, Customer
Medical imaging, Self-driving cars,
house price segmentation,
Examples speech recognition, game playing,
prediction, medical market basket
web classification. robotics.
diagnosis. analysis.
What is a Feature?
In Machine Learning, a feature is an individual measurable property, attribute, or
characteristic of the data that is used as input to the model.
Features are basically the independent variables (X) that help in predicting the
output (Y).
👉 In simple words: A feature is a column in your dataset.
Examples of Features
1. In a house price prediction dataset:
o Features → Size of house, Number of rooms, Location, Age of house.
o Target → Price of house.
2. In a student performance dataset:
o Features → Hours studied, Attendance, Past grades.
o Target → Final exam marks.
3. In a spam detection dataset:
o Features → Number of links in email, Presence of suspicious words, Length of
email.
o Target → Spam / Not Spam.
Types of Features
1. Numerical Features
o Represent quantities.
o Example: Height, Weight, Salary.
2. Categorical Features
o Represent categories or labels.
o Example: Gender (Male/Female), Blood Group (A, B, O).
3. Boolean Features
o Represent True/False values.
o Example: “Does email contain the word FREE?” (Yes = 1, No = 0).
4. Derived Features (Engineered Features)
o New features created from existing ones.
o Example: BMI (Body Mass Index) derived from Weight and Height.
Feature Construction in Machine Learning
What is Feature Construction?
Feature Construction is the process of creating new features from the existing raw
data to make the dataset more useful for machine learning models.
It is a part of Feature Engineering.
Goal: Improve model performance by providing more meaningful and informative
inputs.
👉 In simple words: We take the existing data and construct new features that better
represent the problem.
Why Feature Construction is Important?
Raw data often does not contain features in the exact form needed by ML models.
Constructing new features can:
o Improve accuracy of predictions.
o Make hidden patterns more visible.
o Reduce noise and irrelevant information.
o Allow simpler models to perform better.
Examples of Feature Construction
1. From Date/Time Data
o Raw feature: “2025-08-20 18:30”
o Constructed features: Day, Month, Year, Hour, Day of week,
Weekend/Weekday.
o Useful in: Sales forecasting, traffic prediction.
2. From Text Data
o Raw feature: Customer reviews (text).
o Constructed features: Word counts, Sentiment score (positive/negative),
Presence of keywords.
o Useful in: Sentiment analysis, spam detection.
3. From Numerical Data
o Raw features: Height (cm), Weight (kg).
o Constructed feature: Body Mass Index (BMI = weight / height²).
o Useful in: Health/medical predictions.
4. From Transaction Data
o Raw feature: Purchase history of customer.
o Constructed features: Total spending, Average spending, Frequency of
purchase.
o Useful in: Customer segmentation, fraud detection.
Steps in Feature Construction
1. Understand the problem & dataset – Know what the model needs.
2. Analyze raw data – Identify which attributes are useful.
3. Create new features – Using domain knowledge (like BMI from height & weight).
4. Test features – Check if new features improve model accuracy.
Feature Selection in Machine Learning
What is Feature Selection?
Feature Selection is the process of choosing only the most relevant features from
the dataset and removing irrelevant or redundant ones.
It helps in reducing the size of data while keeping only the important information.
Goal: Improve model performance (accuracy, speed, interpretability).
👉 In simple words: If your dataset has many columns (features), feature selection picks the
best ones for training the model.
Why Feature Selection is Needed?
1. Reduces Overfitting → Less noise, fewer irrelevant features.
2. Improves Accuracy → Focus on important variables only.
3. Reduces Training Time → Smaller dataset, faster training.
4. Better Interpretability → Easier to understand the model.
Examples of Feature Selection
1. House Price Prediction
o Raw features: Size, Location, Rooms, Flooring type, Owner’s name, House
color, Date of construction.
o Selected features: Size, Location, Rooms → these directly affect price.
o Ignored: Owner’s name, House color (not useful).
2. Spam Email Detection
o Raw features: Email length, Number of links, Words like “FREE”, Sender’s font
style.
o Selected features: Number of links, Suspicious words → useful for
classification.
3. Medical Diagnosis
o Raw features: Blood test results, Age, Patient ID, Room number.
o Selected features: Blood test results, Age.
o Ignored: Patient ID, Room number.
Methods of Feature Selection
1. Filter Methods – Use statistical tests (correlation, chi-square, mutual information) to
select features.
2. Wrapper Methods – Use machine learning models to test different feature subsets
(Forward selection, Backward elimination).
3. Embedded Methods – Feature selection happens during model training (like Lasso
Regression, Decision Trees).
Training Dataset vs Testing Dataset
S.
Aspect Training Dataset Testing Dataset
No
Used to train the machine learning
Used to evaluate the performance
1 Purpose model by allowing it to learn
and accuracy of the trained model.
patterns and rules.
Model does not learn; it only
Model learns from this dataset
2 Data Usage predicts outcomes to check
(adjusts weights, parameters).
performance.
S.
Aspect Training Dataset Testing Dataset
No
Presence in Always used during the training Never used during training; only in
3
Training phase. evaluation phase.
Usually larger to allow better
Usually smaller to validate results
4 Size learning (e.g., 70–80% of total
(e.g., 20–30% of total data).
data).
Helps detect overfitting if the model
Role in Helps in reducing underfitting when
5 performs well on training but poorly
Overfitting used correctly.
on testing.
K-Fold Cross-Validation
K-Fold Cross-Validation is a technique used to evaluate a machine learning model's
performance more reliably.
It reduces the risk of overfitting or underfitting during model selection.
Steps of K-Fold Cross-Validation
1. Split the dataset into k equal-sized subsets (folds).
2. For each iteration (k times):
o Use one fold as the testing set.
o Use the remaining (k-1) folds as the training set.
3. Train the model on the training set and evaluate it on the testing set.
4. Repeat this process k times (each fold becomes the test set once).
5. Calculate the average performance score (e.g., accuracy, precision) of all k runs.
Example (k=5)
Suppose you have 100 data points and choose k = 5:
Each fold will have 20 samples.
Iteration 1: Train on folds 2–5 → Test on fold 1
Iteration 2: Train on folds 1, 3–5 → Test on fold 2
Iteration 3: Train on folds 1–2, 4–5 → Test on fold 3
Iteration 4: Train on folds 1–3, 5 → Test on fold 4
Iteration 5: Train on folds 1–4 → Test on fold 5
Then, you average the results.
Advantages
More reliable than a single train-test split.
Uses all data for both training and testing.
Reduces bias in evaluation.
Tabular Representation (k=5)
Training Data (Folds
Fold Testing Data (Fold Used)
Used)
1 2, 3, 4, 5 1
2 1, 3, 4, 5 2
3 1, 2, 4, 5 3
4 1, 2, 3, 5 4
5 1, 2, 3, 4 5
Scenario
You are developing a machine learning model to predict house prices based on features like
location, area, number of bedrooms, and amenities.
Real-Life Problem
You have a dataset of 1,000 houses, but you want your model to perform well on new
houses that it hasn’t seen before.
Applying K-Fold Cross Validation (let's say k = 5)
1. Split the dataset into 5 equal parts (folds) → each fold has 200 houses.
2. Iteration 1: Train on folds 2–5 (800 houses), test on fold 1 (200 houses).
3. Iteration 2: Train on folds 1, 3, 4, 5, test on fold 2.
4. Iteration 3: Train on folds 1, 2, 4, 5, test on fold 3.
5. Iteration 4: Train on folds 1, 2, 3, 5, test on fold 4.
6. Iteration 5: Train on folds 1–4, test on fold 5.
After all 5 iterations, average the 5 test accuracies to get the final model performance.
Why use it in real life?
Ensures the model performs well on unseen data (generalization).
Makes full use of your dataset (every house is used for both training and testing).
Reduces the risk of overfitting or underfitting caused by random train-test splits.
Leave-One-Out Cross-Validation (LOOCV)
Definition:
LOOCV is a special case of k-fold cross-validation where the number of folds k = number of
samples (n).
Each time, only one sample is used as the test set, and the remaining n−1n-1n−1
samples are used for training.
This process is repeated for each sample, and the performance is averaged.
How It Works
1. Suppose you have 5 samples: A, B, C, D, E
2. Iteration 1 → Train on B, C, D, E → Test on A
3. Iteration 2 → Train on A, C, D, E → Test on B
4. … and so on, until each sample has been tested once.
5. Average the errors/accuracy from all iterations to get the final result.
Real-Life Example
You are creating a model to predict student exam performance from their study hours:
You have 50 students’ data.
Each time, train on 49 students and test on the remaining 1.
Repeat 50 times, then calculate the average prediction error.
Advantages
Uses maximum data for training in each iteration (n−1n-1n−1 samples).
Reduces bias in performance estimation.
Best for very small datasets.
Disadvantages
Computationally expensive for large datasets (training occurs nnn times).
Variance in the evaluation may still be high.