0% found this document useful (0 votes)
29 views33 pages

NOTES - UNIT 2 - Machine Learning

Linear Regression is a fundamental machine learning algorithm used to predict numerical values by establishing a linear relationship between independent and dependent variables. It is characterized by the formula Y = mX + c, where m is the slope and c is the intercept, and it has advantages such as simplicity and interpretability, but also limitations like sensitivity to outliers and assumptions of linearity. Ordinary Least Squares (OLS) is a method used to estimate the coefficients in linear regression, minimizing the difference between actual and predicted values.

Uploaded by

Mayank Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views33 pages

NOTES - UNIT 2 - Machine Learning

Linear Regression is a fundamental machine learning algorithm used to predict numerical values by establishing a linear relationship between independent and dependent variables. It is characterized by the formula Y = mX + c, where m is the slope and c is the intercept, and it has advantages such as simplicity and interpretability, but also limitations like sensitivity to outliers and assumptions of linearity. Ordinary Least Squares (OLS) is a method used to estimate the coefficients in linear regression, minimizing the difference between actual and predicted values.

Uploaded by

Mayank Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

What is Linear Regression?

Linear Regression is one of the simplest and most commonly used machine learning algorithms for
predicting numerical values. It is used to find a relationship between independent (input) and
dependent (output) variables.

Key Concept
Linear Regression assumes that there is a straight-line (linear) relationship between the input variables
(features) and the output variable (target). It tries to fit the best possible line that represents the
relationship between them.
Formula of Linear Regression:

Y=mX+c

Where:
 Y = Predicted output (dependent variable)
 X = Input feature (independent variable)
 m = Slope of the line (shows how much Y changes with X)
 c = Intercept (where the line crosses the Y-axis

Linear Regression in Machine Learning


⬛ Advantages:

 Simple & Easy to Implement – It is straightforward and easy to understand.
 Interpretable – Clearly shows the relationship between input and output variables.
 Efficient with Small Data – Works well with small to medium-sized datasets.
 Less Computational Power Required – Faster compared to complex models.
 Can Handle Multivariate Cases – Can be extended to multiple input variables (Multiple Linear
Regression).
+ Disadvantages:
 Assumes Linearity – Works only if there is a linear relationship between variables.
 Sensitive to Outliers – Outliers can significantly impact the model.
 Not Suitable for Complex Patterns – Cannot capture non-linear relationships.
 Multicollinearity Issue – If independent variables are highly correlated, the model becomes
unreliable.
 Overfitting Risk – If too many variables are added, it may fit the noise instead of the actual
pattern.
’ Key Features:
c
)
 Uses a Straight Line (y = mx + c) – Predicts output based on a straight-line equation.
 Minimizes Error (Least Squares Method) – Finds the best fit line by reducing the difference
between actual and predicted values.
 Dependent & Independent Variables – Works with one dependent (output) and one/multiple
independent (input) variables.
 Regression Coefficients – Determines how much an independent variable affects the dependent
variable.
 Used in Various Fields – Applied in finance, healthcare, marketing, and other industries for
prediction and trend analysis.

Example of Linear Regression


Scenario:
Let's say you are a shopkeeper and want to predict the sales based on the amount spent on
advertising.
Advertising Cost (X) Sales (Y)
1,000 10,000
2,000 20,000
3,000 30,000
4,000 40,000

Here, you can see that sales increase as advertising cost increases. So, we can fit a straight-line
equation to predict future sales.

Using the Linear Regression Formula


If we find that the best-fit equation is:

Y=10X+0
This means:
 If you spend ₹1,000 on advertising, sales will be ₹10,000.
 If you spend ₹5,000, predicted sales will be ₹50,000 (since Y=10(5000)+0=50,000Y = 10(5000)
+ 0 = 50,000Y=10(5000)+0=50,000
How Linear Regression Works?
1. Collect Data – Gather input (X) and output (Y) values.
2. Plot Data – Visualize the relationship between X and Y.
3. Find the Best Fit Line – The algorithm calculates m (slope) and c (intercept).
4. Make Predictions – Use the formula to predict Y for any given X.
5. Evaluate Model – Check accuracy using metrics like Mean Squared Error (MSE).

Conclusion
Linear Regression ek simple aur powerful technique hai jo past data ke basis par numerical values predict
karne ke liye use hoti hai. Yeh ek seedha-line wala relationship dhoondhne ki koshish karta hai input (X)
aur output (Y) ke beech. Iska use sales forecasting, stock price prediction, aur bahut saari real-world
applications mein hota hai. Agar data ka pattern linear ho, toh Linear Regression ek accurate aur effective
model ban sakta hai. / .
'


˙
½

REFERNCE YOUTUBE VIDEO - Linear Regression - AK19

Estimating the Coefficients in Machine Learning (Linear Regression)

In Linear Regression, we find the best straight-line equation that predicts the output based on input
values. The equation is:
y=mx+c
How are the Coefficients Estimated?
The goal is to find m and c so that the predicted values are as close as possible to the actual values in the
dataset. This is done using the Least Squares Method (also called Ordinary Least Squares - OLS).
- Steps to Estimate Coefficients:
1. Collect Data – Gather historical data with input (x) and output (y).
2. Find the Best Fit Line – The line that minimizes the error (difference between actual and
predicted values).
3. Use Mathematical Formulas to calculate m and c:

.Features of Estimating Coefficients


+
C
✔ Uses Mathematical Formulas – In Linear Regression, we use the Least Squares Method to
find the best-fit line.
✔ Helps in Prediction – By knowing coefficients, we can predict unknown values.
✔ Indicates Strength of Relationship – Larger coefficient values mean a stronger effect of xxx
on yyy.
✔ Works with Numerical Data – Used for problems where output is a continuous number.
✔ Used in Many Algorithms – Found in Linear Regression, Logistic Regression, and Neural
Networks.
✔ Optimized Using Gradient Descent – In large datasets, coefficients are estimated using
optimization techniques.

✓ Advantages of Estimating Coefficients



✔ Simple & Interpretable – Easy to understand how input affects output.
✔ Identifies Important Variables – Helps in feature selection by checking coefficient impact.
✔ Optimizes Model Performance – Adjusting coefficients improves prediction accuracy.
✔ Works in Various ML Models – Used in Regression, Neural Networks, and other models.
✔ Enables Trend Analysis – Useful for forecasting future trends.

+ Disadvantages of Estimating Coefficients


✖ Assumes Linear Relationship – In some cases, the relationship may not be a straight line.
✖ Sensitive to Outliers – Extreme values can distort coefficient values.
✖ Multicollinearity Issues – If input variables are highly related, coefficient estimation becomes
unstable.
✖ Computationally Expensive – In large datasets, optimization takes more time and resources.
✖ May Not Capture Complex Patterns – Simple coefficient estimation may not work well for
complex datasets.

Maan lo ek student ka study hours aur exam score ka data hai:


Study Hours (x) Exam Score (y)
1 50
2 55
3 65
4 70
5 80
Ab humein m (slope) aur c (intercept) find karna hai taaki hum predict kar sakein ki agar koi 6
ghante padhega to uska score kya hoga.
Simple Conclusion
~ Har extra 1 hour padhai karne par score 7.5 points badhta hai.

⬛ Agar koi bilkul na padhe (x = 0), to expected score 41.5 hoga.
~
~ Ab hum kisi bhi study hour ke liye future score predict kar sakte hain!

Summary
 We estimate coefficients (mmm and c) using the Least Squares Method.
 Slope (m) tells us the relationship strength between x and y.
 Intercept (c) tells us the starting value when x=0.
Ordinary Least Squares (OLS) in Machine Learning
What is OLS?
Ordinary Least Squares (OLS) is a method used in linear regression to find the best-fitting line through
a set of data points. It minimizes the difference between the actual values and the predicted values by
reducing the sum of squared errors.
How Does OLS Work?
1. Linear Equation
The relationship between input (X) and output (Y) is represented as:

Y=mX+c

where:
 Y = dependent variable (output)
 X = independent variable (input)
 m = slope (coefficient)
 c = intercept
2. Calculating the Best Line
OLS finds the best values for m and c by minimizing the sum of squared differences between actual and
predicted values.

Why Use OLS?


 Simple and effective for linear relationships
 Minimizes errors to improve predictions
 Widely used in statistics and machine learning
Limitations of OLS
 Assumes a linear relationship between variables
 Sensitive to outliers
 Requires independent variables to not be highly correlated (multicollinearity issue)

Advantages of OLS
 ⬛ Simple and Easy to Understand – OLS is easy to implement and interpret.

 ✓ Efficient for Small Datasets – Works well when the dataset is small and clean.

 ⬛ Minimizes Errors Effectively – Reduces the sum of squared errors, leading to accurate

predictions in linear relationships.
 ✓ Mathematically Proven – Based on strong statistical foundations, making it a reliable

method for linear regression.
 ✓ Used in Many Applications – Common in finance, economics, and machine learning for

predictive modeling.

Disadvantages of OLS
 + Assumes a Linear Relationship – Does not work well if the relationship between variables
is not linear.
 + Sensitive to Outliers – Large errors from outliers can affect the accuracy of predictions.
 + Multicollinearity Issue – If independent variables are highly correlated, OLS may not work
properly.
 + Not Suitable for High-Dimensional Data – Struggles when there are too many variables or
complex relationships.
 + Assumes Constant Variance – Assumes that errors are evenly distributed, which may not
always be true.

Conclusion
OLS is a fundamental technique in machine learning for linear regression. It helps in understanding and
predicting relationships between variables by minimizing errors.

OLS ek simple aur powerful technique hai jo linear regression ke liye use hoti hai. Yeh chhoti aur clean
datasets ke liye kaafi effective hai, lekin agar data nonlinear ho ya outliers zyada ho, toh iska performance
achha nahi hota. Multicollinearity aur high-dimensional data ke case me bhi OLS struggle karta hai.
Isliye, OLS tabhi use karna chahiye jab data linear ho aur assumptions satisfy ho rahe ho. Agar data
complex ya large ho, toh advanced techniques jaise Ridge, Lasso ya Non-Linear Regression better ho
sakti hain.

Accessing the accuracy of Coefficient estimates


When we use linear regression or any regression model, we get coefficients (slope values) that
tell us how much the independent variables affect the output. However, it is also important to
check the accuracy of these coefficients to ensure the model is reliable.

Coefficient Estimates Ki Accuracy Check Karne Ke Tarike


1.Standard Error (SE)
 Har coefficient ka ek Standard Error (SE) hota hai, jo batata hai ki yeh estimate kitna accurate
hai.
 Agar SE chhota hoga, toh iska matlab coefficient ka estimate zyada accurate hai.

where,
 σ = residual standard deviation
 n = sample size

2.t-Statistic Aur p-Value


 t-Statistic batata hai ki coefficient significant hai ya nahi.
 p-Value agar 0.05 se kam hai, toh iska matlab coefficient important hai aur model kaafi
accurate hai.
 Agar p-value zyada hai (0.05 se upar), toh iska matlab ho sakta hai ki yeh coefficient zaroori
nahi hai.
3. Confidence Interval (CI)
 Confidence Interval (95% CI) ek range deta hai jisme actual coefficient kaafi high chance se
aayega.
 Agar CI chhota hai, toh coefficient accurate hai.
 Agar CI bada hai, toh estimates uncertain hain.
4.R-squared
 Yeh batata hai ki model kitna achha fit ho raha hai.
 R2R^2R2 zyada (0.8 ya usse upar) ho toh model strong hai.
 R2R^2R2 kam (0.5 se neeche) ho toh model weak ho sakta hai.
5.Variance Inflation Factor (VIF)
 Agar independent variables zyada correlated hain (multicollinearity), toh coefficients unreliable
ho sakte hain.
 VIF agar 10 se zyada ho, toh model me multicollinearity ka issue ho sakta hai.

Conclusion

 Machine learning mein coefficients ki accuracy check karna important hai taki hum reliable
predictions bana sakein. Standard Error, p-Value, Confidence Interval, R-squared aur VIF
jaise metrics se hum jaan sakte hain ki model ke coefficients kitne accurate hain. Agar accuracy
low hai, toh feature selection, data cleaning, ya advanced regression techniques ka use karna
chahiye.
Machine Learning Mein Model Ki Accuracy Kaise Check Karein?

-When we train a machine learning model, it's important to check its accuracy to determine whether the
model is predicting correctly or not. Accuracy refers to how many correct predictions the model is
making.

Model Ki Accuracy Check Karne Ke Tarike


1. Classification Models Ke Liye (Jaise Logistic Regression, Decision Tree, etc.)
✓ Accuracy Score

 Yeh batata hai ki total predictions me se kitne sahi hain.

⬛ Confusion Matrix

 Yeh TP, TN, FP, FN (True Positive, True Negative, False Positive, False Negative) dikhata hai,
jo classification model ka performance check karne me madad karta hai.
 Example:
| Actual ↓ | Predicted 0 | Predicted 1 |
| | | |
| Class 0 | TN | FP |
| Class 1 | FN | TP |

✓ Precision, Recall Aur F1-Score



 Precision → Kitne positive predictions sahi hain.
 Recall → Kitne actual positives ko model ne sahi pakda.
 F1-Score → Precision aur Recall ka balance.

⬛ ROC Curve Aur AUC Score



 ROC Curve batata hai ki model ka decision-making kaisa hai.
 AUC Score agar 1 ke paas hai, toh model best hai. Agar 0.5 ke paas hai, toh model random
predict kar raha hai.

ROC Curve Aur AUC Score – Detail Mein Samajhein


ROC Curve Kya Hai?
ROC (Receiver Operating Characteristic) Curve ek graphical representation hai jo machine learning
model ke classification performance ko dikhata hai. Ye True Positive Rate (TPR) aur False Positive
Rate (FPR) ke beech ka relation batata hai.
ROC Curve ka Kaam?
 Yeh batata hai ki model different threshold values pe kitna achha perform kar raha hai.
 Agar model kisi binary classification problem (jaise spam vs non-spam email) pe kaam kar raha
hai, toh ROC Curve dikhata hai ki model ki prediction quality kaisi hai.
 Threshold change karne se True Positives aur False Positives ka balance change hota hai, aur
ROC Curve in changes ko represent karta hai.
AUC (Area Under the Curve) Score Kya Hai?
AUC (Area Under the Curve) ek numerical value hai jo batata hai ki ROC Curve ke neeche kitna area
hai. Yeh model ki overall performance ka ek single metric hai.
c AUC Score Ka Matlab
)

-AUC Score Meaning:
- AUC = 1: The model is perfect; all predictions are correct.
- AUC = 0.9: The model is performing very well.
- AUC = 0.7 - 0.8: The model is good, but it can be improved.
- AUC = 0.5: The model is guessing randomly (not good).
- AUC < 0.5: The model is predicting the wrong way (in the wrong direction).

You use the ROC curve and AUC in the following cases:
✓ When the model is solving a binary classification problem (like spam detection, fraud

detection).
✓ When the dataset is imbalanced, because AUC reduces the effect of imbalance.

⬛ When you need to tune the threshold, as you can understand the effects of different threshold

values using the ROC curve.

Binary Classification Kya Hai?


Binary Classification ek machine learning problem hai jisme model sirf do categories (0 ya 1,
True ya False, Yes ya No) me predict karta hai.
Example:
 Email Spam Detection → Spam (1) ya Not Spam (0)
 Fraud Detection in Banking → Fraud (1) ya Not Fraud (0)
 Medical Diagnosis → Disease Present (1) ya Not Present (0)

Binary Classification Ke Main Components


✓ Input Features (X)

 Yeh wo variables (features) hote hain jo prediction ke liye use hote hain.
 Example: Email Spam Detection me subject line, email body, sender information features ho
sakte hain.
✓ Target Variable (Y)

 Yeh binary output hota hai (0 ya 1).
 Example: Spam Email = 1, Not Spam = 0
✓ Classification Algorithm

 Machine Learning model jo decide karta hai ki given input spam hai ya nahi.
 Example: Logistic Regression, Decision Tree, Random Forest, SVM, Neural Networks

Binary Classification Me Evaluation Metrics


✓ Accuracy Score → Kitna sahi predict kar raha hai

✓ Confusion Matrix → True Positives, False Positives, etc.

✓ Precision & Recall → False Positives aur False Negatives ko handle karta hai

⬛ ROC-AUC Score → Model ki decision-making power show karta hai

Binary Classification sabse common ML problem hai jo do categories me data classify karti
hai. Isme Logistic Regression, Decision Tree, Random Forest, SVM jaise algorithms use hote
hain. Model ki accuracy, precision, recall, aur ROC Curve se performance check kiya jata hai.
Ye spam detection, fraud detection, aur medical diagnosis jese real-world problems solve
karta hai. 7

'
.

˙
½

Balanced vs. Imbalanced Binary Classification


⬛ Balanced Classification

Jab dono classes (0 aur 1) ke instances almost equal hote hain.
◆ Example:
 Spam Detection → 50% spam emails, 50% normal emails
 Medical Diagnosis → 50% patients me disease hai, 50% me nahi
◆ Model Performance → Accuracy metric kaafi useful hoti hai kyunki classes equal
proportion me hoti hain.

+ Imbalanced Classification
Jab ek class dusri se zyada frequent hoti hai (ek taraf data zyada aur dusri taraf kam).
◆ Example:
 Fraud Detection → 98% transactions normal, sirf 2% fraud
 Disease Prediction → 95% log healthy, sirf 5% me disease
◆ Problem:
 Accuracy misleading ho sakti hai (model sirf majority class ko predict karke bhi high accuracy
dikha sakta hai).
 Example: 95% accuracy ka matlab ho sakta hai model sirf "healthy" predict kar raha hai, aur
disease wale cases ignore kar raha hai.
◆ Solution:
 Precision, Recall, F1-Score, ROC-AUC Score jese metrics use karo.
 Resampling Techniques: Oversampling (SMOTE) ya Undersampling

Advantages (Fayde) of Accuracy Check


⬛ Performance Measure → Bata sakta hai model kitna sahi predict kar raha hai.

⬛ Comparison of Models → Alag-alag models compare karne me madad karta hai.

✓ Quick Evaluation → Ek simple aur fast metric hai to check correctness.

✓ Useful for Balanced Data → Jab data balanced hota hai (dono classes equal hote hain), tab

accuracy useful hoti hai.
Disadvantages (Nuksan) of Accuracy Check
+ Misleading for Imbalanced Data → Agar ek class ka data zyada hai toh accuracy galat impression
de sakti hai.
+ Ignores False Positives & False Negatives → Sirf overall correct predictions dekhta hai, galat
predictions ka impact ignore karta hai.
+ Not Useful for Probabilistic Models → Soft classification (probability-based predictions) me
accuracy kaafi helpful nahi hoti.
+ Doesn’t Show Model Confidence → Model kitna confident hai apni prediction me, ye accuracy nahi
dikhati.
Better Alternatives
P Precision, Recall, F1-Score → Imbalanced data ke liye better hain.

P ROC Curve & AUC Score → Model ki decision-making power check karta hai.

Multiple Linear Regression is a popular method in machine learning used to predict a number (output)
based on multiple factors (inputs). It is easy to understand but needs proper data handling, like choosing
the right features, removing unusual values (outliers), and checking if input factors are too similar
(multicollinearity). If the data has a clear pattern, MLR can give accurate and reliable predictions.

Threshold Kya Hota Hai?


Threshold is a cutoff value that decides which class the model's output will fall into. It is mainly used in
probability-based models, where the model gives a score (between 0 and 1), and we have to decide at
what value to assign a specific class.

Default threshold hota hai 0.5, lekin hum ise problem ke according adjust kar sakte hain.

Threshold Example: Spam Email Detection


Suppose ek machine learning model kisi email ke spam hone ki probability nikalta hai.
 Agar probability > 0.5 → Email Spam
 Agar probability ≤ 0.5 → Email Not Spam
Email Model Output (Probability) Threshold (0.5) Final Prediction
Email 1 0.7 0.5 se zyada Spam
Email 2 0.3 0.5 se kam Not Spam
Email 3 0.9 0.5 se zyada Spam
Email 4 0.4 0.5 se kam Not Spam
● Issue: Agar spam filter strict banana hai, toh threshold badha sakte hain (e.g., 0.8 tak), taaki sirf
high confidence wale emails hi spam classify ho.

Threshold Change Karne Se Kya Hoga?


◆ Threshold kam karne se (e.g., 0.3) →
P Zyada Spam Emails pakdi jayengi

+ Galti se kuch Non-Spam emails bhi Spam ho sakti hain (False Positive badh sakte hain)
◆ Threshold badhane se (e.g., 0.7) →
P Sirf sure-shot Spam emails hi pakdi jayengi

+ Kuch asli Spam emails detect nahi hongi (False Negative badh sakte hain)
S Best Threshold problem ke nature par depend karta hai!
y)

Threshold in Medical Diagnosis (Example: Cancer Detection)


Ek AI model ek patient ke cancer hone ki probability predict karta hai.
 Default threshold 0.5 → Agar probability > 0.5, toh Cancer Positive; warna Negative
 But Cancer detection me false negatives dangerous hote hain, isliye threshold 0.3 ya 0.4 tak
kam kiya jata hai, taaki zyada patients scan ho sakein.

Threshold ek cut-off value hoti hai, jo decide karti hai ki model ka output kaunsi category me jayega.
 High Threshold → Kam false positives, par zyada false negatives.
 Low Threshold → Kam false negatives, par zyada false positives.
Sy) Best threshold problem-specific hota hai aur ROC Curve/F1-Score use karke optimize
kiya jata hai! /'
.


˙
½

Threshold: Advantages & Disadvantages


✓ Advantages

 Control over model output → Helps decide when a prediction is positive or negative.
 Customizable → Can be adjusted based on the problem (e.g., high for fraud detection, low for
medical diagnosis).
 Handles imbalanced data → Helps improve model performance on uneven datasets.
 Improves Precision or Recall → Adjusting it can reduce false positives or false negatives.

+ Disadvantages
 No universal threshold → Different problems need different threshold values.
 Trial & error needed → Finding the best threshold requires multiple tests (e.g., using ROC
Curve).
 Impacts errors → A wrong threshold can increase false positives or false negatives.
 Not useful for all models → Some models (like Decision Trees) do not rely on thresholds.

Qualitative Predictors (Categorical Variables)


In machine learning and statistics, predictors (features) are the inputs used to predict an outcome. These
predictors can be of two types:
1. Quantitative Predictors → Features that have numerical values (e.g., age, height, salary).
2. Qualitative Predictors → Features that have categories or labels, not numerical values (e.g.,
gender, color, city).
Sy) Qualitative Predictors are also called Categorical Variables because they represent categories
instead of numbers.

Types of Qualitative Predictors


1. Nominal Variables → Categories with no order or ranking
 Example: Color (Red, Blue, Green)
 Example: Gender (Male, Female, Other)
 No category is greater or smaller than another.
2. Ordinal Variables → Categories with a meaningful order or ranking
 Example: Education Level (High School, Bachelor’s, Master’s, PhD)
 Example: Movie Ratings (Bad, Average, Good, Excellent)
 Order matters, but difference between levels is not measurable.
Examples of Qualitative Predictors in Real Life
Example 1: Predicting Car Price
When predicting the price of a car, these could be the features:
Feature Type
Brand (Toyota, BMW, Ford) Qualitative (Nominal)
Car Type (SUV, Sedan, Hatchback) Qualitative (Nominal)
Color (Red, Blue, Black) Qualitative (Nominal)
Fuel Type (Petrol, Diesel, Electric) Qualitative (Nominal)
Customer Rating (1 Star, 2 Stars, 3 Stars, 4 Stars, 5 Stars) Qualitative (Ordinal)
yS All these are qualitative predictors because they represent categories!
)

Example 2: Predicting Student Performance


A school wants to predict whether a student will pass or fail. The following features are used:
Feature Type
Study Hours Per Day (Numeric) Quantitative
Class Section (A, B, C) Qualitative (Nominal)
Attendance Level (Low, Medium, High) Qualitative (Ordinal)
Preferred Study Mode (Online, Offline) Qualitative (Nominal)
Sy) Attendance Level is an ordinal predictor because "High" is better than "Low."
Handling Qualitative Predictors in Machine Learning

Since machine learning models work with numbers, we need to convert qualitative predictors into
numerical form before using them in models. Some common techniques:
1. Label Encoding → Assigning numbers to categories
 Example: Color (Red = 1, Blue = 2, Green = 3)
2. One-Hot Encoding → Creating separate columns for each category with binary values (0 or 1)
 Example:
o Color_Red → 1 if Red, else 0
o Color_Blue → 1 if Blue, else 0
o Color_Green → 1 if Green, else 0
S One-Hot Encoding is preferred when categories have no order (Nominal Variables).
y)

Advantages & Disadvantages of Qualitative Predictors


✓ Advantages:

✓ Captures real-world information like gender, color, type of product.
P
P Works well in classification problems like spam detection, disease diagnosis.

P Easy to understand and interpret.

+ Disadvantages:
+ Needs preprocessing (encoding) before using in machine learning models.
+ Too many categories (high cardinality) can make models complex and slow.
+ Difficult to compare ordinal categories (e.g., how much "Excellent" is better than "Good" in
ratings?).

Classification:

What is Classification?
Classification is a supervised machine learning technique used to categorize data into predefined
labels (classes). The model learns from labeled data and then predicts which category new data belongs
to.
y Example:
S
)
 Email Spam Detection: Classify emails as "Spam" or "Not Spam".
 Disease Prediction: Classify patients as "Diabetic" or "Non-Diabetic".

Logistic Regression in Machine Learning – Simple Explanation


What is Logistic Regression?
- Logistic Regression is a supervised learning algorithm used for classification tasks. It
predicts probabilities and classifies data into two or more categories.
Key Idea:
Instead of predicting a continuous value (like Linear Regression), Logistic Regression predicts
the probability that a given input belongs to a certain category.
y
S Example:
)
 Email Spam Detection: Classify emails as Spam (1) or Not Spam (0).
 Disease Diagnosis: Predict whether a patient has Diabetes (Yes = 1, No = 0).

Logistic Regression Kaise Kaam Karta Hai? ● ӳ


Logistic Regression ek classification algorithm hai jo probabilities predict karta hai aur data ko
categories me classify karta hai. Ye algorithm Sigmoid Function ka use karta hai jo kisi bhi input value
ko 0 aur 1 ke beech convert karta hai.

1. Sigmoid Function (Logistic Function) Kya Hai?


Sigmoid function ek mathematical function hai jo kisi bhi value ko 0 aur 1 ke beech convert karti hai.
yS) Formula of Sigmoid Function:
Sigmoid ka Output
Agar z ka value positive hai, toh sigmoid function 1 ke paas hoga.
Agar z ka value negative hai, toh sigmoid function 0 ke paas hoga.
Agar z = 0 hai, toh output 0.5 hoga.
◆ Example:
Maan lo hume predict karna hai ki ek email spam hai ya nahi.

2. Decision Boundary (Prediction Kaise Hoti Hai?)


Sigmoid function ka output 0 aur 1 ke beech ek probability hota hai, isliye ek threshold value (0.5)
set karte hain jo decision boundary define karta hai.
⬛ Agar Probability ≥ 0.5 → Class 1 (Positive) assign karenge.

+ Agar Probability < 0.5 → Class 0 (Negative) assign karenge.
◆ Example:
Maan lo hume predict karna hai ki ek email spam hai ya nahi:
 Agar probability = 0.8 → Email ko Spam (1) classify karenge.
 Agar probability = 0.3 → Email ko Not Spam (0) classify karenge.
S Real-Life Example:
y
)
Agar ek bank fraud detection system logistic regression model use kare:
 Agar probability = 0.9 (90%) → Transaction fraud hai.
 Agar probability = 0.2 (20%) → Transaction fraud nahi hai.

Types of Logistic Regression f µ˛


Logistic Regression ko output classes ke basis par 3 types me divide kiya jata hai:

1. Binary Logistic Regression


)y Jab model sirf 2 categories ko predict karta hai.
S
✓ Example:

 Exam Result → Pass (1) ya Fail (0)
 Email Detection → Spam (1) ya Not Spam (0)
˙ Decision Rule:
(
(
 Probability ≥ 0.5 → Class 1 (Positive)
 Probability < 0.5 → Class 0 (Negative)

2. Multiclass Logistic Regression


y)S Jab model 3 ya usse zyada categories predict karta hai, lekin ek time par sirf ek class select hoti
hai.
✓ Example:

 Weather Prediction → Sunny (0), Rainy (1), Cloudy (2)
 Animal Classification → Dog (0), Cat (1), Elephant (2)
’)c Isme One vs Rest (OvR) technique ka use hota hai, jisme har class ko alag-alag binary problem
ke form me solve kiya jata hai.

3. Multinomial Logistic Regression


y)S Jab model multiple labels ko predict karta hai aur labels ke beech koi order nahi hota.
✓ Example:

 Fruit Classification → Apple, Banana, Mango
 Job Type → Doctor, Engineer, Teacher
( Multinomial Logistic Regression me Softmax Function ka use hota hai jo probabilities ko
˙
normalize karta hai taaki sabhi classes ka total 1 ho.

Summary
Type Classes Example Function
Binary Logistic Regression 2 Classes Spam/Not Spam Sigmoid
Multiclass Logistic Regression 3+ Classes Weather Prediction One vs Rest
Multinomial Logistic Regression Multiple Labels Fruit Classification Softmax

◉●
’ Conclusion
Logistic Regression ek powerful technique hai jo Binary, Multiclass aur Multinomial problems ko
solve karti hai. Har type ka use problem ke nature aur data ke type par depend karta hai.

Advantages & Disadvantages of Logistic Regression


✓ Advantages:

P Simple and easy to interpret.

✓ Works well for binary classification problems.
P
P Computationally efficient (Faster than complex models).

+ Disadvantages:
+ Assumes a linear relationship between input variables and output probability.
+ Not ideal for large datasets with complex patterns (Deep Learning performs better).
+ Sensitive to imbalanced data (If 95% of emails are non-spam, it may always predict non-spam).
Applications of Logistic Regression
P Spam Email Detection '
✓ V□Ya
P Medical Diagnosis (Disease Prediction) g
✓ ¡jf
✓ Customer Churn Prediction ⬛
P ¥
#
✓ Fraud Detection in Banking V
P S̄
P Sentiment Analysis (Positive/Negative Reviews) ●
✓ –

Machine Learning Regression Coefficients ka Estimation


Regression ek machine learning technique hai jo ek continuous value (jaise house price ya temperature)
ko predict karne ke liye use hoti hai. Regression coefficients woh numbers hote hain jo batate hain ki
input variables (features) prediction ko kitna affect kar rahe hain.

Regression Coefficients Estimate Karne ke Steps


1. Data Collect Karna
 Humein ek dataset chahiye jisme input variables (features) aur output variable (target) ho.
 Example: Agar humein ek ghar ka price predict karna hai, toh hum area, number of rooms, aur
location ko features le sakte hain.
2. Regression Model Choose Karna
 Common regression models:
o Linear Regression (Simple aur Multiple)
o Polynomial Regression
o Logistic Regression (Classification ke liye)
3. Regression Equation Define Karna
 Simple Linear Regression (Agar ek hi feature ho):

4. Coefficients Estimate Karna


 Least Squares Method se
o Ye method best values b0,b1,…bn find karta hai jo prediction error ko minimize
karti hai.
 Gradient Descent se (Jab dataset bada ho)
o Model step-by-step coefficients ko update karta hai taki error kam ho sake.
5. Model Evaluate Karna
Coefficients achhe hain ya nahi, isko check karne ke liye hum metrics use karte hain:
R-squared (R2R^2R2) – Model kitna achha fit ho raha hai.
Mean Squared Error (MSE) – Average prediction error ko check karta hai.

Conclusion
Regression coefficients machine learning models mein ek important role play karte hain, especially
regression-based predictions mein. Ye humein batate hain ki kaunsa input variable (feature) output ko
kitna affect kar raha hai. Least Squares ya Gradient Descent jaise methods ka use karke hum inhe
accurately estimate kar sakte hain.
Ek achha regression model wahi hoga jisme kam error ho aur high accuracy ho. Lekin, sahi features select
karna aur data ko properly handle karna bhi zaroori hai taki model sahi results de.

Advantages of Regression Coefficients


✓ Interpretability – Coefficients clearly show how each feature affects the output.

✓ Efficiency – Linear regression is computationally fast, even for large datasets.

⬛ Useful for Feature Selection – Helps identify important variables for prediction.

⬛ Works Well with Structured Data – Suitable for numerical and well-organized datasets.

Disadvantages of Regression Coefficients


+ Assumption of Linearity – Works well only if the relationship between variables is linear.
+ Sensitive to Outliers – Extreme values can distort the coefficients.
+ Overfitting in Complex Models – With too many features, the model might fit the training data too
well but fail on new data.
+ Feature Dependency – Highly correlated features can cause incorrect coefficient estimations
(Multicollinearity issue).

Making Predictions in Machine Learning

What is Prediction in Machine Learning?


Prediction in Machine Learning means using a trained model to guess the output for new or unknown
data. The model learns from past data, finds patterns, and then makes predictions.
For example, think about Netflix recommendations. Netflix looks at your watch history and predicts
which movies or shows you might like. It learns from your past choices and suggests content accordingly.

How Machine Learning Makes Predictions?

1. Collecting Data
To make predictions, the first step is to collect relevant data.
 The data should have input (features) and output (labels).
 Example: Suppose we want to predict a student's exam score based on their study hours. Our
data might look like this:
Study Hours (X) Exam Score (Y)
2 50
4 70
6 80
8 90
10 95

Here, Study Hours is the input (X), and Exam Score is the output (Y).

2. Training the Model


 We choose a machine learning algorithm that can learn from the data.
 The model finds a pattern between input and output.
 In our example, it will find a relationship between Study Hours and Exam Scores.
S If we use Linear Regression, it will try to fit a straight line to the data:
y)
Score = m × (Study Hours) + c
 Where m is the slope, and c is the intercept.

3. Testing the Model


Once trained, we test the model on new data to check how accurate it is.
 Suppose a new student studies for 7 hours, we use the model to predict the exam score.
 The model calculates and gives an output, say 85 marks.

4. Making Predictions
After training, we use the model to predict new results.

Input :
from sklearn.linear_model import LinearRegression
import numpy as np

# Training data
X_train = np.array([2, 4, 6, 8, 10]).reshape(-1, 1)
y_train = np.array([50, 70, 80, 90, 95])

# Creating model
model = LinearRegression()
model.fit(X_train, y_train) # Train the model

# Predicting for 7 hours of study


X_test = np.array([7]).reshape(-1, 1)
prediction = model.predict(X_test)

print(f"Predicted Score for 7 hours of study: {prediction[0]}")

Output:
Predicted Score for 7 hours of study: 82.5
How Predictions Work in Real Life?
)
c’ 1. Weather Forecasting
 Weather apps use past temperature, humidity, and wind speed data to predict future weather.
) 2. Stock Market Predictions

c
 ML models analyze past stock prices and trends to predict future stock values.
’) 3. Disease Detection in Healthcare
c
 Hospitals use ML to predict if a patient has a disease based on medical history.
’ 4. Spam Email Detection
)
c
 Email services analyze past spam emails and predict which new emails are spam.
c 5. Online Shopping Recommendations

)
 Amazon suggests products based on your past purchases and search history.

Advantages of Making Predictions with Machine Learning


✓ Fast & Automated – Machines can make predictions quickly.

✓ Accurate – More data improves prediction quality.

✓ Works on Complex Data – Finds patterns that humans might miss.

✓ Reduces Human Effort – Automates decision-making.

Disadvantages of Making Predictions with Machine Learning
+ Needs Good Data – Poor-quality data gives bad predictions.
+ Can be Expensive – Training ML models requires time and computing power.
+ Not Always 100% Accurate – Predictions are estimates, not guarantees.

Machine learning models learn from past data and make future predictions. The more data a model
has, the better its predictions will be. From Netflix recommendations to self-driving cars, ML
predictions are shaping our world.

Multiple Logistic Regression

What is Multiple Logistic Regression?

Multiple Logistic Regression is a machine learning algorithm used for classification problems when
there are multiple independent variables (inputs) but a binary output (Yes/No, 0/1, True/False).

It helps in predicting the probability of an outcome based on several input factors.

Real-Life Example

Predicting Whether a Student Will Pass or Fail an Exam

We want to predict whether a student will pass (1) or fail (0) based on:

1. Study Hours

2. Previous Exam Scores

3. Number of Assignments Completed

Here, we have three independent variables (inputs) and one dependent variable (output: Pass/Fail).
Formula of Logistic Regression

How Multiple Logistic Regression Works?

Step 1: Collect Data

 Gather data with multiple input features and a binary output.

 Example dataset:

Study Hours Previous Score Assignments Done Pass (1) / Fail (0)

2 50 1 0

4 60 2 0

6 75 3 1

8 85 4 1

10 95 5 1

Step 2: Train the Model


 Use logistic regression to find the best-fit curve for the data.

 The model calculates weights (b values) that influence the prediction.

Step 3: Test the Model

 Provide new input values to check if the model correctly predicts pass or fail.

Step 4: Make Predictions (Python Example)

We can implement Multiple Logistic Regression in Python using Scikit-Learn:

INPUT -

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# Sample dataset

data = {

'Study_Hours': [2, 4, 6, 8, 10],

'Previous_Score': [50, 60, 75, 85, 95],

'Assignments_Done': [1, 2, 3, 4, 5],

'Pass_Fail': [0, 0, 1, 1, 1] # 1 = Pass, 0 = Fail

# Convert to DataFrame

df = pd.DataFrame(data)

# Define X (input variables) and Y (output variable)

X = df[['Study_Hours', 'Previous_Score', 'Assignments_Done']]

Y = df['Pass_Fail']
# Split data into training and testing sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Create and train logistic regression model

model = LogisticRegression()

model.fit(X_train, Y_train)

# Make predictions

Y_pred = model.predict(X_test)

# Check accuracy

accuracy = accuracy_score(Y_test, Y_pred)

print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Predict for a new student

new_student = [[7, 80, 3]] # Study Hours = 7, Previous Score = 80, Assignments Done = 3

prediction = model.predict(new_student)

print("Prediction (1 = Pass, 0 = Fail):", prediction[0])

Advantages of Multiple Logistic Regression

✓ Simple & Efficient – Easy to implement and interpret.



✓ Works for Binary Classification – Used for Yes/No predictions.

✓ Handles Multiple Variables – Can analyze multiple factors together.

✓ Probability-Based – Gives a probability score for predictions.

Disadvantages of Multiple Logistic Regression

+ Only Works for Binary Outputs – Cannot predict multiple categories.


+ Assumes Linear Relationship – Requires proper feature selection.
+ Sensitive to Outliers – Large variations in data can impact results.

✓ What is Linear Discriminant Analysis (LDA)?



Linear Discriminant Analysis (LDA) is a supervised machine learning technique used for:

 Classification (main goal)

 Dimensionality reduction (secondary goal)

It finds a new feature space that best separates classes in your data.

Term Meaning

Supervised We use labeled data (with class labels).

Classes Categories or groups (e.g., spam vs. non-spam).

Reducing the number of input features (like converting 3D data into


Dimensionality Reduction
2D).

Discriminant A line or plane that separates the classes well.

Maximize Between-Class
Classes should be far apart.
Variance

Minimize Within-Class Variance Data points in the same class should be close to each other.

k Why Use LDA?

 To improve classification performance.

 To visualize high-dimensional data.

 To remove irrelevant or noisy features.

 Works well when data is normally distributed and classes have equal covariance.

LDA Kaam Kaise Karta Hai? (Steps in Easy Words)

1. Har class ka mean (average) nikaalo


2. Dekho ki har class ke andar points kitna bikre hue hain (Within-class scatter)
3. Dekho ki dono classes ke beech ka distance kitna hai (Between-class scatter)
4. Ab aisi line/axis find karo jahan:
o Classes ka distance maximum ho
o Same class ke points ek dusre ke close ho
5. Ab data ko uss line par project karo (dimensionality reduce)
6. Fir kisi classifier (like KNN, Logistic Regression) se prediction karo

¥ Graphical Intuition (Simple Example):


#

Imagine two classes of points in 2D:


 Red dots on the left

 Blue dots on the right

LDA tries to draw a line (or axis) so that when we project all points on that line:

 Red and Blue are as separate as possible.

 Points from the same color/class stay close.

^
● _: Applications of LDA:
" H

 Face recognition

 Medical diagnosis (e.g., cancer classification)

 Document classification

 Speech recognition

 Customer segmentation

Simple Example se Samjho:

Maan lo tumhare paas students ka data hai:

 Kitne hours padhe

 Kitna soye

 Internal marks

Aur tumhare paas label hai:

 Pass ya Fail

Ab agar tum is data ko visualize karna chahte ho ek simple graph mein — toh LDA help karta hai data
ko aise transform karne mein jahan “Pass” aur “Fail” clearly alag dikhai dein.

CONCLUSION

Linear Discriminant Analysis (LDA) ek supervised learning technique hai jo classification aur
dimensionality reduction ke liye use hoti hai. LDA aise features dhoondta hai jo alag-alag classes ke
beech maximum difference create karein. Ye method within-class scatter ko kam karta hai aur
between-class scatter ko zyada karta hai, taaki classes easily separate ho jaayein.
Bayes' Theorem (Simple Explanation)

ӳ What is it?

Bayes' Theorem is a math rule that helps us find the probability of something happening, based on
some known information.

In Machine Learning, we use it to predict which category or class something belongs to.

" In Machine Learning (Example):


ç
;
'

"
'

Imagine we want to predict if a message is spam or not spam.

We use Bayes' Theorem to look at:

 The words in the message (data)

 How often those words appear in spam and not-spam messages

 The overall number of spam vs not-spam messages

Then we calculate and choose the class (spam or not spam) with the highest probability.

) Why is it called "Naive" Bayes?


c’

Because it assumes that all features (like words in a message) are independent — which is often not
true, but the method still works well!

✓ Key Points to Remember:


 It's fast and simple.

 Works well even if the assumption (features are independent) isn’t fully true.

 Used for text classification, spam detection, sentiment analysis, etc.


Conclusion (Hinglish me):

Bayes' Theorem ek simple aur powerful formula hai jo hume yeh batata hai ki kisi cheez ke hone ki
chance kitni hai, jab hume kuch aur information already pata ho.

Machine Learning me isse Naive Bayes Classifier banaya gaya hai jo classification problems (jaise spam
ya not spam, positive ya negative review) solve karta hai.

Ye method fast hota hai, simple hota hai, aur kaafi accurate bhi hota hai — even agar assumptions 100%
sahi na ho tab bhi.

( Short me:
˙
(
Bayes' Theorem = Pahle se jo pata hai uske base pe naye data ko samajhna aur uska class predict karna.

Agar ek email me likha ho "Win money now!", to hum check karte hain ki ye words zyada Spam emails
me aate hain ya Not Spam me.

Using Bayes' Theorem:

 Agar ye words spam emails me zyada baar aaye hain → Email ko Spam classify karo.

 Agar ye words normal emails me zyada aaye hain → Email ko Not Spam classify karo.

Naive Bayes ye kaam fast aur simple tarike se karta hai, isliye use spam detection, sentiment analysis,
aur medical prediction me use kiya jata hai.

ç' Logic: Pehle se available data se naye cheezon ka guess lagana.


C

What is LDA in Machine Learning?

Linear Discriminant Analysis (LDA) ek dimensionality reduction technique hai jo use hoti hai
classification problems me.

LDA ka main goal hota hai:

+C. “Different classes ko maximum separate karna aur data ka dimension reduce karna.”

LDA ek supervised algorithm hai jo input data ke saath class labels ka bhi use karta hai, aur ek new axis
banata hai jahan pe:

 Same class ke points ek saath ho (within-class scatter = low)

 Different class ke points door ho (between-class scatter = high)

Case 1: LDA for p = 1 (1D LDA)

S Matlab:
y
)

 Input features are 1-dimensional (ek hi feature).

 LDA simply tries to find a threshold to separate classes on that line.


✓ Simple Steps:

1. Calculate class means (m1, m2)

2. Find threshold = (m1 + m2) / 2

3. Classify new data using this threshold

◉ Example:
’●

Agar sirf ek feature ho jaise "exam marks", to LDA decide karega kaun "pass" hai aur kaun "fail" based
on ek cut-off.

⬛¡ Case 2: LDA for p > 1 (Multidimensional LDA)


]#

y Matlab:
)
S

 Input features are more than one (like p = 2, 3, ...).

 LDA projects data from high-dimensional space to lower dimension (1D or 2D) while
maximizing class separability.

✓ Simple Steps:

1. Calculate mean vectors for each class

2. Compute within-class and between-class scatter matrices

3. Find best direction (projection line) using eigenvalues and eigenvectors

4. Project original data onto this new line or plane

◉ Example:

Agar aapke paas features ho jaise: age, income, education, etc., to LDA in sabko combine karke ek naya
feature banata hai jo best separate karta hai "eligible" vs "not eligible" categories.

★ Advantages of LDA (Hinglish):

1. ⬛
✓ Fast and simple to use

2. ⬛
✓ Useful for classification + dimensionality reduction

3. ⬛
✓ Works well when classes are linearly separable

4. ⬛
✓ Performs better when data is normally distributed

5. ⬛
✓ Reduces overfitting by reducing dimensions
ị Disadvantages of LDA:

1. + Assumes data is normally distributed (jo real world me always nahi hota)

2. + Works best when classes have same variance (homoscedasticity)

3. + Not good if classes are not linearly separable

4. + Sensitive to outliers

● Summary in One Line (Hinglish):


’◉

LDA ek aisa method hai jo feature space ko compress karta hai, aur is tarah se naye axes banata hai
jahan pe alag-alag classes easily separate ho sakein.

What is QDA (Quadratic Discriminant Analysis)?

QDA is a classification algorithm in machine learning, just like LDA.


Lekin LDA aur QDA mein ek major difference hai:

'çC LDA assumes: All classes have same covariance (same spread/shape).
ç' QDA assumes: Each class can have different covariance (different shape).
C

□] Definition (Simple):
L

QDA tries to model the boundary between classes based on the assumption that:

 Data from each class follows a Gaussian distribution.

 But each class has its own covariance matrix (i.e., they can spread differently).

yS Isliye QDA ka decision boundary curve ya quadratic shape ka hota hai — not a straight line like in
)
LDA.

Feature LDA QDA

Covariance matrix Same for all classes Different for each class

Boundary Linear (straight line) Quadratic (curved)

Complexity Less More

Flexibility Less flexible More flexible

Overfitting risk Lower Higher (if data is small)


Advantages of QDA

1. ⬛
✓ Works better when classes are non-linear or have different shapes

2. ✓
⬛ More flexible and accurate than LDA in many real-world problems

3. ⬛
✓ Can model complex boundaries

+ Disadvantages of QDA:

1. + Needs more data to estimate separate covariance matrices

2. + Overfitting ho sakta hai jab data kam ho

3. + Slower than LDA

4. + Not suitable when data is not Gaussian (normal distribution)

’ Use Cases of QDA:



 Medical diagnosis (e.g., disease prediction)

 Handwritten digit classification (when digits overlap)

 Image recognition (when shapes vary across classes)

c Conclusion

)

QDA ek smart classifier hai jo maan ke chalta hai ki har class ki shape alag ho sakti hai, isliye ye curved
boundaries banata hai aur flexible decisions leta hai.
Lekin agar data kam ho to ye overfit kar sakta hai.

You might also like