Regression
Regression in machine learning is a technique used to find the relationships between
independent and dependent variables, with the main purpose of predicting an outcome. It
involves training a set of algorithms to reveal patterns that characterize the distribution of
each data point. With patterns identified, the model can then make accurate predictions
for new data points or input values.
Types of Regression
1. Linear Regression
2. Logistic Regression
Linear Regression
Linear regression is a type of supervised machine-learning algorithm that learns from
the labelled datasets and maps the data points with most optimized linear functions
which can be used for prediction on new datasets. It assumes that there is a linear
relationship between the input and output, meaning the output changes at a constant
rate as the input changes. This relationship is represented by a straight line.
For example we want to predict a student's exam score based on how many hours they
studied. We observe that as students study more hours, their scores go up. In the
example of predicting exam scores based on hours studied. Here
Independent variable (input): Hours studied because it's the factor we control or
observe.
Dependent variable (output): Exam score because it depends on how many hours
were studied.
Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the best-fit line is
represented by the equation
y=mx+b
Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b (intercept)
so that the predicted y values are as close as possible to the actual data points.
dy Test Mean(X) Mean(Y Deviations(X) Deviations(Y) Product of Sum of Square of
urs scor ) deviations Product deviations for X
e (Y) of
deviations
40 4 50 -2 -10 20 40 4
50 0 0 0 0
60 2 10 20 4
Calculate m = Sum of product of deviations / Sum of square of deviation for X
Calculate b = Mean of Y – (m* Mean of X)
Calculations
Sum of Product of Deviations = 20 + 0 + 20 = 40
Sum of Square of Deviations for X = 4 + 0 + 4 = 8
m = Sum of Product of Deviations / Sum of Square of Deviations for X
m = 40/8 = 5
b=Mean(Y) − (m * mean(X)) =50− (5*4) =30
Final Regression Equation
Y=5X+30Y
Study_hours.py
import pandas as pd
from sklearn.linear_model import LinearRegression
# Dataset
data = {
'StudyHours': [2, 3, 4, 5, 6, 7, 8],
'Marks': [40, 50, 55, 65, 70, 80, 85]
}
df = pd.DataFrame(data)
# Train model
X = df[['StudyHours']]
y = df['Marks']
regr = LinearRegression()
regr.fit(X, y)
# User input
study_hours = float(input("Enter study hours: "))
# Wrap input in DataFrame with the same column name
input_data = pd.DataFrame({'StudyHours': [study_hours]})
predicted_marks = regr.predict(input_data)
print(f"Study Hours: {study_hours}")
print(f"Predicted Marks: {predicted_marks[0]:.2f}")
Output
Enter study hours: 6
Study Hours: 6.0
Predicted Marks: 71.07
Q1: Fit a linear regression model for data set (x, y): (1, 1.5), (2, 3.0), (3, 4.5), (4, 6.0) and predict y for x = 5
Non-Linear Regression
Non-linear regression is a type of regression in machine learning where the relationship between input XXX
and output YYY is not a straight line. Instead, the data follows a curved pattern.
In such cases, a straight line (linear regression) does not fit well, so we use equations like polynomial,
exponential, logarithmic, or other non-linear functions.
Multiple Linear Regression
Linear regression is a statistical method used for predictive analysis. It models the
relationship between a dependent variable and a single independent variable by fitting a
linear equation to the data. Multiple Linear Regression extends this concept by
modelling the relationship between a dependent variable and two or more independent
variables. This technique allows us to understand how multiple features collectively
affect the outcomes.
Steps for Multiple Linear Regression
Steps to perform multiple linear regression are similar to that of simple linear Regression
but difference comes in the evaluation process. We can use it to find out which factor
has the highest influence on the predicted output and how different variables are related
to each other. Equation for multiple linear regression is:
y=β0+β1X1+β2X2+⋯+βnXn
Where:
Y is the dependent variable
X1,X2,⋯Xn are the independent variables
β0 is the intercept
β1,β2,⋯βn are the slopes
The goal of the algorithm is to find the best fit line equation that can predict the values
based on the independent variables. A regression model learns from the dataset with
known X and y values and uses it to predict y values for unknown X.
Q1.Find B0,B1,B2 using the given data.
Product Product Weekly
1 Sales 2 Sales Sales
(X1) (X2) (Y)
1 4 1
2 5 6
3 8 8
4 2 12
[ ] []
1 1 4 1
1 2 5 6
X= Y=
1 3 8 8
1 4 2 12
Step 1. Transpose of X
[ ]
1 1 1 1
X=1 2 3 4
4 5 8 2
Step 2. Multiply of X`.X
[ ][ ][ ]
1 1 4
1 1 1 1 4 10 19
1 2 5
X`.X = 1 2 3 4 . 1 3 8
= 1 0 30 46
4 5 8 2 1 9 46 109
1 4 2
Step 3. Multiply of X`.Y
[ ][ ] [ ]
1
1 1 1 1 27
6
X`.Y = 1 2 3 4 8
. = 85
4 5 8 2 1 22
12
Step 4. Inverse of (X`.X)-1
[ ][ ]
4 10 19 3.153 −0.590 −0.300
-1
(X`.X) = 10 30 46 = −0.590 0.204 0.016
19 46 109 −0.300 0.016 0.054
-1
Step 4. Put in this equation β=(X`X) X`Y
[ ][ ] [ ]
3.153 −0.590 −0.300 27 −1.699
β= −0.590 0.204 0.016 . 85 = 3.483
−0.300 0.016 0.054 122 −0.054
β0 = -1.699, β1 = 3.483, β2 = - 0.054
Logistic Regression
Logistic regression is a type of supervised machine-learning algorithm that also learns from labelled datasets but is
mainly used for classification problems instead of predicting continuous values. It assumes that the output is categorical,
such as Yes/No or 0/1, and maps the data points using a logistic function (sigmoid curve) to estimate probabilities
between 0 and 1. This probability is then used to decide the class of new data points. For example, we may want to
predict whether a student will pass or fail based on how many hours they studied. We observe that as study hours
increase, the probability of passing also increases, which is captured by the S-shaped logistic curve.
Sigmoid Function
Y = 1/1+e-(a0 + a1*X)
Where :
a0 → Intercept (similar to b in linear regression).
a1→ Coefficient/weight of the feature XXX.
X → Input (independent variable).
Output → A probability between 0 and 1.
Example :
Study Hours (X) Output(Y) = Pass/Fail
2 0
3 0
4 0
5 1
6 1
7 1
8 1
Given :
a0 = -1.5
a1 = 0.6
Input (X) = 5
1
y= −(a 0 +a 1 X )
1+ ⅇ
1
y= −(−1.5 +0.6 x5 )
1+ ⅇ
1
y= −(1.5 )
1+ ⅇ
1
y= −(1.5 )
1+ ⅇ
1
y=
1+0.2231
1
y=
1.2231
y=0.8175
Note : Value “Y” is greater than 0.5 then student is Pass.
What is Bayes theorem?
Bayes' theorem is a fundamental concept in probability theory that plays a crucial role in
various machine learning algorithms, especially in the fields of Bayesian statistics and
probabilistic modelling. It provides a way to update probabilities based on new evidence
or information. In the context of machine learning, Bayes' theorem is often used in
Bayesian inference and probabilistic models.
P(A∣B)=P(B∣A)⋅P(A)P(B)P(A∣B)=P(B)P(B∣A)⋅P(A)
The theorem can be mathematically expressed as:
where
P(A∣B) → Posterior Probability
The probability that event A is true after seeing evidence B.
P(B∣A) → Likelihood
The probability of seeing evidence B if hypothesis A is true.
P(A) → Prior Probability
The probability we assign to A before seeing any evidence.
P(B) → Marginal Probability (Evidence Probability)
The total probability of observing evidence B, across all possible hypotheses.