0% found this document useful (0 votes)

32 views20 pages

ML U2 Regression

Regression is a statistical method used to analyze the relationship between dependent and independent variables, aiming to create a predictive model. Common regression algorithms include Linear Regression, Polynomial Regression, and Decision Tree Regression, each suited for different data types and relationships. Evaluation metrics such as Mean Squared Error and R-squared are used to assess the performance of regression models.

Uploaded by

aaryankrsaini24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views20 pages

ML U2 Regression

Uploaded by

aaryankrsaini24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 20

REGRESSION

Regression is a statistical approach used to analyze the relationship

between a dependent variable (target variable) and one or more
independent variables (predictor variables). The objective is to determine
the most suitable function that characterizes the connection between
these variables.
It seeks to find the best-fitting model, which can be utilized to make
predictions or draw conclusions.

Regression in Machine Learning

It is a supervised machine learning technique, used to predict the value of
the dependent variable for new, unseen data. It models the relationship
between the input features and the target variable, allowing for the
estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or
continuous value, such as “salary” or “weight”. Many different models can
be used, the simplest is the linear regression. It tries to fit data with the
best hyper-plane which goes through the points.

Regression Algorithms
There are many different types of regression algorithms, but some of the most
common include:
● Linear Regression
○ Linear regression is one of the simplest and most widely used statistical

models. This assumes that there is a linear relationship between the

independent and dependent variables. This means that the change in the

dependent variable is proportional to the change in the independent variables.

● Polynomial Regression

○ Polynomial regression is used to model nonlinear relationships between

the dependent variable and the independent variables. It adds polynomial

terms to the linear regression model to capture more complex relationships.

● Support Vector Regression (SVR)

○ Support vector regression (SVR) is a type of regression algorithm that is

based on the support vector machine (SVM) algorithm. SVM is a type of

algorithm that is used for classification tasks, but it can also be used for

regression tasks. SVR works by finding a hyperplane that minimizes the sum

of the squared residuals between the predicted and actual values.

● Decision Tree Regression

○ Decision tree regression is a type of regression algorithm that builds a

decision tree to predict the target value. A decision tree is a tree-like structure

that consists of nodes and branches. Each node represents a decision, and

each branch represents the outcome of that decision. The goal of decision tree

regression is to build a tree that can accurately predict the target value for

new data points.

● Random Forest Regression

○ Random forest regression is an ensemble method that combines multiple

decision trees to predict the target value. Ensemble methods are a type of

machine learning algorithm that combines multiple models to improve the

performance of the overall model. Random forest regression works by

building a large number of decision trees, each of which is trained on a

different subset of the training data. The final prediction is made by averaging

the predictions of all of the trees.

Applications of Regression

● Predicting prices: For example, a regression model could be used

to predict the price of a house based on its size, location, and

other features.

● Forecasting trends: For example, a regression model could be

used to forecast the sales of a product based on historical sales

data and economic indicators.

● Identifying risk factors: For example, a regression model could

be used to identify risk factors for heart disease based on patient

data.

● Making decisions: For example, a regression model could be

used to recommend which investment to buy based on market

data.

Machine Learning is a branch of Artificial intelligence that focuses on the

development of algorithms and statistical models that can learn from and
make predictions on data. Linear regression is also a type of machine-
learning algorithm more specifically a supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
to the most optimized linear functions, which can be used for prediction on
new datasets.
First off we should know what supervised machine learning algorithms is.
It is a type of machine learning where the algorithm learns from labelled
data. Labeled data means the dataset whose respective target value is
already known. Supervised learning has two types:
● Classification: It predicts the class of the dataset based on the

independent input variable. Class is the categorical or discrete

values. like the image of an animal is a cat or dog?

● Regression: It predicts the continuous output variables based on

the independent input variable. like the prediction of house prices

based on different parameters like house age, distance from the

main road, location, area, etc.

Types of Regression Techniques

Along with the development of the machine learning domain regression

analysis techniques have gained popularity as well as developed manifold

from just y = mx + c. There are several types of regression techniques,

each suited for different types of data and different types of relationships.

The main types of regression techniques are:

1. Linear Regression

2. Polynomial Regression

3. Stepwise Regression

4. Decision Tree Regression

5. Random Forest Regression

6. Support Vector Regression

7. Ridge Regression

8. Lasso Regression

9. ElasticNet Regression

10. Bayesian Linear Regression

Linear Regression

Linear regression is used for predictive analysis. Linear regression is a

linear approach for modeling the relationship between the criterion or the

scalar response and the multiple predictors or explanatory variables.

Linear regression focuses on the conditional probability distribution of the

response given the values of the predictors. For linear regression, there is

a danger of overfitting. The formula for linear regression is:

This is the most basic form of regression analysis and is used to model a

linear relationship between a single dependent variable and one or more

independent variables.

Here, a linear regression model is instantiated to fit a linear relationship

between input features (X) and target values (y). This code is used for

simple demonstration of the approach.

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple
linear regression is:

y=β0+β1X

where:
● Y is the dependent variable

● X is the independent variable

● β0 is the intercept

● β1 is the slope

Multiple Linear Regression

This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:

y=β0+β1X1+β2X2+………βnXn

\where:

● Y is the dependent variable

● X1, X2, …, Xn are the independent variables

● β0 is the intercept

● β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can predict the

values based on the independent variables.

In regression set of records are present with X and Y values and these

values are used to learn a function so if you want to predict Y from an

unknown X this learned function can be used. In regression we have to

find the value of Y, So, a function is required that predicts continuous Y in

the case of regression given X as independent features.

Evaluation Metrics for Linear Regression

A variety of evaluation measures can be used to determine the strength of

any linear regression model. These assessment metrics often give an

indication of how well the model is producing the observed outputs.

The most common measurements are:

Mean Square Error (MSE)

Mean Squared Error (MSE) is an evaluation metric that calculates the

average of the squared differences between the actual and predicted

values for all the data points. The difference is squared to ensure that

negative and positive differences don’t cancel each other out.

MSE=1n∑i=1n(yi–yi^)2

Here,

● n is the number of data points.

● yi is the actual or observed value for the ith data point.

MSE is a way to quantify the accuracy of a model’s predictions. MSE is

sensitive to outliers as large errors contribute significantly to the overall

score.

Mean Absolute Error (MAE)

Mean Absolute Error is an evaluation metric used to calculate the accuracy

of a regression model. MAE measures the average absolute difference

between the predicted values and actual values.

Mathematically, MAE is expressed as:

MAE=1n∑i=1n∣Yi–Yi^∣

Here,

● n is the number of observations

● Yi represents the actual values.

● represents the predicted values

Lower MAE value indicates better model performance. It is not sensitive to

the outliers as we consider absolute differences.

Root Mean Squared Error (RMSE)

The square root of the residuals’ variance is the Root Mean Squared Error.

It describes how well the observed data points match the expected

values, or the model’s absolute fit to the data.

In mathematical notation, it can be expressed as:

RMSE=RSSn=∑i=2n(yiactual−yipredicted)2n

her than dividing the entire number of data points in the model by the

number of degrees of freedom, one must divide the sum of the squared
residuals to obtain an unbiased estimate. Then, this figure is referred to as

the Residual Standard Error (RSE).

In mathematical notation, it can be expressed as:

RMSE=RSSn=∑i=2n(yiactual−yipredicted)2(n−2)

RSME is not as good of a metric as R-squared. Root Mean Squared Error

can fluctuate when the units of the variables vary since its value is

dependent on the variables’ units (it is not a normalized measure).

Coefficient of Determination (R-squared)

R-Squared is a statistic that indicates how much variation the developed

model can explain or capture. It is always in the range of 0 to 1. In general,
the better the model matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:

R2=1−(RSSTSS)

● Residual sum of Squares (RSS): The sum of squares of the

residual for each data point in the plot or data is known as the
residual sum of squares, or RSS. It is a measurement of the
difference between the output that was observed and what was
anticipated.

● RSS=∑i=2n(yi−b0−b1xi)2

Total Sum of Squares (TSS): The sum of the data points’ errors
from the answer variable’s mean is known as the total sum of
squares, or TSS.
● TSS=∑(y−yi‾)2

R squared metric is a measure of the proportion of variance in the

dependent variable that is explained the independent variables in the

model.

Adjusted R-Squared Error

Adjusted R2 measures the proportion of variance in the dependent

variable that is explained by independent variables in a regression model.

Adjusted R-square accounts the number of predictors in the model and

penalizes the model for including irrelevant predictors that don’t

contribute significantly to explain the variance in the dependent variables.

Mathematically, adjusted R2 is expressed as:

AdjustedR2=1–((1−R2).(n−1)n−k−1)

Here,

● n is the number of observations

● k is the number of predictors in the model

● R2 is coeeficient of determination

Adjusted R-square helps to prevent overfitting. It penalizes the model

with additional predictors that do not contribute significantly to explain

the variance in the dependent variable.

FINDING THE LINE:

A linear regression lets you use one variable to predict another variable’s value.

Regression line formula

The regression line formula used in statistics is the same used in algebra:

y = mx + b

Where: x = horizontal axis

y = vertical axis

m = the slope of the line (how steep it is)

b = the y-intercept (where the line crosses the Y axis)

For any data set ;

CORRELATION COEFFICIENT
Correlation is a statistical measure that describes the extent to which two

variables are related to each other. It quantifies the direction and strength
of the linear relationship between variables. Generally, a correlation

between any two variables is of three types that include:

● Positive Correlation

● Zero Correlation

● Negative Correlation

The Pearson Correlation Coefficient, denoted as r, is a statistical

measure that calculates the strength and direction of the linear

relationship between two variables on a scatterplot. The value of r

ranges between -1 and 1, where:

● 1 indicates a perfect positive linear relationship,

● -1 indicates a perfect negative linear relationship, and

● 0 indicates no linear relationship between the variables.

Pearson’s Correlation Coefficient Formula
Karl Pearson’s correlation coefficient formula is the most commonly used

and the most popular formula to get the statistical correlation coefficient.

It is denoted with the lowercase “r”. The formula for Pearson’s

correlation coefficient is shown below:

r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²

The full name for Pearson’s correlation coefficient formula is Pearson’s

Product Moment correlation (PPMC). It helps in displaying the Linear

relationship between the two sets of the data.

Pearson’s correlation helps in measuring the correlation strength (it’s

given by coefficient r-value between -1 and +1) and the existence (given

by p-value ) of a linear correlation relationship between the two variables

and if the outcome is significant we conclude that the correlation exists.

Cohen (1988) says that an absolute value of r of 0.5 is classified as large,

an absolute value of 0.3 is classified as medium and an absolute value of

0.1 is classified as small.

The interpretation of the Pearson’s correlation coefficient is as follows:

● A correlation coefficient of 1 means there is a positive increase of

a fixed proportion of others, for every positive increase in one

variable. Like, the size of the shoe goes up in perfect correlation

with foot length.

● If the correlation coefficient is 0, it indicates that there is no

relationship between the variables.

● A correlation coefficient of -1 means there is a negative decrease

of a fixed proportion, for every positive increase in one variable.

Like, the amount of water in a tank will decrease in a perfect

correlation with the flow of a water tap.

The Pearson correlation coefficient essentially captures how closely the

data points tend to follow a straight line when plotted together. It’s

important to remember that correlation doesn’t imply causation – just

because two variables are related, it doesn’t mean one causes the

change in the other.

Pearson Correlation Coefficient Table

Pearson Correlation Type of Description of New Illustrative

Coefficient (r) Range Correlation Relationship Example

Study Time vs.

An increase in one Test Scores: More

variable associates hours spent

0<r≤ 1 Positive
with an increase in studying tends to

the other. lead to higher test

scores.

Shoe Size vs.

No discernible Reading Skill: A

relationship between person’s shoe size

r=0 None
the changes in both doesn’t predict

variables. their ability to

read.

-1 ≤ r < 0 Negative An increase in one Outdoor

variable associates Temperature vs.

with a decrease in Home Heating

the other. Cost: As the

outdoor

temperature

decreases, heating

costs in the home

increase.
Pearson Correlation Coefficient Interpretation
Interpreting the Pearson correlation coefficient (r) involves assessing the
correlation strength, direction, and correlation significance of the relationship
between two variables. Here’s a guide to interpreting r:
1. Strength of Relationship:

● Close to +1: Indicates a strong positive linear relationship. As one

variable increases, the other tends to increase proportionally.

● Close to -1: Suggests a strong negative linear relationship. As one

variable increases, the other tends to decrease proportionally.

● Close to 0: Implies a weak or no linear relationship. Changes in one

variable do not consistently predict changes in the other.

2. Direction of Relationship:

● Positive r: Both variables tend to increase or decrease together.

● Negative r: One variable tends to increase as the other decreases, and

vice versa.

3. Significance:

● Statistical significance indicates whether the observed correlation

coefficient is likely to occur due to chance.

● Significance is typically assessed using a hypothesis test, such as the

t-test for correlation coefficient, with the null hypothesis stating that the true

correlation coefficient in the population is zero.

● If the p-value is less than the chosen significance level (e.g., 0.05), the

correlation is considered statistically significant.

4. Scatterplot Examination:
● Visual inspection of a scatterplot can provide additional insights into

the relationship between variables.

● A scatterplot allows you to assess the linearity, directionality, and

presence of outliers, complementing the numerical interpretation of r.

5. Caution:

● Correlation does not imply causation. Even if a strong correlation is

observed between two variables, it does not necessarily mean that changes in

one variable cause changes in the other.

● Other factors, such as confounding variables or omitted variables,

may influence the observed correlation.

6. Sample Size:

● Larger sample sizes tend to provide more reliable estimates of

correlation coefficients, reducing the likelihood of obtaining spurious

correlations.

7. Context Dependence:
● The interpretation of r should consider the specific context and

subject matter of the study. What is considered a strong or weak correlation may

vary depending on the field of research and the variables under investigation.

Examples:Calculate the correlation coefficient for the following table with the

help of Pearson’s correlation coefficient formula:

Understanding Dependent Variables in Regression
No ratings yet
Understanding Dependent Variables in Regression
5 pages
Data Science
No ratings yet
Data Science
5 pages
Types of Supervised Learning2
No ratings yet
Types of Supervised Learning2
66 pages
4 ML
No ratings yet
4 ML
41 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
LECTURE Regression
No ratings yet
LECTURE Regression
12 pages
Unit 2
No ratings yet
Unit 2
67 pages
Unit - Iii Data Analysis
No ratings yet
Unit - Iii Data Analysis
39 pages
Unit 3
No ratings yet
Unit 3
48 pages
Unit 2 Notes - Final
No ratings yet
Unit 2 Notes - Final
32 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
Regression Analysis for ML Beginners
No ratings yet
Regression Analysis for ML Beginners
12 pages
Machine Learning: Regression & Trees
No ratings yet
Machine Learning: Regression & Trees
17 pages
Understanding Linear Regression Analysis
No ratings yet
Understanding Linear Regression Analysis
27 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Supervised Learning Essentials
No ratings yet
Supervised Learning Essentials
30 pages
ML Points
No ratings yet
ML Points
13 pages
Unit 2
No ratings yet
Unit 2
26 pages
9 Types of Regression Analysis
No ratings yet
9 Types of Regression Analysis
16 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Regression Guide for Supporting Characters
100% (1)
Regression Guide for Supporting Characters
21 pages
Unit - 2 MLA
No ratings yet
Unit - 2 MLA
57 pages
Understanding Regression in Supervised Learning
No ratings yet
Understanding Regression in Supervised Learning
25 pages
Types of Regression in Data Science
No ratings yet
Types of Regression in Data Science
8 pages
Machine Learning - Regression Notes
No ratings yet
Machine Learning - Regression Notes
9 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
11 pages
Regression
No ratings yet
Regression
11 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
12 pages
5 Regression-1
No ratings yet
5 Regression-1
46 pages
Supervised Learning Regression
No ratings yet
Supervised Learning Regression
15 pages
Week 7. Intro To ML. Regression
No ratings yet
Week 7. Intro To ML. Regression
24 pages
Unit-2 Supervised Machine Learning
No ratings yet
Unit-2 Supervised Machine Learning
132 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
9 pages
ML 2 ND Unit
No ratings yet
ML 2 ND Unit
50 pages
Supervised Learning
No ratings yet
Supervised Learning
20 pages
Ch-2 Supervised Machine Learning
No ratings yet
Ch-2 Supervised Machine Learning
48 pages
Overview of Supervised Machine Learning
No ratings yet
Overview of Supervised Machine Learning
24 pages
Module 4
No ratings yet
Module 4
41 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
RRB - Unit 2 Regresion
No ratings yet
RRB - Unit 2 Regresion
53 pages
Unit-3 Part 2 DA
No ratings yet
Unit-3 Part 2 DA
20 pages
MLT Unit 2
No ratings yet
MLT Unit 2
53 pages
AI Lab7
No ratings yet
AI Lab7
13 pages
ML Unit-2
No ratings yet
ML Unit-2
123 pages
Complete
No ratings yet
Complete
12 pages
ML Unit3b
No ratings yet
ML Unit3b
175 pages
Chapter 2
No ratings yet
Chapter 2
50 pages
S&ML Unit 5 - Q & A
No ratings yet
S&ML Unit 5 - Q & A
15 pages
DAV 2201079 Exp 2 2-1
No ratings yet
DAV 2201079 Exp 2 2-1
35 pages
Unit 2
No ratings yet
Unit 2
136 pages
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
No ratings yet
What Are Linear Models in Machine Learning (1) .Docx (Unit3 ML)
60 pages
Wa0023.
No ratings yet
Wa0023.
22 pages
ML Exp 1
No ratings yet
ML Exp 1
6 pages
Unit 2 3 Notes
No ratings yet
Unit 2 3 Notes
16 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
No ratings yet
Regression Analysis in Machine Learning: Temperature, Age, Salary, Price
12 pages
Unit 2
No ratings yet
Unit 2
92 pages
Unit 2
No ratings yet
Unit 2
133 pages
Unit - 3 Machine Learning
No ratings yet
Unit - 3 Machine Learning
30 pages
Session 4 Forecasting Regression Methods II
No ratings yet
Session 4 Forecasting Regression Methods II
65 pages
BSC - Applied Statistics - Correlation and SLR
No ratings yet
BSC - Applied Statistics - Correlation and SLR
67 pages
Reliability and Model Analysis in Research
No ratings yet
Reliability and Model Analysis in Research
4 pages
Logistic Regression Course Notes
No ratings yet
Logistic Regression Course Notes
6 pages
Examples: Mixture Modeling With Cross-Sectional Data
No ratings yet
Examples: Mixture Modeling With Cross-Sectional Data
56 pages
Multiple Linear Regression Guide
No ratings yet
Multiple Linear Regression Guide
3 pages
Module 11 Unit 3 Multiple Linear Regression
No ratings yet
Module 11 Unit 3 Multiple Linear Regression
8 pages
ML Unit-4
No ratings yet
ML Unit-4
65 pages
Unit 5 - Week 4: Assignment 4
No ratings yet
Unit 5 - Week 4: Assignment 4
4 pages
Test Exer 6
No ratings yet
Test Exer 6
3 pages
Time Series Formulas and Python Functions
No ratings yet
Time Series Formulas and Python Functions
10 pages
Multicollinearity Assignment April 5
100% (1)
Multicollinearity Assignment April 5
15 pages
Random Motors Briefing
No ratings yet
Random Motors Briefing
43 pages
Scilab Curve Fitting: Linear & Nonlinear
No ratings yet
Scilab Curve Fitting: Linear & Nonlinear
10 pages
Motivational Letter
No ratings yet
Motivational Letter
2 pages
Gas Station
No ratings yet
Gas Station
11 pages
Linear and Polynomial Regression
No ratings yet
Linear and Polynomial Regression
26 pages
Nonlinear Model
No ratings yet
Nonlinear Model
3 pages
Correlation and Regression 2020
No ratings yet
Correlation and Regression 2020
63 pages
The Jackknife: Patrick Breheny
No ratings yet
The Jackknife: Patrick Breheny
23 pages
Stock Watson 3u Exercise Solutions Chapter 11 Instructors
100% (1)
Stock Watson 3u Exercise Solutions Chapter 11 Instructors
13 pages
Midterm Question - Time Series Analysis - Updated
No ratings yet
Midterm Question - Time Series Analysis - Updated
3 pages
Econometrics - Fumio Hayashi (Solutions)
No ratings yet
Econometrics - Fumio Hayashi (Solutions)
19 pages
Financial Econometrics Introduction
No ratings yet
Financial Econometrics Introduction
13 pages
Student Assignment Scores Analysis
No ratings yet
Student Assignment Scores Analysis
11 pages
OLS and Strict Exogeneity in Time Series
No ratings yet
OLS and Strict Exogeneity in Time Series
22 pages
Regression Discontinuity Subgroup Analysis
No ratings yet
Regression Discontinuity Subgroup Analysis
14 pages
Vhembe Agribusiness Income Study
No ratings yet
Vhembe Agribusiness Income Study
20 pages
Random Walk Models in Economics
No ratings yet
Random Walk Models in Economics
1 page
BAN 602 - Project1
No ratings yet
BAN 602 - Project1
4 pages