REGRESSION
Regression is a statistical approach used to analyze the relationship
between a dependent variable (target variable) and one or more
independent variables (predictor variables). The objective is to determine
the most suitable function that characterizes the connection between
these variables.
It seeks to find the best-fitting model, which can be utilized to make
predictions or draw conclusions.
Regression in Machine Learning
It is a supervised machine learning technique, used to predict the value of
the dependent variable for new, unseen data. It models the relationship
between the input features and the target variable, allowing for the
estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or
continuous value, such as “salary” or “weight”. Many different models can
be used, the simplest is the linear regression. It tries to fit data with the
best hyper-plane which goes through the points.
Regression Algorithms
There are many different types of regression algorithms, but some of the most
common include:
● Linear Regression
○ Linear regression is one of the simplest and most widely used statistical
models. This assumes that there is a linear relationship between the
independent and dependent variables. This means that the change in the
dependent variable is proportional to the change in the independent variables.
● Polynomial Regression
○ Polynomial regression is used to model nonlinear relationships between
the dependent variable and the independent variables. It adds polynomial
terms to the linear regression model to capture more complex relationships.
● Support Vector Regression (SVR)
○ Support vector regression (SVR) is a type of regression algorithm that is
based on the support vector machine (SVM) algorithm. SVM is a type of
algorithm that is used for classification tasks, but it can also be used for
regression tasks. SVR works by finding a hyperplane that minimizes the sum
of the squared residuals between the predicted and actual values.
● Decision Tree Regression
○ Decision tree regression is a type of regression algorithm that builds a
decision tree to predict the target value. A decision tree is a tree-like structure
that consists of nodes and branches. Each node represents a decision, and
each branch represents the outcome of that decision. The goal of decision tree
regression is to build a tree that can accurately predict the target value for
new data points.
● Random Forest Regression
○ Random forest regression is an ensemble method that combines multiple
decision trees to predict the target value. Ensemble methods are a type of
machine learning algorithm that combines multiple models to improve the
performance of the overall model. Random forest regression works by
building a large number of decision trees, each of which is trained on a
different subset of the training data. The final prediction is made by averaging
the predictions of all of the trees.
Applications of Regression
● Predicting prices: For example, a regression model could be used
to predict the price of a house based on its size, location, and
other features.
● Forecasting trends: For example, a regression model could be
used to forecast the sales of a product based on historical sales
data and economic indicators.
● Identifying risk factors: For example, a regression model could
be used to identify risk factors for heart disease based on patient
data.
● Making decisions: For example, a regression model could be
used to recommend which investment to buy based on market
data.
Machine Learning is a branch of Artificial intelligence that focuses on the
development of algorithms and statistical models that can learn from and
make predictions on data. Linear regression is also a type of machine-
learning algorithm more specifically a supervised machine-learning
algorithm that learns from the labelled datasets and maps the data points
to the most optimized linear functions, which can be used for prediction on
new datasets.
First off we should know what supervised machine learning algorithms is.
It is a type of machine learning where the algorithm learns from labelled
data. Labeled data means the dataset whose respective target value is
already known. Supervised learning has two types:
● Classification: It predicts the class of the dataset based on the
independent input variable. Class is the categorical or discrete
values. like the image of an animal is a cat or dog?
● Regression: It predicts the continuous output variables based on
the independent input variable. like the prediction of house prices
based on different parameters like house age, distance from the
main road, location, area, etc.
Types of Regression Techniques
Along with the development of the machine learning domain regression
analysis techniques have gained popularity as well as developed manifold
from just y = mx + c. There are several types of regression techniques,
each suited for different types of data and different types of relationships.
The main types of regression techniques are:
1. Linear Regression
2. Polynomial Regression
3. Stepwise Regression
4. Decision Tree Regression
5. Random Forest Regression
6. Support Vector Regression
7. Ridge Regression
8. Lasso Regression
9. ElasticNet Regression
10. Bayesian Linear Regression
Linear Regression
Linear regression is used for predictive analysis. Linear regression is a
linear approach for modeling the relationship between the criterion or the
scalar response and the multiple predictors or explanatory variables.
Linear regression focuses on the conditional probability distribution of the
response given the values of the predictors. For linear regression, there is
a danger of overfitting. The formula for linear regression is:
This is the most basic form of regression analysis and is used to model a
linear relationship between a single dependent variable and one or more
independent variables.
Here, a linear regression model is instantiated to fit a linear relationship
between input features (X) and target values (y). This code is used for
simple demonstration of the approach.
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one
independent variable and one dependent variable. The equation for simple
linear regression is:
y=β0+β1X
where:
● Y is the dependent variable
● X is the independent variable
● β0 is the intercept
● β1 is the slope
Multiple Linear Regression
This involves more than one independent variable and one dependent
variable. The equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXn
\where:
● Y is the dependent variable
● X1, X2, …, Xn are the independent variables
● β0 is the intercept
● β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.
In regression set of records are present with X and Y values and these
values are used to learn a function so if you want to predict Y from an
unknown X this learned function can be used. In regression we have to
find the value of Y, So, a function is required that predicts continuous Y in
the case of regression given X as independent features.
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the strength of
any linear regression model. These assessment metrics often give an
indication of how well the model is producing the observed outputs.
The most common measurements are:
Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that calculates the
average of the squared differences between the actual and predicted
values for all the data points. The difference is squared to ensure that
negative and positive differences don’t cancel each other out.
MSE=1n∑i=1n(yi–yi^)2
Here,
● n is the number of data points.
● yi is the actual or observed value for the ith data point.
MSE is a way to quantify the accuracy of a model’s predictions. MSE is
sensitive to outliers as large errors contribute significantly to the overall
score.
Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the accuracy
of a regression model. MAE measures the average absolute difference
between the predicted values and actual values.
Mathematically, MAE is expressed as:
MAE=1n∑i=1n∣Yi–Yi^∣
Here,
● n is the number of observations
● Yi represents the actual values.
● represents the predicted values
Lower MAE value indicates better model performance. It is not sensitive to
the outliers as we consider absolute differences.
Root Mean Squared Error (RMSE)
The square root of the residuals’ variance is the Root Mean Squared Error.
It describes how well the observed data points match the expected
values, or the model’s absolute fit to the data.
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2n
her than dividing the entire number of data points in the model by the
number of degrees of freedom, one must divide the sum of the squared
residuals to obtain an unbiased estimate. Then, this figure is referred to as
the Residual Standard Error (RSE).
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2(n−2)
RSME is not as good of a metric as R-squared. Root Mean Squared Error
can fluctuate when the units of the variables vary since its value is
dependent on the variables’ units (it is not a normalized measure).
Coefficient of Determination (R-squared)
R-Squared is a statistic that indicates how much variation the developed
model can explain or capture. It is always in the range of 0 to 1. In general,
the better the model matches the data, the greater the R-squared number.
In mathematical notation, it can be expressed as:
R2=1−(RSSTSS)
● Residual sum of Squares (RSS): The sum of squares of the
residual for each data point in the plot or data is known as the
residual sum of squares, or RSS. It is a measurement of the
difference between the output that was observed and what was
anticipated.
● RSS=∑i=2n(yi−b0−b1xi)2
Total Sum of Squares (TSS): The sum of the data points’ errors
from the answer variable’s mean is known as the total sum of
squares, or TSS.
● TSS=∑(y−yi‾)2
R squared metric is a measure of the proportion of variance in the
dependent variable that is explained the independent variables in the
model.
Adjusted R-Squared Error
Adjusted R2 measures the proportion of variance in the dependent
variable that is explained by independent variables in a regression model.
Adjusted R-square accounts the number of predictors in the model and
penalizes the model for including irrelevant predictors that don’t
contribute significantly to explain the variance in the dependent variables.
Mathematically, adjusted R2 is expressed as:
AdjustedR2=1–((1−R2).(n−1)n−k−1)
Here,
● n is the number of observations
● k is the number of predictors in the model
● R2 is coeeficient of determination
Adjusted R-square helps to prevent overfitting. It penalizes the model
with additional predictors that do not contribute significantly to explain
the variance in the dependent variable.
FINDING THE LINE:
A linear regression lets you use one variable to predict another variable’s value.
Regression line formula
The regression line formula used in statistics is the same used in algebra:
y = mx + b
Where: x = horizontal axis
y = vertical axis
m = the slope of the line (how steep it is)
b = the y-intercept (where the line crosses the Y axis)
For any data set ;
CORRELATION COEFFICIENT
Correlation is a statistical measure that describes the extent to which two
variables are related to each other. It quantifies the direction and strength
of the linear relationship between variables. Generally, a correlation
between any two variables is of three types that include:
● Positive Correlation
● Zero Correlation
● Negative Correlation
The Pearson Correlation Coefficient, denoted as r, is a statistical
measure that calculates the strength and direction of the linear
relationship between two variables on a scatterplot. The value of r
ranges between -1 and 1, where:
● 1 indicates a perfect positive linear relationship,
● -1 indicates a perfect negative linear relationship, and
● 0 indicates no linear relationship between the variables.
Pearson’s Correlation Coefficient Formula
Karl Pearson’s correlation coefficient formula is the most commonly used
and the most popular formula to get the statistical correlation coefficient.
It is denoted with the lowercase “r”. The formula for Pearson’s
correlation coefficient is shown below:
r = n(∑xy) – (∑x)(∑y) / √[n∑x²-(∑x)²][n∑y²-(∑y)²
The full name for Pearson’s correlation coefficient formula is Pearson’s
Product Moment correlation (PPMC). It helps in displaying the Linear
relationship between the two sets of the data.
Pearson’s correlation helps in measuring the correlation strength (it’s
given by coefficient r-value between -1 and +1) and the existence (given
by p-value ) of a linear correlation relationship between the two variables
and if the outcome is significant we conclude that the correlation exists.
Cohen (1988) says that an absolute value of r of 0.5 is classified as large,
an absolute value of 0.3 is classified as medium and an absolute value of
0.1 is classified as small.
The interpretation of the Pearson’s correlation coefficient is as follows:
● A correlation coefficient of 1 means there is a positive increase of
a fixed proportion of others, for every positive increase in one
variable. Like, the size of the shoe goes up in perfect correlation
with foot length.
● If the correlation coefficient is 0, it indicates that there is no
relationship between the variables.
● A correlation coefficient of -1 means there is a negative decrease
of a fixed proportion, for every positive increase in one variable.
Like, the amount of water in a tank will decrease in a perfect
correlation with the flow of a water tap.
The Pearson correlation coefficient essentially captures how closely the
data points tend to follow a straight line when plotted together. It’s
important to remember that correlation doesn’t imply causation – just
because two variables are related, it doesn’t mean one causes the
change in the other.
Pearson Correlation Coefficient Table
Pearson Correlation Type of Description of New Illustrative
Coefficient (r) Range Correlation Relationship Example
Study Time vs.
An increase in one Test Scores: More
variable associates hours spent
0<r≤ 1 Positive
with an increase in studying tends to
the other. lead to higher test
scores.
Shoe Size vs.
No discernible Reading Skill: A
relationship between person’s shoe size
r=0 None
the changes in both doesn’t predict
variables. their ability to
read.
-1 ≤ r < 0 Negative An increase in one Outdoor
variable associates Temperature vs.
with a decrease in Home Heating
the other. Cost: As the
outdoor
temperature
decreases, heating
costs in the home
increase.
Pearson Correlation Coefficient Interpretation
Interpreting the Pearson correlation coefficient (r) involves assessing the
correlation strength, direction, and correlation significance of the relationship
between two variables. Here’s a guide to interpreting r:
1. Strength of Relationship:
● Close to +1: Indicates a strong positive linear relationship. As one
variable increases, the other tends to increase proportionally.
● Close to -1: Suggests a strong negative linear relationship. As one
variable increases, the other tends to decrease proportionally.
● Close to 0: Implies a weak or no linear relationship. Changes in one
variable do not consistently predict changes in the other.
2. Direction of Relationship:
● Positive r: Both variables tend to increase or decrease together.
● Negative r: One variable tends to increase as the other decreases, and
vice versa.
3. Significance:
● Statistical significance indicates whether the observed correlation
coefficient is likely to occur due to chance.
● Significance is typically assessed using a hypothesis test, such as the
t-test for correlation coefficient, with the null hypothesis stating that the true
correlation coefficient in the population is zero.
● If the p-value is less than the chosen significance level (e.g., 0.05), the
correlation is considered statistically significant.
4. Scatterplot Examination:
● Visual inspection of a scatterplot can provide additional insights into
the relationship between variables.
● A scatterplot allows you to assess the linearity, directionality, and
presence of outliers, complementing the numerical interpretation of r.
5. Caution:
● Correlation does not imply causation. Even if a strong correlation is
observed between two variables, it does not necessarily mean that changes in
one variable cause changes in the other.
● Other factors, such as confounding variables or omitted variables,
may influence the observed correlation.
6. Sample Size:
● Larger sample sizes tend to provide more reliable estimates of
correlation coefficients, reducing the likelihood of obtaining spurious
correlations.
7. Context Dependence:
● The interpretation of r should consider the specific context and
subject matter of the study. What is considered a strong or weak correlation may
vary depending on the field of research and the variables under investigation.
Examples:Calculate the correlation coefficient for the following table with the
help of Pearson’s correlation coefficient formula: