Dr . K.
Dhivakar
Assistant Professor of Statistics
School of Management, City Campus,
CMR University, Kalyan Nagar,
Bangalore -560 043.
1
Page
Linear Regression Analysis
Linear Regression Analysis is a statistical method used to model the relationship
between two variables by fitting a linear equation to observed data.
One variable is considered an independent variable (predictor), and the other is the
dependent variable (response).
Key Concepts in Linear Regression:
1. Independent Variable (X): The predictor or explanatory variable.
2. Dependent Variable (Y): The outcome or response variable.
3. Regression Line (Best Fit Line): The line that minimizes the sum of squared
differences between observed and predicted values.
4. Slope (b1): Measures the change in the dependent variable (Y) for a one-unit change
2
in the independent variable (X).
Page
5. Intercept (b0): The value of Y when X is 0.
The linear regression equation is of the form:
Y=b0+b1X
Where:
Y is the predicted value of the dependent variable.
X is the independent variable.
b0 is the intercept.
b1 is the slope of the regression line.
3
Page
Steps for Linear Regression Analysis:
1. Collect Data: Obtain paired data points (X1,Y2),(X2,Y2),…,(Xn,Yn) for the
variables.
2. Fit the Model: Use the least squares method to estimate the slope b1b_1b1 and
intercept b0b_0b0.
3. Interpret the Results: Analyze the regression equation, R2R^2R2, and residuals
to determine the model’s fit and make predictions.
4. Prediction: Use the regression equation to predict new values of Y for given X
values.
4
Page
Linear Regression Analysis using SPSS Statistics
Introduction
Linear regression is the next step up after correlation. It is used when we want to
predict the value of a variable based on the value of another variable. The variable
we want to predict is called the dependent variable (or sometimes, the outcome
variable).
The variable we are using to predict the other variable's value is called the
independent variable (or sometimes, the predictor variable).
For example, you could use linear regression to understand whether exam
performance can be predicted based on revision time; whether cigarette
consumption can be predicted based on smoking duration; and so forth.
5
Page
If you have two or more independent variables, rather than just one, you need to
use multiple regression.
Assumptions
o Assumption #1: Your dependent variable should be measured at the continuous level
(i.e., it is either an interval or ratio variable). Examples of continuous
variables include revision time (measured in hours), intelligence (measured using IQ
score), exam performance (measured from 0 to 100), weight (measured in kg).
o Assumption #2: Your independent variable should also be measured at
the continuous level (i.e., it is either an interval or ratio variable).
o Assumption #3: There needs to be a linear relationship between the two variables.
Whilst there are a number of ways to check whether a linear relationship exists between
your two variables, we suggest creating a scatterplot using SPSS Statistics where you
6
Page
can plot the dependent variable against your independent variable and then visually
inspect the scatterplot to check for linearity. Your scatterplot may look something like
one of the following:
If the relationship displayed in your scatterplot is not linear, you will have to either
run a non-linear regression analysis, perform a polynomial regression or
7
Page
"transform" your data, which you can do using SPSS Statistics.
Assumption #4: There should be no significant outliers. An outlier is an observed data
point that has a dependent variable value that is very different to the value predicted by
the regression equation. As such, an outlier will be a point on a scatterplot that is
(vertically) far away from the regression line indicating that it has a large residual, as
highlighted below:
8
Page
The problem with outliers is that they can have a negative effect on the regression
analysis (e.g., reduce the fit of the regression equation) that is used to predict the
value of the dependent (outcome) variable based on the independent (predictor)
variable.
This will change the output that SPSS Statistics produces and reduce the predictive
accuracy of your results.
Assumption #5: You should have independence of observations, which you can easily
check using the Durbin-Watson statistic, which is a simple test to run using SPSS
Statistics.
Assumption #6: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line. Whilst we
9
Page
explain more about what this means and how to assess the homoscedasticity of your data
in our enhanced linear regression guide, take a look at the three scatterplots below, which
provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):
10
Page
Assumption #7: Finally, you need to check that the residuals (errors) of the regression
line are approximately normally distributed (we explain these terms in our enhanced
linear regression guide).
Example
A salesperson for a large car brand wants to determine whether there is a
relationship between an individual's income and the price they pay for a car. As
such, the individual's "income" is the independent variable and the "price" they pay
for a car is the dependent variable.
The salesperson wants to use this information to determine which cars to offer
potential customers in new areas where average income is known.
11
Page
Setup in SPSS Statistics
In SPSS Statistics, we created two variables so that we could enter our
data: Income (the independent variable), and Price (the dependent variable). It can
also be useful to create a third variable, caseno , to act as a chronological case
number.
Test Procedure in SPSS Statistics
The five steps below show you how to analyse your data using linear regression in
SPSS Statistics. Click Analyze > Regression > Linear... on the top menu, as
shown below:
12
Page
You will be presented with the Linear Regression dialogue box:
13
Page
1. Transfer the independent variable, Income , into the Independent(s): box and the
dependent variable, Price , into the Dependent: box. You can do this by either drag-
and-dropping the variables or by using the appropriate buttons. You will end up
14
Page
with the following screen:
2. You now need to check four of the assumptions discussed in
the Assumptions section above: no significant outliers (assumption #3);
independence of observations (assumption #4); homoscedasticity (assumption #5);
15
Page
and normal distribution of errors/residuals (assumptions #6). You can do this by
using the and features, and then selecting the appropriate options
within these two dialogue boxes. Click on the button. This will generate the
results.
Output of Linear Regression Analysis
SPSS Statistics will generate quite a few tables of output for a linear regression.
In this section, we show you only the three main tables required to understand
your results from the linear regression procedure, assuming that no
assumptions have been violated.
The first table of interest is the Model Summary table, as shown below:
16
Page
This table provides the R and R2 values. The R value represents the simple
correlation and is 0.873 (the "R" Column), which indicates a high degree of
correlation.
The R2 value (the "R Square" column) indicates how much of the total variation in
the dependent variable, Price , can be explained by the independent
variable, Income . In this case, 76.2% can be explained, which is very large.
The next table is the ANOVA table, which reports how well the regression equation
fits the data (i.e., predicts the dependent variable) and is shown below:
17
Page
This table indicates that the regression model predicts the dependent variable
significantly well. How do we know this? Look at the "Regression" row and go to
the "Sig." column. This indicates the statistical significance of the regression model
that was run.
Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression
model statistically significantly predicts the outcome variable (i.e., it is a good fit
for the data).
The Coefficients table provides us with the necessary information to predict price
from income, as well as determine whether income contributes statistically
significantly to the model (by looking at the "Sig." column).
Furthermore, we can use the values in the "B" column under the "Unstandardized
Coefficients" column, as shown below:
18
Page
To present the regression equation as:
Price = 8287 + 0.564(Income)
19
Page
Thank you
20
Page