0% found this document useful (0 votes)
7 views17 pages

Task 3 Multiple Linear Regression

This document describes multiple linear regression, which analyzes the relationship between a dependent variable and multiple independent variables through equations. It explains that multiple linear regression is a statistical technique for testing hypotheses and causal relationships between variables. It also describes the conditions that must be met to apply this method and the methodological process that includes parameter estimation, hypothesis testing, and model fit evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

Task 3 Multiple Linear Regression

This document describes multiple linear regression, which analyzes the relationship between a dependent variable and multiple independent variables through equations. It explains that multiple linear regression is a statistical technique for testing hypotheses and causal relationships between variables. It also describes the conditions that must be met to apply this method and the methodological process that includes parameter estimation, hypothesis testing, and model fit evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MULTIPLE LINEAR REGRESSION

What is multiple linear regression?

It is possible to analyze the relationship between two or more variables through equations,
what is called multiple regression or multiple linear regression.

Multiple linear regression is the great statistical technique for testing hypotheses.
and causal relationships.

Conditions that must be met to apply regression


multiple linear

1. The dependent variable (outcome) must be ordinal or scale, that is to say, that
the categories of the variable have internal order or hierarchy, for example:
income level, weight, number of children, etc.
The independent variables (causes) must be ordinal or scalar.
3. There are other conditions such as: the independent variables cannot be
highly correlated with each other, the relationships between the causes and the
results must be linear, all variables must follow thedistribution
normaland they must have equal variances. These conditions are not so
strict and there are ways to handle the data if it is breached.

Other criteria that must be met will be the following:


ƒ
To have numerical sense.
There should be no repeated or redundant variables.
The variables introduced in the model must have a certain
theoretical justification.
The relationship between explanatory variables in the model and cases must be
from 1 to 10 at a minimum.
The relationship of the explanatory variables with the dependent variable must
to be linear, that is, proportional.

Multiple Linear Regression Analysis

This allows us to establish the relationship that occurs between a variable


dependent variable Y and a set of independent variables (X1, X2, ... XK).

Multiple linear regression analysis, unlike simple regression, is closer to


real analysis situations since phenomena, facts, and processes
Social matters, by definition, are complex and, consequently, must be explained.
to the extent possible due to the series of variables that, directly and indirectly,
they participate in its realization.

From multiple linear regression analyses we can:


Identify which independent variables (causes) explain a variable
dependent (result).
Compare and verify causal models.
Predicting values of a variable, that is, based on some characteristics
to predict approximately a behavior or state.

The multiple linear regression model

The multiple linear regression model is identical to the linear regression model.
simple, with the only difference being that more explanatory variables appear:

Simple regression model:

Multiple regression model:

Another way to find it:

Meaning of the parameters:

β0 = Mean value of the response variable when X1= ... = Xk= 0.

Very often, the parameter β0 does not have an intuitive interpretation of interest.

βj= Measures the average variation that the response variable undergoes when Xj
increase by one unit (j = 1, ..., k).

The intuitive interpretation of βj (j = 1, ..., k) is always very interesting.

Hypothesis
In order to obtain and use statistical tools that allow us to
to make objective and reasoned decisions, we need the model to
adjust to certain hypotheses. These initial hypotheses of the model
they are the following:

Normality: The observations Yi follow a Normal distribution.

Linearity: The mean values of the response variable depend


Linearly from the values of X1, ... Xk: E[Yi] = β0 + β1x1i + ... + βjxji + ...
βkxki.

Homogeneity or equality of variances (homoscedasticity): V (Yi) = σ 2.


The observations are independent.

All these hypotheses can be briefly expressed in the following way:


Yi∼ N(β0 + β1x1i+ ... + βjxyes+ ... + βkxwhoindependent.

Absence of multicollinearity: There are no linear relationships between the variables.


explanatory X1, ..., Xk.

The absence of multicollinearity constitutes a completely new hypothesis and


its meaning is as follows:

On one hand, if any of the explanatory variables were a linear combination of the
others, the model, obviously, could be simplified. But that is not the most
important.

The practical importance of demanding absence of multicollinearity arises from the fact
that, if any of the explanatory variables is strongly correlated with
Others, distortions in the results can occur.

It is important that these initial hypotheses of the model are met.


(approximately) so that the conclusions we obtain are not a
barbarity.

At this point, the question can be addressed as to whether we have enough


data (sufficient sample information) to address the statistical analysis of this
model. The basic rule for responding to this is very easy to remember (and to
understand): in general, we will need at least as much data as parameters
we want to estimate in the model. In this model, we have:

Number of data = n
Number of parameters = k + 2

Therefore, we need at least n = k + 2 data sets.


Methodology:
The methodology or work plan that we will follow in the analysis
the statistic of a multiple regression model is as follows:

Diagnosis of the initial hypotheses of the model.

Point estimate of the model parameters.

Confidence intervals for estimating model parameters.

Hypothesis contrasts.

Analysis of variance.

(6) Evaluation of the fit provided by the adjusted regression model.

Another type of Methodology:

For selection of multiple linear regression models based on methods


multiobjective

The selection proposal of MRLM.

4.1. Algorithm.

The MERLIND algorithm (Non-Dominated Linear Regression Models) is based on


the operating principles of commitment programming using the L metric1y
L∞because it allows to narrow down the efficient set by providing a range of non- solutions
balanced and balanced respectively; it also uses the approach of
lexicographic meta programming, to ensure that the solutions do not
certain levels of achievement deteriorate. The steps of the
algorithm.

Step 0. Initialization.
a. Start from a dependent variable Y, and from a set Xkof
independent variables suggested by at least one theory that
explain the phenomenon.
b. Ask the analyst:
b1. ¿Cuál es el nivel de significación a utilizar para las pruebas estadísticas?: 1% o
5%.

b2What are the expected signs in multiple regression for the k


coefficients of each independent variable?
b3Is there any theoretical restriction to satisfy among the coefficients? In case
If the answer is affirmative, indicate the restriction(s).

Go to step 1.

Step 1. Model generation.

a. Optionally generate a set of transformed variables (logarithms,


differences, delays or combinations of these) based on the original data.

b. Generate the 2k-1 possible combinations of variables including the variables


transformed if applicable.

c. Estimate the corresponding regression models including or not the


intercept

d. Record the observed results of the following selection criteria for


each generated model, according to the following indicators:

• Observed signs of the coefficients: positive or negative.


• Hypothesis test on individual coefficients: p-value of the t statistic.
Global significance test: p-value of the F statistic.
Hypothesis testing for restricted models: global F and critical F.
Adjusted coefficient of determination: R value2adjusted.
• Durbin-Watson test: DW value and p-value.
AC and PAC bars outside or inside the confidence bands.
White Heteroscedasticity Test with error term: value of n·R2and p-
value.
• Asymmetry and kurtosis contrast of Jarque-Bera: value of the JB statistic and p-
value.
• Normality test: Kolmogorov-Smirnov: K-S statistic value and p-value.
Multicollinearity test: variance inflation factor value.
• Test of multicollinearity: Condition index.
• Mallow's Cp criterion: value of Cp.
Schwarz's CIS criterion: value of the CIS.

Go to step 2.

Step 2. Decision matrix.


a. For each model, reflect the score associated with the degree of compliance
each criterion, for this use the point conversion matrix shown in
Table 3.
b. Organize the results of the model scores in a matrix of Zij
as shown in Table 2.

Step 3. Normalized distances.

Set the ideal value at 3 points and the anti-ideal at 1 point.


b. Calculate the normalized distances:

c. Add the normalized distances of the k=7 block tests


criteria:
c.1. Theoretical coherence:

for i=1,2,…,m and j=1


c.2. Statistical coherence:

for i=1,2,…,m and j=2,3,…,5


c.3. Autocorrelation:

for i=1,2,…,m and for j=6,7


c.4. Heteroscedasticity:

for i=1,2,…,m and j=8


c.5. Normality:

for i=1,2,…,m and j=9,10


c.6. Multicollinearity:

6 for i=1,2,…,m and j=11,12


c.7. Other criteria:

Go to step 4.

Step 4. Commitment set.

a. For the set L1, the following lexicographic achievement function is defined: those
indices provide the multiple regression equations that meet the
theoretical coherence, they have a minimal global distance and simultaneously,
provide a balanced solution in the rest of the criteria.

Examples:

1.-'One wants to estimate the food expenses of a family.' based on


the information provided by the predictor variables X1income
monthly2"number of family members". To do this, a collection is made of a
simple random sample of 15 families whose results are in the table
attached (The expenditure and income are given in hundreds of thousands of pesetas).

Expenditure Income Size Expense Income Size

043 21 3 129 89 3

031 11 4 035 24 2
032 09 5 035 12 4

046 16 4 078 47 3

125 62 4 043 35 2

044 23 3 047 29 3

052 18 6 038 14 4

029 10 5

The data in matrix form:

From this data, it is obtained that:

Therefore:

Where from:
The linear regression model obtained is:

From this equation, the predictions and the associated residues are obtained.
the sample observations. For the first
observation is obtained:

Reasoning this way for all the sample points yields:

Calculation of scR

The scR can also be calculated in the following way

t = tY - tXtY = yi2 - 0 yi - 1 yix1i - 2 yix2i =

= 5'7733 - . 8'070 - 0'149 . 32'063 - 0'077 . 28'960

The confidence intervals for the model parameters are calculated at 90%.
For the variance, 2

~ 122

< < 21'0298


< 2 < 0.0138

The variance of the model estimators is

from where it is deduced that

0.00816 = 0.0903

0' 000099 = 0.0099

= 0' 00040 = 0'0201

Confidence interval for 0

t12 . 0'0903 <-0'160 -0<t12 . 0'0903

0'321 < 0< 0'001


Confidence interval for1(income)
t12 . 0'0099 is less than 0'149 - 1<t12 . 0'0099

0'1314 <1< 0'1666


HireH0 1= 0, "the income variable does not influence" (individual t-test contrast)

Confidence interval for2(size)


t12 . 0'0201 < 0077
' - 2<t12 . 0'0201

0'0412 <2< 0'1128


HireH0 2= 0, "the variable size does not influence" (individual t-test contrast)

ANOVA table,

where from
ANOVA Table

Sources of Sum of Degrees of Variances


variation Squares Freedom
scE (for the 13595 2 e2 = 0.6797
model)
scR (Residual) 00721 12 R2 = 0.0060
scG (Global) 14316 14 y2 = 0.1023

With this data, the following joint F contrast is obtained.

The joint contrast of the Findica clearly indicates the influence of the model on the
variable response. Therefore, from the individual contrasts and the whole
deduce the influence of each of the two regressor variables and the influence
joint of the model.
Now the individual contrast of F is calculated with respect to
variablex2size, contrast which is equivalent to the individual contrast of
To do this, the regression of the expenditure variable is obtained with respect to the variable
income

the ANOVA table of this model is

ANOVA Table

Sources of Sum of Degrees of Variances


Variation Squares Freedom
scE (income) 1'2716 1 2=
1'2716
e
scR (residual) 0’1600 13 R
2 = 0'0123
scG (global) 1'4316 14 y = 0 1022
2 '

The incremental variability due to the diameter variable is

this value indicates how much the variability explained by the model increases when
introduce the size variable.

To contrast the influence or not of this variable, the statistic is used.


which gives the same p-value as in the individual test of lat (there are small
differences due to the responses).

Calculation of correlation coefficients:

The coefficient of determination,

The multiple correlation coefficient

The coefficient of determination adjusted for the number of degrees of freedom.

The simple correlation coefficient between the expenditure and income variables,

This coefficient is a measure of the linear relationship between the variables.


expenses and income. It can also be calculated from the coefficient of
determination of the following regression

The ANOVA table of the model is:

Sources of Sum of Degrees of Variances


Income Squares Freedom
scE (income) 1'2716 1 e2 = 1'2716
scR (residual) 0'1600 13 R2 = 0.0123
scG (global) 1,4316 14 y2 = 0.1022
Similarly, the simple correlation coefficient between the variables expenditure and
size is,

Partial correlation coefficient between the variables spending and income.

Another more complex way to calculate this coefficient is as follows: it is obtained


the following regressions are performed and the residuals are saved,

= 0.6713 - 0.0363 size + expense. Size.

Size 5'5923 - 07615 + and income. Size.

Now the partial correlation coefficient between the variables spending and income is
obtains the simple correlation coefficient between the spending variables.
Size and income. Size.

this coefficient measures the relationship between the spending and income variables free of the
influence of the size variable.
Similarly, it is obtained that:

Estimation of the conditional mean.


Estimate the average spending on food for a family with an income of dex1 =
3'0 and a size of x2 = 4.
This is:
”.
Applying the regression model:

The influence value associated with the data h = =

= = 0'07649

nh = = 13'073
The variance of the estimator is

And a 90% confidence interval for mh is

Prediction of an observation.
The Pérez family has an income of dex1 = 3.0 and a size of x2 = 4. This
What will be the food expense?
Applying the estimated regression model

The variance of the prediction is:

= R2 0.0060 0.0065

= 0'0803

And a 90% prediction interval is

Some interesting charts that help to solve the problem are the following:
Partial graphs of the components

Residual plots
2.- A measurement on 12 individuals allows us to know data about their weight,
altura, contorno de cintura (en cm.) y su edad.

We are going to adjust a new linear regression model (multiple, in this case) that
Incorporate the information from these new variables. First, we are going to create
two numerical vectors, one for each new variable

> cintura <- c(62, 75, 60, 71, 66, 62, 79, 74, 70, 66, 71, 69)
> age <- c(25, 31, 29, 64, 44, 41, 37, 35, 34, 29, 19, 50)

And we are going to group the information related to the 4 variables we have.
in a data frame that we will name 'datos':
data <- [Link](weight, height, waist, age)

Let's check that, indeed, the data frame we have created contains the
information about the 4 variables:

head(data)
peso altura cintura edad
1 74 168 62 25
2 92 196 75 31
3 63 170 60 29
4 72 175 71 64
5 58 162 66 44
6 78 169 62 41

Next, we adjust the multiple linear regression model.

reg_lin_mul <- lm(weight ~ height + waist + age)


summary(reg_lin_mul)
Call:
lm(formula = weight ~ height + waist + age)
Residuals:
Min 1Q Median 3Q Max
-7.5822 -2.8758 -0.6746 2.6828 9.9842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -78.03017 35.37744 -2.206 0.0585 .

Height Waist Age


0.93629 -0.13261 -0.09672
0.34941 0.60578 0.15806
2.680 -0.219 -0.612
0.0279* 0.8322 0.5576

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.024 on 8 degrees of freedom
Multiple R-squared: 0.7464 Adjusted R-squared: 0.6513
F-statistic: 7.85 on 3 and 8 DF, p-value: 0.009081

The model could be written as follows:

Both the interpretation and the verification of the significance of the parameters
They are performed in a similar way to the case when there is only one variable.
independent. Likewise, the validation is carried out in the same way as for
simple linear regression.
Regarding graphical representations, graphs can be represented of
dispersion of the dependent variable with respect to each of the variables
independents through the command plot, as shown earlier.

Bibliografía:
[Link]
[Link]
al_multiple_3.pdf
[Link]
Notes/[Link]
Methodology for model selection of... (PDF Download Available).
Available from:
[Link]
ion_of_multiple_linear_regression_models_based_on_multi-objective_methods
I am
[Link]

You might also like