A Tool for Linear Prediction
What is regression?
refers to the method of predicting the value of one variable on the basis of the known value/s of other variable/s done by establishing a simplified statement, or a model, of the relationship between a variable and another variable or set of variables
Example of a Statement
Y = + 1X1 + 2X2 + 3X3 + This statement is called a regression equation This implies that the value of Y, except for a random error, is determined by the values of the variables X1, X2 and X3. When the observed variations in the values of Y can largely be accounted for by the values of X1, X2 and X3 then the statement is acceptable.
What is a regression equation?
A regression equation is the equation of the line which is defined by the path of the means of Ys for fixed Xs, or the regression equation of Y on X. Assumptions: equation is linear Y is normally distributed for each value of X Var(Y) is the same for each value of X
A Graphical Presentation
Y
Income
X1
X2
X3
X4
X5
X6
X7
Education Fig. 1. General form of regression of Y on X or the path of the means of Y values for fixed values of X
Method for finding the regression line
Least squares procedure - method for finding the straight line that minimizes the squared deviations of points around it
Assumptions: 1. Random sampling 2. E( i) = 0 3. Xi and i are statistically independent
A Graphical Presentation
Y
vertical distances between the line and the points X
Fig. 1. Least-squares equation minimizing sum of squares of vertical distances and estimating the regression of Y on X
Components of deviations
SST = sum of squared deviations between the mean and the observed values (d.f. = n -1) SSR = sum of squared deviations between the observed values and the corresponding fitted values (d.f. = p - 1, or the number of explanatory variables) SSE = unexplained deviations; = SST - SSR with d.f. = n - p
Statistics for Evaluating Regression Equation
F =
MSR MSE
where, MSR = SSR/(p - 1) MSE = SSE/(n - p) Thus, F is a measure of the relative size of the variations accounted for by the fitted regression line to the unexplained remaining variations.
Statistics for Evaluating Regression Equation
Coefficient of determination, R2 R2 = SSR/SST R2 = 1.0 implies that the regression line is a perfect fit R2 = 0.0 implies that the regression line fails to explain the deviations in the observed values; ie, all deviations are accounted for by SSE
Statistics for Evaluating Regression Equation
Regression coefficient - quantitative measure of the change in the value of the dependent variable (Y) that is effected by a unit change in the value of the explanatory variable (X).
Y = a + b1X1 + b2X2 + b3X3 + b4X4
regression coefficients
Statistics for Evaluating Regression Equation
Partial Fi - measure of the significance of the contribution of the regression coefficient bi , given that the other variables are included in the model.
Y = a + b1X1 + b2X2 + b3X3 + b4X4
F1 F2 F3 F4
Stepwise Regression
Used when there is no definite model to be tested, or when from a set of variables a subset is to be identified as the explanatory variables
Variables are entered into (or step in) or removed from (or step out) the model one at a time
Stepping Method Criteria:
significance of the associated F-value
The explanatory variable with the highest F-value is
entered first
A variable is not entered if the contribution to the overall
F-value is not significant
Regression with Dummy Variables
Regression procedure when at least one of the explanatory variables are categorical (either nominal or ordinal) A categorical variable with k levels is converted into (k - 1) indicator variables Example: Levels of the original variable, X4 : X41, X42, X43, X44 Dummy variables : DX41, DX42, DX43
Regression with Dummy Variables
The dummies are defined as follows: DX41 = 1, if the unit belongs to X41 level of variable X4 0, otherwise DX42 = 1, if the unit belongs to X42 level of variable X4 0, otherwise DX43 = 1, if the unit belongs to X43 level of variable X4 0, otherwise Note that when the three dummies are all zero, the unit belongs to the omitted level of variable X4. Thus, there is no need for another dummy variable.
Regression with Dummy Variables
The Model: Y = a + b1X1 + b2X2 + b3X3 + b4X4 If X4 is a categorical variable, the adjusted model is: Y = a + b1X1 + b2X2 + b3X3 + b4DX41 + b42DX42 + b43DX43
Regression with Dummy Variables
Interpretation of the Coefficients of Dummy Variables Value of X4 X41 X42 X43 X44 Mean value of Y Y = a + b1X1 + b2X2 + b3X3 + b41 Y = a + b1X1 + b2X2 + b3X3 + b42 Y = a + b1X1 + b2X2 + b3X3 + b43 Y = a + b1X1 + b2X2 + b3X3