CHAPTER 14 Regression Analysis
CHAPTER 14 Regression Analysis
Regression Analysis
LEARNING OBJECTIVES
14.1 INTRODUCTION
Is there any functional (or algebraic) relationship between two variables? If yes, can it be
used to estimate the most likely value of one variable, given the value of other variable?
The statistical technique that expresses the relationship between two or more
variables in the form of an equation to estimate the value of a variable, based on the
given value of another variable, is called regression analysis. The variable whose
value is estimated using the algebraic equation is
called dependent (or response) variable and the variable whose value is used to
estimate this value is called independent (regressor or predictor) variable. The
linear algebraic equation used for expressing a dependent variable in terms of
independent variable is called linear regression equation.
The term regression was used in 1877 by Sir Francis Galton while studying the
relationship between the height of father and sons. He found that though ‘tall father
has tall sons’, the average height of sons of tall father is x above the general height,
the average height of sons is 2x/3 above the general height. Such a fall in the average
height was described by Galton as ‘regression to mediocrity’. However, the theory of
Galton is not universally applicable and the term regression is applied to other types
of variables in business and economics. The term regression in the literary sense is
also referred as ‘moving backward’.
The basic differences between correlation and regression analysis are summarized as
follows:
1. Developing an algebraic equation between two variables from sample data and predicting
the value of one variable, given the value of the other variable is referred to as regression
analysis, while measuring the strength (or degree) of the relationship between two
variables is referred as correlation analysis. The sign of correlation coefficient indicates
the nature (direct or inverse) of relationship between two variables, while the absolute
value of correlation coefficient indicates the extent of relationship.
2. Correlation analysis determines an association between two variables x and y but not that
they have a cause-and-effect relationship. Regression analysis, in contrast to correlation,
determines the cause-and-effect relationship between x and y, that is, a change in the
value of independent variable x causes a corresponding change (effect) in the value of
dependent variable y if all other factors that affect y remain unchanged.
3. In linear regression analysis one variable is considered as dependent variable and other
as independent variable, while in correlation analysis both variables are considered to be
independent.
4. The coefficient of determination r2 indicates the proportion of total variance in the
dependent variable that is explained or accounted for by the variation in the
independent variable. Since value of r2 is determined from a sample, its value is subject to
sampling error. Even if the value of r2 is high, the assumption of a linear regression may
be incorrect because it may represent a portion of the relationship that actually is in the
form of a curve.
The particular form of regression model depends upon the nature of the problem
under study and the type of data available. However, each type of association or
relationship can be described by an equation relating a dependent variable to one or
more independent variables.
β1 = slope of the regression line that represents the expected change in
the value of y (either positive or negative) for a unit change in the
value of x.
The intercept β0 and the slope β1 are unknown regression coefficients. The equation
(14-1) requires to compute the values of β0 and β1 to predict average values of y for a
given value of x. However Fig. 14.1 presents a scatter diagram where each pair of
values (xi, yi) represents a point in a two-dimensional coordinate system. Although
the mean or average value of y is a linear function of x, but not all values of y fall
exactly on the straight line rather fall around the line.
Since few points do not fall on the regression line, therefore values of y are not
exactly equal to the values yielded by the equation: E(y|x) = β0 + β1x, also called line
of mean deviations of observed y value from the regression line. This situation is
responsible for random error (also called residual variation or residual error) in
the prediction of y values for given values of x. In such a situation, it is likely that the
variable x does not explain all the variability of the variable y. For instance, sales
volume is related to advertising, but if other factors related to sales are ignored, then
a regression equation to predict the sales volume (y) by using annual budget of
advertising (x) as a predictor will probably involve some error. Thus for a fixed value
of x, the actual value of y is determined by the mean value function plus a random
error term as follows:
The error component e allows each individual value of y to deviate from the line of
means by a small amount. The random errors corresponding to different
observations (xi, yi) for i=1, 2,…, n are assumed to follow a normal distribution with
mean zero and (unknown) constant standard deviation.
The term e in the expression (14-2) is called the random error because its value,
associated with each value of variable y, is assumed to vary unpredictably. The
extent of this error for a given value of x is measured by the error variance σe2. Lower
the value of σe2, better is the fit of linear regression model to a sample data.
If the line passing through the pair of values of variables x and y is curvilinear, then
the relationship is called nonlinear. A nonlinear relationship implies a varying
absolute change in the dependent variable with respect to changes in the value of the
independent variable. A nonlinear relationship is not very useful for predictions.
where (called y hat) is the value of y lying on the fitted regression line for a
given x value and ei = yi – i is called the residual that describes the error in fitting of
the regression line to the observation yi. The fitted value is also called the predicted
value of y because if actual value of y is not known, then it would be predicted for a
given value of x using the estimated regression line.
Remark: The sum of the residuals is zero for any least-squares regression line.
Since Σyi = Σ i, therefore so Σei = 0.
Assumptions
The device used for estimating the values of one variable from the value of the other
consists of a line through the points, drawn in such a manner as to represent the
average relationship between the two variables. Such a line is called line of regression.
x = c + dy
is used for estimating the value of x for given values of y.
Remarks
To estimate values of population parameter β0 and β1, under certain assumptions, the
fitted or estimated regression equation representing the straight line regression
model is written as:
= a + bx
b = slope of regression line that represents the expected change in the
value of y for unit change in the value of x
To determine the value of for a given value of x, this equation requires the
determination of two unknown constants a (intercept) and b (also called regression
coefficient). Once these constants are calculated, the regression line can be used to
compute an estimated value of the dependent variable y for a given value of
independent variable x.
1. The correlation coefficient is the geometric mean of two regression coefficients, that
is, r =
2. If one regression coefficient is greater than one, then other regression coefficient must be
less than one, because the value of correlation coefficient r cannot exceed one. However,
both the regression coefficients may be less than one.
3. Both regression coefficients must have the same sign (either positive or negative). This
property rules out the case of opposite sign of two regression coefficients.
4. The correlation coefficient will have the same sign (either positive or negative) as that of
the two regression coefficients. For example, if byx = −0.664 and bxy = − 0.234,
then
5. The arithmetic mean of regression coefficients bxy and byx is more than or equal to the
correlation coefficient r, that is, (byx + bxy)/2 ≥ r. For example, if byx = − 0.664 and bxy = −
0.234, then the arithmetic mean of these two values is (− 0.664 − 0.234)/2 = − 0.449,
and this value is more than the value of r = − 0.394.
6. Regression coefficients are independent of origin but not of scale.
Let = a + bx be the least squares line of y on x, where is the estimated average
value of dependent variable y. The line that minimizes the sum of squares of the
deviations of the observed values of y from those predicted is the best fitting line.
Thus the sum of residuals for any least-square line is minimum, where
Solving these two equations, we get the same set of equations as equations (14-3)
Similarly if we have a least squares line = c + dy of x on y, where is the estimated
mean value of dependent variable x, then the normal equations will be
These equations are solved in the same manner as described above for
constants c and d.
The values of these constants are substituted to the regression equation x = c + dy.
Instead of using the algebraic method to calculate values of a and b, we may directly
use the results of the solutions of these normal equation.
Since the regression line passes through the point ( ), the mean values
of x and y and the regression equations can be used to find the value of
constants a and c as follows:
Example 14.1: Use least squares regression line to estimate the increase in sales
revenue expected from an increase of 7.5 per cent in advertising expenditure.
A 1 1
B 3 2
C 4 2
D 6 4
E 8 6
F 9 8
G 11 8
H 14 9
Substituting the values of a = 0.072 and b = 0.704 in the regression equation, we get
For x = 0.075, we have y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%.
Example 14.2: The owner of a small garment shop is hopeful that his sales are
rising significantly week by week. Treating the sales for the previous six weeks as a
typical example of this rising trend, he recorded them in Rs 1000’s and analysed the
results
Fit a linear regression equation to suggest to him the weekly rate at which his sales
are rising and use this equation to estimate expected sales for the 7th week.
Solution: Assume sales (y) is dependent on weeks (x). Then the normal equations
for regression equation: y = a + bx are written as:
Calculations for sales during various weeks are shown in Table 14.2.
Substituting the values a = 2.64 and b = 0.025 in the regression equation, we have
Hence the expected sales during the 7th week is likely to be Rs 2.815 (in Rs 1000's).
Calculations to least squares normal equations become lengthy and tedious when
values of x and y are large. Thus the following two methods may be used to reduce
the computational time.
where byx = regression coefficient of
y on x
The value of byx can be calculated using the using the formula
Regression equation of x on y
where bxy = regression coefficient of x on y.
The value of bxy can be calculated formula
(b) Deviations Taken from Assumed Mean Values for x and y If mean value
of either x or y or both are in fractions, then we must prefer to take deviations of
actual values of variables x and y from their assumed means.
n = number of observations
dx = x – A; A is assumed mean of x
n = number of observations
dx = x – A; A is assumed mean of x
dy = y – B; B is assumed mean of y
(c) Regression Coefficients in Terms of Correlation Coefficient If
deviations are taken from actual mean values, then the values of regression
coefficients can be alternatively calculated as follows:
(a) Obtain the regression equation of sales on intelligence test scores of the
salesmen.
(b) If the intelligence test score of a salesman in 65, what would be his expected
weekly sales. [HP Univ., MCom, 1996]
Solution: Assume weekly sales (y) as dependent variable and test scores (x) as
independent variable. Calculations for the following regression equation are shown
in Table 14.3.
Hence we conclude that the weekly sales is expected to be Rs 53.75 (in Rs 1000's)for
a test score of 65.
Example 14.4: A company is introducing a job evaluation scheme in which all jobs
are graded by points for skill, responsibility, and so on. Monthly pay scales (Rs in
1000's) are then drawn up according to the number of points allocated and other
factors such as experience and local conditions. To date the company has applied
this scheme to 9 jobs:
1. Find the least squares regression line for linking pay scales to points.
2. Estimate the monthly pay for a job graded by 20 points.
Solution: Assume monthly pay (y) as the dependent variable and job grade points
(x)as the independent variable. Calculations for the following regression equation
are shown in Table 14.4.
(a)
Since mean values and are non-integer value, therefore deviations are taken from
assumed mean as shown in Table 14.4.
(b) For job grade point x = 20, the estimated average pay scale is given by
Hence, likely monthly pay for a job with grade points 20 is Rs 5986.
Example 14.5: The following data give the ages and blood pressure of 10 women.
We may conclude that there is a high degree of positive correlation between age and
blood pressure.
(c) For a women whose age is 45, the estimated average blood pressure will be
In view of the above, how much expenditure on advertisement would you suggest
the General Sales Manager of the enterprise to incur to meet his target of sales?
[Kurukshetra Univ., MBA, 1998]
Given r = 0.8, σx= 40, σ = 25, = 45,000, = 30,000. Substituting these value in the
above equation, we have
When a sales target is fixed at x = 80,000, the estimated amount likely to the spent
on advertisement would be
Arithmetic mean, 10 90
Standard deviation,σ 3 12
Given = 10, r = 0.8, σx = 3, σy = 12, = 90. Substituting these values in the above
regression equation, we have
Thus the likely sales for advertisement budget of Rs 15 lakh is Rs 106 lakh.
Hence, the likely advertisement budget of Rs 16 lakh should be sufficient to attain
the sales target of Rs 120 lakh.
Variance of x = 9
Multiplying the first equation by 5 and subtracting from the second, we have
Rewriting the given regression equations in such a way that the coefficient of
dependent variable is less than one at least in one equation.
Example 14.9: There are two series of index numbers, P for price index and S for
stock of a commodity. The mean and standard deviation of P are 100 and 8 and of S
are 103 and 4 respectively. The correlation coefficient between the two series is 0.4.
With these data, work out a linear equation to read off values of P for various values
of S. Can the same equation be used to read off values of S for various values of P?
Given = 100, = 103, σp = 8, σs = 4, r = 0.4. Substituting these values in the above
equation, we have
This equation cannot be used to read off values of S for various values of P. Thus to
read off values of S for various values of P we use another regression equation of the
form:
What is the correlation coefficient and what is its probable error? Show that the ratio
of the coefficient of variability of x to that of y is 5/24. What is the ratio of variances
of x and y?
That is, bxy = 6/5
Hence
Similarly,the regression equation for estimating the advertising budget (y)on sales
turnover of Rs 200 lakh is written as:
For y = 150, we have x = 116.65 – 0.414 × 150 = Rs 54.55 lakh
2. Regression equation of advertising budget (y) on sales turnover (x) is:
For x = 200, we have y = 76.457 – 0.04 (200) = Rs 68.457
thousand.
Mean Standard
Deviation
Write down the regression equation and estimate the expenditure on food and
entertainment if the expenditure on accommodation is Rs 200.
[Bangalore Univ., BCom, 1998]
14.3 The following data give the experience of machine operators and their
performance ratings given by the number of good parts turned out per 100 pieces:
14.4 A study of prices of a certain commodity at Delhi and Mumbai yield the
following data:
Delhi Mumbai
14.6 A company wants to assess the impact of R&D expenditure (Rs in 1000s) on its
annual profit; (Rs in 1000's). The following table presents the information for the
last eight years:
Year R & D expenditure Annual profit
1991 9 45
1992 7 42
1993
1994
5 41
1995 10 60
1996 4 30
1997 5 34
1998 3 25
2 20
Estimate the regression equation and predict the annual profit for the year 2002
for an allocated sum of Rs 1,00,000 as R&D expenditure.
[Jodhpur Univ., MBA, 1998]
14.7 Obtain the two regression equations from the following bivariate frequency
distribution:
Estimate (a) the sales corresponding to advertising expenditure of Rs 50,000,
(b) the advertising expenditure for a sales revenue of Rs 300 lakh, (c) the
coefficient of correlation.
[Delhi Univ., MBA, 2002]
Fit a linear least squares regression equation of production rating on test score.
[Delhi Univ., MBA, 200]
14.10 Suppose that you are interested in using past expenditure on R&D by a firm to
predict current expenditures on R&D. You got the following data by taking a random
sample of firms, where x is the amount spent on R&D (in lakh of rupees) 5 years ago
and y is the amount spent on R&D (in lakh of rupees) in the current year:
x : 30 50 20 180 10 20 20 40
y : 50 80 30 110 20 20 40 50
1. Obtain the regression equation of sales on intelligence test scores of the salesmen.
2. If the intelligence test score of a salesman is 65, what would be his expected weekly sales?
14.13 For a given set of bivariate data, the fiollowing results were obtained
= 53.2, = 27.9,
1996 12 5.0
1997 15 5.6
1998 17 5.8
1999 23 7.0
2000 24 7.2
2001 38 8.8
2002 42 9.2
2003 48 9.5
[Bharathidasan Univ., MBA, 2003]
x – = bxy (y – )
x – 48.33 = –1.102 (y – 30.83)
or x = 82.304 – 1.102y
(b)
(c)
14.8 Let test score and production rating be denoted by x and y respectively.
(b) Regression equation of production (x) on capacity utilization (y)
Hence the estimated production is 2,42,647 units when the capacity utilization is
70 per cent.
14.10 = Σ x/n = 270/8 = 33.75; = Σ y/n = 400/8 = 50
Residual ei = yi – i
The residual values ei are plotted on a diagram with respect to the least squares
regression line = a + bx. These residual values represent error of estimation for
individual values of dependent variable and are used to estimate, the variance σ 2 of
the error term. In other words, residuals are used to estimate the amount of
variation in the dependent variable with respect to least squares regression line.
Here it should be noted that the variations are not the variations (deviations) of
observations from the mean value in the sample data set, rather these variations are
the vertical distances of every observation (dot point) from the least squares line as
shown in Fig. 14.3.
Since sum of the residuals is zero, therefore it is not possible to determine the total
amount of error by summing the residuals. This zero-sum characteristic of residuals
can be avoided by squaring the residuals and then summing them. That is
The estimate of variance of the error term or is obtained as follows:
Figure 14.3 Residuals
The variance measures how the least squares line ‘best fits’ the sample y-values.
A large variance and standard error of estimate indicates a large amount of scatter or
dispersion of dot points around the line. Smaller the value of Syx, the closer the dot
points (y-values) fall around the regression line and better the line fits the data and
describes the better average relationship between the two variables. When all dot
points fall on the line, the value of Syx is zero, and the relationship between the two
variables is perfect.
A smaller variance about the regression line is considered useful in predicting the
value of a dependent variable y. In actual practice, some variability is always left
over about the regression line. It is important to measure such variability due to the
following reasons:
1. This value provides a way to determine the usefulness of the regression line in predicting
the value of the dependent variable.
2. This value can be used to construct interval estimates of the dependent variable.
3. Statistical inferences can be made about other components of the problem.
1. Find the equation of the least squares line fitting the data.
2. Estimate the value of sales corresponding to advertising expenditure of Rs 30 lakh.
3. Calculate the standard error of estimate of sales on advertising expenditure.
(a) The calculations for the least squares line are shown in Table 14.7
Table 14.8 gives the fitted values and the residuals for the data in Table 14.7. The
fitted values are obtained by substituting the value of x into the regression equation
(equation for the least squares line). For example, 8.608 + 0.712(10) = 15.728.
The residual is equal to the actual value minus fitted value. The residuals indicate
how well the least squares line fits the actual data values.
(b) The least squares equation obtained in part (a) may be used to estimate the sales
turnover corresponding to the advertising expenditure of Rs 30 lakh as:
(c) Calculations for standard error of estimate Sy.x of sales (y) on advertising
expenditure (x) are shown in Table 14.9.
The objective of regression analysis is to develop a regression model that best fits the
sample data, so that the residual variance S2y.x small as possible. But the value of
S2y.x depends on the scale with which the sample y-values are measured. This
drawback with the calculation of S2y.x restricts its interpretation unless we consider
the units in which the y-values are measured. Thus, we need another measure of fit
called coefficient of determination that is not affected by the scale with which the
sample y-values are measured. It is the proportion of variability of the dependent
variable, y accounted for or explained by the independent variable, x, i.e. it
measures how well (i.e. strength) the regression line fits the data. The coefficient of
determination is denoted by r2 and its value ranges from 0 to 1. A particular r2 value
should be interpreted as high or low depending upon the use and context in which
the regression model was developed. The coefficient of determination is given by
= SST – SSE
The three variations associated with the regression analysis of a data set are shown
in Fig 14.5. Thus
Since the formula of r2 is not convenient to use therefore an easy formula for the
sample coefficient of determination is given by
The value r2 = 0.9352 indicates that 93.52% of the variance in sales revenue is
accounted for or statistically explained by advertising expenditure.
Correlation Regression
• Nature of variables Both continuous, and Both continuous, and linearly related
linearly related
4. Why should a residual analysis always be done as part of the development of a regression
model?
5. What are the assumptions of simple linear regression analysis and how can they be
evaluated?
6. What is the meaning of the standard error of estimate?
7. What is the interpretation of y-intercept and the slope in a regression model?
8. What are regression lines? With the help of an example illustrate how they help in
business decision-making.
[Delhi Univ., MBA, 1998]
9. Point out the role of regression analysis in business decision-making. What are the
important properties of regression coefficients?
[Osmania Univ., MBA; Delhi Univ., MBA, 1999]
10.
a. Distinguish between correlation and regression analysis.
[Dipl in Mgt., AIMA, Osmania Univ., MBA, 1998]
Formulae Used
1. Simple linear regression model
2. Simple linear regression equation based on sample data
y = a + bx
3. Regression coefficient in sample regression equation
4. Residual representing the difference between an observed
value of dependent variable y and its fitted value
e= y – ŷ
5. Standard error of estimate based on sample data
Deviations formula
Computational formula
2. Coefficient of determination based on sample data
Sums of squares formula
Computational formula
2. Regression sum of squares
3. Interval estimate based on sample data: ŷ ± tdf Syx
True or False
1. A statistical relationship between two variables does not indicate a perfect relationship.
(T/F)
3. The residual value is required to estimate the amount of variation in the dependent
variable with respect to the fitted regression line.
(T/F)
5. Standard error of estimate is a measure of scatter of the observations about the regression
line.
(T/F)
6. If one of the regression coefficients is greater than one the other must also be greater than
one.
(T/F)
9. If the sign of two regression coefficients is negative, then sign of the correlation
coefficient is positive.
(T/F)
11. The point of intersection of two regression lines represents average value of two variables.
(T/F)
12. The two regression lines are at right angle when the correlation coefficient is zero.
(T/F)
13. When value of correlation coefficient is one, the two regression lines coincide.
(T/F)
15. The regression coefficients are independent of the change of origin but not of scale.
(T/F)
Multiple Choice
16. The line of best fit’ to measure the variation of observed values of dependent
variable in the sample data is
1. regression line
2. correlation coefficient
3. standard error
4. none of these
1. r = 0
2. r = 1/3
3. r = – 1/2
4. r = ± 1
1. bxy
2. byx
3. r
4. none of these
1. r = 0
2. r = 1/3
3. r = – 1/2
4. r = ± 1
1. negative
2. positive
3. zero
4. none of these
1.
2.
3.
4.
23. If two regression lines are: y = a + bx and x = c + dy, then the ratio of standard
deviations of x and y are
1.
2.
3.
4.
1. b/d
2.
3.
4.
25. If two coefficients of regression are 0.8 and 0.2, then the value of coefficient of
correlation is
1. 0.16
2. – 0.16
3. 0.40
4. –0.40
26. If two regression lines are: y = 4 + kx and x = 5 + 4y, then the range of k is
1. k ≤ 0
2. k ≥ 0
3. 0 ≤ k≤ 1
4. 0 ≤ 4k ≤1
27. If two regression lines are: x + 3y + 7 = 0 and 2x + 5y = 12, then and are
respectively
1. 2,1
2. 1,2
3. 2,3
4. 2,4
1. minimized
2. increased
3. maximized
4. decreased
1. closeness
2. variability
3. linearity
4. none of these
1.
2.
3.
4.
Average 30 50
Standard deviation 5 10
Coefficient of correlation r = 0.8.
[Bharthidarsan Univ., MCom, 1996]
Mean 20 120
Standard deviation 5 25
14.21 The following table gives the age of cars of a certain make and their annual
maintenance costs. Obtain the regression equation for costs related to age.
14.22 An analyst in a certain company was studying the relationship between travel
expenses in rupees (y) for 102 sales trips and the duration in days (x) of these trips.
He has found that the relationship between y and x is linear. A summary of the data
is given below:
Σx = 510; Σy = 7140; Σx2 = 4150; Σxy= 54,900, and Σy2 = 7,40,200
14.24 With ten observations on price (x) and supply (y), the following data were
obtained (in appropriate units): Σx = 130, Σy = 220, Σx2 = 2288, Σy2 = 5506, Σxy =
3467. Obtain the line of regression of y on x and estimate the supply when the price
is 16 units. Also find out the standard error of the estimate.
14.25 Data on the annual sales of a company in lakhs of rupees over the past 11
years is shown below. Determine a suitable straight line regression model y = β0 +
β1x + ∊ for the data. Also calculate the standard error of regression of y for values
of x.
From the regression line of y on x, predict the values of annual sales for the year
1989.
14.26 Find the equation of the least squares line fitting the following data:
1 2 26
2 9 20
3 6 28
4 14 16
5 8 23
6 12 18
7 10 24
8 4 26
9 2 38
10 11 22
11 1 32
12 8 25
1. Determine the linear regression equation for estimating the number of components
rejected given the number of weeks of experience. Comment on the relationship between
the two variables as indicated by the regression equation.
2. Use the regression equation to estimate the number of motors rejected for an employee
with 3 weeks of experience in the job.
3. Determine the 95 per cent approximate prediction interval for estimating the number of
motors rejected for an employee with 3 weeks of experience in the job, using only the
standard error of estimate.
14.28 A financial analyst has gathered the following data about the relationship
between income and investment in securities in respect of 8 randomly selected
families:
Solve for residuals and graph a residual plot. Do these data seem to violate any of
the assumptions of regression?
14.31 Graph the followign residuals and indicate which of the assumptions
underlying regresion appear to be in jeopardy on the basis of the graph:
Regression equation:
Regression equation:
When age of wife is y = 16; x = 10.92 + 0.64 (16) = 22 approx.
(husband's age)
2. Left as an exercise
14.18
Regression equation of y on x
2. When advertisement expenditure is of Rs 25 crore,likely sales is
3. For y = 150, x = 0.8 + 0.16y = 0.8 + 0.16(150) = 24.8
14.19 Let x = marks in Statistics and y = marks in Accountancy,
2. When x = 13 years, the average income would be
14.21 Let x = age of cars and y = maintenance costs.
The regression equation of y on x
Regression lines:
1. Regression coefficients:
Regression lines:
2. For x = 124,
(b)When x = 16,
(c)
14.25 Take years as x = – 5, –4, –3, –2, –1, 0, 1, 2, 3, 4, 5,
where 1983 = 0. The regression equation is
14.26 = Σx/n = 15/5 = 3, = Σx/n = 20/5 = 4
The regression equation is:
14.27
Case Studies
The phrase ‘made in China’ has become an issue of concern in the last few years, as
Indian Companies try to protect their products from overseas competition. In these
years a major trade imbalance in India has been caused by a flood of imported goods
that enter the country and are sold at lower price than comparable Indian made
goods. One prime concern is the electronic goods in which total imported items have
steadily increased during the year 1990s to 2004s. The Indian companies have been
worried on complaints about product quality, worker layoffs, and high prices and
has spent millions in advertising to produce electronic goods that will satisfy
consumer demands. Have these companies been successful in stopping the flood
these imported goods purchased by Indian consumers? The given data represent the
volume of imported goods sold in India for the years 1999–2004. To simplify the
analysis, we have coded the year using the coded variable x = Year 1989.
1989 0 1.1
1990 1 1.3
1991 2 1.6
1992 3 1.6
1993 4 1.8
1994 5 1.4
1995 6 1.6
1996 7 1.5
1997 8 2.1
1998 9 2.0
1999 10 2.3
2000 11 2.4
2001 12 2.3
2002 13 2.2
2003 14 2.4
2004 15 2.4
Regression analysis aims to model the relationship between dependent and independent variables, often establishing causal links, while least squares estimation is a method used within regression to find the line that best fits the data by minimizing the sum of squared residuals. While regression encompasses the theoretical framework and interpretation, least squares provide a computational approach to obtaining parameter estimates .
In least-squares regression analysis, the fitted regression line minimizes the sum of squared residuals. The least-squares estimators are computed such that the sum of residuals (errors) is zero. This is a consequence of the formulation of least-squares estimation, where Σyi = Σŷ, ensuring that there is no systematic deviation from the observed values around the regression line .
Correlation analysis identifies the association between two independent variables and indicates the direction and strength of this relationship using a correlation coefficient, but it does not imply causation. Regression analysis, on the other hand, establishes a cause-and-effect relationship, determining how the dependent variable changes with the independent variable while assuming other factors remain constant. In correlation, both variables are considered independent, whereas in regression, one is dependent and the other is independent .
The coefficient of determination, r², indicates the proportion of the total variance in the dependent variable that can be explained or accounted for by the independent variable's variance. A high r² value suggests a strong relationship where the regression model explains a significant portion of the dependent variable's variance. However, it is subject to sampling errors and does not confirm a linear relationship, as it may represent a non-linear association .
Assuming that errors are independent and normally distributed with mean zero and constant variance is crucial for making valid statistical inferences in regression analysis. This assumption ensures that the parameter estimates are unbiased, efficient, and have valid confidence intervals. Violations of these assumptions can lead to incorrect conclusions about the significance and impact of independent variables on the dependent variable .
The standard error of the estimate quantifies the average distance of observed values from the regression line. A smaller standard error indicates that the data points are closely clustered around the line, suggesting a better fit and more accurate prediction. It also helps in constructing prediction intervals and assessing the reliability of the line in capturing the relationship between variables .
Regression analysis allows for the development of a regression equation that can be used to estimate the value of a dependent variable based on known values of one or more independent variables. It provides the standard error of estimate to gauge the variability of the dependent variable relative to the regression line, enhancing prediction accuracy. Additionally, regression analysis facilitates interval estimation for larger samples with reliable prediction outcomes .
Estimating the dependent variable for independent variable values outside the sample range is problematic because the regression model is only valid within the range of observed data. Extrapolation introduces risks as the relationship may not hold outside this range, leading to inaccurate and unreliable predictions due to potential changes in underlying patterns or relationships not captured by the model .
Residuals, defined as the differences between observed and predicted values, gauge the accuracy of a regression line. Analyzing residuals helps in diagnosing the fit of the regression line; systematic patterns indicate model inadequacies. The sum of residuals is zero by design in least squares, and their distribution helps assess assumptions like homoscedasticity (constant variance) and independence, crucial for model validity .
In simple linear regression models, the assumptions include: a linear relationship between the dependent and independent variables; normally distributed dependent variable values for every fixed independent value, with the mean on the regression line; the independent variable is fixed while the dependent variable is continuous; errors are independent and normally distributed with constant variance; and no estimations outside the sample data range .