0% found this document useful (0 votes)
5 views17 pages

CH-1 Regression Data Analysis

Uploaded by

gardewstat2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

CH-1 Regression Data Analysis

Uploaded by

gardewstat2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

CHAPTER ONE

1. INTRODUCTION
Correlation, chi-square test of association (independency) and regression are another area of
inferential statistics involves determining whether a relationship exists between variables. For
example, an Educator may want to know whether the number of hours a student studies is
related to the student’s score on a particular exam. Medical researchers are interested in
questions such as, is caffeine related to heart damage? Or is there a relationship between a
person’s age and his or her blood pressure? A zoologist may want to know whether the birth
weight of a certain animal is related to its life span. These are only a few of the many
questions that can be answered by using the techniques of correlation, chi-square test of
association and regression analysis.
But what is correlation, chi-square test of association and also regression analysis?
What are the similarities and difference between those techniques?
Correlation is used to determine whether there is statistical significance linear relationship
between two quantitative variables. We investigate correlation using a scatter plot (graph)
and the linear correlation coefficient (a measure of direction and magnitude of a linear
relationship between two quantitative variables)

Chi-square test of association is used to determine whether there is significant relation


between two categorical (nominal variables). In this case hypothesis testing using chi-square
distribution can be applied to determine the relationship between variables.

Regression analysis has the same objective to correlation analysis and chi-square test of
association; determining the relationship between two or more variables but unlike correlation
and chi-square test of association, regression is more advanced techniques and is used to
develop an equation that relates two or more variables. The equation is then can be used to
predict the values of one variable given that the values of the other variables. The variable
whose values are predicted is called dependent (response) variable while variables whose
values are used to predict is called independent variables or predictors or explanatory
variables.

The purpose of this module is to answer the following questions statistically:


 Are two or more variables linearly related?
 If so, what is the strength of the relationship?
 What type of relationship exists?

Regression Analysis Page 1


 What kind of predictions can be made from the relationship?

How do we measure association? Correlation and Chi-Square

It is useful to explore the concepts of association and correlation at this stage as it will hold us
in good stead when we start to tackle regression in greater detail. Correlation basically refers
to statistically exploring whether the values of one variable increase or decrease
systematically with the values of another. For example, there is an association between IQ
test score and exam grades if individuals having high IQ scores also get good exam grades or
individuals who get low scores on the IQ test do better in their exams.
This is very useful but association cannot always be ascertained using correlation. What if
there are only a few values or categories that a variable can take? For example, can gender be
correlated with school type? There are only a few categories in each of these variables (e.g.
male, female). Variables that are sorted into discrete categories such as these are known as
nominal variables. When researchers want to see if two nominal variables are associated with
each other they can't use correlation but they can use chi-square test of association. In chi-
square test of association we test the null hypothesis that states there is no association
between any two nominal types variable against the alternative which states there is
statistically strong association between the two variables.

1.1 Correlation analysis


Visually identifying association - Scatter plots

Scatter plots are the best way to visualize correlation between two continuous (scale)
variables, so let us start by looking at a few basic examples. The graphs below illustrate a
variety of different bivariate relationships, with the horizontal axis (x-axis) representing one
variable and the vertical axis (y-axis) the other. Each point represents one individual and is
dictated by their score on each variable. Note that these examples are fabricated for the
purpose of our explanation - real life is rarely this neat! Let us look at each in turn:
Figure 1 displays three scatter plots overlaid on one another and represents maximum or
‘perfect' negative and positive correlations (blue and green respectively). The points
displayed in red provide us with a cautionary tale! It is clear there is a relationship between
the two as the points display a clear arching pattern. However they are not statistically
correlated as the relationship is not consistent (not linear). For values of X up to about 75 the
relationship with Y is positive but for values of 100 or more the relationship is negative!

Regression Analysis Page 2


Figure 1: Examples of 'perfect' relationships

Figure 2: These two scatter plots look a little bit more like they may represent real life data.
The green points show a strong positive correlation as there is a clear relationship between
the variables - if a participant's score on one variable is high; their score on the other is also
high. The red points show a strong negative relationship. A participant with a high score on
one variable has a low score on the other.
Figure 2: Examples of strong correlations

Figure 3: This final scatter plot demonstrates a situation where there is no apparent
correlation. There is no discernable relationship between an individual's score on one
variable and their score on another.

Regression Analysis Page 3


Figure 3: Example of no clear relationship

Several points are evident from these scatter plots. If one variable increases when the other
increases, the correlation is positive; and if one variable increases when the other decreases,
the correlation is negative. The strongest correlations occur when data points fall exactly in a
straight line (Figure 1).The correlation becomes weaker as the data points become more
scattered. If the data points fall in a random pattern there is no correlation (Figure 3).
Let us practice the above concept with real life example. For example, if a researcher wishes
to see whether there is a linear relationship between number of hours of study and test scores
on an exam. In simple correlation, the researcher collects data on the two numerical or
quantitative variables to see whether a relationship exists between the variables. Thus, she
must select a random sample of students, determine the hours each studied, and obtain their
score on the exam. A table can be made for the data, as shown here.
Student Hours of study(x) test score (y)

A 6 82
B 2 63
C 1 57
D 5 88
E 2 68
F 3 75

Regression Analysis Page 4


In order to determine the nature of the relationship between the two variables either scatter plot
or the correlation coefficient can be used
A scatter plot is a graph of the ordered pairs in a form of (x, y) of numbers consisting of the
values of variable x and values of the other variable y. After the plot is drawn, it should be
analyzed to determine which type of relationship, if any, exists.

Scatter plot of study hrs. and test score

4
Study

3
hrs

6 6 7 7 8 8 9
0 5 0 5
Test score 0 5 0

For example, the above plot shown suggests a positive relationship, since as the study hrs.
increases, students score tends to increase also.
Example
Construct a scatter plot for the data obtained in a study on the number of absences and the final
score of seven randomly selected students from a statistics class.
The data are shown here.
Student Number of absences(x) Final score(y)
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90

Regression Analysis Page 5


G 8 78

Scatter plot of the number of absence and final score of the students

The plot of the data shown in the above figure suggests a negative relationship, since as the
number of absences increases, the final score decreases.
In general
 When higher values of one variable are associated with higher values of the other variable
and lower values of one variable are associated with lower values of the other variable, then
the correlation is said to be positive or direct.
Examples:
 Income and expenditure
 Number of hours spent in studying and the score obtained
 Distance covered and fuel consumed by car.
 When higher values of one variable are associated with lower values of the other variable and
lower values of one variable are associated with higher values of the other variable, then the
correlation is said to be negative or inverse.

Regression Analysis Page 6


Examples:
 Cigarette consumption and price of cigarettes
The correlation coefficient
The correlation coefficient computed from the sample data measures the strength and direction
of a linear relationship between two quantitative variables. The symbol for the sample correlation
coefficient is r. The symbol for the population correlation coefficient is ρ(Greek letter rho).
 The range of the correlation coefficient is from 1to−1. If there is a strong positive linear
relationship between the variables, the value of r will be close to1.
 If there is a strong negative linear relationship between the variables, the value of r will
be close to−1.
 When there is no linear relationship between the variables or only a weak relationship,
the value of r will be close to 0.
The correlation coefficient between any two variables X and Y denoted by r is given by
r=
∑ ( Xi− X ) (Yi−Y )
√ ∑ ( Xi−X )2 ∑ (Yi−Y )2
Short cut formula

r=
∑ XiYi−n X Y
√¿ ¿ ¿

Example
Calculate the correlation coefficient between mid-semester and final exam scores of 10 students
(both out of 50)

Student 1 2 3 4 5 6 7 8 9 10
Mid Exam(X) 31 23 41 32 29 33 28 31 31 33
Final Exam(Y) 31 29 34 35 25 35 33 42 31 34
2 2
Solution n=10 , X=31.2 , Y =32.9 , X =973.4 , Y =1082.4 ,

∑ XiYi=10331 , ∑ Xi2=9920 , ∑ Yi 2=11003


10331−10 ( 31.2 ) (32.9 )
r=
√ ( 9920−10 ( 973.4 ) )( 11003−10 ( 1082.4 ) )
66.2
=0.363
182.5

Regression Analysis Page 7


This means that the sample correlation coefficient between mid-semester score and final score is
0.363.

1.2 Inference about the population correlation coefficient


As stated before, the range of the correlation coefficient is between -1 and 1. When the value of r
is near -1 or 1, there is a strong linear relationship. When the value of r is near 0, the linear
relationship is weak or nonexistent. The value of r is computed from data obtained from samples
only. If we obtained a different sample, we would obtain different correlation coefficient and
therefore potentially different conclusions. Therefore there are two possibilities when r is not
equal to zero: either the value of r is high enough to conclude that there is a significant linear
relationship between the variables (the population correlation coefficient is different from zero),
or the value of r is due to chance (the population correlation coefficient is not different from
zero). Thus we have to draw conclusions about populations, not just samples. The population
correlation coefficient is computed from taking all possible (x, y) pairs; it is designated by the
Greek letter ρ (rho). The sample correlation coefficient can then be used as an estimator of ρ if
the following assumptions are valid.
 The variables x and y are linearly related.
 The variables are random variables.
 The two variables have a bivariate normal distribution.
A bivariate normal distribution means that for the pairs of (x, y) data values, the corresponding y
values have normal distribution for any given x value, and the x values for any given y value
have a normal distribution. In general to test the hypothesis about the population correlation
coefficient based on the evidence obtained from the sample data, we have to go through the
following steps.
Hypothesis testing-steps
1. State the hypotheses.
Ho: ρ=0 ¿This null hypothesis means that there is no statistically significant correlation between
the x and y variables in the population).
Ha: ρ≠ 0(This alternative hypothesis means that there is statistically significant correlation
between the variables in the population).
2. Find the critical values.
Fix the level of significance (alpha value)

Regression Analysis Page 8


3. Compute the test value.

The test statistic is given by t=


√ n−2
1−r 2
. Under the null hypothesis ( H o ), this test statistic

will be approximately distributed as t distribution with n−2 degrees of freedom


4. Make the decision.
We will reject the null hypothesis (Ho) at level α if the absolute value of the test statistic (t),
is greater than the critical value from the t-table with n−2 degrees of freedom
5. Summarize the results.
When the null hypothesis is rejected at a specific level, it means that there is a significant
difference between the value of r and 0. When the null hypothesis is not rejected, it means that
the value of r is not significantly different from 0 (zero) and is probably due to chance.
Example; test whether the population correlation coefficient between mid-semester exam result
and final exam result is statistically different from zero 5% level of significance.
Solution
1. State the hypotheses.
Ho: ρ=0 ¿).
Ha: ρ≠ 0 (
There is statistical significant correlation between mid−result∧final score∈the population ¿ .
2. Find the critical values.
Alpha value is given to be 0.05
3. Compute the test value.

The test statistic is given by t=r


√ n−2
1−r 2
with n−2 degree of freedom but r =0.363∧n=10

Thus t cal=r
√ n−2
1−r 2 =

0.363
10−2
1−(0.363)2
=1.101

4. Locate the rejection region and make your decision.


(n−2)
t tab=t a /2
8
t tabulated =t 0.025 =2.306

Regression Analysis Page 9


Rejection region
Non-rejection Rejection region
region

-2.306 t=0 2.306


Decision; Ho is not rejected because t cal is not in the rejection region
5. Summarize the results.
Conclusion: There is no strong evidence to conclude that the population correlation coefficient is
different from zero or the data does not provide us enough evidence to conclude that there is
statistically significant correlation between mid-semester exam result and final exam result

When the null hypothesis has been rejected for a specific a value, any of the following five
possibilities can exist.
 There is a direct cause-and-effect relationship between the variables. That is, x causes y.
For example, water causes plants to grow, poison causes death, and heat causes ice to
melt.
 There is a reverse cause-and-effect relationship between the variables. That is, y causes x.
For example, suppose a researcher found that People who smoke have higher level of
stress but the researcher fails to consider that the reverse situation may occur. Higher
mental stress can actually influence a person to smoke.
 The relationship between the variables may be caused by a third variable called lurking
variable. For example, if a statistician correlated having kids and level of maturity, he or
she would probably find a significant relationship. However, having kids is not
necessarily a cause of attaining higher maturity levels, since both variables are related to
age. Higher age leads to both, having kids and higher maturity levels.
 There may be a complexity of interrelationships among many variables. For example, a
researcher may find a significant relationship between students’ high school grades and
college grades. But there probably are many other variables involved, such as IQ, hours
of study, influence of parents, motivation, age, and instructors.
 The relationship may be coincidental. For example, a researcher may be able to find a
significant relationship between the increase in the number of people who are exercising

Regression Analysis Page 10


and the increase in the number of people who are committing crimes. But common sense
dictates that any relationship between these two values must be due to coincidence.
Exercise: The following data were collected from a certain household on the monthly income (X) and
consumption (Y) for the past 10 months. Find the type of relationship exists between the two variables by
using both the scatter plot and by computing the simple correlation coefficient. And at 5% level of
significance test whether the population correlation coefficient between income and consumption is
different from zero.
X: 650 654 720 456 536 853 735 650 536 666
Y: 450 523 235 398 500 632 500 635 450 360
Thus, when the null hypothesis is rejected, the researcher must consider all possibilities and
select the appropriate one as determined by the study. Remember, neither regression nor
correlation analyses can be interpreted as establishing cause-and-effect relationships. They can
indicate only how or to what extent variables are associated with each other. The correlation
coefficient measures only the degree of linear association between two variables. Any
conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.

1.3 Introduction to Regression analysis


The historical origin of regression
The term regression was introduced by Francis Galton in 1908, the renowned British biologist,
when he was engaged in the study of heredity. In famous paper Galton found that although there
was a tendency for tall parents to have tall children and for short parents to have short children,
the average height of children born of parents of a given height tended to move or regress toward
the average height in the population as whole. In other words the heights of children of unusually
tall parents or unusually short parents tend to move toward the average height of the population.
Galton law of universal regression was later confirmed by his friend Karl Pearson, who collected
more than thousand records of height of members of family groups. He found that the average
height of sons of a group of tall fathers were less than their fathers height and the average height
of sons of a group of short fathers was greater than their father height, thus regressing tall and
short sons alike toward the average height of all men. This regressing toward mediocrity" gave
these statistical methods its name.
Regression analysis is a conceptually simple method for investigating functional relationships
among variables. A real estate appraiser may wish to relate the sale price of a home from

Regression Analysis Page 11


selected physical characteristics of the building (number of room). We may wish to examine
whether cigarette consumption is related to various socioeconomic and demographic variables
such as age, education, income, and price of cigarettes. The relationship is expressed in the form
of an equation or a model connecting the response or dependent variable and one or more
explanatory or predictor variables. In the cigarette consumption example, the response variable
is cigarette consumption (measured by the number of packs of cigarette used by a person per
day) and the explanatory or predictor variables are the various socioeconomic and demographic
variables.
An essential first step in regression analysis is to determine whether the dependent variables are
significantly correlated with the explanatory variables by using scatter diagram or hypothesis
testing about the population correlation coefficient because we do not put something into
regression if they do not significantly correlated.
Even though correlation and regression are related in the sense that both deals with relationships
among variables, the two statistical techniques have slight differences
 Although Correlation is a useful quantity for measuring the direction and the strength of
linear relationships, it cannot be used for prediction purposes, that is, we cannot use
correlation to predict the value of one variable given the value of the other.
 In correlation both variables are assumed to be random but in regression the independent
variable is assumed to be fixed.
 Furthermore, Correlation measures only pair wise relationships. Regression analysis,
however, can be used to relate one or more response variable to one or more predictor
variables.
Thus regression analysis is an attractive extension to correlation analysis because it postulates a
model that can be used not only to measure the direction and the strength of a relationship
between the response and predictor variables, but also to numerically describe that relationship.
Regression takes the information obtained from the correlation analysis and chi-square test of
association and tries to develop an equation that used to describe the relationship between the
variables. Regression equation is used to predict the values of one random variable depending on
the values of one or more random variables. The variable whose values are to be estimated is
called dependent variables or outcome variable while the variables whose values are used to
estimate are called independent variables or explanatory variables. Thus Regression analysis

Regression Analysis Page 12


involves identifying the relationship between a dependent variable and one or more independent
variables. A model of the relationship is hypothesized, and estimates of the parameter values are
used to develop an estimated regression equation. Various tests are then employed to determine
if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation
can be used to predict the value of the dependent variable given values for the independent
variables.
Different names for dependent and independent variables
In different field of study or in different literature, the term dependent and independent variables
are listed in different ways. These are
Dependent variable independent variable
Explained variable explanatory variable
Predictand predictor
Regressand regressor
Response stimulus
Endogenous exogenous
Outcome covariate
Controlled Factor

What is a lurking variable?

A lurking variable is a variable that is not included as an explanatory or response variable in the
analysis but can affect the interpretation of relationships between variables. A lurking variable
can falsely identify a strong relationship between variables or it can hide the true relationship.
For example, a student is conducting a research study on differences in body weight between
engineering and health science students at Debre Berhan University. His data show that
engineering students weigh more than health science students on average. Then, his advisor
points out College of Engineering are 81.1% male while health science is 7.7% male. The
student conducts a second research study and includes biological sex. He finds that there is not a
difference between engineering and health science students after controlling for biological sex.
In the first study gender was a lurking variable because the student was not taking it into
account.

Regression Analysis Page 13


Steps in regression analysis

Regression analysis study includes the following steps:


 Identifying the statistical problem(statement of the problem)
 Selection of potentially relevant variables
 Data collection
 Formulation of the model
 Model validation and criticism
 Using the chosen model(s) for prediction.
Statement of the problem
Regression analysis usually starts with a formulation of the problem. This includes the
determination of the question(s) to be addressed by the analysis. The problem statement is the
first and perhaps the most important step in regression analysis. It is important because an ill-
defined problem or a misformulated question can lead to wasted effort. It can lead to the
selection of irrelevant set of variables or to a wrong choice of the statistical method of analysis
that is a question that is not carefully formulated can also lead to the wrong choice of a model.
Selection of potentially relevant variables
The next step after the statement of the problem is to select a set of variables that are thought by
the experts in the area of study to explain or predict the response variable. The response variable
is denoted by Y and the explanatory or predictor variables are denoted by X 1 , X 2… Xpwhere p
denotes the number of predictor variables. An example of a response variable is the price of a
single family house in a given geographical area. A possible relevant set of predictor variables in
this case are: area of the lot, area of the house, age of the house, number of bedrooms, number of
bathrooms, type of neighborhood, style of the house, amount of real estate taxes, etc.
Data collection
The next step after the selection of potentially relevant variables is to collect the data from the
environment under study to be used in the analysis. Sometimes the data are collected in a
controlled setting so that factors that are not of primary interest can be held constant. More often
the data are collected under non experimental conditions (observational) where very little can be
controlled by the investigator. In either case, the collected data consist of observations on n
subjects. Each of these n observations consists of measurements for each of the potentially
relevant variables.

Regression Analysis Page 14


Model formulation
The form of the model that is thought to relate the response variable to the set of predictor
variables can be specified initially by the experts in the area of study based on their knowledge
or their objective and or subjective judgments. Scatter plot, correlation coefficient and chi-square
test of association can also be used to select the form of the model. After the model has been
defined and the data have been collected, the next task is to estimate the parameters of the model
based on the collected data. This is also referred to as parameter estimation or model fitting. The
most commonly used method of estimation is called the least squares method.
Model validation and criticism
The validity of a statistical method, such as regression analysis, depends on certain assumptions.
Assumptions are usually made about the data and the model. The accuracy of the analysis and
the conclusions derived from an analysis depends crucially on the validity of these assumptions.
The adequacy of the linear model can also be checked by using summary statistics such as
coefficient of determination and tests such as ANOVA.
Using the model for different purpose
The explicit determination of the regression equation is the most important product of the
analysis. It is a summary of the relationship between the response variable and the set of
predictor variables. The equation may be used for several purposes. It may be used to evaluate
the importance of individual predictors, to analyze the effects of policy that involves changing
values of the predictor variables, or to forecast values of the response variable for a given set of
predictors.
Types of Regression models
Depending on the nature of dependent and independent variables and the type of relationship
between the variables, we can define different types of regression models.
 Simple regression equation is an equation that contains only one predictor variable.
 The multiple regression is used when there are more than one independent variables in the
model
 Logistic regression is used when we have categorical response variable. There are different
types of logistic regression
 Binary logistic regression is used when the dependent variable is
dichotomous( with only two levels or categories)

Regression Analysis Page 15


 Multinomial logistic regression is used when the dependent variable is
categorical with nominal level
 Ordinal logistic regression is used when the dependent variable is
categorical with ordinal level.
 Linear model is a type of models in which all parameters enter the equation linearly,
possibly after transformation of the data
 A function is said to be linear in the parameter say β 1 if β 1 appears with power of 1 only
and if it is not multiplied or divided by any other parameter (for example β 0 β 1 orβ 0/ β 1
and so on).

The adjective linear has a dual role here. It may be taken to describe the fact that the relationship
between Y and X is linear. More generally, the word linear refers to the fact that the regression
parameter such as β 0 and β 1 , enter in a linear fashion. Thus, for example, Y = β 0+ β 1 X+ β 2X2
+ ℰ is also a linear model even though the relationship between Y and X is quadratic.
 The other type of regression is nonlinear regression, which assumes that the relationship
between dependent variable and independent variables is not linear in regression
parameters. Example of nonlinear regression model (growth model) may be written as
α
Y= βt
+E
1+e
Where y is the growth of a particular organism as a function of time t, α and β are model
parameters, and ℰis the random error.
Nonlinear regression model is more complicated than linear regression model in terms of
estimation of model parameters, model selection, model diagnosis, variable selection, outlier
detection, or influential observation identification
Objective of the regression analysis
In general, the goal of a regression analysis is to predict or explain differences in values of the
dependent variable with information about values of the explanatory variables. We are primarily
interested in the following issues:
 The form of the relationship among the dependent variables and independent variables,
or what the equation that represents the relationship looks like.
 The direction and strength of the relationships. As we shall learn, these are based on the
sign and size of the slope coefficients.

Regression Analysis Page 16


 Which explanatory variables are important and which are not. This issue is based on
comparing the size of the slope coefficients (by constructing confidence interval and
hypothesis testing concerning on the model parameters)
 Predicting a value or set of values of the dependent variable for a given set of values of
the explanatory variables.
Uses of regression

 causal analysis(describe the relationship between variable)


It might be used to identify the strength of the effect that the independent variable(s) have on a
dependent variable. Typical questions are what is the strength of relationship between dose and
effect, sales and marketing spend, age and income.
 forecasting an effect,
It can be used to forecast effects or impacts of changes. That is regression analysis helps us to
understand how much will the dependent variable change, when we change one or more
independent variables. Typical questions are how much additional Y do I get for one additional
unit X.
 Trend forecasting.
Regression analysis predicts trends and future values. The regression analysis can be used to get
point estimates. Typical questions are what will the price for gold be in 6 month from now?
What is the total effort for a task X?
Exercise
1. In each of the following sets of variables, identify which of the variables can be regarded as a
response variable and which can be used as independent variable and state possible lurking
variable?
a) Income and expenditure of households
b) Number of hours spent in studying and the score obtained
c) Height and weight of students
d) Distance covered and fuel consumed by car
e) Demand and supply
f) The time to run the race, and the temperature at the time of running.
g) The weight of a person, whether or not the person is a smoker, and whether or not the
person has a lung cancer.
h) The height and weight of a child, his/her parents’ height and weight, and the sex and age
of the child.
2. State the similarities and difference between correlation and regression analysis

Regression Analysis Page 17

You might also like