FACULTY OF ECONOMICS AND BUSINESS
CAMPUS BRUSSELS
Statistical Modelling
Tests for relation between 2 variables
1
Context
Parametric Non-parametric Non-parametric
(no normality or ordinal) (nominal)
1 sample t-test Sign test Binomial test
Wilcoxon signed-rank test
2 paired samples t-test differences /
2 independent samples t-test Mann-Whitney-Wilcoxon test Chi-square test
More than 2 ANOVA Kruskal-Wallis test Chi-square test
independent samples
Relation between 2 Pearson-correlation Spearman correlation Chi-square test
variables (linear relation two (relation between ordinal (relation two
quantitative variables) variables) qualitative
variables)
2
Chi-square test of independence
Goal: Evaluate whether there is a statistical relation between two
qualitative variables.
o the two variables are independent
o the two variables are dependent
Method: The chi-square test statistic is based on counts in the cross-table
of two variables. It measures the distance between
o observed counts
o expected counts if the two variables are statistically independent
number of rows in cross-table,
number of columns in cross-table
If is true the chi-square statistic has a distribution with
degrees of freedom.
Assumptions: (1) all , (2) not more than 20% cells with .
3
Chi-square test: approach
Example: Are the categorical variables education level and income
category related?
4
Chi-square test: approach
As we are at the boundary of violating the assumptions, we join the
categories college degree and post-undergraduate degree.
5
Pearson correlation test
Goal: Evaluate whether two quantitative variables have a linear relation.
We also aim to assess the direction and strength of the linear relation.
We distinguish
o The population correlation coefficient
o The sample correlation coefficient
A correlation coefficient takes values between -1 and 1, i.e.
o means the variables are not related
o close to 0 means the variables have a weak relation
o means the variables have a perfect positive linear relation
o means the variables have a perfect negative linear relation
6
Pearson sample correlation
Suppose we have a SRS of the variables and
The sample correlation between quantitative variables and is defined
as:
A positive linear relation between and (see
top panel) means that observations with an -value
above average usually also have a -value above
average.
A negative linear relation between and (see
bottom panel) means that observations with an -
value above average usually also have a -value
below average.
7
Pearson sample correlation
measures the size and direction of the linear relation between two
variables
8
Pearson sample correlation
measures the size and direction of the linear relation between and .
In this example but there is a strong non-linear (i.e., quadratic)
relation between and .
9
Pearson sample correlation
Outliers can have a very big effect on the sample correlation coefficient.
In this example one outlier increases the sample correlation from to
.
10
Sample Pearson correlation in SPSS
We compute correlations between monthly wage, weekly working hours,
age for a sample of observations.
In SPSS: analyze/correlate/bivariate
Correlation
between age
monthly wage
= .302
11
Test
there is no linear relation between and : .
there is a linear relation between and : .
If is true, and if has a bivariate Normal distribution (or if
than the test statistic is -distributed with degrees of
freedom:
12
Test
If and have a bivariate normal distribution, the scatterplot has the
shape of an ellipse.
Bivariate normal distribution no bivariate normal distribution
Remark: if the Pearson correlation test is valid, even if and do
not have a bivariate Normal distribution.
13
Test in SPSS
14
Spearman correlation-test
The non-parametric Spearman correlation test can be used
o to measure and test the relation between two ordinal qualitative
variables
o to measure and test the relation between two quantitative variables if
the assumptions of the Pearson correlation test are violated (i.e., small
sample and do not have a Bivariate Normal distribution).
The Spearman correlation (available in SPSS) is equivalent to the Pearson
correlation computed on the ranks of the observations.
We do not further discuss the test in this course.
15
Overview testing the relation between variables
(Parametric // non-parametric) test
2 quantitative variables:
Pearson correlation // Spearman correlation
2 qualitative variables:
chi-square test
1 quantitative variable and 1 qualitative variable
o qualitative variable with 2 categories:
independent samples -test // MWW-test
o qualitative variable with more than 2 categories
ANOVA // Kruskal Wallis-test
16
Exercise 1
Suppose we have a sample of 4000 observations for the following
variables:
o Trust of a respondent in the government measured on a scale from 0 to 100.
o Country with categories 1=Belgium, 2=France, 3= the Netherlands
o Age measured in years
o Gender: nominal variable with categories 0=male, 1=female
Which test can you use to test whether there is a relation between
o Country and trust
o Gender and trust
o Country and gender
o Trust and age
Formulate the null and alternative hypothesis for each test. Discuss
whether/when the proposed test is valid in the present context.
17
Exercise 2
Consider the cross-table between two qualitative variables education level and
type of company for a sample of observations . The table contains
observed counts and expected counts if the variables are assumed to be
independent.
Compute the expected counts for the first row of the table, compute the chi-
square test statistic and test (using ) the null hypothesis that education
level and type of company are statistically independent. Formulate a conclusion
about the result of the test.
18
Exercise 3
We compute the Pearson sample correlation between household income
and years with current employer in a SRS of employees.
Correlations
1 ,625
N 850 850
,625 1
N 850 850
Test whether the population correlation is positive (using ) and
draw a conclusion. Indicate whether the assumptions of the test are
satisfied.
19
Solution Exercise 1
Relation country and trust
the null hypothesis is wrong
To test , we can use a one-way ANOVA with as dependent variable
trust and as factor country.
If the assumptions of the ANOVA are violated (residuals not normally
distributed, population variances not equal), the non-parametric Kruskal-
Wallis test could be used.
Relation gender and trust
To test , we can use an independent samples t-test with as dependent
variable trust and as factor gender.
20
Solution Exercise 1
The sample is very large and hence the t-statistic has an
approximate t-distribution if population variances for males and females
are equal. If the null-hypothesis of equal population variances for
males/females is rejected, a Welch correction to the t-statistic can be used.
Country and Gender
country and gender are statistically independent
country and gender are statistically dependent
To test we can use a Pearson chi-square test on the cross-table country
x gender. The assumptions are (1) that all expected counts are larger than 1,
(2) that not more than 20% of the cells in the cross-table are smaller than 5.
Stated otherwise, the chi-square test tests the null hypothesis that the
proportion of males is the same in the three countries:
versus is wrong
21
Solution Exercise 1
Trust and age
To test that there is no linear relation between age and trust in the
population we can use a Pearson correlation test. As the sample size is
large the test statistic has an approximate t-distribution and
hence the test is valid.
22
Solution Exercise 2
the variables company size and diploma are statistically independent
the variables company size and diploma are statistically dependent
Expected counts
o Small company and low education level: (794)(734)/1476=394.8
o Small company and average education level (794)(505)/1476=271.7
o Small company and high education level: (794)(237)/1476=127.5
23
Solution Exercise 2
Let
and hence we reject . We conclude with 95% confidence that
company size and diploma are statistically related.
The assumptions of the test are satisfied:
o All expected counts are larger than or equal to 1
o There are no cells with an expected count smaller than 5, hence the
proportion of cells with is smaller than 20%.
24
Solution Exercise 3
We test against H A : 0 with
and hence we reject . We conclude with 95% confidence that the
population correlation between household income and years with current
employer is positive.
The scatterplot shows that the assumption of a bivariate Normal
distribution for the two variables is doubtful. However, as the sample size
is large , the test statistic will have an approximate t-distribution
and hence the test is valid.
Remark: to reduce the influence of outliers it is recommended to apply a
natural log transformation to household income.
25