Statistics I
Chapter 2: Univariate data analysis
Chapter 2: Univariate data analysis
Contents
I
Graphical displays for categorical data (barchart, piechart)
Graphical displays for numerical data data (histogram, polygon,
boxplot)
Numerical measures to describe:
I
I
central tendency (mean, median, mode)
variation (variance, standard deviation, quasi-variance and
quasi-standard-deviation, range, IQR, coefficient of variation)
others (quartiles, percentiles)
Chapter 2: Univariate data analysis
Recommended reading
I
Pe
na, D., Romo, J., Introducci
on a la Estadstica para las Ciencias
Sociales
I
Chapters 4, 5
Newbold, P. Estadstica para los Negocios y la Economa (2009)
I
Chapter 2
Graphical presentation of data
Once we have a frequency distribution of the data, the following
graphical displays can be obtained:
Categorical
piechart
barchart
Numerical
histogram
polygon
boxplot
Graphs for qualitative data: piechart
Example 1: The frequency table below corresponds to the data
representing blood types reported for a sample of 40 individuals.
Class
A
B
AB
O
Total
Absolute
Frequency
12
11
8
9
40
Relative
Frequency
0.300
0.275
0.200
0.225
1
Piechart
Example 1 cont.:
I Each slice is a fraction of the total size of the pie
I Many softwares rank slices alphabetically
I Although pretty harder to read than barcharts
I Avoid 3D piecharts, for those the area in the background seems to
be smaller than the area in the foreground
O 22.5%
B 27.5%
A 30%
AB 20%
Graphs for qualitative data: barchart
Example 2: The frequency table below corresponds to levels of
satisfaction for 901 employees.
Class
VU
U
S
VS
Total
Absolute
Frequency
62
108
319
412
901
Relative
Frequency
0.07
0.12
0.35
0.46
1
Cumulative
Absolute
Frequency
62
170
489
901
Cumulative
Relative
Frequency
0.07
0.19
0.54
1
Barchart
200
100
0
FREQUENCY
300
400
Example 2 cont.:
I Bars are of the same width and equally-spaced, with the heights
corresponding to the frequencies
I There are gaps between the bars
I Bars are labeled with class names
I Many softwares rank bars alphabetically
VU
VS
Barchart
12
10
8
6
4
2
0
Barcharts can also be constructed for discrete data if there are not
too many values
This is a barchart for Example 3 of Ch.1 where we looked at the
number of leaves attacked by a pest for a sample of 50 plants
FREQUENCY
10
Graphs for quantitative data: histogram and polygon
Example: 4 The frequency distribution of the daily high temperature (in
Fahrenheit) reported on 20 winter days is as follows:
Class Interval
[10, 20)
[20, 30)
[30, 40)
[40, 50)
[50, 60)
Total
Midpoint
15
25
35
45
15
ni
3
6
5
4
2
20
fi
0.15
0.30
0.25
0.20
0.10
1
Ni
3
9
14
18
20
Fi
0.15
0.45
0.70
0.90
1
Histogram and polygon
Polygon
There are no gaps between the bars/bins
Bin widths = widths of class intervals (identical), class boundaries
are marked on the horizontal axis
Bin heights = frequencies (here, absolute)
Bin areas are proportional to the frequencies
FREQUENCIES
10
20
30
40
TEMP (F)
50
60
70
Histogram with area of 1 (on a density scale)
0.030
0.020
Bin widths = widths of class intervals (not necessarily identical)
Bin heights = li lfii1
Bin areas = fi
TOTAL AREA = 1
0.010
0.000
10
20
30
40
TEMP (F)
50
60
70
Describing data numerically
Variation
Center
mean
median
mode
New notation:
n
X
range
interquartile range
variance
standard deviation
coeff. of variation
Others
quartiles
percentiles
xi = x1 + x2 + . . . + xn
i=1
P
( : sum, i = 1: the lower limit, n: the upper limit, xi : example of a
formula depending on i)
Example:
3
X
i 2 = (1)2 + 02 + 12 + 22 + 32 = 15
i=1
Central tendency: (arithmetic) mean
I
The most common measure of central tendency
Population mean
PN
=
Sample mean
xi
N
Pn
x =
I
i=1
i=1
xi
x1 + . . . + xN
N
x1 + . . . + xn
n
If a, b (b 6= 0) are real numbers and y = a + bx, then
y = a + b
x
Affected by extreme values (outliers)
Example: X : 3, 1, 5, 4, 2,
x =
Y : 3, 1, 5, 4, 200
3+1+5+4+2
=3
5
y =
3 + 1 + 5 + 4 + 200
= 42.6!
5
Central tendency: median
I
In the ordered list, the median M is the middle number
x((n+1)/2)
if n odd (the middle number)
M=
x(n/2) +x(n/2+1)
if
n even (the average of the two middle numbers)
2
(x(1) , x(2) , . . . , x(n) means that the observations are ranked in increasing
order, eg. x(1) = xmin , x(n) = xmax )
Not affected by outliers
Example: Given observations 3, 1, 5, 4, 2 (n = 5), first rank the data
1,2, 3 ,4,5, then identify the middle number(s)
M = x((5+1)/2) =
3rd smallest
z}|{
x(3)
=3
Example: Given observations 3, 1, 5, 4, 2, 0 (n = 6), first rank the data
0,1, 2,3 ,4,5, then identify the middle number(s)
M=
x(6/2) + x(6/2+1)
2
the average of 3rd and 4th
z }| {
x(3) + x(4)
2+3
=
=
= 2.5
2
2
Central tendency: mode
The value that occurs most often
Not affected by outliers
Used for either numerical or categorical data
There may be no mode, there may be several modes
Example: Given observations 3, 1, 5, 4, 2, there is no mode
Example: Given observations 3, 1, 5, 4, 2, 1, the mode is 1
Shape: comparing mean and median
Three types of distributions:
I
Skewed to the left Mean < Median
Symmetric Mean = Median
Skewed to the right Median < Mean
LEFTSKEWED
x<M
SYMMETRIC
x=M
RIGHTSKEWED
M<x
Note: The distribution in the middle is known as bell-shaped or normal
Variation: range and interquartile range (IQR)
Range is the simplest measure of variation
R = xmax xmin
Ignores the way the data is distributed
Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4
Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99
I
Interquartile range (IQR) can eliminate some outlier problems.
Eliminate high and low observations and calculate the range of the
middle 50% of the data
IQR = 3rd quartile 1st quartile = Q3 Q1
Variation: Interquartile range and boxplot
I
Outliers are observations that fall
I
I
below the value of Q1 1.5 IQR
above the value of Q3 + 1.5 IQR
For extreme outliers, replace 1.5 by 3 in the above definition
xmin
Q1
25%
12
MEDIAN
(Q2)
25%
24
xmax
Q3
25%
31
IQR=18
25%
42
58
Quartiles and percentiles
Quartiles split the ranked data into four segments with an equal number
of values per segment
The first quartile Q1 has position 14 (n + 1)
The second quartile Q2 (= median) has position 12 (n + 1)
The third quartile Q3 has position 34 (n + 1)
Example: Given observations 22, 18, 17, 16, 16, 13, 12, 21, 11 (n = 9), first rank
the data 11, 12, 13, 16, 16 , 17, 18, 21, 22, then identify the positions
Q1 = x(2.5) = x(3) = 12
Q2 = 16
Q3 = x(7.5) = x(8) = 21
pth percentile, p = 1, 2, . . . , 99, Pk = x(k(n+1)/100) .
Example cont.: 60th percentile = x(60(9+1)/100) = x(6) = 17
Measure of variation: variance
I
Average of squared deviations of values from the mean
Population variance
2 =
PN
i=1
(xi )2
N
Sample variance
2 =
faster to calculate
}|
{
zP
n
2
x )2
)
i=1 xi n(
i=1 (xi x
=
n
n
Pn
divided by n
Sample quasi-variance (corrected sample variance)
Pn
Pn
2
)2
x )2
i=1 (xi x
i=1 xi n(
s2 =
=
divided by n 1
n1
n1
They are related via
2 =
n1 2
s
n
If a, b (b 6= 0) are real numbers and y = a + bx, then sy2 = b 2 sx2
Measure of variation: standard deviation (SD)
I
I
The most-commonly used measure of spread
Population standard deviation, sample standard deviation and
sample quasi-standard deviation are respectively
= 2
=
2
s = s2
Shows variation about the mean
Has the same units as the original data, whilst variance is in units2
Variance and SD are both affected by outliers
Calculating variance and standard deviation
Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,
Z : 11, 11, 11, 12, 19, 20, 20, 20
x =
124
= 15.5
8
n
X
i=1
n
X
y =
124
= 15.5
8
z =
124
= 15.5
8
xi2 = 112 + 122 + . . . + 212 = 2000
yi2 = 142 + 152 + . . . + 172 = 1928
i=1
n
X
zi2 = 112 + 112 + . . . + 202 = 2068
i=1
sx2
Pn
=
i=1
xi2 n(
x )2
2000 8(15.5)2
78
=
=
= 11.1429 sx = 3.3381
n1
81
7
1928 8(15.5)2
6
sy2 =
= = 0.8571 sy = 0.9258
81
7
2
2068
8(15.5)
146
sz2 =
=
= 20.8571 sz = 4.5670
81
7
Comparing standard deviations
Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,
Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
x = 15.5 sx = 3.3
11
12
13
14
15
16
17
18
19
20
21
18
19
20
21
19
20
y = 15.5 sy = 0.9
11
12
13
14
15
16
17
z = 15.5 sz = 4.6
11
12
13
14
15
16
17
18
21
Numerical summaries and frequency tables. Standarization.
I
If the data is discrete then
Pk
i=1 xi ni
x =
n
and
s2 =
Pk
i=1
xi2 ni n
x2
n1
If the data is continuous, we replace xi in the above difinition, by the
mid-points of class intervals
To standardize variable x means to calculate
x x
s
If you apply this formula to all observations x1 , . . . , xn and call the
transformed ones z1 , . . . , zn , then the mean of the zs is zero with the
standard deviation of one
Standarization = finding z-score
Empirical rule
If the data is bell-shaped (normal), that is, symmetric and with light
tails, the following rule holds:
I
68% of the data are in (
x 1s, x + 1s)
95% of the data are in (
x 2s, x + 2s)
99.7% of the data are in (
x 3s, x + 3s)
Note: This rule is also known as 68-95-99.7 rule
Example: We know that for a sample of 100 observations, the mean is
40 and the quasi-standard deviation is 5. Assuming that the data is
bell-shaped, give the limits of an interval that captures 95% of the
observations.
95% of xi s are in: (
x 2s) = (40 2(5)) = (30, 50)
Measure of variation: coefficient of variation (CV)
Measures relative variation and is defined as
CV =
s
|
x|
Is a unitless number (sometimes given in %s)
Shows variation relative to mean
Example: Stock A: Average price last year = 50, Standard deviation = 5
Stock B: Average price last year = 100, Standard deviation = 5
5
5
= 0.10 CVB =
= 0.05
50
100
Both stocks have the same SDs, but stock B is less variable relative to its
mean price
CVA =