STATISTICS
IN ECONOMICS AND BUSINESS
Nguyen Huyen Trang
Faculty of Statistics - National Economics University
[email protected] LECTURE 6: DATA MEASUREMENT
Summary
Measures
Central Measures of
Tendency Dispersion
Standard
Mean Median Variance
Deviation
Coefficient of
Mode Variation
Quartile
OUTLINE
• Central Tendency
• Percentiles - Quartile
• Measures of Dispersion
CENTRAL TENDENCY
A summary measure that attempts to describe a whole
set of data with a single value that represents the middle
or center of its distribution.
• Mean
• Median
• Mode
MEAN
• The most common measure of central tendency
• Apply for quantitative only
• Have the same unit as original data
• Denote for the population mean: μ, for the sample mean: xത
• Formula:
➢ Arithmetic mean
➢ Geometric mean
ARITHMETIC MEAN
• Example: Student A’s grade in some courses
Course Grade Points
Algebra 3.63
Introduction to Logic 4.20 GPA???
Microeconomics 3.46
Statistics 4.00
xi xi
= x=
N n
WEIGHTED ARITHMETIC MEAN
• Example: Any difference if know more information about the
number of credits?
Course Number of Credits Grade Points
Algebra 3 3.63
Introduction to Logic 2 4.20
Microeconomics 3 3.46
Statistics 3 4.00
Weight wi Value xi
Each data is given a weight that reflects its importance
WEIGHTED ARITHMETIC MEAN
Number Grade Grade Points x
Course
of Credits Points Credits
Algebra 3 3.63 10.89
Introduction to Logic 2 4.20 8.40
Microeconomics 3 3.46 10.38
Statistics 3 4.00 12.00
Total 11 x 41.67
In general, for weighted data:
σ x i wi where:
xത = xi = value of observation i
σ wi wi = weight for observation i
GROUPED DATA
• The weighted mean computation can be used to obtain
approximations of the mean, variance, and standard deviation for the
grouped data.
• To compute the weighted mean, we treat the midpoint of each class
as though it were the mean of all items in the class.
• We compute a weighted mean of the class midpoints using the class
frequencies as weights.
• Similarly, in computing the variance and standard deviation, the class
frequencies are used as weights.
σ x i fi where:
xത = xi = midpoint of each class
σ fi fi = class frequencies
MEAN FOR GROUPED DATA
Example: SCCoast, an Internet provider in the Southeast, developed
the following frequency distribution on the age of Internet users.
Age Frequency
Number (fi)
of users xi x if i
10 up to 20 3 15 45
20 up to 30 7 25 175
30 up to 40 18 35 630
40 up to 50 20 45 900
50 up to 60 12 55 660
Total 60 2410
THE MEAN
• Compare the mean of following data:
– Data 1: {10, 10, 11, 12, 12}
– Data 2: {2, 3, 4, 6, 40}
• The mean is easily affected by the extreme values
or outliers → lead to biased comparison
• Use the other measure
MEDIAN
• The median of a data set is the value in the middle
when the data items are arranged in ascending
order.
• For an odd number of observations, the median is
the middle value.
• For an even number of observations, the median is
the average of the two middle values.
MEDIAN
• Median is the ‘cutoff point’ of lower 50% - upper 50% parts
• Denoted as Me
Lower
50%
Upper
50%
Median
MEDIAN
Example:
• Data: { 5, 6, 9, 5, 6}
Ordered data: { 5, 5, 6, 6, 9 }: Median = 6
• Ordered Data {6, 6, 7, 8, 9, 11} :
7+8
Median = = 7.5
2
MEDIAN
• Compare the mean and median of following data:
Data 1: {10, 10, 11, 12, 12}
Data 2: {2, 3, 4, 6, 40}
• The median is independent from the outliers
• Depends on the position
• Apply for quantitative variable only
MODE
• Could be applied for both quantitative and qualitative
variable
• The mode of a data set is the value that occurs with greatest
frequency
• Denoted as Mo
• Find the Mode:
➢ Qualitative Data
➢ Quantitative Data
MODE
• Qualitative Data
➢ Data: { Yellow, Yellow, Red, Blue, Green}
→ Mode is the category having the largest frequency
• Quantitative Data
➢ Data 1: { 5, 6, 6, 7, 7, 7, 9 }
➢ Data 2: { 5, 6, 7, 8, 9 }
➢ Data 3: { 5, 6, 9, 5, 6 }
➢ Data 4: { 5, 5, 5, 5, 5 }
→ Mode is the value having the largest frequency
There may be no mode or several modes
MEAN, MODE, MEDIAN
Negatively skewed Positively skewed
Left skewed Symmetric Right skewed
Mean
Median
Mean < Median < Mode Mode Mode < Median < Mean
PERCENTILES
❑ A percentile provides information about how the data are spread
over the interval from the smallest value to the largest value.
❑The pth percentile is a value
that divides the data into two
parts:
At least p% of the observations
are equal or less than the pth
percentile
At least (100 – p)% of the
observations are equal or
greater than the pth percentile
PERCENTILES
80% of people are shorter than you and your height is 1.85m
You are at the 80th percentile
Approximately 80% people shorter than (1.85m) and
20% people taller than 1.85m
PERCENTILES
A total of 10,000 people visited the shopping mall over 12 hours:
Time Cumulative 0 000 eople
(hours) Freq. 000
0 0 000
2 350 000
4 1100 000
5 000
6 2400
000
8 6500
000
10 8850 000
12 10000 000
i e in ours
0
• Estimate the 30th percentile 0 5 0
• Estimate what percentile of visitors had arrived after 11 hours
QUARTILES
Quartiles are specific percentiles, divides the data into 4 equal parts
by 3 cut-off points
• First Quartile Q1 = 25th Percentile
• Second Quartile Q2 = 50th Percentile = Median
• Third Quartile Q3 = 75th Percentile
25% 25% 25% 25%
Q1 Q2 Q3
MEASURES OF VARIABILITY
Firm A Firm B Mean A = Mean B = 1500
Worker 1 400 1480
Worker 2 400 1485
Worker 3 600 1486 Which firm’s worker salary is more
Worker 4 600 1488 fluctuated/stable?
Worker 5 700 1490
Worker 6 800 1503 Central Tendency may not provide
Worker 7 900 1505 efficient information of the data.
Worker 8 2000 1520 Data may have the same Mean,
Worker 9 2600 1521 Median, but differ in variability
Worker 10 6000 1522 (dispersion, spread)
MEASURES OF VARIABILITY
Tells about the spread of the data. Help us to compare the spread
in two or more distributions
▪ Range
▪ Variance
▪ Standard Deviation
▪ Coefficient of Variation
RANGE
• The difference between the largest and the smallest value in a
data set.
Firm A Firm B
R = xmax - xmin Worker 1 400 1480
Worker 2 400 1485
• Example:
Worker 3 600 1486
Range (A) = 6000 – 400 = 5600 Worker 4 600 1488
Worker 5 700 1490
Range (B) = 1522 – 1480 = 52 Worker 6 800 1503
Worker 7 900 1505
• Pros: simple Worker 8 2000 1520
Worker 9 2600 1521
• Cons: affected by outliers
Worker 10 6000 1522
INTERQUARTILE RANGE
• Interquartile Range is range between 3rd quartile and 1st quartile
• IQR is the width of 50% middle value of data
• It overcomes the sensitivity to extreme data values
VARIANCE
• Overcome the weakness of the range by using all the
values
➢ Data: x1, x2,…, xn → the mean
➢ Difference between the value of each observation (xi) and
the mean (x for a sample, μ for a population): xi - x
ത
VARIANCE
• Formula:
( x − ) 2
➢ Population Variance: 2 = i
N
➢ Sample Variance: s2 = ( xi − x )
2
n −1
• If 𝑠𝑥2 > 𝑠𝑦2 then:
• x is more dispersed, widespread, fluctuated than y
• y is more stable, concentrated than x
VARIANCE FOR GROUPED DATA
• Formula:
f ( M − ) 2
➢ Population Variance: 2
= i i
N
➢ Sample Variance: s2 = f i ( M i − x ) 2
n −1
STANDARD DEVIATION
• Is the square root of the variance
• It is measured in the same units as the data, making it more
easily comparable, than the variance, to the mean
• Formula:
➢ Population Standard Variance: σ = σ2
➢ Sample Standard Variance: s = s2
COEFFICIENT OF VARIATION
• Indicates how large the standard deviation is in relation to the mean
• This is the ratio of the standard deviation to the mean
SD
CV = 100
mean
Business Decision Making – Nguyen Minh Thu – [email protected]
COEFFICIENT OF VARIATION
An investor is considering the relative risks associated with two
projects:
• The first project has a mean expected profit of £5000 with a
standard deviation of £707.11
• The second project has a mean expected profit of £500 with a
standard deviation of £112.13
Use the measures of dispersion to establish which project has the
lowest degree of risk.
Business Decision Making – Nguyen Minh Thu –
[email protected] EXPLORATORY DATA ANALYSIS
• Five-Number Summary
• Box Plot
• Detecting Outlier
FIVE-NUMBER SUMMARY
• Smallest Value
• First Quartile
• Median
• Third Quartile
• Largest Value
=> use to draw box plot
BOX PLOT
• A box is drawn with its ends located at the first and third
quartiles.
• A vertical line is drawn in the box at the location of the
median.
• Limits are located (not drawn) using the interquartile range
(IQR).
✓ The lower limit is located 1.5(IQR) below Q1.
✓ The upper limit is located 1.5(IQR) above Q3.
✓ Data outside these limits are considered outliers.
(Value < Q1 – 1.5 IQR or Value > Q3 + 1.5 IQR)
BOX AND WHISKER PLOT
▪ Boxplot 1
min max
Q1 Q2 Q3
▪ Boxplot 2 IQR = Q3 – Q1
outlier
Q1 – 1.5IQR Q3 + 1.5IQR
Lower limit: the maximum of Upper limit: the minimum of
(min, Q1-1.5*IQR) (max, Q3+1.5*IQR)
BOX AND WHISKER PLOT
A B C D E F
Max 6 6 7 9 6 4
Q3 5 4 6 6 4 3
Q2 4.5 2.5 5.5 4.5 2.5 2.5
Q1 3 2 4 4 1 2
Min 1 1 1 3 -1 1
4.2 2.8 5.16 4.84 2.5 2.5