Week 1: Introduction
26134
Responsible Evidence-Based
Decisions
UTS CRICOS PROVIDER CODE: 00099F
Subject Coordinator
Dr. Nathan Kettlewell
Senior lecturer in Economics
Ph.D. in Economics
All enquiries to
[email protected] If the enquiry is about content, please use the discussion board (every student will
hopefully benefit from the answer). I monitor it regularly (you could also write
me an email that I should draw my attention to the discussion board).
Office Hours: Thursday 10:00-11:30 am (please make an appointment if you plan
to come).
*Note: Meetings outside this time may be available – send an email
*These office hours may change during the semester. You will be updated.
© University of Technology Sydney
Course Structure
• Lectures (1.5 h, Tuesday session will be recorded and posted
by Friday)
• Videos on Canvas allow me to explain the content a bit more
detail (if you watch the lecture recording/read through the
slides) and you understand everything, it’s fine to skip them
• Tutorials: About content of the lecture a week before. Most
important advice: Try to solve the tutorial questions before
going to the tutorial. Even if you cannot answer them, you learn
much more from the tutorial if you tried before.
© University of Technology Sydney
Assessment Structure
• 6 quizzes (4 best count in total 20 per cent towards final grade)
• 1 hour, 1 attempt
• Format: fill in the blanks, multiple choice
• All materials are allowed
• Group assignment: Group of 5, 6 or 7.
• Why group/team work? https://www.uts.edu.au/current-students/support/helps/self-help-
resources/group-work
• Case study, report writing, Excel (Microsoft Spreadsheet platform ) skills.
• Teamwork and communication.
• See subject outline, or Assessment Overview on Canvas for details and due date
• It might be the case that groups get extra team members at the beginning of Week 6
• Final: AI invigilated online exam, multiple choice and restricted open book
• Besides a calculator of any type, only paper-based materials are allowed. Internet resources,
ebook, computer apps, and an extra ipad are not allowed.
• Most likely, an AI program called ProctorU invigilates your exam via your webcam and mic.
Stay tunned for any updates on the assessment items.
© University of Technology Sydney
Textbook
Using the textbook is voluntary.
• You will need to take notes during class. Examples and questions
will be illustrated in lectures! So, you are recommended (though
not required) to engage with the lectures.
• If you want to learn more, read more and practice more, you are
advised to choose the following textbook: Business Analytics and
Statistics or “BAS”.
• What I would do: I’d try to understand the content looking at slides,
lecture (recording), and (pre)recorded videos on Canvas. If I am
still puzzled, I would Google the content. If that does not help, I’d
buy/look into the textbook
© University of Technology Sydney
Commitment
Keep in mind:
• Stats is a challenging subject
• A lot of new concepts
• Don’t feel bad and get demotivated if you struggle
• Struggling is just a sign that you will learn something new
• Things that look very complicated at first glance become approachable
once you studied a lot
• 3 hours facilitated learning (lecture and tutorial) per week
• 6 hours self directed (self study and group study) per week
© University of Technology Sydney
U:PASS
http://www.uts.edu.au/current-students/support/upass/upass
http://tinyurl.com/upass2017
• Voluntary “study session”
• Led by a D or HD student of the subject with good WAM
• Sign up for U:PASS sessions via U:PASS website
• Sign up open in Week 2, but ONLY sign up if you are going to go
• U:PASS will start Week 3
• Questions? Contact Georgina at [email protected], or check out
the website.
© University of Technology Sydney
Introduction
• Statistics is about quantifying the world (i.e., describing the
world in the most objective way possible)
• Important for good decision making:
• Descriptive Statistics and data communication (Module 1)
• Inferential Statistics (Module 2)
• Hypothesis Testing (Module 3)
• Regression Analysis (Module 4)
• Data ethics (Module 5)
© University of Technology Sydney
Week 1: Exploring data
26134
Responsible Evidence-Based
Decisions
UTS CRICOS PROVIDER CODE: 00099F
Foundations of
Descriptive Statistics
Data Types and Visualisation
© University of Technology Sydney
Some preliminary terms and definitions
Variable: A variable is “any characteristic, number, or quantity that can
be measured or counted” (Australian Bureau of Statistics)
Random variable: a variable whose outcome is unknown before we
collect the data (e.g., the income of an Australian household)
Population: The complete pool of a particular random variable (e.g.,
the income of all Australian households)
Sample: A subset of the population (e.g., income of 100 households)
Our goal today is to be able to describe and visualise the
information contained in different types of variables.
© University of Technology Sydney
Types of data
Variables can be broadly classified as either qualitative/categorical
or quantitative/numerical:
Qualitative/categorical Quantitative/Numerical
Nominal Ordinal Discrete Continuous
In categories that Have a natural We can list all Can take on an
have no natural order, but any possible values infinite number of
ordering numbers attached (values are not values within a
to the categories infinitely divisible). given range. Often
Example: a 0/1 are meaningless Often arise from a arise from a
variable for counting process. measurement
male/female). Example: how process.
much do you agree Example: Number
with the following of children in a Example: The
statement [don’t household heights of
agree=-1/somewhat (0,1,2,3,…) professional
agree=0/completely basketball players.
agree=1]).
© University of Technology Sydney
Frequency distributions
For qualitative/categorical data, it is intuitive to visualise it via
a table that displays frequencies
Material Status Frequency Relative Percent
of home loan Frequency Frequency
applicants
Single 102 0.1262 12.62
Married 341 0.4220 42.20
Widowed 155 0.1918 19.18
De Facto 50 0.0619 6.19
Separated 40 0.0495 4.95
Divorced 120 0.1485 14.85
Total 808 1 100
© University of Technology Sydney
Frequency counts: the total number of occurrences for each
category.
Relative frequency: the fraction or proportion of the total number
of data items belonging to the category.
Percent frequency: relative frequency × 100 (%)
Material Status of Frequency Relative Frequency Percent Frequency
home loan applicants
Single 102 0.1262 12.62
Married 341 0.4220 42.20
Widowed 155 0.1918 19.18
De Facto 50 0.0619 6.19
Separated 40 0.0495 4.95
Divorced 120 0.1485 14.85
Total 808 1 100
© University of Technology Sydney
Excel
In Excel, use function COUNTIF(range, values) (will be illustrated
during tutorial)
© University of Technology Sydney
Data visualisation: Histograms
Histograms are most commonly used for continuous variables.
Let’s demonstrate through example:
Say we have a sample of observations of household incomes.
1. Choose a bandwidth/bin size to group the counts of incomes
into equally spaced categories (e.g., $0-100, $101-$200, $201-
$300 etc.)
2. Plot the frequencies for each group in a kind of bar chart,
usually with the frequencies on the y-axis and categories on the
x-axis.
© University of Technology Sydney
Data visualisation: Histograms
Weekly income
40
35
30
25
Frequency
20
15
10
0
[6, 526] (526, 1046] (1046, 1566] (1566, 2086] (2086, 2606] (2606, 3126]
© University of Technology Sydney
Histogram for Numerical Data
Q: What is wrong with the following histogram?
© University of Technology Sydney
Data visualisation: Bar chart
Bar charts are visually similar to a histogram. However:
• the categories need not be equally ranged continuous values;
• the y-axis can represent things other than frequency; and
• there is usually white space between the bars.
Let’s consider a categorical variable from earlier: Marital Status of
home loan applicants Marital status of homeloan applicants
400
350
300
250
200
150
100
50
0
Single Married Widowed De Facto Separated Divorced
© University of Technology Sydney
Data visualisation: Pie chart
Pie charts are a way to visualise categorical data. This time, the
frequencies will be shown as segments of a circle (‘slices of the pie’)
Let’s consider again the categorical variable from earlier: Marital
Status of home loan applicants
Frequencies of homeloan applicants
Tip: pie charts are rarely a
good idea, and never when
there are a large number of
categories
Single Married Widowed De Facto Separated Divorced
© University of Technology Sydney
© University of Technology Sydney
Foundations of
descriptive statistics
Summary Statistics:
Central Tendency
© University of Technology Sydney
Describing Data
• Summary Statistics:
• Central Tendency
• Variability
• Skewness
© University of Technology Sydney
Notation
Random variables are usually denoted by 𝑋, 𝑌… Capital letters
• 𝑋 : The number of children in a household
• 𝑌 : The amount of time spent by the husband on housework per day
Realisations/observations of a r.v. are denoted by 𝑥𝑖 , 𝑦𝑖 … lowercase letters with subscript
𝑖 ∈ {1,2, … 𝑁} or 𝑖 ∈ {1,2, … 𝑛}. (It reads: 𝑖 is in the set of one, two, until 𝑁; 𝑖 is in the set of
one, two, until 𝑛)
• 𝑥1 : number of children in household 1
• 𝑦137 : amount of time spent by husband 137 on housework per day
𝑁 and 𝑛 denote the size or number of observations. Typically,
• 𝑁 is referred to the population size (usually very large and can be infinitely many)
• 𝑛 denotes the sample size, i.e. the number of data points we collect in a sample.
© University of Technology Sydney
Central tendency
Definition: Measures of central tendency yield information about the
centre of the distribution of a random variable. They give us some idea
what a typical, middle or average value is. They are sometimes called
measures of location.
Three measures of central tendency:
1. Mean: arithmetic average value
2. Mode: the most commonly occurring value
3. Median: middle value in an ordered array
© University of Technology Sydney
Central Tendency: Mean
We can talk about either population mean or sample mean. If we
denote the random variable by 𝑋, we have:
• Population mean is denoted by 𝝁 (mu) or 𝐸(𝑋). We call the latter
the expectation of 𝑿. It is computed by
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑁 1 𝑁
𝜇=𝐸 𝑋 = = 𝑥𝑖
𝑁 𝑁 𝑖=1
• Sample mean is denoted by 𝑿ഥ , called X bar. It is computed by
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 1 𝑛
𝑋ത = = 𝑥𝑖
𝑛 𝑛 𝑖=1
© University of Technology Sydney
Example
• Random variable is the height of female aged between 25 and
40. John has a sample of randomly chosen female aged 25 and
40, heights are 157cm, 163cm, 166cm, 148cm, 174cm, 165cm,
168cm.
• Sample size 𝑛 equals 7. (There are 7 females in John’s sample,
𝑥1 = 157cm, …, 𝑥7 = 168cm)
157+163+166+148+174+165+168
• The sample mean is 𝑋ത = = 163cm
7
© University of Technology Sydney
Example
• Consider the following gamble: First, I toss a fair coin.
• If the coin realizes as heads, I receive 10 dollars
• If the coin realizes as tails, I pay 10 dollars
• Q: Suppose I play this gamble 100 times (i.e., I construct a
sample of 100 observations). Further, suppose 60 times the
coin falls on heads and 40 times it falls on tails. What’s the
sample mean?
60∗10+ 40∗(−10)
ത=
• A: 𝑋
100
© University of Technology Sydney
Example
• Consider the following gamble: First, I toss a fair coin.
• If the coin realizes as heads, I receive 10 dollars
• If the coin realizes as tail, I pay 10 dollars
• Q: What’s the population mean?
• A: 𝐸 𝑋 = 0.5 ∗ 10 + 0.5 ∗ −10
© University of Technology Sydney
Central tendency: Mode
The mode is the most commonly occurring value.
Example: Waiting times of people in a queue.
We record the following:
2,3,3,3,4,2,2,2,2,3,3,3,1,1 minutes
(ordered: 1,1,2,2,2,2,2,3,3,3,3,3,3,4)
• Q: What is the mode?
• A: 3 (it occurs six times, which is more than any other value)
• Random variables where there are two modes are said to be bimodal,
or more generally multimodal for multiple modes.
© University of Technology Sydney
Central tendency: Median
The median is the middle value in an ordered array.
Example: Waiting times of people in a queue.
We record the following:
2,3,3,3,4,2,2,2,2,3,3,3,1,1 minutes
(ordered: 1,1,2,2,2,2,2,3,3,3,3,3,3,4)
• Q: What is the median?
• A: Median is 2.5 minutes, because it lies in the middle of the 14
numbers
2+3
1,1,2,2,2,2,2,( = 2.5),3,3,3,3,3,3,4
2
© University of Technology Sydney
Central tendency: Qualitative Data
University major of employees. We label 1=marketing, 2=finance,
3=economics, 4=law, 5=others.
We record the following:
2,5,3,1,4,2,5,3,4,2,1 (ordered: 1,1,2,2,2,3,3,4,4,5,5)
Mode is 2, median is 3, mean is 2.9091.
Q: which measure of central tendency is the most appropriate?
© University of Technology Sydney
© University of Technology Sydney
Foundations of
descriptive statistics
Summary Statistics:
Variability
© University of Technology Sydney
Describing Data
• Summary Statistics:
• Central Tendency
• Variability
• Skewness
© University of Technology Sydney
Descriptive statistic: Variability
Definition: Measures of variability yield information about how dispersed the
values of a random variable are around the mean. They are sometimes
called measures of scale, spread, dispersion, or risk.
We discuss three commonly used measures of variability:
1. Variance (Var): average of squared distance from the mean
2. Standard deviation (std): square root of variance
3. Coefficient of variation: std / mean × 100 %
© University of Technology Sydney
Variability: Example
Q: Which stock to invest based on the data of their weekly returns?
Stock X
Stock Y
𝐸 𝑋 = 𝐸 𝑌 = 1.5%, meaning that every week both stocks grow
1.5% in expectation (on average). But which one do you prefer?
© University of Technology Sydney
Variability: Formulas
We can talk about either population variance or sample variance. If we
denote the random variable by 𝑋, we have:
• Population variance is denoted by 𝝈𝟐 (sigma square) or 𝑽𝒂𝒓(𝑿). We
call the latter the variance of 𝑿. It is computed by
(𝑥 −𝜇)2 + ⋯ + (𝑥 −𝜇)2 1 𝑁
2 1 𝑁
𝜎 = 𝑉𝑎𝑟 𝑋 = = (𝑥𝑖 −𝜇)2
𝑁 𝑁 𝑖=1
• Sample variance is denoted by 𝒔𝟐 , called s square. It is computed by
(𝑥1 − ത 2 + ⋯ + (𝑥𝑛 −𝑋)
𝑋) ത 2 1 𝑛
2
𝑠 = = (𝑥𝑖 −𝑋) ത 2
𝑛−1 𝑛 − 1 𝑖=1
© University of Technology Sydney
Variability: Variance
Variance: it computes the average squared distance between data
points and their mean, depending on sample or population.
+2 𝑥1 = 14, 𝑥1 − 𝑋ത = 2, (𝑥1 −𝑋)
ത 2=4
𝑥2 = 6, 𝑥2 − 𝑋ത = −6, (𝑥2 −𝑋)
ത 2 = 36
-6 +4
𝑥3 = 16, 𝑥3 − 𝑋ത = 4, (𝑥3 −𝑋)
ത 2 = 16
𝑋ത = 12
© University of Technology Sydney
Example
𝑋: waiting time of people in a queue (in minutes)
Observations 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8 𝑥9 𝑥10 𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝒙𝒊 12 9 8 8 11 9 10 9 14 9 9 7 10 10 14
1
• If this the population, 𝑁 = 15. We have 𝜇 = 15 σ15
𝑖=1 𝑥𝑖 = 9.93 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 .
1 15 2
1 2 + ⋯ + 14 − 9.93 2 = 3.929
2
𝜎 = (𝑥 𝑖 − 𝜇) = 12 − 9.93
𝑁 15
𝑖=1
1
• If this is a sample, 𝑛 = 15. We have 𝑋ത = σ15 𝑥 = 9.93 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 .
15 𝑖=1 𝑖
1 15 1
2
𝑠 = (𝑥𝑖 − 𝑋) ത 2= 12 − 9.93 2 + ⋯ + 14 − 9.93 2 = 4.210
𝑛 − 1 𝑖=1 14
© University of Technology Sydney
Variance: Remarks
Q1: Why sum up or average out squared distance instead of distance?
• Distance in different directions may cancel out, not suitable for
measuring variability.
Q2: What is the unit?
• Distance such as 𝑥1 − 𝜇 is in the unit of the data
• But squared distance such as 𝑥1 − 𝜇 2 is in the unit of the data raised
to the power of 2!
Example:
• Distance such as 𝑥1 − 𝜇 = (12 − 9.93) is in minutes.
• But squared distance such as 𝑥1 − 𝜇 2 = 12 − 9.93 2 is in minutes2!
© University of Technology Sydney
Standard Deviation
We can talk about either population standard deviation or sample
standard deviation. If we denote the random variable by 𝑋, we have:
• Population standard deviation is denoted by 𝝈 (sigma) or 𝒔𝒕𝒅(𝑿). We
call the latter the standard deviation of 𝑿. It is computed by
(𝑥1 −𝜇)2 + ⋯ + (𝑥𝑁 −𝜇)2 1 𝑁
𝜎= 𝜎2 = = (𝑥𝑖 −𝜇)2
𝑁 𝑁 𝑖=1
• Sample standard deviation is denoted by 𝒔, called s. It is computed
by
ത 2 + ⋯ + (𝑥𝑛 −𝑋)
(𝑥1 −𝑋) ത 2 1 𝑛
𝑠= 𝑠2 = = ത 2
(𝑥𝑖 −𝑋)
𝑛−1 𝑛 − 1 𝑖=1
© University of Technology Sydney
Standard Deviation
Standard deviation solves the problem of squared units. It has the
same units as the original data. In the waiting example, we have
• Population: 𝜎 = 𝜎 2 = 3.929 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 2 = 1.982 (𝑚𝑖𝑛𝑢𝑡𝑒𝑠)
• Sample: 𝑠 = 𝑠 2 = 4.210 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 2 = 2.052 (𝑚𝑖𝑛𝑢𝑡𝑒𝑠)
© University of Technology Sydney
Example
Q: Intuitively, what does variance or standard deviation really measure?
Example: 𝑋, 𝑌 are the time spent on work (X) and leisure (Y) per day
𝒊 Observation 𝒙𝒊 𝒚𝒊 ഥ
𝒙𝒊 − 𝑿 𝟐 ഥ
𝒚𝒊 − 𝒀 𝟐
1 Person A 4 1 4 25
2 Person B 5 4 1 4
3 Person C 6 7 0 1
4 Person D 7 8 1 4
On work
5 Person E 8 10 4 16
On leisure
Variance and standard deviation
Mean are the same: Variance are different: measure how spread out the
𝑋ത = 𝑌ത = 6 𝑠𝑋2 = 2.5 < 𝑠𝑌2 = 12.5 distribution of a random variable
is.
© University of Technology Sydney
Coefficient of Variation
We can talk about either population or sample coefficient of
variation (CV). If we denote the random variable by 𝑋, we have:
• Population CV(%) is computed by
𝜎
× 100 %
𝜇
• Sample CV(%) is computed by
𝑠
× 100 %
ത
𝑋
It is unit free, because both the numerator and denominator have
the same unit as the original data and they cancel each other.
© University of Technology Sydney
Coefficient of Variation: Example
𝑋: waiting time of people in a queue (in minutes)
Observations 𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7 𝑥8 𝑥9 𝑥10 𝑥11 𝑥12 𝑥13 𝑥14 𝑥15
𝒙𝒊 12 9 8 8 11 9 10 9 14 9 9 7 10 10 14
𝜎 1.982 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
Population CV(%): 𝜇 × 100 % = × 100 % = 19.96 %
9.93 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
𝑠 2.052 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
Sample CV(%): 𝑋ത × 100 % = × 100 % = 20.66 %
9.93 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
© University of Technology Sydney
CV: Interpretation
CV is unit free. It measures standard deviation per unit of mean. In
finance when the random variable 𝑋 denotes asset returns, CV
measures risk per unit of expected return.
Q: Why is unit-free measure is sometimes preferred?
Time on leisure per day in hours Time on leisure per day in minutes
𝒊 Observation 𝒚𝒊 ഥ
𝒚𝒊 − 𝒀 𝟐 𝒊 Observation 𝒚𝒊 ഥ
𝒚𝒊 − 𝒀 𝟐
1 Person A 1 25 1 Person A 60 90000
2 Person B 4 4 2 Person B 240 14400
3 Person C 7 1 3 Person C 420 3600
4 Person D 8 4 4 Person D 480 14400
5 Person E 10 16 5 Person E 600 57600
𝑌ത = 6, 𝑠𝑌2 = 12.5, 𝑠𝑌 = 3.54 𝑌ത = 360, 𝑠𝑌2 = 45000, 𝑠𝑌 = 212.13
3.54 212.13
𝐶𝑉𝑌 = × 100 % = 𝟓𝟗% 𝐶𝑉𝑌 = × 100 % = 𝟓𝟗%
6 360
ത 𝑠𝑌2 , 𝑠𝑌 and 𝐶𝑉𝑌 , for both tables?
Q: What is the unit for 𝑌,
Q: Practice the calculation, now treating the data as a population.
© University of Technology Sydney
Variability: Excel
Using Excel, everything can be computed with functions.
Excel is our friend, for assignment and your future career.
Google is our friend for learning Excel, e.g. search “how to calculate inter-
quartile range in excel?” in Google.
© University of Technology Sydney
© University of Technology Sydney
Foundations of
descriptive statistics
Summary Statistics:
Skewness
© University of Technology Sydney
Describing Data
• Summary Statistics:
• Central Tendency
• Variability
• Skewness
© University of Technology Sydney
Descriptive statistics: Shape
Central tendency and variability are useful to describe and summarise data, or the
distribution of random variables.
They cannot summarise asymmetry. Skewness is a measure of asymmetry.
(Calculating skewness will not be examined. BAS Chap 3.4)
© University of Technology Sydney
Skewness (“non-parametric skew”)
Symmetric distribution
median (skewness = 0):
mean
median = mean
Right-skewed distribution
(skewness > 0, positively
skewed):
median < mean
Left-skewed distribution
(skewness < 0, negatively
skewed):
median > mean
© University of Technology Sydney
© University of Technology Sydney
Summary for week 1:
1. We summarise categorical data using table, frequency counts, and visualise using
histogram or pie chart.
2. Distribution is the general shape that shows the probability that a random variable
takes a certain value.
3. Central tendency includes mean, mode (most commonly occurring value in an array of
numbers) and median (the middle number if you sort the array)
𝑥 +𝑥 +⋯+𝑥𝑁 1
• Population mean: 𝜇 = 𝐸 𝑋 = 1 2𝑁 = 𝑁 σ𝑁
𝑖=1 𝑥𝑖
𝑥 +𝑥 +⋯+𝑥𝑛 1
• Sample mean: 𝑋ത = 1 2 𝑛
= σ𝑛𝑖=1 𝑥𝑖
𝑛
4. Variability includes variance, standard deviation and coefficient of variation
Population Sample
Variance 1 1
𝜎 2 = 𝑉𝑎𝑟 𝑋 = 𝑁 σ𝑁
𝑖=1(𝑥𝑖 −𝜇)
2 𝑠 2 = 𝑛−1 σ𝑛𝑖=1(𝑥𝑖 −𝑋)
ത 2
Standard deviation 1 𝑁 1 ഥ 2
𝑛 (𝑥 −𝑋)
𝜎= (𝑥 −𝜇)2
𝑁 σ𝑖=1 𝑖 𝑠=
𝑛−1 σ𝑖=1 𝑖
Coefficient of variation 𝜎 𝑠
× 100 % × 100 %
𝜇 𝑋ത
5. Measure of shape: skewness
© University of Technology Sydney