STAT 008 CH 1-3 p.1-37 Lecture Notes
STAT 008 CH 1-3 p.1-37 Lecture Notes
What is Statistics?
Using information from collected data to make general statements or conclusions; it is the discipline that
encompasses data collection, evaluation, and decision-making while accounting for uncertainty
Two main branches:
1)
2)
Idea:
Terminology:
Population – the entire group of interest; often too large to collect data for every member
e.g.
Parameter – a fixed number describing the population; in practice, this value is unknown
e.g.
e.g.
Statistic – a number describing a particular sample; changes from sample to sample; used
to estimate the population parameter
e.g.
Experimental Unit – the object from which information (i.e., data) is observed; also called elements
e.g.
Variable – any characteristic of an experimental unit that is of interest for measuring, estimating,
or making statements about; think of this as a ‘description of the data’
e.g.
AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 2
Distribution – the possible values a variable can take on and how often they occur
Shapes of Distributions
Symmetric
Example: A U.S. company composed of several departments with 4000 employees maintains a
secure database containing employee information. Among other things, the database
contains the following information for each employee: contact phone number, social
security number, department, position held, salary, number of days employed, and
number of days absent (sick/vacation). The owner of this company is interested in the
average salary of its employees. To report a value, the Human Resource Department
randomly selects 100 employees from the database and calculates their average salary.
Parameter –
Sample –
Statistic –
Experimental Unit –
Variable –
AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 3
Example: A September 2023 Forbes Advisor article, Business Credit Cards: Statistics And Trends
In 2023, states “According to the data collected by Forbes Advisor, most new businesses
applying for business cards—88.61%—had at least three cards.”
Parameter –
Sample –
Statistic –
Experimental Unit –
Variable –
Types of Variables
1) Quantitative – numerical data for which mathematical operations (adding, subtracting, averaging)
can be applied
e.g.,
i) Discrete –
e.g.,
ii) Continuous –
e.g.
e.g.,
AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 4
Scales of Measurement
Interval – quantitative data; data that is ordered and the interval between values is expressed in a fixed
unit of measurement; no true zero and values can be negative
e.g.,
Ratio – quantitative data; all properties of interval data; ratio of two values is meaningful; true zero
exists and values are non-negative
e.g.,
Nominal – qualitative data; can be coded numerically; calculations are meaningless; consider the
frequency of ‘categories’; no natural ordering of categories
e.g.,
Ordinal – qualitative data; can be coded numerically; calculations provide some insight; natural
ordering of categories exists
e.g.,
We would like to display data graphically to obtain a visual impression of the information contained
within the data. The type of graph we use will depend on the type of variable (qualitative or quantitative).
Frequency –
Relative Frequency –
Percentage –
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 5
Bar Chart
vertical axis
Example: Refer to the Human Resource Department Example. Consider the variable “department”
and suppose the following summary data is reported:
Accounting 100
Customer Service 300
Executive 25
Human Resource 75
Labor 2000
Sales 1500
Total 4000
Note: 1)
2)
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 6
Pie Chart
Construction: Divide a ‘pie’/circle into slices that are proportional to the relative
frequencies/percentages for each category.
f
Note: The angle of each slice can be calculated as 360
n
Example: Refer to the previous example. Construct a pie chart to display the information.
Note:
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 7
Examples:
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 8
Dot Plot
Example: Refer to the Human Resource Department Example. Consider the variable “number of
days absent” and suppose the following data is reported for 20 employees:
0 2 6 4 3 4 3 5 12 0
1 5 7 3 2 2 1 6 3 4
Histogram
Idea:
vertical axis
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 9
Steps: 1)
Note:
2)
3)
Note: Each observation should fall into one and only one class. This means we
must decide which endpoint of the interval to include for each class. The
right inclusion rule indicates that an observation on the ‘right’ (or upper
bound of the interval should be included. Alternatively, the left inclusion
rule indicates the ‘left’ (or lower bound) of the interval should be included.
Observation Class 1 Interval Class 2 Interval
4) Construct the histogram similar to constructing a bar chart (where the classes are the
“categories”)
Note: Steps 2-3 can be summarized using a Frequency / Relative Frequency Distribution.
Class Interval Frequency Relative Frequency
1
2
3
k
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 10
Example: Refer to the Human Resource Department Example. Consider the variable “salary”. Since
there are 4000 employees and many different values for their salaries, it is best to display
the data by grouping the salaries into classes. Below is a partial list of the salaries:
\
Assume the minimum and maximum salary in the employee database is $30,235 and
$99,999, respectively. Complete the following relative frequency distribution using 7
classes and then construct a relative frequency histogram.
2 732 0.1830
3 654 0.1635
4 482 0.1205
5 377 0.0943
6 215 0.0537
7 42 0.0105
Totals 4000 1
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 11
Stem-and-Leaf Plot
Construction: 1) Split each observation into a “stem” and a “leaf” where the last digit of the
observation is the “leaf” and all preceding digits form the “stem”.
Note: Some rounding may be required so that the list in Step 2 is appropriate.
2) List the stems in numerical order, in a vertical line, from the minimum value to
the maximum value. Do not skip any stem values.
3) Draw a vertical line to the right of the stems. This line separates the ‘stems’ and
‘leaves’.
4) List the leaves in numerical order, horizontally, at the appropriate stem. Repeat
leaf values for multiple observations of the same value.
Example: Refer to the Human Resource Department example. Consider the variable “number of
days employed” for 18 summer interns. The data is provided below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
Construct a stem-and-leaf plot to display the information.
Box Plot
This graph is based on numerical measures of position. We will come back to this.
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 12
Once a graph is constructed, we would like to describe the overall pattern of the data. Specifically, we
are interested in the following:
1)
2)
3)
4) Are there any observations that don’t ‘fit’ with the rest of the data (i.e., unlikely
observations/outliers)? Are those observations affecting the shape of the distribution?
Example: Consider the dot plot for the “number of days absent” data
0 1 2 3 4 5 6 7 8 9 10 11 12
1) Shape?
2) Center?
3) Spread?
4) Outlier(s)?
Example: Consider the stem-and-leaf plot for “number of days employed” for the summer interns.
11 09 1) Shape?
12 567789
13 1689 2) Center?
14 567
15 15 3) Spread?
16
17 4) Outlier(s)?
18 9
The impression of the data displayed in the graphs is subjective. Therefore, we need supporting evidence
for the statements/conclusions we are drawing.
Numerical Measures
Measures of the Center
Measures of Spread
Measures of Position
Measures for Detecting Outliers
1) Mean –
population:
2) Median –
i.e.,
3) Mode –
These measures of the center can be used to determine the shape of the distribution:
mean median
mean median
mean median
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 14
Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
a) Calculate the mean, median, and mode for this data. How do these values compare with the
‘center’ interpreted from the stem and leaf plot?
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 15
b) Use the measurements from part (a) to describe the shape of the data. How does this compare
with the shape interpreted from the stem and leaf plot?
c) The stem & leaf plot indicated a possible outlier (i.e.,189). Remove this observation and repeat
part (a).
d) Use the measurements from part (c) to describe the shape of the data.
e) Compare the measures of center and the shape for the data set with and without the outlier.
n 18 n 17 Comments
Mean
Median
Mode
Shape
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 16
Measures of Spread
1) Range –
Notation / Formula:
2) Variance –
sample:
population:
Notes: 1)
2)
4)
3) Standard Deviation –
sample: population:
Notes: 1)
Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 18
c) Compare the measures of spread for the data set with and without the outlier.
n 18 n 17 Comments
Range
Variance
Standard Deviation
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 19
Chebyshev’s Theorem
For a distribution of any shape, at least 1 1 100% of the observations will fall in k .
k2
100%
xs
x 2s
x 3s
x 3.5s
Illustration #1:
Illustration #2:
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 20
Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
Recall: x s
f
Interval Lower Bound Upper Bound f 100%
n
xs
x 2s
x 3s
b) Compare the percentage of observations within the specified intervals for our data (summarized
in the previous table) with Chebyshev’s Theorem.
c) Use Chebyshev’s Theorem to make a statement about the percentage of observations within 1.5
standard deviations of the mean.
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 21
Empirical Rule
For a symmetric, mound-shaped distribution, the following statements are true:
Approximately of the observations will fall in the interval .
Approximately of the observations will fall in the interval 2 .
Approximately of the observations will fall in the interval 3 .
Illustration #1:
Alternatively,
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 22
Example: Refer to the Human Resource Department Example. Consider the variable “number of
days employed”. Suppose a histogram indicates that this variable is approximately
symmetric with a mean of 3010 days (about 8.25 years) and a standard deviation of 915
days (about 2.5 years). Find the proportion of employee that…
b) have worked at the company between 1180 days and 4840 days?
d) have worked at the company between 2095 days and 5755 days?
e) What are the lower and upper bounds for the number of days worked for the middle 99.7% of all
employees?
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 23
Measures of Position
1) p th percentile –
2) Quartiles –
Illustration:
Notation: Position:
Notation: Position:
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 24
1) Range
2) Variance
3) Standard Deviation
4) Inter-Quartile Range –
Notation/Formula:
Notes: 1) If the IQR is ‘small’ (relative to the overall Range), then the spread tends to be small
(since 50% of the observations are close to the center of the data within the range of the
IQR).
e.g.
2) If the IQR is ‘large’ (relative to the overall Range), then the data tends to be widely
dispersed (since 50% of the observations are widely dispersed).
e.g.
Empirical Rule
For a symmetric, mound-shaped distribution, the following statements are true:
Approximately 68% of the observations will fall in the interval x 1s
Approximately 95% of the observations will fall in the interval x 2 s
Approximately 99.7% of the observations will fall in the interval x 3s
Chebyshev’s Theorem
For a distribution of any shape, at least 1 1
k2 100% of the observations will fall in x ks .
Consider the proportion/percentage of observations not in these intervals.
Empirical Rule Chebyshev’s Theorem
Inside of Outside of Inside of Outside of
the Interval the Interval the Interval the Interval
x 1s
x 2s
x 3s
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 25
1) z-score –
sample:
population:
Rules: z score 2
z score 3
Note: This measure for detecting outliers is best used for symmetric (or approximately symmetric)
distributions.
2) Boundaries
Lower Fence/Limit =
Upper Fence/Limit =
i.e.,
Note: This measure for detecting outliers is best used for distributions that are skewed or when
the shape is unknown.
Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 26
Min:
Q1:
Median:
Q3:
Max:
i) z-score =
ii) Fences/Limits:
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 27
Box Plot
Box plots depend on measures of position known as the Five Number Summary.
vertical lines
dots/points
asterisks (*)
Illustration:
Example: Refer to the previous example. Construct a box plot for the data. Identify the shape of the
distribution.
90 100 110 120 130 140 150 160 170 180 190
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 28
Bivariate data –
Notation:
We would like to display quantitative bivariate data graphically to obtain a visual impression of the
information contained within the data; specifically, any relationships that may exist between the two
variables. We would also like to support our graphs with numerical measurements.
Scatterplot
vertical axis
Illustration:
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 29
Once a graph is constructed, we would like to describe the overall pattern of the data. Specifically, we
are interested in the following:
1)
2)
3)
4) Are there any observations that don’t ‘fit’ with the rest of the data (i.e., unlikely
observations/outliers)?
Form:
Direction:
Strength:
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 30
Example: A small business owner opening a new pet store believes there is a relationship between
the number of pets a customer owns and the amount of money the customer will spend at
the pet store. To determine if there is in fact a relationship, she collects data from 9
customers.
Number of
1 3 2 5 4 2 3 0 7
Pets
Amount
20 45 28 70 62 35 53 15 100
Spent ($)
d) Construct a scatterplot to determine if a relationship exists between the number of pets a customer
owns and the total amount the customer spends at the pet store.
e) Consider the scatterplot in part (d). How would you describe the relationship?
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 31
Just as we learned with single variable data, the interpretations drawn from graphs are subjective.
Therefore, we must also consider numerical descriptions for relationships.
Correlation Coefficient –
sample:
population:
is called the covariance. Its sign identifies the direction of the relationship.
ௌೊ
Note:
ିଵ
Interpretation:
1)
2)
3)
4)
5)
6)
7)
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 32
Example: Refer to the pet store example. Calculate the correlation coefficient and interpret
appropriately. The data is re-printed below.
x y
1 20
3 45
2 28
5 70
4 62
2 35
3 53
0 15
7 100
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 33
Once it has been determined that a linear relationship exists between two variables, we would like to
‘describe’ that relationship mathematically.
Regression Line –
sample:
population:
Notes: 1)
2)
3)
This regression line is called the Least Squares Regression Line because it minimizes the sum of the
squares of the error between the observed value of y and the predicted value, y .
Illustration:
x
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 34
Notes: 1)
2)
a) Calculate the regression line for describing the relationship between the number of pets a
customer owns and the total amount spent at the pet store.
Recall: n
r xy S xy
x x 2
S xx
y y 2
S yy
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 35
How can we determine if the calculated regression line describes the observed data well?
Coefficient of Determination –
sample:
population:
Note:
Interpretation:
1)
2)
3)
4)
5)
Example: Refer to the pet store example. Calculate the coefficient of determination and interpret.
Recall: r
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 36
Regression Warnings:
1)
2)
Example: Suppose it has been determined that there is a positive relationship between the
monthly payment a person makes for their car and their mortgage. Does this mean
a higher car payment causes a person’s mortgage payment to increase?
Explanation:
3)
Illustration:
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 37
If a scatterplot displays a linear relationship between two variables, that relationship can be described by
the correlation coefficient and regression line. Further, the coefficient of determination can be used to
evaluate the fit of the regression line. If it is found that the regression line provides a ‘good’ description
of the relationship, we would like to utilize it.
Recall:
a) If a customer entering the pet store has three pets, predict the total amount he will spend.
b) Compare your predicted value in part (a) with the observed values for customers with three pets.
Why is there a difference?
c) If a customer entering the pet store has eleven pets (maybe 2 cats, 3 dogs, and 6 birds), predict
the total amount she will spend. Are there any concerns with your prediction?
AFlores 10.23