0% found this document useful (0 votes)
51 views37 pages

STAT 008 CH 1-3 p.1-37 Lecture Notes

Chapter 1 introduces statistics as the discipline of using data to make conclusions, covering key concepts such as population, sample, parameter, and statistic. It explains types of variables (quantitative and qualitative) and scales of measurement (nominal, ordinal, interval, ratio), along with graphical representations for data. The chapter emphasizes the importance of visual data representation and lays the groundwork for further statistical analysis.

Uploaded by

Mohnish Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views37 pages

STAT 008 CH 1-3 p.1-37 Lecture Notes

Chapter 1 introduces statistics as the discipline of using data to make conclusions, covering key concepts such as population, sample, parameter, and statistic. It explains types of variables (quantitative and qualitative) and scales of measurement (nominal, ordinal, interval, ratio), along with graphical representations for data. The chapter emphasizes the importance of visual data representation and lays the groundwork for further statistical analysis.

Uploaded by

Mohnish Jindal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT 008 | Chapter 1 – Introduction to Statistics Page 1

CHAPTER 1 – INTRODUCTION TO STATISTICS

What is Statistics?
Using information from collected data to make general statements or conclusions; it is the discipline that
encompasses data collection, evaluation, and decision-making while accounting for uncertainty
Two main branches:
1)

2)

Idea:

Terminology:

 Population – the entire group of interest; often too large to collect data for every member

e.g.

 Parameter – a fixed number describing the population; in practice, this value is unknown

e.g.

 Sample – a sub-group of the population which is used to collect data

e.g.

 Statistic – a number describing a particular sample; changes from sample to sample; used
to estimate the population parameter

e.g.

 Experimental Unit – the object from which information (i.e., data) is observed; also called elements

e.g.

 Variable – any characteristic of an experimental unit that is of interest for measuring, estimating,
or making statements about; think of this as a ‘description of the data’

e.g.
AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 2

 Distribution – the possible values a variable can take on and how often they occur

Shapes of Distributions
Symmetric

Skewed Right Skewed Left

Example: A U.S. company composed of several departments with 4000 employees maintains a
secure database containing employee information. Among other things, the database
contains the following information for each employee: contact phone number, social
security number, department, position held, salary, number of days employed, and
number of days absent (sick/vacation). The owner of this company is interested in the
average salary of its employees. To report a value, the Human Resource Department
randomly selects 100 employees from the database and calculates their average salary.

Identify the following: Population –

Parameter –

Sample –

Statistic –

Experimental Unit –

Variable –

How would you describe


the distribution?

AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 3

Example: A September 2023 Forbes Advisor article, Business Credit Cards: Statistics And Trends
In 2023, states “According to the data collected by Forbes Advisor, most new businesses
applying for business cards—88.61%—had at least three cards.”

Identify the following: Population –

Parameter –

Sample –

Statistic –

Experimental Unit –

Variable –

How would you describe


the distribution?

Types of Variables

1) Quantitative – numerical data for which mathematical operations (adding, subtracting, averaging)
can be applied

e.g.,

i) Discrete –

e.g.,

ii) Continuous –

e.g.

2) Qualitative – data that is classified according to some characteristic/category; not naturally a


numerical value but can be coded as such; also known as Categorical

e.g.,

AFlores 10.23
STAT 008 | Chapter 1 – Introduction to Statistics Page 4

Scales of Measurement

Interval – quantitative data; data that is ordered and the interval between values is expressed in a fixed
unit of measurement; no true zero and values can be negative

e.g.,

Ratio – quantitative data; all properties of interval data; ratio of two values is meaningful; true zero
exists and values are non-negative

e.g.,

Nominal – qualitative data; can be coded numerically; calculations are meaningless; consider the
frequency of ‘categories’; no natural ordering of categories

e.g.,

Ordinal – qualitative data; can be coded numerically; calculations provide some insight; natural
ordering of categories exists

e.g.,

We would like to display data graphically to obtain a visual impression of the information contained
within the data. The type of graph we use will depend on the type of variable (qualitative or quantitative).

Graphs for Qualitative Variables Graphs for Quantitative Variables


1) Bar Graph 1) Dot Plot
2) Pie Chart 2) Histogram
3) Stem & Leaf Plot
4) Box Plot

Graphs for Qualitative Variables


Since the data collected for qualitative variables consist of characteristics/categories, it is generally of
interest to summarize the frequency (i.e., count), relative frequency, or percentage for each of the
categories. This can be done in a table format and is known as a frequency distribution, relative
frequency distribution, or percentage frequency distribution.

Frequency –

Relative Frequency –

Percentage –
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 5

CHAPTER 2 – DESCRIBING DATA GRAPHICALLY

Bar Chart

Construction: horizontal axis 

vertical axis 

Example: Refer to the Human Resource Department Example. Consider the variable “department”
and suppose the following summary data is reported:

Department Frequency Relative Frequency Percentage

Accounting 100
Customer Service 300
Executive 25
Human Resource 75
Labor 2000
Sales 1500
Total 4000

Construct a bar chart to display the information.

Note: 1)

2)
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 6

Pie Chart

Construction: Divide a ‘pie’/circle into slices that are proportional to the relative
frequencies/percentages for each category.

f
Note: The angle of each slice can be calculated as  360
n

Example: Refer to the previous example. Construct a pie chart to display the information.

Department Frequency Relative Frequency Angle

Accounting 100 0.025


Customer Service 300 0.075
Executive 25 0.00625
Human Resource 75 0.01875
Labor 2000 0.50
Sales 1500 0.375
Total 4000 1

Note:

AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 7

Examples:

AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 8

Graphs for Quantitative Variables


The data for a quantitative variable takes on numerical values. Often, it is not the same value occurring
but instead many values within a range. The following types of graphs will allow us to analyze that
range, as well as values within the data that are occurring more often than others.

Dot Plot

Use: Small to medium data sets.

Construction: horizontal axis 

Example: Refer to the Human Resource Department Example. Consider the variable “number of
days absent” and suppose the following data is reported for 20 employees:

0 2 6 4 3 4 3 5 12 0
1 5 7 3 2 2 1 6 3 4

Construct a dot plot to display the information.

Histogram

Use: Large data sets.

Idea:

Construction: horizontal axis 

vertical axis 
AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 9

Steps: 1)

Note: Range = max – min


Note: Class = Interval of Values
Note: The number of classes selected will depend on the data set.
Note:

Note:

2)

3)

Note: Each observation should fall into one and only one class. This means we
must decide which endpoint of the interval to include for each class. The
right inclusion rule indicates that an observation on the ‘right’ (or upper
bound of the interval should be included. Alternatively, the left inclusion
rule indicates the ‘left’ (or lower bound) of the interval should be included.
Observation Class 1 Interval Class 2 Interval

4) Construct the histogram similar to constructing a bar chart (where the classes are the
“categories”)

Note: There should not be any space between bars.

Note: Steps 2-3 can be summarized using a Frequency / Relative Frequency Distribution.
Class Interval Frequency Relative Frequency
1
2
3

k

AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 10

Example: Refer to the Human Resource Department Example. Consider the variable “salary”. Since
there are 4000 employees and many different values for their salaries, it is best to display
the data by grouping the salaries into classes. Below is a partial list of the salaries:
\

$45,453 $48,564 $48,721 $52,012


$54,675 $85,768 $57,332 $38,289
$60,342 $60,000 $64,847 $65,000
$67,298 $37,887 $70,000 $32,948
$35,569 $35,685 $76,032 $96,028

Assume the minimum and maximum salary in the employee database is $30,235 and
$99,999, respectively. Complete the following relative frequency distribution using 7
classes and then construct a relative frequency histogram.

Class Interval Frequency Relative Frequency


1 1498 0.3745

2 732 0.1830

3 654 0.1635

4 482 0.1205

5 377 0.0943

6 215 0.0537

7 42 0.0105
Totals 4000 1

AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 11

Stem-and-Leaf Plot

Use: Small to medium data sets.

Construction: 1) Split each observation into a “stem” and a “leaf” where the last digit of the
observation is the “leaf” and all preceding digits form the “stem”.

Note: Some rounding may be required so that the list in Step 2 is appropriate.

2) List the stems in numerical order, in a vertical line, from the minimum value to
the maximum value. Do not skip any stem values.

3) Draw a vertical line to the right of the stems. This line separates the ‘stems’ and
‘leaves’.

4) List the leaves in numerical order, horizontally, at the appropriate stem. Repeat
leaf values for multiple observations of the same value.

Example: Refer to the Human Resource Department example. Consider the variable “number of
days employed” for 18 summer interns. The data is provided below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
Construct a stem-and-leaf plot to display the information.

Box Plot
This graph is based on numerical measures of position. We will come back to this.

AFlores 10.23
STAT 008 | Chapter 2 – Describing Data Graphically Page 12

Once a graph is constructed, we would like to describe the overall pattern of the data. Specifically, we
are interested in the following:

1)

2)

3)

4) Are there any observations that don’t ‘fit’ with the rest of the data (i.e., unlikely
observations/outliers)? Are those observations affecting the shape of the distribution?

Example: Consider the dot plot for the “number of days absent” data

0 1 2 3 4 5 6 7 8 9 10 11 12

1) Shape?

2) Center?

3) Spread?

4) Outlier(s)?

Example: Consider the stem-and-leaf plot for “number of days employed” for the summer interns.

11 09 1) Shape?
12 567789
13 1689 2) Center?
14 567
15 15 3) Spread?
16
17 4) Outlier(s)?
18 9

The impression of the data displayed in the graphs is subjective. Therefore, we need supporting evidence
for the statements/conclusions we are drawing.

Next, we consider describing data numerically.


AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 13

CHAPTER 3 – DESCRIBING DATA NUMERICALLY

Numerical Measures
 Measures of the Center
 Measures of Spread
 Measures of Position
 Measures for Detecting Outliers

Measures of the Center

1) Mean –

sample: where n  number of observations


xi  value of the i th observation

population:

Note: The mean is sensitive / resistant to outliers.

2) Median –

Note: 1) position of the median =

i.e.,

2) The median is sensitive / resistant to outliers.

3) Mode –

Note: The mode is sensitive / resistant to outliers.

These measures of the center can be used to determine the shape of the distribution:

 mean  median 

 mean  median 

 mean  median 
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 14

Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189

a) Calculate the mean, median, and mode for this data. How do these values compare with the
‘center’ interpreted from the stem and leaf plot?

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 15

b) Use the measurements from part (a) to describe the shape of the data. How does this compare
with the shape interpreted from the stem and leaf plot?

c) The stem & leaf plot indicated a possible outlier (i.e.,189). Remove this observation and repeat
part (a).

d) Use the measurements from part (c) to describe the shape of the data.

e) Compare the measures of center and the shape for the data set with and without the outlier.

n  18 n  17 Comments
Mean
Median
Mode
Shape

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 16

Measures of Spread

1) Range –

Notation / Formula:

Note: The range is sensitive / resistant to outliers.

2) Variance –

sample:

population:

Notes: 1)

2)

3) The denominator of the variance, ( n  1 ), is called the degrees of freedom (df).

4)

5) The variance is sensitive / resistant to outliers.

3) Standard Deviation –

sample: population:

Notes: 1)

2) The standard deviation is sensitive / resistant to outliers.


AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 17

Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189

a) Calculate the range and standard deviation for this data.

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 18

b) Remove the possible outlier and repeat part (a).

c) Compare the measures of spread for the data set with and without the outlier.
n  18 n  17 Comments
Range
Variance
Standard Deviation

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 19

Using the Standard Deviation


1) Chebyshev’s Theorem
2) Empirical Rule

Chebyshev’s Theorem

For a distribution of any shape, at least 1  1  100% of the observations will fall in   k .
k2

Interval Lower Bound Upper Bound k 1  1k 


2
1  1k 
2

100%
xs
x  2s
x  3s
x  3.5s

Note: Chebyshev’s Theorem can be applied to a sample, i.e., x  ks .

Note: Chebyshev’s Theorem can also be interpreted as follows:


1
For a distribution of any shape, at most 2 100% of the observations will not fall in x  ks .
k

Illustration #1:

Illustration #2:

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 20

Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189

Recall: x s

a) Complete the following table:

f
Interval Lower Bound Upper Bound f  100%
n
xs
x  2s
x  3s

b) Compare the percentage of observations within the specified intervals for our data (summarized
in the previous table) with Chebyshev’s Theorem.

c) Use Chebyshev’s Theorem to make a statement about the percentage of observations within 1.5
standard deviations of the mean.

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 21

Empirical Rule
For a symmetric, mound-shaped distribution, the following statements are true:
 Approximately of the observations will fall in the interval    .
 Approximately of the observations will fall in the interval   2 .
 Approximately of the observations will fall in the interval   3 .

Note: The Empirical Rule can be applied to a sample, i.e., x  ks .

Illustration #1:

Illustration #2: Illustration #3:

2.5% 47.5% 47.5% 2.5% 0.15% 49.85% 49.85% 0.15%

Alternatively,

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 22

Example: Refer to the Human Resource Department Example. Consider the variable “number of
days employed”. Suppose a histogram indicates that this variable is approximately
symmetric with a mean of 3010 days (about 8.25 years) and a standard deviation of 915
days (about 2.5 years). Find the proportion of employee that…

a) have worked at the company less than 3010 days?

b) have worked at the company between 1180 days and 4840 days?

c) have worked at the company less than 1180 days?

d) have worked at the company between 2095 days and 5755 days?

e) What are the lower and upper bounds for the number of days worked for the middle 99.7% of all
employees?

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 23

Measures of Position

1) p th percentile –

Notation/Formula for the


percentile of a particular value:

Formula for the position of the


p th percentile:

2) Quartiles –

Illustration:

 Lower Quartile = 25th percentile –

Notation: Position:

 Median = 50th percentile

Notation: m Position: 0.50  n  1

 Upper Quartile = 75th percentile –

Notation: Position:
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 24

Measures of Spread (additional)

1) Range

2) Variance

3) Standard Deviation

4) Inter-Quartile Range –

Notation/Formula:

Notes: 1) If the IQR is ‘small’ (relative to the overall Range), then the spread tends to be small
(since 50% of the observations are close to the center of the data within the range of the
IQR).

e.g.

2) If the IQR is ‘large’ (relative to the overall Range), then the data tends to be widely
dispersed (since 50% of the observations are widely dispersed).

e.g.

Measures for Detecting Outliers

Recall: Empirical Rule and Chebyshev’s Theorem

Empirical Rule
For a symmetric, mound-shaped distribution, the following statements are true:
 Approximately 68% of the observations will fall in the interval x  1s
 Approximately 95% of the observations will fall in the interval x  2 s
 Approximately 99.7% of the observations will fall in the interval x  3s

Chebyshev’s Theorem

For a distribution of any shape, at least 1  1
k2  100% of the observations will fall in x  ks .
Consider the proportion/percentage of observations not in these intervals.
Empirical Rule Chebyshev’s Theorem
Inside of Outside of Inside of Outside of
the Interval the Interval the Interval the Interval
x  1s
x  2s
x  3s
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 25

1) z-score –

sample:

population:

Rules: z  score  2

z  score  3

Note: This measure for detecting outliers is best used for symmetric (or approximately symmetric)
distributions.

2) Boundaries

Lower Fence/Limit =

Upper Fence/Limit =

Rule: Observations falling outside the fences/limits are considered outliers.

i.e.,

Note: This measure for detecting outliers is best used for distributions that are skewed or when
the shape is unknown.

Example: Refer to the Human Resource Department example where the variable of interest was
“number of days employed” for 18 summer interns. The data is re-printed below:
110 119 125 126 127 127
128 129 131 136 138 139
145 146 147 151 155 189
AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 26

a) Calculate the following summary for this data.

Min:

Q1:

Median:

Q3:

Max:

b) Is 189 an outlier? Use both approaches.

i) z-score =

ii) Fences/Limits:

AFlores 10.23
STAT 008 | Chapter 3 – Describing Data Numerically Page 27

Box Plot
Box plots depend on measures of position known as the Five Number Summary.

 Five Number Summary:

Construction: horizontal axis 

vertical lines 

dots/points 

horizontal lines (“whiskers”) 

dashed vertical lines 

asterisks (*) 

Illustration:

Example: Refer to the previous example. Construct a box plot for the data. Identify the shape of the
distribution.

90 100 110 120 130 140 150 160 170 180 190

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 28

CHAPTER 3 – BIVARIATE DATA

Bivariate data –

Notation:

Terminology: x  explanatory / independent / predictor variable;

y  response / dependent variable;

We would like to display quantitative bivariate data graphically to obtain a visual impression of the
information contained within the data; specifically, any relationships that may exist between the two
variables. We would also like to support our graphs with numerical measurements.

Scatterplot

Use: Determining relationships between two variables.

Construction: horizontal axis 

vertical axis 

Illustration:

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 29

Once a graph is constructed, we would like to describe the overall pattern of the data. Specifically, we
are interested in the following:

1)

2)

3)

4) Are there any observations that don’t ‘fit’ with the rest of the data (i.e., unlikely
observations/outliers)?

Form:

Direction:

Strength:

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 30

Example: A small business owner opening a new pet store believes there is a relationship between
the number of pets a customer owns and the amount of money the customer will spend at
the pet store. To determine if there is in fact a relationship, she collects data from 9
customers.
Number of
1 3 2 5 4 2 3 0 7
Pets
Amount
20 45 28 70 62 35 53 15 100
Spent ($)

a) Do you expect a relationship to exist? If so, what type of relationship?

b) Identify the explanatory variable: _______________________________________________

c) Identify the response variable: _______________________________________________

d) Construct a scatterplot to determine if a relationship exists between the number of pets a customer
owns and the total amount the customer spends at the pet store.

e) Consider the scatterplot in part (d). How would you describe the relationship?

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 31

Just as we learned with single variable data, the interpretations drawn from graphs are subjective.
Therefore, we must also consider numerical descriptions for relationships.

Numerical Measures for Bivariate Data


 Correlation Coefficient
 Regression Line
 Coefficient of Determination
 Predicted/Fitted Values

Correlation Coefficient –

sample:

population:

is called the covariance. Its sign     identifies the direction of the relationship.
ௌ೉ೊ
Note:
௡ିଵ

Interpretation:

1)

2)

3)

4)

5)

6)

7)

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 32

Example: Refer to the pet store example. Calculate the correlation coefficient and interpret
appropriately. The data is re-printed below.

x y

1 20
3 45
2 28
5 70
4 62
2 35
3 53
0 15
7 100

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 33

Once it has been determined that a linear relationship exists between two variables, we would like to
‘describe’ that relationship mathematically.

Regression Line –

sample:

population:

Notes: 1)

2)

3)

This regression line is called the Least Squares Regression Line because it minimizes the sum of the
squares of the error between the observed value of y and the predicted value, y .

Illustration:

x
AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 34

Notes: 1)

2)

Example: Refer to the pet store example.

a) Calculate the regression line for describing the relationship between the number of pets a
customer owns and the total amount spent at the pet store.

Recall: n
r  xy  S xy 

x  x  2
S xx 
y y  2
S yy 

b) Interpret the slope as it pertains to this problem.

c) Interpret the intercept as it pertains to this problem.

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 35

How can we determine if the calculated regression line describes the observed data well?

Coefficient of Determination –

sample:

population:

Note:

Interpretation:

1)

2)

3)

4)

5)

Example: Refer to the pet store example. Calculate the coefficient of determination and interpret.

Recall: r

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 36

Regression Warnings:

1)

2)

Example: Suppose it has been determined that there is a positive relationship between the
monthly payment a person makes for their car and their mortgage. Does this mean
a higher car payment causes a person’s mortgage payment to increase?

Explanation:

3)

Illustration:

AFlores 10.23
STAT 008 | Chapter 3 – Bivariate Data Page 37

If a scatterplot displays a linear relationship between two variables, that relationship can be described by
the correlation coefficient and regression line. Further, the coefficient of determination can be used to
evaluate the fit of the regression line. If it is found that the regression line provides a ‘good’ description
of the relationship, we would like to utilize it.

Example: Refer to the pet store example.

Recall:

a) If a customer entering the pet store has three pets, predict the total amount he will spend.

b) Compare your predicted value in part (a) with the observed values for customers with three pets.
Why is there a difference?

c) If a customer entering the pet store has eleven pets (maybe 2 cats, 3 dogs, and 6 birds), predict
the total amount she will spend. Are there any concerns with your prediction?

AFlores 10.23

You might also like