STAT 614
Descriptive Statistics
Carly Metcalfe
Fall 2020
| 2
Topics
Sample versus Population
Types of Data
Numerical Summaries
Visual Summaries
Sample Versus Population
| 4
Population
A collection of all object of interest
• All full-time students at RIT
• All tax-exempt property in Wisconsin
• All patients in a Corning area hospital at a certain time
• All people living in the US today
| 5
A Sample
A sample is a subset of the population
| 6
What makes a good Sample?
Random Sample!
A good sample is like a small version of the population,
only smaller
The sample also needs to be larger enough so that we
can draw reliable conclusions
At the same time, we do not want the sample to be too
larger because it costs money
| 7
Sampling Unit
The objects (or people) on which the measurements are
performed are called sampling units
Types of Data
| 9
Ways to collect data
Three basic methods of collecting data:
• A retrospective study using historical data
Data collected in the past for other purposes.
• An observational study
Data, presently collected, by a passive observer.
• A designed experiment
Data collected in response to process input changes.
| 10
Type of Data
Categorical (Qualitative or Attribute)
• Ordinal: Values that can be ordered or ranked
Examples
> Shirt size (small, medium, large)
> Quality of product shipped (Grades A, B, C)
> Year in College (Freshman, Sophomore, Junior, Senior)
• Nominal: Categories that need not have an ordering
Examples
> In or out of specification (Yes, No)
> Make of car preferred (Toyota, Honda, etc.)
> Vehicle type (pick up truck, van, 2-door, 4-door)
| 11
Types of Data
Numeric (Quantitative)
• Discrete: Can only take on whole numbers
Examples
> Number of empty beds in a hospital at a certain time
> Number of misread scripts in a pharmacy over the course of a day
> Number of arrivals to a grocery store within an hour
• Continuous: Can, at least in theory, take on all values in a range
Examples
> Body temperature, in °F
> Weight, in lbs
> Commute time, in minutes
Numerical Summaries
| 13
Muffin Example
Blueberry Muffins
• Each package contains several muffins
• Target package weight is 427 grams
10 packages are selected at random
• Each muffin in the package is weighed
What do you do first?
| 14
In JMP
Can explore this data set: First Tab in Summarizing
Data.xlxs
• Number of muffins in each package?
• Distribution of individual muffin weight?
• Total weight of each package?
• Distribution of package weight?
| 15
Qualitative Numerical Summaries
Nominal or Ordinal data can be counted, tallied, and the
percentage (proportion) of each group can be found
| 16
Quantitative Numerical Summaries
Quantitative data summaries include measures of central tendency and variability
Measure of Location (center)
• Mean
• Median
• Mode
Measures of Variability (spread)
• Range
• Variance
• Standard Deviation
| 17
Measures of Central Tendency
Mean
σ𝑛
𝑖=1 𝑥𝑖
• Average 𝑥ҧ =
𝑛
Median
Middle observation or average of the middle two
Data: 16 7 8 12 7
> Median: 7 7 8 12 16
Data: 12 15 10 10 9 16
> Median: 9 10 10 12 15 16 -- Median = 11
Mode
The most frequent observation
| 18
Mean versus Median
| 19
Measures of the Spread
Range
• Data: 8 7 8 3 7 9 5 6 8 8 5
• Max = 9, Min = 3, Range = Max – Min = 9 – 3 = 6
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
Variance 𝑠 2 =
𝑛−1
Standard Deviation 𝑠 = 𝑠 2
• Square root of the variance. The advantage of the sample standard deviation
is that it is expressed in the original units that are measured (as opposed to
the variance, which is in square units).
| 20
Measures of the Spread
Standard Deviation
• Data: 8 6 1 5 10
Obs. 𝒙𝒊 ഥ
𝒙 (𝒙𝒊 − ഥ
𝒙) 𝒙𝒊 − ഥ
𝒙 𝟐
σ𝑛
𝑖=1 𝑥𝑖 30
mean = 𝑥ҧ = = =6
1 8 6 2 4 𝑛 5
2 6 6 0 0 2 σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2 46
variance = 𝑠 = = = 11.5
3 1 6 -5 25 𝑛−1 4
4 5 6 -1 1 standard deviation = 𝑠 = 11.5 = 3.39
5 10 6 4 16
Sum = 30 Sum = 46
| 21
Measures of Spread
Quartiles and Interquartile Range
After the observations are sorted in ascending order, imagine the sorted data split into
four equal parts. The value that splits the sorted data exactly in two is the median, which
is sometimes call Q2, the second quartile. Then the value that splits the lower half in two
is Q1, the first quartile, and the value that splits the upper half in two is Q3, the third
quartile.
Therefore, 25% of the data is below Q1, and 25% of the data is above Q3, with 50%
between Q1 and Q3.
The difference between Q3 and Q1 is called the interquartile range (IQR), which is
another measure of variability in data.
| 22
Percentiles
Percentiles split the sorted data into 100 equal parts, with 1% in
each section. For examples, to be in the 99th percentile on a test
would mean that 99% of the sample had lower scores that you,
and 1% of the sample had higher scores than you.
Ex 1: The score of 640 on Math SAT is the 85th percentile if 85%
of the test takers scored below 640.
Ex 2: Since the median splits the data in two, it is considered the
50th percentile.
| 23
Numerical Summary Strategy
Certain summary measures are sometimes used together
If dataset contains outliers or is highly skewed
• Use median as measure of central tendency
• Use interquartile range (IQR) as measure of variability
Otherwise
• Use mean and standard deviation
Use Distribution platform in JMP
Visual Summaries
| 25
Qualitative Graphical Summaries
Bar and Pie Charts
Bar Charts and Pie Charts can be used to graph the same
information as count or percentage table.
Use JMP Graph Builder to create
| 26
Bar Chart
Shows graphically how many muffins are in each
package.
Allows easy identification
of differences between
packages
| 27
Pie Chart
Shows graphically how many muffins are in each
package:
| 28
Taxman Example
You prepare taxes for individuals and businesses
You collected tax preparation information for 2017
• Type: Individual, Business
• Extension needed: Extension, No Extension
• Fee collected per file ($)
• Days to complete
Let’s take a look
| 29
Taxman Example
For this examples, let’s put together a graphical and
numerical summary for whether or not an extension is
needed and draw conclusions
We can take it step further and instead summarize the
need for extension by type
| 30
Quantitative Graphical Summaries
Univariate Data
• Series (Run) charts
If time order is relevant
• Histograms
• Box plots
Bivariate Date
• Scatter plots
| 31
Series Plot
For data that have been ordered in some meaningful way, for
instance according to the time of collections, beginning with
the first and ending with the last
A time series plot looks at a single variable over time.
However, you can overlay several time series plots on one
graph
Patterns can occur over time, such as trends (up or down),
cycles, and alternating values
A type of time series plot is a run chart
| 32
Series Plot
In JMP: Analyze -> Specialized Modeling -> Time Series
Shows the data plotted in the order it occurs in the
worksheet
| 33
Raleigh Temperature CO2 at Mauna Loa Observatory
| 34
Static Data Plots
Used to answer to the questions:
1. What is the spread?
2. What is the distribution?
3. Where is the center?
| 35
Static Data Plots
To accomplish this, we can use the following plots:
1. Dotplot – Based on the raw data
2. Histogram – Based on grouped data
3. Boxplot – Based on summarized data
| 36
Dotplot
Plot used to view the raw data
A symbol (dot) is used to represent each data point
No grouping, e.g. to nearest unit
Useful if:
• Data is continuous
• Small data sets
| 37
Dotplot
Use Graph Builder in JMP
A dotplot shows the spread and shape of the individual data points, in this
case by package
| 38
Taxman Example
Create a dotplot for one of the continuous variables in
the taxman example and draw conclusions
| 39
Histogram
Plot used to view data grouped in cells or intervals, usually
all of the same width
Bars are used to assess the frequency of occurrence in each
cell based on bar height
Useful if:
• Data is discrete or continuous
• Datasets are large or small
| 40
Center
Histogram
Distribution
Answers the questions
• What is the spread?
• What is the distribution?
• Where is the center?
Spread
Sorts data into bills (cells) by numerical values
| 41
Shapes of Histograms
Symmetric Positively Skewed Negatively Skewed
Also, bimodal, multimodal, uniform
Histograms can tell you a lot, identify outliers
| 42
Histogram
Use Graph Builder in JMP
| 43
Taxman Example
Create a histogram for one of the continuous variables in
the taxman example and draw conclusions
| 44
Boxplot
Created based on a summarization of the data
Emphasized percentiles (quartiles)
Useful if:
• Data is continuous
• Data sets are large or small
| 45
Boxplot
Boxplot uses 5 number summary: minimum, Q1, Q2, Q3,
maximum
If there are outliers, the
line extends to the
largest/smallest value that
is not an outlier and then
outliers are added with a
symbol (dot or star)
| 46
Boxplot
Use Graph Builder in JMP
Can also identify shape
| 47
Outliers
A points that is too extreme to assume it has occurred on
account of inherent process variability
An observation is considered an outlier if it is more that
1.5*IQR away from the nearest quartile
An observation is considered an extreme outlier if it
more than 3*IQR away from the nearest quartile
| 48
Analysis – looking at the data
Boxplot vs … dotplot of histogram
• Dotplot and histogram:
Requires two dimensions
• Boxplot:
Requires one dimension
| 49
Multiple Boxplots
We can also view response variables based on a set of
categorical input variables
This can be done with boxplots
| 50
Taxman Example
Let’s do the taxman example with the variable ‘days to
complete’ by the two categorical input variables
| 51
Taxman Example
Should files
with
extensions be
charged a
higher fee?
| 52
Scatter Plot
A scatter plot is used to examine the association between
two variables
To construct a scatter plot, scale one variable on the
horizontal axis and scale the second variable on the vertical
axis. Plot a point for each par of values for an observation
(usually a row in the data set). The points should not be
connected.
| 53
Association between X and Y
Can determine positive, negative,
or no associations
Can assess strong, moderate, or
weak associations
Can identify linear and non-linear
trends
| 54
Taxman Example
Use Graph Builder in JMP
| 55
Problem
You are given measurements of the diameter of 30 coils of
wire samples from ABC Company productions
The measurements (in inches) are in the Summarizing
Data.xlxs
The coils came from five days of production (the first six
from Day 1, the next six from Day 2, etc.)
| 56
Problem
Summarize the coil diameters
• Use descriptive statistics and graphical displays we’ve looked at in this
lesson.
Investigate whether the coil diameters likely vary from
one day to the next.