0% found this document useful (0 votes)
33 views56 pages

Descriptive Statistics Overview for STAT 614

The document outlines key concepts in descriptive statistics, including the distinction between samples and populations, types of data, and methods for data collection. It discusses various numerical and visual summaries, such as measures of central tendency and variability, as well as graphical representations like bar charts, histograms, and boxplots. The document also includes examples and strategies for analyzing data to draw meaningful conclusions.

Uploaded by

Sanket Waghmare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views56 pages

Descriptive Statistics Overview for STAT 614

The document outlines key concepts in descriptive statistics, including the distinction between samples and populations, types of data, and methods for data collection. It discusses various numerical and visual summaries, such as measures of central tendency and variability, as well as graphical representations like bar charts, histograms, and boxplots. The document also includes examples and strategies for analyzing data to draw meaningful conclusions.

Uploaded by

Sanket Waghmare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

STAT 614

Descriptive Statistics
Carly Metcalfe
Fall 2020
| 2

Topics
 Sample versus Population
 Types of Data
 Numerical Summaries
 Visual Summaries
Sample Versus Population
| 4

Population
 A collection of all object of interest
• All full-time students at RIT
• All tax-exempt property in Wisconsin
• All patients in a Corning area hospital at a certain time
• All people living in the US today
| 5

A Sample
 A sample is a subset of the population
| 6

What makes a good Sample?


 Random Sample!
 A good sample is like a small version of the population,
only smaller
 The sample also needs to be larger enough so that we
can draw reliable conclusions
 At the same time, we do not want the sample to be too
larger because it costs money
| 7

Sampling Unit
 The objects (or people) on which the measurements are
performed are called sampling units
Types of Data
| 9

Ways to collect data


Three basic methods of collecting data:
• A retrospective study using historical data
 Data collected in the past for other purposes.
• An observational study
 Data, presently collected, by a passive observer.
• A designed experiment
 Data collected in response to process input changes.
| 10

Type of Data
 Categorical (Qualitative or Attribute)
• Ordinal: Values that can be ordered or ranked
 Examples
> Shirt size (small, medium, large)
> Quality of product shipped (Grades A, B, C)
> Year in College (Freshman, Sophomore, Junior, Senior)
• Nominal: Categories that need not have an ordering
 Examples
> In or out of specification (Yes, No)
> Make of car preferred (Toyota, Honda, etc.)
> Vehicle type (pick up truck, van, 2-door, 4-door)
| 11

Types of Data
 Numeric (Quantitative)
• Discrete: Can only take on whole numbers
 Examples
> Number of empty beds in a hospital at a certain time
> Number of misread scripts in a pharmacy over the course of a day
> Number of arrivals to a grocery store within an hour
• Continuous: Can, at least in theory, take on all values in a range
 Examples
> Body temperature, in °F
> Weight, in lbs
> Commute time, in minutes
Numerical Summaries
| 13

Muffin Example
 Blueberry Muffins
• Each package contains several muffins
• Target package weight is 427 grams
 10 packages are selected at random
• Each muffin in the package is weighed

 What do you do first?


| 14

In JMP
 Can explore this data set: First Tab in Summarizing
Data.xlxs
• Number of muffins in each package?
• Distribution of individual muffin weight?
• Total weight of each package?
• Distribution of package weight?
| 15

Qualitative Numerical Summaries


 Nominal or Ordinal data can be counted, tallied, and the
percentage (proportion) of each group can be found
| 16

Quantitative Numerical Summaries


 Quantitative data summaries include measures of central tendency and variability

 Measure of Location (center)


• Mean
• Median
• Mode

 Measures of Variability (spread)


• Range
• Variance
• Standard Deviation
| 17

Measures of Central Tendency


 Mean
σ𝑛
𝑖=1 𝑥𝑖
• Average 𝑥ҧ =
𝑛
 Median
 Middle observation or average of the middle two
 Data: 16 7 8 12 7
> Median: 7 7 8 12 16
 Data: 12 15 10 10 9 16
> Median: 9 10 10 12 15 16 -- Median = 11
 Mode
 The most frequent observation
| 18

Mean versus Median


| 19

Measures of the Spread


 Range
• Data: 8 7 8 3 7 9 5 6 8 8 5
• Max = 9, Min = 3, Range = Max – Min = 9 – 3 = 6
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2
 Variance 𝑠 2 =
𝑛−1
 Standard Deviation 𝑠 = 𝑠 2
• Square root of the variance. The advantage of the sample standard deviation
is that it is expressed in the original units that are measured (as opposed to
the variance, which is in square units).
| 20

Measures of the Spread


 Standard Deviation
• Data: 8 6 1 5 10
Obs. 𝒙𝒊 ഥ
𝒙 (𝒙𝒊 − ഥ
𝒙) 𝒙𝒊 − ഥ
𝒙 𝟐
σ𝑛
𝑖=1 𝑥𝑖 30
mean = 𝑥ҧ = = =6
1 8 6 2 4 𝑛 5
2 6 6 0 0 2 σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ
2 46
variance = 𝑠 = = = 11.5
3 1 6 -5 25 𝑛−1 4
4 5 6 -1 1 standard deviation = 𝑠 = 11.5 = 3.39
5 10 6 4 16
Sum = 30 Sum = 46
| 21

Measures of Spread
Quartiles and Interquartile Range
 After the observations are sorted in ascending order, imagine the sorted data split into
four equal parts. The value that splits the sorted data exactly in two is the median, which
is sometimes call Q2, the second quartile. Then the value that splits the lower half in two
is Q1, the first quartile, and the value that splits the upper half in two is Q3, the third
quartile.
 Therefore, 25% of the data is below Q1, and 25% of the data is above Q3, with 50%
between Q1 and Q3.
 The difference between Q3 and Q1 is called the interquartile range (IQR), which is
another measure of variability in data.
| 22

Percentiles
 Percentiles split the sorted data into 100 equal parts, with 1% in
each section. For examples, to be in the 99th percentile on a test
would mean that 99% of the sample had lower scores that you,
and 1% of the sample had higher scores than you.

 Ex 1: The score of 640 on Math SAT is the 85th percentile if 85%


of the test takers scored below 640.
 Ex 2: Since the median splits the data in two, it is considered the
50th percentile.
| 23

Numerical Summary Strategy


 Certain summary measures are sometimes used together
 If dataset contains outliers or is highly skewed
• Use median as measure of central tendency
• Use interquartile range (IQR) as measure of variability
 Otherwise
• Use mean and standard deviation
 Use Distribution platform in JMP
Visual Summaries
| 25

Qualitative Graphical Summaries


Bar and Pie Charts
 Bar Charts and Pie Charts can be used to graph the same
information as count or percentage table.

 Use JMP Graph Builder to create


| 26

Bar Chart
 Shows graphically how many muffins are in each
package.
Allows easy identification
of differences between
packages
| 27

Pie Chart
 Shows graphically how many muffins are in each
package:
| 28

Taxman Example
 You prepare taxes for individuals and businesses

 You collected tax preparation information for 2017


• Type: Individual, Business
• Extension needed: Extension, No Extension
• Fee collected per file ($)
• Days to complete

 Let’s take a look


| 29

Taxman Example
 For this examples, let’s put together a graphical and
numerical summary for whether or not an extension is
needed and draw conclusions

 We can take it step further and instead summarize the


need for extension by type
| 30

Quantitative Graphical Summaries


 Univariate Data
• Series (Run) charts
 If time order is relevant
• Histograms
• Box plots
 Bivariate Date
• Scatter plots
| 31

Series Plot
 For data that have been ordered in some meaningful way, for
instance according to the time of collections, beginning with
the first and ending with the last
 A time series plot looks at a single variable over time.
However, you can overlay several time series plots on one
graph
 Patterns can occur over time, such as trends (up or down),
cycles, and alternating values
 A type of time series plot is a run chart
| 32

Series Plot
 In JMP: Analyze -> Specialized Modeling -> Time Series
 Shows the data plotted in the order it occurs in the
worksheet
| 33

Raleigh Temperature CO2 at Mauna Loa Observatory


| 34

Static Data Plots


 Used to answer to the questions:
1. What is the spread?
2. What is the distribution?
3. Where is the center?
| 35

Static Data Plots


 To accomplish this, we can use the following plots:
1. Dotplot – Based on the raw data
2. Histogram – Based on grouped data
3. Boxplot – Based on summarized data
| 36

Dotplot
 Plot used to view the raw data
 A symbol (dot) is used to represent each data point
 No grouping, e.g. to nearest unit

 Useful if:
• Data is continuous
• Small data sets
| 37

Dotplot
 Use Graph Builder in JMP
 A dotplot shows the spread and shape of the individual data points, in this
case by package
| 38

Taxman Example
 Create a dotplot for one of the continuous variables in
the taxman example and draw conclusions
| 39

Histogram
 Plot used to view data grouped in cells or intervals, usually
all of the same width
 Bars are used to assess the frequency of occurrence in each
cell based on bar height

 Useful if:
• Data is discrete or continuous
• Datasets are large or small
| 40

Center
Histogram
Distribution
 Answers the questions
• What is the spread?
• What is the distribution?
• Where is the center?

Spread

 Sorts data into bills (cells) by numerical values


| 41

Shapes of Histograms

Symmetric Positively Skewed Negatively Skewed


 Also, bimodal, multimodal, uniform
 Histograms can tell you a lot, identify outliers
| 42

Histogram
 Use Graph Builder in JMP
| 43

Taxman Example
 Create a histogram for one of the continuous variables in
the taxman example and draw conclusions
| 44

Boxplot
 Created based on a summarization of the data
 Emphasized percentiles (quartiles)

 Useful if:
• Data is continuous
• Data sets are large or small
| 45

Boxplot
 Boxplot uses 5 number summary: minimum, Q1, Q2, Q3,
maximum
If there are outliers, the
line extends to the
largest/smallest value that
is not an outlier and then
outliers are added with a
symbol (dot or star)
| 46

Boxplot
 Use Graph Builder in JMP
 Can also identify shape
| 47

Outliers
 A points that is too extreme to assume it has occurred on
account of inherent process variability
 An observation is considered an outlier if it is more that
1.5*IQR away from the nearest quartile
 An observation is considered an extreme outlier if it
more than 3*IQR away from the nearest quartile
| 48

Analysis – looking at the data


 Boxplot vs … dotplot of histogram
• Dotplot and histogram:
 Requires two dimensions

• Boxplot:
 Requires one dimension
| 49

Multiple Boxplots
 We can also view response variables based on a set of
categorical input variables
 This can be done with boxplots
| 50

Taxman Example
 Let’s do the taxman example with the variable ‘days to
complete’ by the two categorical input variables
| 51

Taxman Example
Should files
with
extensions be
charged a
higher fee?
| 52

Scatter Plot
 A scatter plot is used to examine the association between
two variables

 To construct a scatter plot, scale one variable on the


horizontal axis and scale the second variable on the vertical
axis. Plot a point for each par of values for an observation
(usually a row in the data set). The points should not be
connected.
| 53

Association between X and Y


 Can determine positive, negative,
or no associations
 Can assess strong, moderate, or
weak associations
 Can identify linear and non-linear
trends
| 54

Taxman Example
 Use Graph Builder in JMP
| 55

Problem
 You are given measurements of the diameter of 30 coils of
wire samples from ABC Company productions

 The measurements (in inches) are in the Summarizing


Data.xlxs

 The coils came from five days of production (the first six
from Day 1, the next six from Day 2, etc.)
| 56

Problem
 Summarize the coil diameters
• Use descriptive statistics and graphical displays we’ve looked at in this
lesson.

 Investigate whether the coil diameters likely vary from


one day to the next.

You might also like