Chocolate Cake Seminar
Series on Statistical Applications
Todays Talk:
Be an Explorer with Exploratory
Data Analysis!
By David Ramirez
Outline of Presentation
Exploratory v. Confirmatory Data Analyses
Exploratory Data Analysis Techniques
Examples of Graphical Techniques
Examples of Non-graphical Techniques
What is Exploratory Data Analysis (EDA)?
John Tukey (1915-2000), American statistician
It is important to understand what
you CAN DO before you learn to
measure how WELL you seem to
have DONE it.
Definition
EDA consists of methods of discovering unanticipated
patterns and relationships in a data set, by summarizing
data quantitatively or presenting them visually.
3
Exploratory v. Confirmatory
Exploratory Data Analysis
Descriptive Statistics - Inductive Approach
Look for flexible ways to examine data without preconceptions
Heavy reliance on graphical displays
Let data suggest questions
Advantages
Flexible ways to generate hypotheses
Does not require more than data can support
Promotes deeper understanding of processes
Disadvantages
Usually does not provide definitive answers
Requires judgment - cannot be cookbooked
Exploratory v. Confirmatory
Confirmatory Data Analysis
Inferential Statistics - Deductive Approach
Hypothesis tests and formal confidence interval estimation
Hypotheses determined at outset
Heavy reliance on probability models
Look for definite answers to specific questions
Emphasis on numerical calculations
Advantages
Provide precise information in the right circumstances
Well-established theory and methods
Disadvantages
Misleading impression of precision in less than ideal circumstances
Analysis driven by preconceived ideas
Difficult to notice unexpected results
EDA Techniques
Graphical presentation of distribution
- Continuous variables (stem-and-leaf plot, box plot,
histogram, bivariate scatterplot)
- Categorical variables (bar graph, pie chart)
Non-graphical summary of distribution
- Continuous variables (mean, median, mode, variance,
standard deviation, range, correlation coefficient, linear
regression)
- Categorical variables (frequency table, cross-tabulation)
Stem-and-Leaf Plot
What is it?
A plot where each data value is split into a "leaf"
(usually the last digit) and a "stem" (the other digits).
Useful for describing distributions in terms of
-- Symmetry or skewness (right-skewed=long right tail or
left-skewed=long left tail)
-- Unimodality, bimodality or multimodality (one, two,
or more peaks)
-- Presence of outliers (a few very large or very small
observations)
7
How To Create Stem-and-Leaf Plot
Syntax
EXAMINE VARIABLES=Rain
/PLOT BOXPLOT STEMLEAF
By Mouse
Descriptive Statistics-> Explore -> Plot Stem and
Leaf Plot
Example: Stem-and-leaf Plot
We use SPSS to construct a stem-and-leaf plot for
rainfall in the US in metropolitan areas.
Frequency Stem & Leaf
4.00 Extremes (=<15)
1.00
1. 8
.00
2.
2.00
2 . 58
10.00
3 . 0001111234
15.00
3 . 555556666677889
16.00
4 . 0011222223333344
7.00
4 . 5555566
4.00
5 . 0234
1.00 Extremes (>=60)
9
Box Plot
What is it?
A way of graphically depicting groups of numerical data
through their five-number summaries: the smallest
observation (sample minimum), lower quartile (Q1),
median (Q2), upper quartile (Q3), and largest observation
(sample maximum). A box plot may also indicate which
observations, if any, might be considered outliers.
Useful in visualizing the following:
Location
Spread
Skewness
Outliers
10
How To Create Box Plot
Syntax
EXAMINE VARIABLES=Rain
/PLOT=BOXPLOT.
By mouse
Graphs> legacy plots-> Box Plots->Click summaries of
separate variables-> Scaled Variable-> Optional:
Label Case-> Okay
11
Example: Box Plot
Using the previous data on precipitation, we
would like to understand the distribution of
the rain and check for any outliers.
12
Example: Multiple Box Plots
Side-by-side box plots below display the
population distribution of large cities in 1960.
13
How To Create Box Plots
Syntax
EXAMINE VARIABLES=Population BY Country
/PLOT=BOXPLOT
/ID=City.
By mouse
Graph> legacy plots-> Box Plots> click summaries
of groups of cases> define> Variable (scalar) >
categories (how are we organize them)> label (IDs
or name (optional))
14
Histogram
What is it?
A diagram consisting of rectangles which area is
proportional to the frequency of a continuous variable
and which width is equal to the class interval (bin).
Useful for describing distributions in terms of
-- Symmetry or skewness
-- Unimodality, bimodality or multimodality
-- Presence of outliers
15
How To Create Histogram
Automatically chosen Bins
Syntax
GRAPH
/HISTOGRAM(NORMAL)=Population.
By Mouse
Graphs-> histogram-> Variable (scalar)-> okay
16
How To Create Histogram
User-selected number of bins
Syntax
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Frequency"))
ELEMENT: interval(position(summary.count(bin.rect(Population, binCount(5)))),
shape.interior(shape.square))
END GPL.
By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay
17
How To Create Histogram
User-selected bin width
Syntax
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Frequency"))
ELEMENT: interval(position(summary.count(bin.rect(Population, binWidth(1)))),
shape.interior(shape.square))
END GPL.
By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay
18
Example: Histogram
A researcher might need to select bins to have
a better understanding of the distribution and
check what type of distribution we have.
19
Scatterplot
What is it?
A scatterplot is a plot of data points in xy-plane
that displays the strength, direction and shape of
the relationship between the two variables.
Used for
Analyzing relationships between two variables
Looking to see if there are any outliers in the data
20
How To Create Scatterplot
Syntax
GRAPH
/SCATTERPLOT(BIVAR)=Height WITH Wieght
/MISSING=LISTWISE.
By Mouse
> graph-> legacy dialogs-> scatter/dot-> Simple
Scatter-> Y axis (outcome) -> X axis (predictor)->
okay
21
Example: Scatterplot
Researchers wanted to see if there is a link
between Height and Weight.
22
Bar Graph
What is it?
-- A diagram consisting of rectangles which area is
proportional to the frequency of each level of
categorical variable.
-- Bar graph is similar to histogram but for
categorical variables.
Used for
-- comparison of frequencies for different levels
23
How To Create Bar Graph
Syntax
GRAPH
/BAR(SIMPLE)=COUNT BY Gender.
By Mouse
Graph-> legacy dialogues-> bar-> Categorical
Variable->Categorical Axis-> okay
24
Example: Bar Graph
Experimenters wanted to make sure they had
an close equal number of males and females
in a study.
25
Pie chart
What is it?
A type of graph in which a circle is divided into
sectors corresponding to each level of categorical
variable and illustrating numerical proportion for
that level.
Used for
-- comparison of proportions for different levels
26
How To Create Pie Chart
Syntax
GRAPH
/PIE=COUNT BY Bindedage.
By Mouse
Graph-> Legacy Dialogs-> Pie Chart->
Summaries for group of cases-> define->
categorical variable-> categorical axis-> okay
27
Example: Pie Chart
A researcher wants to partition the age
variable into a categorical variable in terms of
mental development (College Age, Older
Young Adult, Young Middle age, Middle
Middle Age and up).
28
Non-Graphical Techniques
Measures of Central Tendency
Central Tendency is the location of the middle
value
Mean=sum of all data values divided by the
number of values (arithmetic average).
29
Measures of Central Tendency
Median=the middle value after all the values are
put in an ordered list (50% observations lie below
and 50% above the median).
If there is a two middle observations, median is the average of
the two.
Mode=most likely or frequently occurring value.
30
Measures of Spread
Spread is how far observations lie from each
other.
-- Variance=average of the squared distances from
the mean.
-- Standard deviation=square root of the variance.
-- Range=maximum-minimum.
31
How to Compute Measures of Central
Tendency and Spread
Syntax
FREQUENCIES VARIABLES=MORT
/STATISTICS=STDDEV VARIANCE RANGE MEAN MEDIAN MODE
/ORDER=ANALYSIS.
By Mouse
Analyze-> Frequency -> Select a Scaled data->
click Statistics-> select Mean, Median, Mode,
Range, Maximum and Minimum.
32
Example: Central Tendency and Spread
We use SPSS to figure out the Central
Tendency and Spread of the Mortality rates in
the 1960s.
Statistics
MORT
N
Valid
Missing
60
0
Mean
940.3650
Median
943.7000
Mode
790.70 a
Std. Deviation
62.20482
Variance
3869.439
Range
322.30
33
Correlation Coefficient
What is it?
-- A numeric measure of linear relationship between two continuous
variables.
Properties of correlation coefficient:
-- Ranges between -1 and 1
-- The closer it is to -1 or 1, the stronger the linear relationship is
-- If r=0, the two variables are not correlated
-- If r is positive, relationship is described as positive (larger values of one
variable tend to accompany larger values of the other variable)
-- If r is negative, relationship is described as negative (larger value of one
variable tend to accompany smaller values of the other variable)
34
Correlation
Slight warning:
Correlation tend to measure linear relationship;
however there are events that a curves might exist
35
Linear Regression
What is it?
-- Statistical technique of fitting a linear function to
data points in attempt to describe a relationship
between two variables.
Used for
-- prediction
-- interpretation of coefficients (change in y for a
unit increase in x)
36
How To Find Correlation and
Fitted Regression Line
By Syntax
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Wieght
/METHOD=ENTER Height.
By mouse
Analyze->Regression-> Y (Variable we want to
predict) to Dependent -> X (variable we are using to
predict Y) with Independent->
37
Example: Correlation
Referring to our weight and height scatterplot,
the researchers want to check how related
these two variable are.
Correlations
Wieght
Pearson
Correlation
Wieght
Hieght
1.000
.717
.717
1.000
Hieght
Sig. (1tailed)
Wieght
Hieght
.000
Wieght
507
507
Hieght
507
507
.000
38
Example: Regression
Researchers want to create a linear model
using the height as an independent variable
(predictor) and weight as a dependent variable
(outcome or response).
The fitted line can be written as
Weight= -105.011+1.018 (Height)
Coefficientsa
Unstandardized
Coefficients
Model
1
B
(Constant)
Hieght
Std. Error
-105.011
7.539
1.018
.044
Standardiz
ed
Coefficient
s
Beta
.717
Sig.
-13.928
.000
23.135
.000
39
Frequency Table
What is it?
-- A table that shows frequency (count) for each
level of a categorical variable.
Used for
-- comparison of frequencies for different levels
40
How To Find Frequency Table
Syntax
FREQUENCIES VARIABLES=EDUbinned
/ORDER=ANALYSIS.
By mouse
Analyze-> Descriptives-> frequency->Variable
-> display Frequency-> okay
41
Example: Frequency Table
We want to know what was the frequencies of different
educational levels in the US metropolitan area in 1960s. We have to
use visual binning first and identify bins. Using the range, we create
bins from 9th, 10th, 11th, 12th grade and up.
Syntax
* Visual Binning.
*EDU.
RECODE EDU (MISSING=COPY) (12 THRU HI=4) (11 THRU HI=3) (10 THRU HI=2) (LO THRU
HI=1) (ELSE=SYSMIS) INTO EDUbins.
VARIABLE LABELS EDUbins 'EDU (Binned)'.
FORMATS EDUbins (F5.0).
VALUE LABELS EDUbins 1 '9th Grade' 2 '10th Grade' 3 '11th Grade' 4 '12th grade and up'.
VARIABLE LEVEL EDUbins (ORDINAL).
By Mouse
Transform-> Visual Binning-> variable we want to create into an ordinal value->
okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay
42
Example: Frequency Table
EDU (Binned)
Valid
Valid
Cumulative
Percent
Percent
15.0
15.0
Frequency
9
Percent
15.0
19
31.7
31.7
46.7
20
33.3
33.3
80.0
12th grade
and up
12
20.0
20.0
100.0
Total
60
100.0
100.0
9th Grade
10th Grade
11th Grade
Cross-tabulation
What it is?
a two-way table containing frequencies (counts)
for different levels of the column and row
variables.
Used for
Comparison of frequencies for different levels of
the variables (chi-squared test)
44
How To Find Cross-tabulation
Syntax:
CROSSTABS
/TABLES=EDUbins BY US
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL.
By Mouse
Analyze-> Descriptive Statistics-> Crosstabs-> select
variable for row-> select variable for column->
statistic-> Chi-Square-> continue-> Okay
45
Example: Cross-tabulation
Researchers wish to understand if the
educational levels from the SMSA data were
equally distributed among the US.
Looking at the p-value, we can see that the
educational levels are different among the
regions of the US.
Chi-Square Tests
EDU (Binned) * US Crosstabulation
Asymp.
Sig. (2sided)
Count
US
1.00
EDU
(Binned)
Total
9th Grade
10th
Grade
11th
Grade
12th grade
and up
2.00
3.00
4.00
Value
Total
Pearson ChiSquare
19
20
12
Likelihood
Ratio
Linear-byLinear
Association
21
16
14
60
N of Valid
Cases
df
26.078a
.002
25.377
.003
9.893
.002
60
46
47
Recommended Readings/Citations
Hartwig, F., & Dearing, B. E. (1979). Exploratory Data
Analysis. Beverly Hills : Sage Publications.
Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983).
Understanding Robust and Exploratory Data Analysis. New
York: John Wile & Sons Inc.
Pampel, F. C. (2004). Exploratory Data Analysis . In M. S.
Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE
Encyclopedia of Social Science Research Methods (pp. 359360). Thousand Oak, California : Sage Publications.
Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt,
Dictionary of Statistics & Methodology: A Nontechnical
Guide for the Social Science (pp. 104-105). Thousand Oaks,
California: SAGE Publications. Inc.
48