IPS7e LecturePowerPointSlides ch01
IPS7e LecturePowerPointSlides ch01
IPS Chapter 1
Variables
Types of variables
Graphs for categorical variables
Bar graphs
Pie charts
Graphs for quantitative variables
Histograms
Stemplots
Stemplots versus histograms
Interpreting histograms
Time plots
Variables
In a study, we collect information—data—from cases. Cases can be
individuals, companies, animals, plants, or any object of interest.
Example: age, height, blood pressure, ethnicity, leaf length, first language
The distribution of a variable tells us what values the variable takes and
how often it takes these values.
Two types of variables
Variables can be either quantitative…
… or categorical.
Something that falls into one of several categories. What can be counted
is the count or proportion of cases in each category.
Example: Your blood type (A, B, AB, O), your hair color, your ethnicity,
whether you paid income tax last tax year or not.
How do you know if a variable is categorical or quantitative?
Ask:
What are the n cases/units in the sample (of size “n”)?
Categorical Quantitative
Each individual is Each individual is
assigned to one of attributed a
Label several categories. numerical value.
Bar graphs
Each category is
represented by
a bar.
Pie charts
The slices must
represent the parts of one whole.
Example: Top 10 causes of death in the United States 2006
% of top
Rank Causes of death Counts % of total deaths
10s
1 Heart disease 631,636 34% 26%
2 Cancer 559,888 30% 23%
3 Cerebrovascular 137,119 7% 6%
4 Chronic respiratory 124,583 7% 5%
5 Accidents 121,599 7% 5%
6 Diabetes mellitus 72,449 4% 3%
7 Alzheimer’s disease 72,432 4% 3%
8 Flu and pneumonia 56,326 3% 2%
9 Kidney disorders 45,344 2% 2%
10 Septicemia 34,234 2% 1%
For each individual who died in the United States in 2006, we record what was
the cause of death. The table above is a summary of that information.
Bar graphs
Each category is represented by one bar. The bar’s height shows the count (or
sometimes the percentage) for that particular category.
700,000
500,000
400,000
The number of individuals
300,000 who died of an accident in
2006 is approximately
200,000 121,000.
100,000
0
Top 10 causes of deaths in the United States 2006
700,000
100,000
Sorted alphabetically
Much less useful
Pie charts
Each slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.
Percent of people dying from
top 10 causes of death in the United States in 2006
Make sure your
labels match
the data.
Make sure
all percents
add up to 100.
Percent of deaths from top 10 causes
Percent of
deaths from
all causes
Child poverty before and after
government intervention—UNICEF, 2005
Histograms
A histogram breaks the range of values of a variable into classes and
displays only the count or percent of the observations that fall into each
class.
When the observed values have too many digits, trim the numbers
before making a stem plot.
The first column represents all states with a Hispanic percent in their
population between 0% and 4.99%. The height of the column shows how
many states (27) have a percent in this range.
The last column represents all states with a Hispanic percent in their
population between 40% and 44.99%. There is only one such state: New
Mexico, at 42.1% Hispanics.
Stemplots versus histograms
Stemplots are quick and dirty histograms that can easily be done by
hand, and therefore are very convenient for back of the envelope
calculations. However, they are rarely found in scientific or laymen
publications.
Interpreting histograms
When describing the distribution of a quantitative variable, we look for the
overall pattern and for striking deviations from that pattern. We can describe
the overall pattern of a histogram by its shape, center, and spread.
Complex,
multimodal
distribution
Not
summarized
enough
Too summarized
Histogram of dry days in 1995
IMPORTANT NOTE:
Your data are the way they are.
Do not try to force them into a
particular shape.
It is a common misconception
that if you have a large enough
data set, the data will eventually
turn out nice and symmetrical.
Line graphs: time plots
In a time plot, time always goes on the horizontal, x axis.
We describe time series by looking for an overall pattern and for striking
deviations from that pattern. In a time series:
This time plot shows a regular pattern of yearly variations. These are seasonal
variations in fresh orange pricing most likely due to similar seasonal variations in
the production of fresh oranges.
There is also an overall upward trend in pricing over time. It could simply be
reflecting inflation trends or a more fundamental change in this industry.
A time plot can be used to compare two or more
data sets covering the same time period.
# cases diagnosed
week 4 8682 552 8000
# deaths reported
9000 700 600
week 5 7164 738 7000
8000
week 6 2229 414 6000 600 500
Incidence 7000
week 7 600 198 5000
6000 500 400
week 8 164 90 4000
5000 400 300
4000
3000 300 200
week 9 57 56 3000
2000 200 100
week 10 722 50 2000
1000
week 11 1517 71 1000 0 100
0
week 12 1828 137 0 0
week 13 1539 178
1
11
13
15
17
1 3 5 7 9 11 13 15 17
k
k
ee
ee
ee
ee
ee
k
week 14 2416 194 k k k k k
ee
ee
ee
ee
ee ee e e e k k k k
w
w
e e e ee ee ee ee
w
week 15 3148 290 w w w w w w w w w
week 16 3465 310
# Cases # Deaths
week 17 1440 149 # Cases # Deaths
The pattern over time for the number of flu diagnoses closely resembles that for the
number of deaths from the flu, indicating that about 8% to 10% of the people
diagnosed that year died shortly afterward, from complications of the flu.
Scales matter Death rates from cancer (US, 1945-95)
200 100
thousand)
150
100 50
50
0 0
1940 1950 1960 1970 1980 1990 2000 1940 1960 1980 2000
Years Years
200
150
180
BUT
100 160
120
hard numbers.
0
1940 1960 1980 2000 Look at the scales.
1940 1960 1980 2000 Years
Years
Why does it matter?
Learn right away how to get the mean using your calculators.
Your numerical summary must be meaningful.
Height of 25 women in a class
Could we have
more than one
plant species or
phenotype?
Height of Plants by Color
5
x 63.9 x 70.5 x 78.3
red
4 pink
Number of Plants
blue
0
58 60 62 64 66 68 70 72 74 76 78 80 82 84
Height in centimeters
Mean
Median
x 3.4 x 4.2
Percent of people dying
Without the outliers
With the outliers
Symmetric distribution…
x 3.4
Disease X:
M 3.4
Multiple myeloma:
x 3.4
M 2.5
11
10 Boxplots remain
9
8 true to the data and
7
6 depict clearly
5
4 symmetry or skew.
3
2
1
0
Disease X Multiple Myeloma
Suspected outliers
One way to raise the flag for a suspected outlier is to compare the
distance from the suspicious data point to the nearest quartile (Q1 or
Q3). We then compare this distance to the interquartile range
(distance between Q1 and Q3).
1 n
Mean
s
n 1 1
( xi x ) 2
±1
s.d.
Calculations … Women’s height (inches)
n
1
s
df
(x
1
i x) 2
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = (n − 1) = 13
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
s = 0 only when all observations have the same value and there is
no spread. Otherwise, s > 0.
Give common
statistics of your
sample data.
Minitab
Choosing among summary statistics
Height in Inches
65
Plot the mean and use the 64
Density curves
Normal distributions
Standardizing observations
The mean of a density curve is the balance point, at which the curve
would balance if it were made of solid material.
The median and mean are the same for a symmetric density curve.
The mean of a skewed curve is pulled in the direction of the long tail.
Normal distributions
Normal – or Gaussian – distributions are a family of symmetrical, bell-
shaped density curves defined by a mean m (mu) and a standard
deviation s (sigma) : N(m,s).
2
1 x m
1
2 s
f (x) e
s 2
x x
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
The 68-95-99.7% Rule for Normal Distributions
Reminder: µ (mu) is the mean of the idealized curve, while x is the mean of a sample.
σ (sigma) is the standard deviation of the idealized curve, while s is the s.d. of a sample.
The standard Normal distribution
Because all Normal distributions share the same properties, we can
standardize our data to transform any Normal curve N(m,s) into the
standard Normal curve N(0,1).
=>
x z
Standardized height (no units)
mean µ = 64.5"
standard deviation s = 2.5"
m = 64.5” x = 67”
x (height) = 67" z=0 z=1
Because of the 68-95-99.7 rule, we can conclude that the percent of women
shorter than 67” should be, approximately, .68 + half of (1 - .68) = .84 or 84%.
Using the standard Normal table
Table A gives the area under the standard Normal curve to the left of any z value.
.0082 is the
area under
N(0,1) left
of z = -
2.40
N(µ, s) =
N(64.5”, 2.5”)
Area ≈ 0.84
Area = 0.9901
Area = 0.0099
z = -2.33
x 820
m 1026
s 209
(x m)
z
s
(820 1026)
z
209 area right of 820 = total area - area left of 820
206 = 1 - 0.1611
z 0.99 ≈ 84%
209
Table A : area under
N(0,1) to the left of Note: The actual data may contain students who scored
z = -0.99 is 0.1611 exactly 820 on the SAT. However, the proportion of scores
or approx. 16%. exactly equal to 820 is 0 for a normal distribution is a
consequence of the idealized smoothing of density curves.
The NCAA defines a “partial qualifier” eligible to practice and receive an athletic
scholarship, but not to compete, with a combined SAT score of at least 720.
What proportion of all students who take the SAT would be partial qualifiers?
That is, what proportion have scores between 720 and 820?
x 720
m 1026
s 209
(x m)
z
s
(720 1026)
z
209 area between = area left of 820 - area left of 720
306 720 and 820 = 0.1611 - 0.0721
z 1.46
209 ≈ 9%
Table A : area under
N(0,1) to the left of
z = -1.46 is 0.0721
About 9% of all students who take the SAT have scores
or approx. 7%. between 720 and 820.
The cool thing about working with
normally distributed data is that
we can manipulate it, and then
find answers to questions that
involve comparing seemingly non-
comparable distributions.
m266
s15
m250
s20
Compared to vitamin supplements alone, vitamins and better food resulted in a much
smaller percentage of women with pregnancy terms below 8 months (4% vs. 31%).
Inverse normal calculations
We may also want to find the observed range of values that correspond
to a given proportion/ area under the curve.
The 75% longest pregnancies in this group are about 256 days or longer.
Normal quantile plots
One way to assess if a distribution is indeed approximately normal is to
plot the data on a normal quantile plot.
The data points are ranked and the percentile ranks are converted to z-
scores with Table A. The z-scores are then used for the x axis against
which the data are plotted on the y axis of the normal quantile plot.
If the distribution is indeed normal the plot will show a straight line,
indicating a good match between the data and a normal distribution.
Normal quantile plots are complex to do by hand, but they are standard
features in most statistical software.
Alternate Slide
CrunchIt!
Stat Summary Statistics Columns
Women’s Heights (inches)
n, Min, Q1, Median, Q3, Max, Mean, Std. Dev.
OK