cover
http://acr.keymath.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
1 de 1
10-03-2009 13:32
Statistics in Action
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
Statistics in Action
Understanding a World of Data
Second Edition
2008 Key Curriculum Press
23-03-2009 21:25
copyright
1 de 1
http://acr.keymath.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
Project Editor
Consulting Editor
Project Administrators
Editorial Assistants
Teacher Consultant and Writer
Mathematics Reviewer and
Accuracy Checker
AP Sample Test Contributor
AP Teacher Reviewers
Josephine Noah
Kendra Lockman
Elizabeth Ball, Aaron Madrigal
Aneesa Davenport, Nina Mamikunian
Corey Andreasen, North High School, Sheboygan, Wisconsin
Cindy Clements, Trinidad State Junior College, Trinidad, Colorado
Editorial Production Manager
Production Editor
Copyeditor
Production Supervisor
Production Coordinator
Text Designers
Compositor
Art Editor and Technical Artist
Cover Designers
Cover Photo Credit
Prepress and Printer
Joshua Zucker, Castilleja School, Palo Alto, California
Angelo DeMattia, Columbia High School, Maplewood, New Jersey
Beth Fox-McManus, formerly of Alan C. Pope High School, Marietta, Georgia
Dan Johnson, Silver Creek High School, San Jose, California
Gil Cuevas, University of Miami, Coral Gables, Florida
Genevieve Lau, Skyline College, San Bruno, California
Beatrice Lumpkin, Malcolm X College (retired), Chicago, Illinois
Christine Osborne
Kristin Ferraioli
Mary Roybal
Ann Rothenbuhler
Thomas Brierly
Graphic World, Thomas Brierly
Graphic World
LMP Media, Inc.
Jensen Barnes, Nidaul Uk
Getty Images/Alberto Incrocci
RR Donnelley
Textbook Product Manager
Executive Editor
Publisher
James Ryan
Casey FitzSimons
Steven Rasmussen
Multicultural Reviewers
2008 by Key Curriculum Press. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, photocopying, recording, or otherwise, without the prior
written permission of the publisher.
Key Curriculum Press is a registered trademark of Key Curriculum Press. Fathom Dynamic
Data is a trademark of KCP Technologies. All other registered trademarks and trademarks in this
book are the property of their respective holders.
Key Curriculum Press
1150 65th Street
Emeryville, CA 94608
[email protected]www.keypress.com
Printed in the United States of America
10 9 8 7 6 5 4 3 2 1
11
ISBN: 978-1-55953-909-8
10
09
08
07
2008 Key Curriculum Press
10-03-2009 16:53
About The Author
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
About the Authors
Ann E. Watkins is Professor of Mathematics at California State University,
Northridge (CSUN). She received her Ph.D. in education from the University of
California, Los Angeles. She is a former president of the Mathematical Association
of America (MAA) and a Fellow of the American Statistical Association (ASA).
Dr. Watkins has served as co-editor of College Mathematics Journal, as a member
of the Board of Editors of American Mathematical Monthly, and as Chair of the
Advanced Placement Statistics Development Committee. She was selected as the
19941995 CSUN Outstanding Professor and won the 1997 CSUN Award for the
Advancement of Teaching Effectiveness. Before moving to CSUN in 1990, she taught
for the Los Angeles Unified School District and at Pierce College in Los Angeles.
In addition to numerous journal articles, she is the co-author of books based on
work produced by the Activity-Based Statistics Project (co-authored with Richard
Scheaffer), the Quantitative Literacy Project, and the Core-Plus Mathematics Project.
Richard L. Scheaffer is Professor Emeritus of Statistics at the University of
Florida, where he served as chairman of the Department of Statistics for 12 years.
He received his Ph.D. in statistics from Florida State University. Dr. Scheaffers
research interests are in the areas of sampling and applied probability, especially
in their applications to industrial processes. He has published numerous papers
and is co-author of four college-level textbooks.
In recent years, much of his effort has been directed toward statistics education
at the elementary, secondary, and college levels. He was one of the developers of the
Quantitative Literacy Project where he helped form the basis of the data analysis
emphasis in mathematics curriculum standards recommended by the National
Council of Teachers of Mathematics. Dr. Scheaffer has also directed the task force
that developed the Advanced Placement Statistics Program and served as its first
Chief Faculty Consultant. Dr. Scheaffer is Fellow and past president of the American
Statistical Association, from which he received a Founders Award.
George W. Cobb is the Robert L. Rooke Professor of Statistics at Mt. Holyoke
College, where he served a three-year term as Dean of Studies. He received his Ph.D.
in statistics from Harvard University, and is an expert in statistics education with
a significant publication record. He chaired the joint committee on undergraduate
statistics of the Mathematical Association of America and the American Statistical
Association and is a Fellow of the American Statistical Association. He also led the
Statistical Thinking and Teaching Statistics (STATS) project of the Mathematical
Association of America, which helped professors of mathematics learn to teach
statistics. Over the past two decades, Dr. Cobb has frequently served as an expert
witness in lawsuits involving alleged employment discrimination.
2008 Key Curriculum Press
iii
23-03-2009 21:20
About The Author
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
Acknowledgments
This book is a product of what we have learned from the statisticians and teachers
who have been actively involved in helping the introductory statistics course
evolve into one that emphasizes activity-based learning of statistical concepts
while reflecting modern statistical practice. This book is written in the spirit of the
recommendations from the MAAs STATS project, the ASAs Quantitative Literacy
and GAISE projects, and the College Boards AP Statistics course. We hope that
it adequately reflectsS the wisdom and experience of those with whom we have
worked and who have inspired and taught us.
We owe special thanks to Corey Andreasen, an outstanding high school
mathematics and AP Statistics teacher at North High School in Sheboygan,
Wisconsin, for his insight into what makes a topic teachable to high school
students. His careful review of the manuscript led to many clarifications of
wording and improvements in exercises, all of which will make it easier for you to
learn the material. Corey also has made substantial contributions to the solutions
and teachers notes, adding his unique perspective and sense of humor.
It has been an awesome experience to work with the Key Curriculum staff
and field-test teachers, who always put the interests of students and teachers first.
Their commitment to excellence has motivated us to do better than we ever could
have done on our own. Steve, Casey, Jim, Kristin B., Kristin F., and the rest of the
staff have been professional and astute throughout. Our deepest gratitude goes to
Cindy Clements and Josephine Noah, the editors of the first and second editions,
respectively, who have been a joy to work with. (Not all authors say thatand
mean itabout their editors.) Cindy and Josephine were outstanding high school
teachers before coming to Key. Their organizational skills, experience in the
classroom, and insight have improved every chapter of this text.
2008 Key Curriculum Press
iv
23-03-2009 21:21
Contents v
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
< NOBR>
Contents
CHAPTER 1
CHAPTER
CHAPTER
A Note to Students from the Authors
ix
Statistical Reasoning: Investigating a
Claim of Discrimination
1.1 Discrimination in the Workplace: Data Exploration
1.2 Discrimination in the Workplace: Inference
Chapter Summary
4
11
20
Exploring Distributions
26
2.1 Visualizing Distributions: Shape, Center, and Spread
2.2 Graphical Displays of Distributions
2.3 Measures of Center and Spread
2.4 Working with Summary Statistics
2.5 The Normal Distribution
Chapter Summary
28
42
56
74
83
95
Relationships Between Two Quantitative Variables
104
3.1 Scatterplots
3.2 Getting a Line on the Pattern
3.3 Correlation: The Strength of a Linear Trend
3.4 Diagnostics: Looking for Features That the Summaries Miss
3.5 Shape-Changing Transformations
Chapter Summary
CHAPTER
Sample Surveys and Experiments
4.1 Why Take Samples, and How Not To
4.2 Random Sampling: Playing It Safe by Taking Chances
4.3 Experiments and Inference About Cause
4.4 Designing Experiments to Reduce Variability
Chapter Summary
2008 Key Curriculum Press
106
116
140
162
179
203
216
219
231
243
262
278
v
23-03-2009 21:22
Contents vi
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
CHAPTER 5
Probability Models
286
5.1 Constructing Models of Random Behavior
5.2 Using Simulation to Estimate Probabilities
5.3 The Addition Rule and Disjoint Events
5.4 Conditional Probability
5.5 Independent Events
Chapter Summary
CHAPTER
288
301
315
325
339
349
Probability Distributions
356
6.1 Random Variables and Expected Value
6.2 The Binomial Distribution
6.3 The Geometric Distribution
Chapter Summary
CHAPTER
358
382
393
402
Sampling Distributions
408
7.1 Generating Sampling Distributions
7.2 Sampling Distribution of the Sample Mean
7.3 Sampling Distribution of the Sample Proportion
Chapter Summary
CHAPTER
410
426
446
458
Inference for Proportions
466
8.1 Estimating a Proportion with Confidence
8.2 Testing a Proportion
8.3 A Confidence Interval for the Difference of Two Proportions
8.4 A Significance Test for the Difference of Two Proportions
8.5 Inference for Experiments
Chapter Summary
CHAPTER
Inference for Means
9.1 A Confidence Interval for a Mean
9.2 A Significance Test for a Mean
9.3 When Things Arent Normal
9.4 Inference for the Difference Between Two Means
9.5 Paired Comparisons
Chapter Summary
vi
468
490
518
526
539
554
560
562
580
602
616
640
663
2008 Key Curriculum Press
23-03-2009 21:22
Contents vii
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
CHAPTER 10
Chi-Square Tests
10.1 Testing a Probability Model: The Chi-Square Goodness-of-Fit Test
10.2 The Chi-Square Test of Homogeneity
10.3 The Chi-Square Test of Independence
Chapter Summary
CHAPTER
CHAPTER
11
12
Inference for Regression
736
739
755
771
790
Statistics in Action: Case Studies
798
Mums the Word!
Keeping Tabs on Americans
Baseball: Does Money Buy Success?
Martin v. Westvaco Revisited: Testing for Discrimination
Against Employees
Statistical Tables
Table A: Standard Normal Probabilities
Table B: t-Distribution Critical Values
Table C:
Critical Values
Table D: Random Digits
2008 Key Curriculum Press
674
692
711
728
11.1 Variation in the Slope from Sample to Sample
11.2 Making Inferences About Slopes
11.3 Transforming for a Better Fit
Chapter Summary
12.1
12.2
12.3
12.4
APPENDIX
672
800
803
810
817
824
824
826
827
828
Glossary
829
Brief Answers to Selected Problems
839
Index
883
Photo Credits
894
vii
23-03-2009 21:22
Acknowledgments
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
A Note to Students
from the Authors
Data enter the conversation whether you talk about income, sports, health,
politics, the weather, or prices of goods and services. In fact, in this age of
information technology, data come at you at such a rapid rate that you can catch
only a glimpse of the masses of numbers. You cannot cope intelligently in this
quantitative world unless you have an understanding of the basic concepts of
statistics and have had practice making informed decisions using real data.
Statistics in Action is designed for students taking an introductory high
school statistics course and includes all of the topics in the Advanced Placement
(AP) Statistics syllabus. Beginning in Chapter 1 with a court case about age
discrimination, you will be immersed in real-world problems that can be solved
only with statistical methods. You will learn to explore, summarize, and display data;
design surveys and experiments; use probability to understand random behavior;
make inferences about populations by looking at samples from those populations;
and make inferences about the effect of treatments from designed experiments.
After completing your statistics course, you will be prepared to take the AP
Statistics Exam, to take a follow-up college-level course, and, above all, to make
informed decisions in this world of data.
You will be using this book in a first course in statistics, so, you arent required
to know anything yet about statistics. You may find that your success in statistics
results more from your perseverance in trying to understand what you read rather
than your skill with algebra. However, basic topics from algebra, such as slope,
linear equations, exponential equations, and the idea of a logarithm, will arise
throughout the book. Be prepared to review these as you go along.
Statistics from a Modern Perspective
Statistical work is more interactive than it was a generation ago. Computers and
graphing calculators have automated the graphical exploration of data and, in the
process, have made the practice of statistics a more visual enterprise. Statistical
techniques are also changing as simulations allow statisticians (and you) to shift
the emphasis from following recipes for calculations to paying more attention to
statistical concepts. Your instructor has selected this book so that you may learn
this modern, data-analytic approach to statistics and because he or she encourages
you to be an active participant in the classroom, wants you to see real data (if you
have only pretend data, you can only pretend to analyze it), believes that statistical
analyses must be tailored to the data, and uses graphing calculators or statistical
software for data analysis and for simulations.
2008 Key Curriculum Press
ix
23-03-2009 21:24
What You Should Know About This Book
1 de 1
http://acr.keypress.com/KeyPressPortalV3.0/ImportingCourses/SIA2/fr...
These features grow out of the vigorous changes that have been reshaping the
practice of statistics and the teaching of statistics over the last quarter century.
The most basic question to ask about any data set is, Where did the data come
from? Good data for statistical analysis must come from a good plan for data
collection. Thus, Statistics in Action treats the design and analysis of experiments
honestly and thoroughly and discusses how these methods of collecting data differ
from observational studies. It then follows through on this theme by relating the
statistical analysis to the manner in which the data were collected.
What You Should Know About This Book
Throughout Statistics in Action you will see many graphical displays, lots of real
data, activities that introduce each major topic, computer printouts, questions
for you to discuss with your class, and practice problems so that you can be sure
you understand the basics before you move on. The practice problems are found
at the end of every section, organized by topic. You should work every problem
for those topics that you wish to learn. The answers to all practice problems and
odd-numbered exercises are in the back of the book.
Also in the back of the book you will find a glossary of statistical notation, a
glossary of statistical terminology, statistical tables, and an index so that you can
locate topics quickly. Two tables are reprinted inside the back cover of the book.
You should keep in mind that the emphasis throughout is on the development
of statistical concepts. Even when looking for a numerical answer in a practice
problem or exercise, think about the underlying concept that is being illustrated.
Concepts are carefully developed in the written text, so read the material
thoughtfully and strive to fully understand what is being explained. Concepts
are also developed in the discussion questions. These are designed for in-class
discussions, but regard them as part of your reading assignment, even if class
time is too limited for full discussion. Through reading, discussing, and working
practice problems and exercises you will develop a profound understanding of
statistical thinking, and that will serve you well as a basis for a lifetime of coping
with a quantitative world.
Ann Watkins
Dick Scheaffer
George Cobb
2008 Key Curriculum Press
23-03-2009 21:24
Lesson
2 de 3
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Statistical Reasoning:
Investigating a Claim
of Discrimination
Were older workers
discriminated against
during a companys
downsizing? When an
older worker felt he had
been unfairly laid off,
his lawyers called on a
statistician to help them
evaluate the claim.
2008 Key Curriculum Press
23-03-2009 21:25
Lesson
3 de 3
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
In the year Robert Martin turned 54, the Westvaco Corporation, which makes
paper products, decided to downsize. They laid off several members of the
engineering department, including Robert Martin. Later that year, he sued
Westvaco, claiming he had been laid off because of his age. A major piece of
Martins case was based on a statistical analysis of the ages of the Westvaco
employees.
In the two sections of this chapter, you will get a chance to try your hand at
two very different kinds of statistical work, exploration and inference. Exploration
is an informal, open-ended examination of data. Your goal in the first section
will be to uncover and summarize patterns in data from Westvaco that bear on
the Martin case. You will try to formulate and answer basic questions such as
Were those who were laid off older on average than those who werent laid off?
You can use any toolsgraphs, averages, and so onthat you think might be
useful. Inference, which youll use in the second section, is quite different from
exploration in that it follows strict rules and focuses on judging whether the
patterns you found are the sort you would expect. Youll use inference to decide
whether the patterns you find in the Westvaco data are the sort you would expect
from a company that does not discriminate on the basis of age, or whether further
investigation into possible age discrimination is needed.
The purpose of this first chapter is to familiarize you with the ideas of
statistical thinking before you involve yourself with the details of statistical
methods. It is easy to get caught in the trap of doing rather than understanding, of
asking how rather than why. You cant do statistics unless you learn the methods,
but you must not get so caught up in the details of the methods that you lose
sight of what they mean. Doing and thinking, method and meaning, will compete
for your attention throughout this course.
In this chapter, you will learn the basic ideas of
exploring datauncovering and summarizing patterns
making inferences from datadeciding whether an observed feature of the
data could reasonably be attributed to chance alone
These ideas will remain key components of the statistical concepts youll develop
and study throughout this course.
2008 Key Curriculum Press
23-03-2009 21:25
Lesson
1 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
1.1
Variables provide
information about cases.
Variability is what statistics
is all about.
A distribution is a record of
variability.
Discrimination in the Workplace: Data Exploration
Robert Martin was one of 50 people working in the engineering department of
Westvacos envelope division. One spring, Westvacos management went through
five rounds of planning for a reduction in their work force. In Round 1, they
eliminated 11 positions, and they eliminated 9 more in Round 2. By the time the
layoffs ended, after all five rounds, only 22 of the 50 workers had kept their jobs.
The average age in the department had fallen from 48 to 46.
After Martin, age 54, was laid off, he sued Westvaco for age discrimination.
Display 1.1 shows the data provided by Westvaco to Martins lawyers. The
statistical analysis in the lawsuit used all 50 employees in the engineering
department of the envelope division, with separate analyses for salaried and
hourly workers. Each row in Display 1.1 corresponds to one worker, and each
column corresponds to a characteristic of the worker: job title, whether hourly
or salaried, month and year of birth, month and year of hire, and age at birthday
in 1991. The next-to-last column (Round) tells how the worker fared in the
downsizing: 1 means chosen for layoff in Round 1 of planning for the reduction
in force, 2 means chosen in Round 2, and so on for Rounds 3, 4, and 5; 0 means
not chosen for layoff.
The subjects (or objects) of statistical examination often are called cases.
In the rows in Display 1.1, the cases are individual Westvaco employees. Their
characteristics, in the columns, are the variables. If you pick a row and read
across, you find information about a single case. (For example, Robert Martin, in
Row 44, was salaried, was born in September 1937, was hired in October 1967,
was chosen for layoff in Round 2, and turned 54 in 1991.) Although reading
across might seem the natural way to read the table, in statistics you will often
find it useful to pick a column and read down. This gives you information about
a single variable as you range through all the cases. For example, pick Age, read
down the column, and notice the variability in the ages. It is variability like
thisthe fact that individuals differthat can make it a challenge to see patterns
in data and figure out what they mean.
Imagine: If there had been no variabilityif all the workers had been of just
two ages, say, 30 and 50, and Westvaco had laid off all the 50-year-olds and kept
all the 30-year-oldsthe conclusion would be obvious and there would be no
need for statistics. But real life is more subtle than that. The ages of the laid-off
workers varied, as did the ages of the workers retained. Statistical methods were
designed to cope with such variability. In fact, you might define statistics as the
science of learning from data in the presence of variability.
Although the bare fact that the ages vary is easy to see in the data table, the
pattern of those ages is not so easy to see. This patternwhat the values are and
how often each occursis their distribution. In order to see that pattern, a graph
is better than a table. The dot plot in Display 1.2 shows the distribution of the
ages of the 36 salaried employees who worked in the engineering department just
before the layoffs began.
[To learn how to create a dot plot on your calculator, see Calculator Note 1A.]
Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:28
Lesson
2 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 1.1 The data in Martin v. Westvaco. [ Source: Martin v. Envelope Division of Westvaco Corp.,
CA No. 92-03121-MAP, 850 Fed. Supp. 83 (1994).]
2008 Key Curriculum Press
23-03-2009 21:28
Lesson
3 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 1.2 Ages of the salaried workers. (Each dot represents a
worker; the age is shown by the position of the dot
along the scale below it.)
Display 1.2 provides some useful information about the variability in the ages,
but by itself doesnt tell anything about possible age discrimination in the layoffs.
For that, you need to distinguish between those salaried workers who lost their
jobs and those who didnt. The dot plot in Display 1.3, which shows those laid
off and those retained, provides weak evidence for Martins case. Those laid off
generally were older than those who kept their jobs, but the pattern isnt striking.
Display 1.3
Salaried workers: ages of those laid off and those
retained.
Display 1.3 shows that most salaried workers who were laid off were age 50
or older. However, this alone doesnt support Martins case because most of the
workers were age 50 or older to begin with.
One way to proceed is to make a summary table. The table shown here
classifies the salaried workers according to age and whether they were laid off
or retained. (Using 50 as the dividing age between younger and older is a
somewhat arbitrary, but reasonable, decision.)
To decide whether Martin has a case, compare the proportion of salaried
workers under 50 who were laid off (6 out of 16, or 0.375) with the proportion of
those 50 or older who were laid off (12 out of 20, or 0.60). These proportions are
quite different, an argument in favor of Martin.
Looking at the layoffs of the salaried workers round by round provides further
evidence in favor of Martin. The dot plots in Display 1.4 show the ages of the
salaried workers laid off and retained by round. These new dot plots use different
6
Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:28
Lesson
4 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
symbols for laid-off workers and for retained workers. For example, in the top dot
plot, the open circles represent the salaried workers whose jobs did not survive
Round 1. In this round, the four oldest workers were laid off, but only one worker
under age 50 was laid off. In the second round, one of the two oldest employees
was laid off. But then this pattern stopped. Again, the evidence favors Martin but
is far from conclusive.
Display 1.4
Salaried workers: ages of those laid off (open circles)
and those retained (solid dots) in each round.
You might feel as if the analysis so far ignores important facts, such as worker
qualifications. Thats true. However, the first step is to decide whether, based
on the data in Display 1.1, older workers were more likely to be laid off. If not,
Martins case fails. If so, it is then up to Westvaco to justify its actions.
Exploring the Martin v. Westvaco Data
D1. Suppose you were the judge in the Martin v. Westvaco case. How would you
use the information in Display 1.1 to decide whether Westvaco tended to lay
off older workers in disproportionate numbers (for whatever reason)?
D2. Display 1.5 (on the next page) is like Display 1.3 except that it gives data
for the hourly workers. Compare the plots for the hourly and salaried
workers. Which provides stronger evidence in support of Martins claim
of age discrimination?
2008 Key Curriculum Press
1.1 Discrimination in the Workplace: Data Exploration 7
23-03-2009 21:28
Lesson
5 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 1.5 Hourly workers: ages of those laid off and those
retained.
D3. Whenever you think you have a message from data, you should be careful
not to jump to conclusions. The patterns in the Westvaco data might be
realthey reflect age discrimination on the part of management. On
the other hand, the patterns might be the result of chancemanagement
wasnt discriminating on the basis of age but simply by chance happened
to lay off a larger percentage of older workers. Whats your opinion about
the Westvaco data: Do the patterns seem realtoo strong to be explained
by chance?
D4. The analysis up to this point ignores important facts such as worker
qualifications. Suppose Martin makes a convincing case that older workers
were more likely to be laid off. It is then up to Westvaco to justify its actions.
List several specific reasons Westvaco might give to justify laying off a
disproportionate number of older workers.
Summary 1.1: Data Exploration
Data exploration, or exploratory analysis, is a purposeful investigation to find
patterns in data, using tools such as tables and graphs to display those patterns
and statistical concepts such as di
Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:28
Lesson
6 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Practice
Practice problems help you master basic concepts
group does the evidence more strongly
and computations. Throughout this textbook, you
favor Martins case?
should work all the practice problems for each
P3. Display 1.6 shows layoffs and retentions by
topic you want to learn. The answers to all practice
round for hourly workers. (There is no plot
problems are given in the back of the book.
for Round 5 because no hourly workers were
chosen for layoff in that round.) Compare
Exploring the Martin v. Westvaco Data
the pattern for the hourly workers with the
P1. Construct a dot plot similar to Display 1.3,
pattern for the salaried workers in Display
comparing the ages of hourly workers who
1.4 on page 7. For which group does the
lost their jobs during Rounds 13 to the ages
evidence more strongly favor Martins case?
of hourly workers who still had their jobs at
the end of Round 3. How do the ages differ?
P2. This summary table classifies the hourly
workers according to age and whether they
were laid off or retained.
a. What proportion of hourly workers
under age 50 were laid off? Were not
laid off?
b. What proportion of laid-off hourly
workers were under age 50? Were age 50
or older?
c. What two proportions should you
compute and compare in order to decide
whether older hourly workers were
disproportionately laid off? Make these
computations and give your conclusion.
d. Compare this table with the table for the
salaried workers on page 6. For which
Display 1.6 Hourly workers: ages of those laid off
(open circles) and those retained (solid
dots) in each round.
Exercises
E1. This summary table classifies salaried
workers as to whether they were laid off and
their age, this time using 40 as the cutoff
between younger and older workers.
2008 Key Curriculum Press
a. What proportion of workers age 40 or
older were laid off? What proportion of
laid-off workers were age 40 or older?
b. What proportion of workers under age 40
were laid off? What proportion were not
laid off?
c. What two proportions should you
compute and compare in order to
1.1 Discrimination in the Workplace: Data Exploration 9
23-03-2009 21:28
Lesson
7 de 7
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
decide whether older workers were
E5. Refer to Display 1.1 on page 5.
disproportionately laid off? Make these
a. Create a summary table whose five cases
computations and give your conclusion.
are Round 1 through Round 5 and
d. Compare this table with the table for the
whose three variables are total number of
salaried workers on page 6, where 50 was
employees laid off in that round, number
the age cutoff. If you were Martins lawyer,
of employees laid off in that round who
would you present a table using 40 or 50
were 40 or older, and percentage laid off
in that round who were 40 or older.
as the cutoff?
E2. Explore whether hourly workers at Westvaco
b. Describe any patterns you find in the table
were more likely than salaried workers to
and what you think they might mean.
lose their jobs.
E6. Last hired, first fired is shorthand for
a. Start by constructing a summary table to
When you have to downsize, start by laying
display the relevant data.
off the newest person, then the person hired
next before that, and work back in reverse
b. Compute two proportions that will allow
order of seniority. (The person whos been
you to make this comparison.
working longest will be the last to be laid
c. What do you conclude from comparing
off.) Examine the Westvaco data.
the proportions?
a. How was seniority related to the
E3. Twenty-two workers kept their jobs. Explore
decisions about layoffs in the engineering
whether the age distributions are similar for
department at Westvaco?
the hourly and salaried workers who kept
b. What explanation(s) can you suggest for
their jobs.
any patterns you find?
a. Show the two age distributions on a pair
E7.
Many
tables in the media are arranged with
of dot plots that have the same scale. How
cases
as
rows and variables as columns. For
do these distributions differ?
Displays
1.7 and 1.8 in parts a and b, identify
b. Do your dot plots in part a support a
the cases and the variables. Then compute
claim that Westvaco was more inclined to
the values missing from each table.
keep older workers if they were salaried
rather than hourly?
E4. Consider these three facts from your work in
this section:
Salaried workers were more likely to keep
their jobs than were hourly workers.
Older workers were more likely to be laid
off than were younger workers.
Older workers were more likely to be
salaried than were younger workers.
Putting these three facts together, what can
you conclude? Is this evidence in favor of
Martins case, or does it help Westvaco?
10 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:28
Lesson
1 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a.
Display 1.7 Number of bases stolen in a single season by the top five Major
League baseball players. [Source: mlb.mlb.com.]
b.
Display 1.8 New York Stock Exchange Activity for October 29, 1929.
[Source: marketplace.publicradio.org.]
E8. Suppose you are studying the effects of
poverty and plan to construct a data set
whose cases are the villages in Bolivia. Name
some variables that you might study.
1.2
2008 Key Curriculum Press
Discrimination in the Workplace: Inference
Overall, the exploratory work on the Westvaco data set in Section 1.1 shows that
older workers were more likely than younger workers to be laid of and were laid
off earlier. One of the main arguments in the court case, along the lines set out
in D3, was about what those patterns mean: Can you infer from the patterns that
Westvaco has some explaining to do, or are they the sort of patterns that tend to
happen even in the absence of discrimination?
A comprehensive analysis of Martin v. Westvaco will have to wait for its
reappearance among the case studies of Chapter 12, when youll be more familiar
with the concepts and tools of statistics. For now, you can get a pretty good idea
of how the analysis goes by working with a subset of the data.
1.2 Discrimination in the Workplace: Inference 11
23-03-2009 21:29
Lesson
2 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The ages of the ten hourly workers involved in Round 2 of the layoffs,
arranged from youngest to oldest, were 25, 33, 35, 38, 48, 55, 55, 55, 56, and 64.
The three workers who were laid of were ages 55, 55, and 64. Display 1.9 shows
these data on a dot plot.
Display 1.9 Hourly workers: ages of those laid off and those
retained in Round 2.
Use a summary statistic
to condense the data.
To simplify the statistical analysis to come, it helps to condense the data into
a single number, called a summary statistic. One possible summary statistic is
the average, or mean, age of the three workers who lost their jobs:
Knowing what to make of the data requires balancing two points of view. On
one hand, the pattern in the data is pretty striking. Of the five workers under age
50, all kept their jobs. Of the five who were 55 or older, only two kept their jobs.
On the other hand, the number of workers involved is small: only three out of ten.
Should you take seriously a pattern involving so few cases? Imagine two people
taking sides in an argument that was at the center of the statistical part of the
Martin case.
Martin: Look at the pattern in the data. All three of the workers laid
off were much older than the average age of all workers. Thats
evidence of age discrimination.
Westvaco: Not so fast! Youre looking at only ten workers total, and only
three positions were eliminated. Just one small change and the
picture would be entirely di erent. For example, suppose it had
been the 25-year-old instead of the 64-year-old who was laid off.
Switch the 25 and the 64, and you get a totally different set of
averages. (Ages in red are those selected for layoff.)
Actual Data:
25
33
35
38
48
55
55
55
56
64
Altered Data:
25
33
35
38
48
55
55
55
56
64
12 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:29
Lesson
3 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
See! Make just one small change, and the average age of the three
who were laid off is lower than the average age of the others.
Martin: Not so fast yourself! Of all the possible changes, you picked the
one most favorable to your side. If youd switched one of the
55-year-olds who got laid off with the 55-year-old who kept his
or her job, the averages wouldnt change at all. Why not compare
what actually happened with all the possibilities?
Westvaco: What do you mean?
Martin: Start with the ten workers, and pick three at random. Do this over
and over, to see what typically happens, and compare the actual
data with the results. then well find out how likely it is that the
average age of those laid off would be 58 or greater.
The dialogue between Martin and Westvaco describes one age-neutral
method for choosing which workers to lay off: Pick three workers completely at
random, with all sets of three having the same chance to be chosen.
Picking Workers at Random
D5. If you pick three of the ten ages at random, do you think you are likely to get
an average age of 58 or greater?
D6. If the probability of getting an average age of 58 or greater turns out to be
small, does this favor Martin or Westvaco?
Simulation requires a
chance model.
Activity 1.2a shows you how to estimate the probability of getting an average
age of 58 years or greater if you choose three workers at random. You will
use simulation, a procedure in which you set up a model of a chance process
(drawing three ages out of a box) that copies, or simulates, a real situation
(selecting three employees at random to lay off).
By Chance or by Design?
What youll need: paper or 3 5 cards, a box or other container
Lets test the process suggested by Martins advocate.
1. Create a model of a chance process. Write each of the ten ages on identical
pieces of paper or 3 5 cards, and put the ten cards in a box. Mix them
thoroughly, draw out three (the ones to be laid off), and record the ages.
(continued)
2008 Key Curriculum Press
1.2 Discrimination in the Workplace: Inference 13
23-03-2009 21:29
Lesson
4 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
(continued)
2. Compute a summary statistic. Compute the average of the three numbers in
your sample to one decimal place.
3. Repeat the process. Repeat steps 1 and 2 nine more times.
4. Display the distribution. Pool your results with the rest of your class and
display the average ages on a dot plot.
5. Estimate the probability. Count the number of times your class computed
an average age of 58 years or greater. Estimate the probability that simply
by chance the average age of those chosen would be 58 years or greater.
6. Interpret your results. What do you conclude from your classs estimate in
step 5?
[See Calculator Note 1B to learn how to do this kind of simulation with your
calculator.]
The simulation tells what
kind of data to expect if
workers are selected at
random for layoff.
Your simulation was completely age-neutral. All sets of three workers
had exactly the same chance of being selected for layoff , regardless of age. The
simulation tells you what results are reasonable to expect from that sort of
age-blind process.
Shown here are the first 4 of 200 repetitions from such a simulation. (The
ages in red are those selected for layoff .) The average ages of the workers selected
for layoff 42.7, 48.0, 42.7, and 37.0are highlighted by the red dots in the
distribution of all 200 repetitions in Display 1.10.
Display 1.10 Results of 200 repetitions: the distribution of the
average age of the three workers chosen for layoff
by chance alone.
14 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:29
Lesson
5 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Here we take a closer
look at the logic.
Out of 200 repetitions, only 10, or 5%, gave an average age of 58 or greater. So
it is not at all likely that simply by chance youd pick workers as old as the three
Westvaco picked. Did the company discriminate? Theres no way to tell from the
numbers aloneWestvaco might have a good explanation. On the other hand,
if your simulation had told you that an average of 58 or greater is easy to get by
chance alone, then the data would provide no evidence of discrimination and
Westvaco wouldnt need to explain.
To better understand how this logic applies to Martin v. Westvaco, imagine a
realistic argument between the advocates for each side.
Martin: Look at the pattern in the data. All three of the workers laid off
were much older than average.
Westvaco: So what? You could get a result like that just by chance. If chance
alone can account for the pattern, theres no reason to ask us for
any other explanation.
Martin: Of course you could get this result by chance. The question is
whether its easy or hard to do so. If its easy to get an average
as large as 58 by drawing at random, Ill agree that we cant rule
out chance as one possible explanation. But if an average that
large is really hard to get from random draws, we agree that
its not reasonable to say that chance alone accounts for the
pattern. Right?
Westvaco: Right.
Martin: Here are the results of my simulation. If you look at the three
hourly workers laid o in Round 2, the probability of getting an
average age of 58 or greater by chance alone is only 5%. And if you
do the same computations for the entire engineering department,
the probability is a lot lower, about 1%. What do you say to that?
Westvaco: Well . . . Ill agree that its really hard to get an average age that
extreme simply by chance, but that by itself still doesnt prove
discrimination.
Martin: No, but I think it leaves you with some explaining to do!
In the actual case, Martin and Westvaco reached a settlement out of court
before the case went to trial.
The logic youve just seen is basic to all statistical inference, but its not easy
to understand. In fact, it took mathematicians centuries to come up with the
ideas. It wasnt until the 1920s that a brilliant British biological scientist and
mathematician, R. A. Fisher, realized that results of agricultural experiments
may be analyzed in a way similar to that in Activity 1.2a to see whether observed
differences should be attributed to chance alone or to treatment. Calculus, in
contrast, was first understood in 1665. Precisely because it is so important, the
logic of using randomization as a basis for statistical inference will be seen over
and over again throughout this book. Youll have lots of time to practice with it.
2008 Key Curriculum Press
1.2 Discrimination in the Workplace: Inference 15
23-03-2009 21:29
Lesson
6 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Key Steps in a Simulation
Here is a summary of the steps in a simulation:
1. Model. Set up a model in which chance is the only cause of being selected.
In Activity 1.2a the model for an age-neutral chance process was to draw
three numbers at random from the set of ten ages. You put the ages of the
workers who could be laid off in a box and selected three by random draw.
2. Repetition. Repeat the process.
In the activity, you repeated the process of drawing ages many times.
3. Distribution. Display the distribution of the summary statistics, and
determine how likely the actual result or one even more extreme would be.
In the activity, you used the average age to summarize the results, although
other summary statistics also could be used. For each repetition, you
computed the average age of those laid off and plotted that average on your
dot plot. The simulation showed that the chance of an average age of 58 or
greater was only about 0.05.
4. Conclusion. If the probability is small (and the definition of small will vary
depending on the situation), conclude that some explanation other than
just chance should be considered. If the probability isnt small, conclude
that you can reasonably attribute the result to chance alone.
In Round 2 of the Martin v. Westvaco case, the small probability of 0.05
(1 chance out of 20) meant that Westvaco had some explaining to do. However,
the Round 2 evidence alone wouldnt have been enough to serve as evidence of
discrimination in a court case (which requires a probability of 0.025 or less).
DISCUSSION
The Logic of Inference
D7. Why must you estimate the probability of getting an average age of 58
or greater rather than the probability of getting an average age of 58?
D8. How unlikely is too unlikely? The probability you estimated in Activity 1.2a
is in fact exactly equal to 0.05. In a typical court case, a probability of 0.025
or less is required to serve as evidence of discrimination.
a. Did the Round 2 layoffs of hourly workers in the Martin case meet the
court requirement?
b. If the probability in the Martin case had been 0.01 instead of 0.05, how
would that have changed your conclusions? 0.10 instead of 0.05?
D9. A friend wants to bet with you on the outcome of a coin flip. The coin looks
fair, but you decide to do a little checking. You flip the coin, and it lands
heads. You flip againalso heads. A third flipheads. Flipheads.
Flipheads. You continue and the coin lands heads 19 times in 20 flips.
a. Explain why the evidence19 heads in 20 flipsmakes it hard to believe
the coin is fair.
b. Design and carry out a simulation to estimate how unusual this result
would be if the coin were fair.
16
Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:29
Lesson
7 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Summary 1.2: Inference
Inference is a statistical procedure that involves deciding whether an event can
reasonably be attributed to chance or whether you should look forand perhaps
investigatesome other explanation. In the Martin case, you used inference to
determine whether the relatively high average age of the laid-off hourly employees
in Round 2 could reasonably be due to chance.
Simulation is a useful device for inference.
First, you set up a model of a process in which chance is the only factor
influencing the outcomes.
The next stage is repetitionyou repeat the process.
Then you plot the distribution of the summary statistics in order to determine
how likely the actual result or one even more extreme would be.
Finally, you reachor infera conclusion. If the probability of getting a
summary statistic as extreme as that from your actual data is small, conclude
that chance isnt a reasonable explanation. If the probability isnt small,
conclude that you can reasonably attribute the result to chance alone.
In the Martin case, the probability was about 0.05, which was considered small
enough to warrant asking for an explanation from Westvaco but not small enough
to present in court as clear evidence of discrimination.
Practice
The Logic of Inference
P4. Suppose three workers were laid off from a
set of ten whose ages were the same as those
of the hourly workers in Round 2 in the
Martin case. This time, however, the ages of
those laid off were 48, 55, and 55.
25 33 35 38 48 55 55 55 56 64
a. Use the dot plot in Display 1.10 on page
14 to estimate the probability of getting
an average age as large as or larger than
that of those laid off in this situation.
b. What would your conclusion be if
Westvaco had laid off workers of these
three ages?
P5. At the beginning of Round 1, there were 14
hourly workers. Their ages were 22, 25, 33,
35, 38, 48, 53, 55, 55, 55, 55, 56, 59, and 64.
After the layoffs were complete, the ages
of those left were 25, 38, 48, and 56. Think
about how you would repeat Activity 1.2a
using these data.
a. What is the average age of the ten workers
laid off?
2008 Key Curriculum Press
b. Describe a simulation for finding the
distribution of the average age of ten
workers laid off at random.
c. The results of 200 repetitions from a
simulation are shown in Display 1.11.
Suppose 10 workers are picked at random
for layoff from the 14 hourly workers.
Make a rough estimate of the probability
of getting, just by chance, the same or
larger average age as that of the workers
who actually were laid off (from part a).
d. Does this analysis provide evidence in
Martins favor?
Display 1.11 Results of 200 repetitions.
1.2 Discrimination in the Workplace: Inference 17
23-03-2009 21:29
Lesson
8 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Exercises
E9. Revisit the idea of the simulation in Activity E10. The ages of the ten hourly workers left after
1.2a, this time for all 14 hourly workers and
Round 1 are given here. The ages of the four
using a different summary statistic. Use
workers laid off in Rounds 2 and 3 are shown
as your summary statistic the number of
in red. Their average age is 57.25.
hourly workers laid off who were 40 or older.
25 33 35 38 48 55 55 55 56 64
The ages listed here are those of the hourly
a. Describe how to simulate the chance of
workers, with the ages of those laid off in
getting an average age of 57.25 or more
red. Note that, of the ten hourly workers laid
using the methods of Activity 1.2a.
off by Westvaco, seven were age 40 or older.
b. Perform your simulation once and
22 25 33 35 38 48 53
compute the average age of the four
55 55 55 55 56 59 64
hourly workers laid off.
a. Write the 14 ages on 14 slips of paper
c. The dot plot in Display 1.13 shows
and draw 10 at random to be chosen for
the results of 200 repetitions of this
layoff. How many of the 10 are age 40 or
simulation. What is your estimate of the
older?
probability of getting an average age as
b. The dot plot in Display 1.12 shows the
great as or greater than Westvaco did if
results of 50 repetitions of this simulation.
four workers are picked at random for
Estimate the probability that, by chance,
layoff in Rounds 2 and 3 from the ten
seven or more of the ten hourly workers
hourly workers remaining after Round 1?
who were laid off would be age 40 or older.
Display 1.13 Results of 200 repetitions: mean ages
of three randomly selected workers.
Display 1.12 Results of 50 repetitions: the
distribution of the number of workers
age 40 or older from ten randomly
selected workers.
c. Do you conclude that the proportion of
laid-off workers age 40 or older could
reasonably be due to chance alone,
or should Westvaco be asked for an
explanation?
18
d. What is your conclusion?
E11. Mrs. Garcia was not happy when she found
that her baker had raised the price of a loaf
of breadand she let him know it. However,
she did buy her usual three loaves of bread.
They seemed a little light, so she asked that
they be weighed and that the other eight
loaves the baker could have given her also
be weighed. The other eight loaves weighed
14, 15, 15, 16, 16, 17, 17, and 18 ounces. The
three loaves Mrs. Garcia was given weighed
Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:29
Lesson
9 de 9
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
14, 15, and 16 ounces. The baker claimed
that he picked the three loaves at random.
In Display 1.14, each dot represents the
average weight of a random sample of three
loaves.
Display 1.14 Results of 200 repetitions: average
weight of three randomly selected
loaves.
Which of these conclusions should Mrs.
Garcia draw?
A. Because the probability of getting an
average weight as low as or lower than that
of Mrs. Garcias three loaves is small, Mrs.
Garcia should not be suspicious that the
baker deliberately gave her lighter loaves.
B. Because the probability of getting an
average weight as low as or lower than
that of Mrs. Garcias three loaves is small,
Mrs. Garcia should be suspicious that the
baker deliberately gave her lighter loaves.
C. Because the probability of getting an
average weight as low as or lower than
that of Mrs. Garcias three loaves is
fairly large, Mrs. Garcia should not be
suspicious that the baker deliberately gave
her lighter loaves.
D. Because the probability of getting an
average weight as low as or lower than that
2008 Key Curriculum Press
of Mrs. Garcias three loaves is fairly large,
Mrs. Garcia should be suspicious that the
baker deliberately gave her lighter loaves.
E12. Snow in July? You have spent some time in
Oz. You think the date back in Kansas is
July 4, but you cant be sure because days
might not have the same length in Oz as
on Earth. A friendly tornado puts you and
your dog Toto down in Kansas. However,
you see snow falling (data). Which of these
inferences should you make?
A. If this is Kansas, it is very unlikely to
be snowing on July 4. Therefore, this
probably isnt Kansas.
B. If it is July 4, it is very unlikely to be
snowing in Kansas. Therefore, this
probably isnt July 4.
C. If it is snowing in Kansas on July 4, it is
time to go back to Oz.
D. If this is Kansas and it is July 4, it
probably isnt really snowing.
E13. For some situations, instead of using
simulation, it is possible to find exact
probabilities by counting equally likely
outcomes. Suppose only two out of the ten
hourly workers had been laid off in Round
2 and that those two workers were ages 55
and 64, with an average age of 59.5. It is
straightforward, though tedious, to list all
possible pairs of workers who might have
been chosen. Heres the beginning of a
systematic listing. The first nine outcomes
include the 25-year-old and one other. The
next eight outcomes include the 33-year-old
and one other, but not the 25-year-old
because the pair {25, 33} was already counted.
1.2 Discrimination in the Workplace: Inference 19
23-03-2009 21:29
Lesson
1 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. How many possible pairs are there?
(Dont list them all!)
b. How many pairs give an average age of
59.5 or greater? (Do list them.)
c. If the pair is chosen completely at
random, then all possibilities are equally
likely and the probability of getting an
average age of 59.5 or greater equals the
number of pairs with an average of 59.5
or more divided by the total number of
possible pairs. What is the probability?
d. Is the evidence of age discrimination
strong or weak?
E14. In this exercise, you will follow the same
steps as in E13 to find the probability of
getting an average age of 58 or greater when
drawing three hourly workers at random in
Round 2. The number of ways to pick three
different workers from ten to lay off is
[See Calculator Note 1C to learn how to
compute numbers of combinations.]
a. List the ways that give an average age of
58 or greater.
b. Compute the probability of getting an
average age of 58 or greater when three
workers are selected for layoff at random.
c. How does this probability compare to the
results of your class simulation in Activity
1.2a? Why do the two probabilities differ
(if they do)?
Chapter Summary
In this chapter, you explored the data from an actual case of alleged age
discrimination, looking for evidence you considered relevant. You then saw how
to use statistical reasoning to test the strength of the evidence: Are the patterns
in the data solid enough to support Martins claim of age discrimination, or are
they the sort that you would expect to occur even if there was no discrimination?
Along the way you made a substantial start at learning many of the most
important statistical terms and concepts: distribution, cases and variables,
summary statistic, simulation, and how to determine whether the result from
the real-life situation can reasonably be attributed to chance alone or whether an
explanation is called for.
You have practiced both thinking like a statistician and reporting your results
like a statistician. Throughout this textbook, you will be asked to justify your
answers in the real-world context. This includes stating assumptions, giving
appropriate plots and computations, and writing a conclusion in context.
The last chapter of this book includes a final look at the Martin case.
Review Exercises
E15. A teacher had two statistics classes, and
students could enroll in either the earlier
class or the later class. Final grades in the
courses are given here.
Earlier class:
99 95 69
91 79 67 64 54 68
47 53 86 100 95 45 41 59 66
Later class:
84 68 94 77 88 75 88 91 83
61 97 75 37 82 62 49 43 93
20 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:30
Lesson
2 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. Display these data on dot plots so that
equal to the number of people age 50 or
you can compare the two classes.
older who were retained.
b. What conclusion can you draw from the
E. The table does not support a claim of
dot plots? Could the difference between
age discrimination because a larger
the two classes reasonably be attributed
percentage of people were 50 or older to
to chance, or do you think the teacher
begin with.
should look for an explanation?
E18. Display 1.15 contains information about
E16. Refer to the data in E15.
the planets in our solar system. The radius
of each planet is given in miles, and the
a. Make a table that divides the course
temperature is the average at the surface.
grades into fail (less than 60) and pass
What are the cases? What are the variables?
(60 or more) for the two classes.
b. What proportion of the students in the
earlier class passed? What proportion of
students who passed were in the earlier
class? What proportion of students
passed overall?
c. What two proportions should you
compute and compare in order to decide
whether a disproportionate number of
passing students enrolled in the earlier
class? Make these computations and give
your conclusion.
Display 1.15 Data about planets in our solar system.
E17. This table classifies the Westvaco workers by
[Source: solarsystem.nasa.gov.]
whether they were laid off and whether they
E19. Earlier you studied the summary table of
were under age 50 or were age 50 or older.
salaried workers classi ed according to age
and to whether they were laid off or retained,
using 50 as the dividing age. That table is
shown again here.
Choose the best conclusion to draw from
this table.
A. The table supports a claim of age
discrimination because most of the
people who were laid off were 50 or older.
B. The table supports a claim of age
discrimination because a larger
percentage of the people age 50 or older
were laid off than people under 50.
C. The table supports a claim of age
discrimination because a larger
percentage of the laid-off people were
age 50 or older than were under 50.
D. The table does not support a claim of age
discrimination because the number of
people under age 50 who were laid off is
2008 Key Curriculum Press
The proportion of those under age 50 who
were laid off (6 out of 16, or 0.375) is smaller
than the proportion of those age 50 or older
who were laid off (12 out of 20, or 0.60).
The key question, however, is Is the actual
12 versus 6 split of those laid off consistent
with selecting workers at random for layoff ?
a. Which of these demonstrations would
help Martins case?
A. Showing that its not unusual to get
12 or more older workers if 18 workers
are selected at random for layoff
Chapter Summary
21
23-03-2009 21:30
Lesson
3 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
d. From this simulation, a total of 26 out
B. Showing that its pretty unusual to get
12 or more older workers if 18 workers
of 200 repetitions gave counts of 50
are selected at random for layoff
or older that were 12 or more. What
percentage of the repetitions is this? Is
b. To investigate this situation using
this percentage small enough to cast
simulation, follow these steps once and
serious doubt on a claim that those laid
record your results.
off were chosen by chance?
1. Make 36 identical white cards. Label
E20.
The
Eastbanko Company had fifteen workers
20 cards with an O for 50 or over
before laying off five of them. The ages of the
and 16 cards with a U for under 50.
fifteen workers were 22, 23, 25, 31, 34, 36,
2. Mix the cards in a bag and select the
37, 40, 41, 43, 44, 50, 55, 55, and 60, with the
18 to be laid off at random.
ages of the five laid-off workers in bold.
3. Count the number of Os among the
The dot plot in Display 1.17 gives the results
18 selected.
of a simulation with 600 repetitions for the
c. Display 1.16 shows the results of a
average age of five of these workers chosen
computer simulation of 200 repetitions
for layoff at random. Each dot represents
conducted according to the rules given
the average of five ages, rounded down to a
in part b. Where would your result from
whole number.
part b be placed on this dot plot?
Display 1.17 Results of 600 repetitions: the
distribution of the average age of
five randomly chosen workers.
Display 1.16 Results of 200 repetitions: the
distribution of the number of Os in
18 randomly selected cards.
22 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:30
Lesson
4 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. Estimate the probability that if workers
E23. The Society for the Preservation of Wild
Gnus held a raffle last week and sold 50
were selected by chance alone for layoff ,
the average age of those laid off would be
tickets. The two lucky participants whose
as large as or larger than the average age
tickets were drawn received all-day passes
of those in the actual layoffs.
to the Wild Gnu Park in Florida. But there
was a near riot when the winners were
b. If the 60-year-old sues for age
announcedboth winning tickets belonged
discrimination, would Eastbanko have
to society president Filbert Newmans
some explaining to do?
cousins. After some intense questioning
E21. In E9, you conducted a simulation to
by angry ticket holders, it was determined
estimate the probability that, just by chance,
that only 4 of the 50 tickets belonged to
seven or more of the ten hourly workers
Newmans cousins and the other 46 tickets
who were laid off would be age 40 or older.
belonged to people who were not part of his
The ages of the fourteen hourly workers were
family. Newmans final comment to the press
22, 25, 33, 35, 38, 48, 53, 55, 55, 55, 55, 56,
was Hey kids, I guess we were just lucky.
59, and 64.
Deal with it.
a. How many ways can 10 workers be
One member of the Gnu Society was taking
selected from 14 workers for layoff ?
a statistics class and decided to deal with it
b. If there are a total of 10 layoff s, what
by simulating the drawing. He put 50 tickets
numbers of older workers would it have
in a bowl; 4 of the tickets were marked C
been possible to lay off ?
for cousin and 46 were marked N for
c. Using your calculator, find the number of
not a cousin. The statistics student drew
ways that you can lay off
two tickets at random and kept track of the
number of cousins picked. A er doing this
i. seven older workers and three
1000 times, the student found that 844 draws
younger workers
resulted in two Ns, 149 in one N and one C,
ii. eight older workers and two younger
and only 7 in two Cs.
workers
a. Use the results of the simulation to
iii. nine older workers and one younger
estimate the probability that, in a fair
worker
drawing, both winning tickets would be
d. What is the probability that you will get
held by Newmans cousins.
7 or more workers age 40 or older if you
b.
Using the probability you estimated
select 10 of the 14 workers completely at
in
part a, write a short paragraph that
random for layoff ?
the statistics student can send to other
E22. Refer to your reasoning in E14, where you
members of the Gnu Society.
computed the probability that the three
c.
Is
it possible that Newmans cousins won
workers laid off in Round 2 would have an
the
prizes by chance alone? Explain.
average age of 58 or greater. Describe how
d. Using reasoning like that in E13 and E14,
your reasoning and conclusions would be
compute the exact probability that, in a
different if the workers ages were 25, 33, 35,
fair drawing, both winning tickets would
38, 48, 55, 55, 55, 55, and 55, and the three
be held by Newmans cousins.
workers chosen for layoff were all age 55. Is
the evidence stronger or weaker for Martin
in this situation than in E14?
2008 Key Curriculum Press
Chapter Summary
23
23-03-2009 21:30
Lesson
5 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
No, because half of hourly workers were
AP1. This plot shows the ages of the part-time
and full-time students who receive financial
laid off, but more than half of salaried
aid at a small college. Which of the following
workers were laid off.
is a conclusion about students at this college AP3. This table shows the number of male
that cannot be drawn from the plot alone?
and female applicants who applied and
were either admitted to or rejected from
a graduate program. What proportion of
admitted applicants were female?
Part-time students who receive financial
aid tend to be older than full-time
students who receive financial aid.
A larger proportion of part-time
students than full-time students receive
financial aid.
The oldest student receiving financial aid
is a full-time student.
No student under age 18 receives
financial aid.
More part-time students than full-time
students receive financial aid.
AP2. This table classifies hourly and salaried
workers as to whether they were laid off .
Do the data support a claim that hourly
workers are being treated unfairly?
Yes, because most people laid off were
hourly workers.
Yes, because a bigger proportion of
hourly workers were laid off than
salaried workers.
No, because half of hourly workers were
laid off and half were not.
No, because more than half of workers
were hourly and less than half salaried.
AP4. For the data in AP3, in order to determine
if there is evidence to continue investigating
whether the graduate admissions process
discriminates against females, a study
takes a random sample of 25 out of the
70 applicants to be the admitted group.
The proportion of females in the sample
was computed. This process was repeated
for a total of 50 random samples and the
results are graphed below. What is the best
conclusion to draw from this simulation?
The actual proportion of females among
those admitted is very near the center of
this distribution, so there is no evidence
of discrimination.
The actual proportion of females among
those admitted is very near the center
of this distribution, so there is strong
evidence of discrimination in favor of
female applicants.
24 Chapter 1 Statistical Reasoning: Investigating a Claim of Discrimination
2008 Key Curriculum Press
23-03-2009 21:30
Lesson
6 de 6
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Construct dot plots for compounds
A and B. Does it now appear that
compound B tends to give larger
measurements than compound A?
c. Find the average increase in lung
capacity for compound A and for
compound B. When you compare
these means, does it look to you as if
compound B is better than compound A
at opening up the lungs?
AP6. Refer to AP5. Your task now is to see
whether the observed difference in the
means of each treatment group reasonably
could be attributed to chance alone.
a. Place the ten measurements on separate
slips of paper and mix them in a bag.
Investigative Tasks
Select five at random to play the role of
the A treatment group; the other five
AP5. People with asthma often use an inhaler
play the role of the B treatment group.
to help open up their lungs and breathing
This time you will use as your summary
passages. A pharmaceutical company has
statistic the difference between the
come up with a new compound to put in
means of each treatment group. Calculate
the inhaler that, they believe, will open
this difference, mean (compound B)
up the lungs of the user even more than
mean
(compound A), for your sample.
the standard compound tends to do. Ten
b.
The
dot
plot in Display 1.19 shows
volunteers with asthma are randomly split
the results of 50 repetitions of this
into two groups: one group uses the new
simulation. Compute the difference
compound B and the other uses the standard
between the means for the actual data.
compound A. The measurements listed in
Locate this difference on the dot plot.
Display 1.18 are the increase in lung capacity
How many simulated differences exceed
(in liters) 1 hour after the use of the inhaler.
the actual difference? What proportion?
The actual proportion of females among
those admitted is very near the center
of this distribution, so there is strong
evidence of discrimination against
female applicants.
The actual proportion of females among
those admitted is quite a bit above the
center of this distribution, so there is
strong evidence of discrimination in
favor of female applicants.
The actual proportion of females among
those admitted is quite a bit below the
center of this distribution, so there
is strong evidence of discrimination
against female applicants.
Display 1.18 Increase in lung capacity, in liters,
1 hour after use of an inhaler
containing
a compound.
a. From simply studying the data in the
table, do you think compound B does
better than compound A in increasing
lung capacity?
2008 Key Curriculum Press
Display 1.19 Results of 50 repetitions: the
distribution of the difference between
the means of two randomly selected
groups of five values.
c. In light of this simulation, do you think
it is reasonable to attribute the actual
difference to chance alone? Explain.
AP Sample Test 25
23-03-2009 21:30
Lesson
1 de 2
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Exploring Distributions
What does the
distribution of female
heights look like?
Statistics gives you
the tools to visualize
and describe large sets
of data.
2008 Key Curriculum Press
23-03-2009 21:31
Lesson
2 de 2
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Raw dataa long list of valuesare hard to make sense of. Suppose, for
example, that you are applying to the University of Michigan at Ann Arbor and
wonder how your SAT I math score of 650 compares with those of the students
who attend that university. If all you have are raw dataa list of the SAT I math
scores of the 25,000 students at the University of Michiganit would take a lot
of time and effort to make sense of the numbers.
Suppose instead that you read the summary in the universitys guide, which
says that the middle 50% of the SAT math scores were between 630 and 720, with
half above 680 and half below. Now you know that your score of 650, though in
the bottom half of the scores, is not far from the center value of 680 and is above
the bottom quarter.
Notice that the summary of the scores gives you two different kinds of
information for the middle 50%: the center, 680, and the spread, from 630 to 720.
Often thats all you need, especially if the shape of the distribution is one of a few
standard shapes youll learn about in this chapter.
These three featuresshape, center, and spreadcan sometimes take
you a surprisingly long way in data analysis. For example, in Chapter 1 you did
a simulation to answer this question: If you choose three people at random
from a set of ten people and compute the average age of the ones you choose,
how likely is it that you get an average of 58 years or more? Generally, you
dont need to do all this work! Using shape, center, and spread, you can get an
answer without doing a simulation. This remarkable fact, which first began to
come to light in the late 1600s, helped make statistical inference possible in the
20th century before the age of computers. In the next several chapters, youll learn
how to make good use of this fact.
In this chapter, you begin your systematic study
of distributions by learning how to
make and interpret different kinds of plots
describe the shapes of distributions
choose and compute a measure of center
choose and compute a useful measure of spread (variability)
work with the normal distribution
2008 Key Curriculum Press
23-03-2009 21:31
2.1
Visualizing Distributions: Shape, Center, and Spread
Summaries simplify. In fact, summaries sometimes can oversimplify, which means
it is important to know when to use summaries and which summaries to use.
Often the right choice depends on the shape of your distribution. To help you
build your visual intuition about how shape and summaries are related, this first
section of the chapter introduces various shapes and asks you to estimate some
summary values visually. Later sections will tell you how to compute summary
values numerically.
Distributions come in a variety of shapes. Four of the most common shapes
are illustrated in the rest of this section.
Uniform (Rectangular) Distributions
The uniform distribution
is rectangular.
The plot to the left shows the shape of a uniform or rectangular distribution, in
which all values occur equally often. How uniform is a sample of values taken
from a uniform distribution? In the next activity, you will find out.
Distributing Digits
What youll need: one page from a phone book for each member of the class,
and a box of slips of paper, with one slip for each member of your class, half
labeled phone book and half fake it
1. Suppose your class made a dot plot of the last digits of every phone
number in the phone book. (This would take a very long time!) Sketch what
you think this plot would look like.
2. Draw a slip of paper from the box.
If the slip you drew says phone book, use the page from the phone
book, start at a random spot, and write down the last digit of each of the
next 30 phone numbers. Using a full sheet of paper, plot your 30 digits
on a dot plot, using a scale like the one here. Use big dots so that they
can be seen from across the room.
If the slip you drew says fake it, dont use the page from the phone
book but instead make up and plot 30 digits on a dot plot using a scale
(Continued)
28 Chapter 2 Exploring Distributions
2008 Key Curriculum Press
like the one on the previous page. Try to make the distribution look like
the digits might have come from the phone book.
Write your name on the front of the plot, but not which method you
used. Dont consult with other students while you are doing this step.
3.Post the dot plots around the room and compare. Which plots are you
confident came from the phone book? Which are you confident came from
made-up digits? (Dont say anything about your own plot.)
4.Find your plot and write a large P (for phone book) or F (for faked it)
on the front. Check your predictions from step 3. What differences do you
see in the two groups of plots?
5.In this activity, it was important that you sampled from the phone book
in such a way that all digits were equally likely to occur. Why did step 1
specify that you use the last digit of the phone number and not, say,
the first?
The number of births per month in a year is another set of data you might expect
to be fairly uniform. Or, is there a reason to believe that more babies are born in
one month than in another? Display 2.1 shows a table and plot of U.S. births (in
thousands) for 2003.
Display 2.1
An example of a (roughly) uniform distribution:
births per month in the United States, 2003.
[Source: Centers for Disease Control and Prevention.]
The plot shows that there is actually little change from month to month;
that is, we see a roughly uniform distribution of births across the months. To
summarize this distribution, you might write The distribution of births is
roughly uniform over the months January through December, with about
340,000 births per month.
2008 Key Curriculum Press
2.1 Visualizing Distributions: Shape, Center, and Spread
29
Computers and many calculators generate random numbers between 0 and
1 with a uniform distribution. Display 2.2 shows a dot plot of 1000 random
numbers generated by statistical software. There is some variability in the
frequencies, but, as expected, about 20% of the random numbers fall between, for
example, 0.2 and 0.4. [See Calculator Note 2A to learn how to create a distribution of
random numbers using your calculator.]
Display 2.2
Dot plot of 1000 random numbers from a uniform
distribution. Each dot represents two points.
Uniform Distributions
D1. Think of other situations that you would expect to be uniform distributions
a. over the days of the week
b. over the digits 0, 1, 2, . . . , 9
D2. Think of situations that you would expect to be very nonuniform
distributions
a. over the months of the year
b. over the days of the month
c. over the digits 0, 1, 2, . . . , 9
d. over the days of the week
Normal Distributions
Activity 2.1b introduces one of the most important common shapes of
distributions and one of the common ways this shape is produced. What happens
when different people measure the same distance or the same feature of very
similar objects? In the activity, youll measure a tennis ball with a ruler, but the
results you get will reflect what happens even if you use very precise instruments
under carefully controlled conditions. For example, a 10-gram platinum weight
is used for calibration of scales all across the United States. When scientists at the
National Institute of Standards and Technology use an analytical balance for the
weights weekly weighing, they face a similar challenge due to variability.
Measuring Diameters
What youll need: a tennis ball, a ruler with a centimeter scale
1. With your partner, plan a method for measuring the diameter of the tennis
ball with the centimeter ruler.
(continued)
30
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
2. Using your method, make two measurements of
the diameter of your tennis ball to the nearest
millimeter.
3. Combine your data with those of the rest of the
class and make a dot plot. Speculate first, about
the shape you expect for the distribution.
4. Shape. What is the approximate shape of the
plot? Are there clusters and gaps or unusual
values (outliers) in the data?
5. Center and spread. Choose two numbers that
seem reasonable for completing this sentence:
Our typical diameter measurement is about ?, give or take about
?. (More than one reasonable set of choices is possible.)
6. Discuss some possible reasons for the variability in the measurements.
How could the variability be reduced? Can the variability be eliminated
entirely?
The normal distribution
is bell-shaped..
The measurements of the diameter of a tennis ball taken by your class in
Activity 2.1b probably were not uniform. More likely, they piled up around
some central value, with a few measurements far away on the low side and a
few far away on the high side. This common bell shape has an idealized version
the normal distributionthat is especially important in statistics.
Pennies minted in the United States are supposed to weigh 3.110 g, but a
tolerance of 0.130 g is allowed in either direction. Display 2.3 shows a plot of the
weights of 100 pennies.
Display 2.3
Weights of pennies. [Source: W. J. Youden, Experimentation and
Measurement (National Science Teachers Association, 1985), p. 108.]
The smooth curve superimposed on the graph of the pennies is an example
of a normal curve. No real-world example matches the curve perfectly, but many
plots of data are approximately normal. The idealized normal shape is perfectly
symmetricthe right side is a mirror image of the left side, as in Display 2.4.
There is a single peak, or mode, at the line of symmetry, and the curve drops off
smoothly on both sides, flattening toward the x-axis but never quite reaching it
and stretching infinitely far in both directions. On either side of the mode are
inflection points, where the curve changes from concave down to concave up.
2008 Key Curriculum Press
2.1
Visualizing Distributions: Shape, Center, and Spread
31
Display 2.4 A normal curve, showing the line of symmetry, mode,
mean, inflection points, and standard deviation (SD).
Use the mean and
standard deviation to
describe the center
and spread of a normal
distribution.
You should use the mean (or average) to describe the center of a normal
distribution. The mean is the value at the point where the line of symmetry
intersects the x-axis. You should use the standard deviation, SD for short, to
describe the spread. The SD is the horizontal distance from the mean to an
inflection point.
It is difficult to locate inflection points, especially when curves are drawn
by hand. A more reliable way to estimate the standard deviation is to use areas.
For a normal curve, 68% (roughly) of the total area under the curve is between
the vertical lines through the two inflection points. In other words, the interval
between one standard deviation on either side of the mean accounts for roughly
68% of the area under the normal curve.
Example: Averages of Random Samples
Display 2.5 shows the distribution of average ages computed from 200 sets of
five workers chosen at random from the ten hourly workers in Round 2 of the
Westvaco case discussed in Chapter 1. Notice that, apart from the bumpiness, the
shape is roughly normal. Estimate the mean and standard deviation.
Display 2.5 Distribution of average age for groups of five workers
drawn at random.
32
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
Solution
The curve in the display has center at 47, and the middle 68% of dots fall roughly
between 43 and 51. Thus, the estimated mean is 47, and the estimated standard
deviation is 4. A typical random sample of five workers has average age 47 years,
give or take about 4 years.
[You can graph a normal curve on your calculator by specifying the mean and standard
deviation. See Calculator Note 2B.]
In this section, youve seen the three most common ways normal distributions
arise in practice:
through variation in measurements (diameters of tennis balls)
through natural variation in populations (weights of pennies)
through variation in averages computed from random samples (average ages)
All three scenarios are common, which makes the normal distribution especially
important in statistics.
Normal Distributions
D3. Determine these summary statistics visually.
a. Estimate the mean and standard deviation of the penny weight data in
Display 2.3, and use your estimates to write a summary sentence.
b. Estimate the mean and standard deviation of your class data from
Activity 2.1b.
Skewed Distributions
Skewed left
Both the uniform (rectangular) and normal distributions are symmetric. That
is, if you smooth out minor bumps, the right side of the plot is a mirror image
of the left side. Not all distributions are symmetric, however. Many common
distributions show bunching at one end and a long tail stretching out in the other
direction. These distributions are called skewed. The direction of the tail tells
whether the distribution is skewed right (tail stretches right, toward the high
values) or skewed left (tail stretches left , toward the low values).
Skewed right
Display 2.6 Weights of bears in pounds. [Source: MINITAB data set from
MINITAB Handbook, 3rd ed.]
The dot plot in Display 2.6 shows the weights, in pounds, of 143 wild bears. It is
skewed right (toward the higher values) because the tail of the distribution stretches
out in that direction. In everyday conversation, you might describe the two parts of
2008 Key Curriculum Press
2.1
Visualizing Distributions: Shape, Center, and Spread
33
the distribution as normal and abnormal. Usually, bears weigh between about
50 and 250 lb (this part of the distribution even looks approximately normal), but
if someone shouts Abnormal bear loose! you should run for coverthat unusual
bear is likely to be big! The unusualness of the distribution is all in one direction.
Often the bunching in a skewed distribution happens because values bump
up against a walleither a minimum that values cant go below, such as 0 for
measurements and counts, or a maximum that values cant go above, such as 100
for percentages. For example, the distribution in Display 2.7 shows the grade-point
averages of college students (mostly first-year students and sophomores) taking an
introductory statistics course at the University of Florida. It is skewed left (toward
the smaller values). The maximum grade-point average is 4.0, for all As, so the
distribution is bunched at the high end because of this wall. A GPA of 0.0 wouldnt
be called a wall, even though GPAs cant go below 0.0, because the values arent
bunched up against it. The skew is to the left : An unusual GPA would be one that
is low compared to most GPAs of students in the class.
Display 2.7 Grade-point averages of 62 statistics students. Each
dot represents two points.
Use the median along
with the lower and
upper quartiles to
describe the center
and spread of a skewed
distribution.
Because there is no line of symmetry in a skewed distribution, the ideas of
center and spread are not as clear-cut as they are for a normal distribution. To get
around this problem, typically you should use the median to describe the center
of a skewed distribution. To estimate the median from a dot plot, locate the value
that divides the dots into two halves, with equal numbers of dots on either side.
You should use the lower and upper quartiles to indicate spread. The lower
quartile is the value that divides the lower half of the distribution into two halves,
with equal numbers of dots on either side. The upper quartile is the value that
divides the upper half of the distribution into two halves, with equal numbers
of dots on either side. The three valueslower quartile, median, and upper
quartiledivide the distribution into quarters. This allows you to describe a
distribution as in the introduction to this chapter: The middle 50% of the SAT
math scores were between 630 and 720, with half above 680 and half below.
Example: Median and Quartiles for Bear Weights
Divide the bears weights in Display 2.8 into four groups of equal size, and
estimate the median and quartiles. Write a short summary of this distribution.
Solution
There are 143 dots in Display 2.8, so there are about 71 or 72 dots in each half and
35 or 36 in each quarter. The value that divides the dots in half is about 155 lb.
The values that divide the two halves in half are roughly 115 and 250. Thus, the
middle 50% of the bear weights are between about 115 and 250, with half above
about 155 and half below.
34
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
Display 2.8 Estimating center and spread for the weights
of bears.
Skewed Distributions
D4. Decide whether each distribution described will be skewed. Is there a wall
that leads to bunching near it and a long tail stretching out away from it? If
so, describe the wall.
a. the sizes of islands in the Caribbean
b. the average per capita incomes for the nations of the United Nations
c. the lengths of pant legs cut and sewn to be 32 in. long
d. the times for 300 university students of introductory psychology to
complete a 1-hour timed exam
e. the lengths of reigns of Japanese emperors
D5. Which would you expect to be the more common direction of skew, right or
left ? Why?
Bimodal Distributions
A bimodal distribution
has two peaks.
Many distributions, including the normal distribution and many skewed
distributions, have only one peak (unimodal), but some have two peaks
(bimodal) or even more. When your distribution has two or more obvious peaks,
or modes, it is worth asking whether your cases represent two or more groups.
For example, Display 2.9 shows the life expectancies of females from countries
on two continents, Europe and Africa.
Display 2.9 Life expectancy of females by country on two
continents. [Source: Population Reference Bureau, World
Population Data Sheet, 2005.]
2008 Key Curriculum Press
2.1
Visualizing Distributions: Shape, Center, and Spread
35
Europe and Africa differ greatly in their socioeconomic conditions, and the
life expectancies reflect those conditions. If you make a separate plot for each of
the two continents, the two peaks become essentially one peak in each plot, as in
Display 2.10.
Display 2.10 Life expectancy of females in Africa and Europe.
Although it makes sense to talk about the center of the distribution of life
expectancies for Europe or for Africa, notice that it doesnt really make sense to
talk about the center of the distribution for both continents together. You could
possibly tell the locations of the two peaks, but finding the reason for the two
modes and separating the cases into two distributions communicates even more.
Other Features: Outliers, Gaps, and Clusters
An unusual value, or outlier, is a value that stands apart from the bulk of the data.
Outliers always deserve special attention. Sometimes they are mistakes (a typing
mistake, a measuring mistake), sometimes they are atypical for other reasons
(a really big bear, a faulty lab procedure), and sometimes unusual features of the
distribution are the key to an important discovery.
In the late 1800s, John William Strutt, third Baron Rayleigh (English,
18421919), was studying the density of nitrogen using samples from the air
outside his laboratory (from which known impurities were removed) and samples
produced by a chemical procedure in his lab. He saw a pattern in the results that
you can observe in the plot of his data in Display 2.11.
Lord Rayleigh
Display 2.11 Lord Rayleighs densities of nitrogen. [Source:
Proceedings of the Royal Society 55 (1894).]
Lord Rayleigh saw two clusters separated by a gap. (There is no formal
definition of a gap or a cluster; you have to use your best judgment about them.
For example, some people call a single outlier a cluster of one; others dont. You
36 Chapter 2 Exploring Distributions
2008 Key Curriculum Press
also could argue that the value at the extreme right is an outlier, perhaps because
of a faulty measurement.)
When Rayleigh checked the clusters, it turned out that the ten values to
the left had all come from the chemically produced samples and the nine to the
right had all come from the atmospheric samples. What did this great scientist
conclude? The air samples on the right might be denser because of something
in them besides nitrogen. This hypothesis led him to discover inert gases in the
atmosphere.
Summary 2.1: Visualizing Distributions
Distributions have different shapes, and different shapes call for different
summaries.
If your distribution is uniform (rectangular), its often enough simply to tell
the range of the set of values and the approximate frequency with which each
value occurs.
If your distribution is normal (bell-shaped), you can give a good summary
with the mean and the standard deviation. The mean lies at the center of
the distribution, and the standard deviation (SD) is the horizontal distance
from the center to the points of inflection, where the curvature changes. To
estimate the SD, find the distance on either side of the mean that defines the
interval enclosing about 68% of the cases.
If your distribution is skewed, you can give the values (median and quartiles)
that divide the distribution into fourths.
If your distribution is bimodal, reporting a single center isnt useful. One
reasonable summary is to locate the two peaks. However, it is even more
useful if you can find another variable that divides your set of cases into two
groups centered at the two peaks.
Later in this chapter, you will study the various measures of center and spread
in more detail and learn how to compute them.
Practice
Practice problems help you master basic concepts
and computations. Throughout this textbook, you
should work all the practice problems for each
topic you want to learn. The answers to all practice
problems are given in the back of the book.
Uniform (Rectangular) Distributions
P1. This diagram shows a uniform distribution
on [0, 2], the interval from 0 through 2.
2008 Key Curriculum Press
a. What value divides the distribution in
half, with half the numbers below that
value and half above?
b. What values divide the distribution into
quarters?
c. What values enclose the middle 50% of
the distribution?
d. What percentage of the values lie between
0.4 and 0.7?
e. What values enclose the middle 95% of
the distribution?
2.1 Visualizing Distributions: Shape, Center, and Spread
37
P2. The plot in Display 2.12 gives the number
of deaths in the United States per month in
2003. Does the number of deaths appear to
be uniformly distributed over the months?
Give a verbal summary of the way deaths are
distributed over the months of the year.
a. SAT math scores
b. ACT scores
c. heights of women attending college
d. single-season batting averages for
professional baseball players in the 1910s
Skewed Distributions
P4. Estimate the median and quartiles for the
distribution of GPAs in Display 2.7 on
page 34. Then write a verbal summary of
the same form as in the example.
P5. Match each plot in Display 2.14 with its
median and quartiles (the set of values that
divide the area under the curve into fourths).
a. 15, 50, 85
b. 50, 71, 87
c. 63, 79, 91
d. 35, 50, 65
e. 25, 50, 75
Display 2.12 Deaths per month, 2003. [Source: Centers
for Disease Control and Prevention, National
Vital Statistics Report, 2004.]
Normal Distributions
P3. For each of the normal distributions
in Display 2.13, estimate the mean and
standard deviation visually, and use your
estimates to write a verbal summary of the
form A typical SAT score is roughly (mean),
give or take (SD) or so.
Display 2.14 Five distributions with different
shapes.
Display 2.13 Four distributions that are
approximately normal.
2008 Key Curriculum Press
38 Chapter 2 Exploring Distributions
Exercises
E1. Describe each distribution as bimodal,
skewed right, skewed left , approximately
normal, or roughly uniform.
a. ages of all people who died last year in the
United States
b. ages of all people who got their first
drivers license in your state last year
c. SAT scores for all students in your state
taking the test this year
d. selling prices of all cars sold by General
Motors this year
E2. Describe each distribution as bimodal,
skewed right, skewed left , approximately
normal, or roughly uniform.
a. the incomes of the worlds 100 richest
people
b. the birthrates of Africa and Europe
c. the heights of soccer players on the last
Womens World Cup championship team
d. the last two digits of telephone numbers
in the town where you live
e. the length of time students used to
complete a chapter test, out of a
50-minute class period
b. a roughly normal distribution with mean
15 and standard deviation 5
c. a distribution that is skewed left , with half
its values above 20 and half below and
with the middle 50% of its values between
10 and 25
d. a distribution that is skewed right, with
the middle 50% of its values between 100
and 1000 and with half the values above
200 and half below
E4. The U.S. Environmental Protection Agencys
National Priorities List Fact Book tells
the number of hazardous waste sites for
each U.S. state and territory. For 2006, the
numbers ranged from 1 to 138, the middle
50% of the values were between 11 and 32,
half the values were above 18, and half were
below 18. Sketch what the distribution might
look like. [Source: U.S. Environmental Protection
Agency, www.epa.gov, 2006.]
E5. The dot plot in Display 2.15 gives the ages of
the officers who attained the rank of colonel
in the Royal Netherlands Air Force.
a. What are the cases? Describe the
variables.
b. Describe this distribution in terms of
shape, center, and spread.
c. What kind of wall might there be that
causes the shape of the distribution?
Generate as many possibilities as you can.
The 2003 Womens World Cup Championship
team, from Germany
E3. Sketch these distributions.
a. a uniform distribution that shows the sort
of data you would get from rolling a fair
die 6000 times
2008 Key Curriculum Press
Display 2.15 Ages of colonels. Each dot represents
two points. [Source: Data and Story Library at
Carnegie-Mellon University, lib.stat.cmu.edu.]
2.1 Visualizing Distributions: Shape, Center, and Spread
39
E6. The dot plot in Display 2.16 shows the
distribution of the number of inches of
rainfall in Los Angeles for the seasons
18991900 through 19992000.
E9. Make up a scenario (name the cases and
variables) whose distribution you would
expect to be
a. skewed right because of a wall. What is
responsible for the wall?
b. skewed left because of a wall. What is
responsible for the wall?
E10. The plot in Display 2.18 shows the last
digit of the Social Security numbers of the
students in a statistics class. Describe this
distribution.
Display 2.16 Los Angeles rainfall.[Source: National
Weather Service.]
a. What are the cases? Describe the
variables.
b. Describe this distribution in terms of
shape, center, and spread.
c. What kind of wall might there be that
causes the shape of the distribution?
Generate as many possibilities as you can.
E7. The distribution in Display 2.17 shows
measurements of the strength in pounds of
22s yarn (22s refers to a standard unit for
measuring yarn strength). What is the basic
shape of this distribution? What feature
makes it uncharacteristic of distributions
with that shape?
Display 2.18 Last digit of a sample of Social Security
numbers.
E11. Although a uniform distribution gives a
reasonable approximation of the actual
distribution of births over months (Display
2.1 on page 29), you can blow up the graph
to see departures from the uniform pattern,
as in Display 2.19. Do these deviations from
the uniform shape form their own pattern,
or do they appear haphazard? If you think
theres a pattern, describe it.
Display 2.17 Strength of yarn. [Source: Data and
Story Library at Carnegie-Mellon University,
lib.stat.cmu.edu.]
E8. Sketch a normal distribution with mean 0
and standard deviation 1. You will study this
standard normal distribution in Section 2.5.
Display 2.19 A blow up of the distribution of births
over months, showing departures
from the uniform pattern.
2008 Key Curriculum Press
40 Chapter 2 Exploring Distributions
E12.Draw a graph similar to that in Display 2.19
for the data on deaths in the United States
listed in Display 2.20, and summarize what
you find.
Display 2.20 Deaths in the United States, 2003.
[Source: Centers for Disease Control and
Prevention.]
E13.How do countries compare with respect to
the value of the goods they produce? Display
2.21 shows gross domestic product (GDP)
per capita, a measure of the total value of
all goods and services produced divided
by the number of people in a country, and
the average number of people per room in
housing units, a measure of crowdedness, for
a selection of countries in Asia, Europe, and
North America. Youll analyze these data in
parts ad.
Display 2.21 Per capita GDP and crowdedness for
a selection of countries. [Source: United
Nations, unstats.un.org.]
A family in Albania
2008 Key Curriculum Press
2.1 Visualizing Distributions: Shape, Center, and Spread
41
A dot plot of the per capita GDP data is
shown in Display 2.22.
d. Is it surprising to find clusters and gaps
in data that measure an aspect of the
economies of the countries?
E14.The dot plot in Display 2.23 gives a look
at how the countries listed in Display 2.21
compare in terms of the crowdedness of
their residents.
Display 2.22 Dot plot of per capita GDP.
a. Describe this distribution in terms of
shape, center, and spread.
b. Which two countries have the highest
per capita GDP? Do they appear to be
outliers?
c. A rather large gap appears near the
middle of the distribution. Which of the
two clusters formed by this gap contains
mostly Western European and North
American countries? In what part of the
world are most of the countries in the
other cluster?
2.2
Plots should present the
essentials quickly and
clearly.
Display 2.23 Dot plot of crowdedness.
a. Describe this distribution in terms of
shape, center, and spread.
b. Which countries appear to be outliers?
Are they the same as the countries that
appeared to be outliers for the per capita
GDP data?
c. Where on the dot plot is the cluster that
contains mostly Western European and
North American countries?
Graphical Displays of Distributions
As you saw in the previous section, the best way to summarize a distribution often
depends on its shape. To see the shape, you need a suitable graph. In this section,
youll learn how to make and interpret three kinds of plots for quantitative variables
(dot plot, histogram, and stemplot) and one plot for categorical variables (bar chart).
Cases and Variables, Quantitative and Categorical
Pet cats typically live about 12 years, but some have been known to live 28 years.
Is that typical of domesticated predators? What about domesticated nonpredators,
such as cows and guinea pigs? What about wild mammals? The rhinoceros, a
nonpredator, lives an average of 15 years, with a maximum of about 45 years.
The grizzly bear, a wild predator, lives an average of 25 years, with a maximum of
about 50 years. Do meat-eaters typically outlive vegetarians in the wild? Often you
can find answers to questions like these in a plot of the data.
Many of the examples in this section are based on the data about mammals in
Display 2.24. Each row (type of mammal) is a case. As you learned in Chapter 1,
the cases in a data set are the people, cities, mammals, or other items being studied.
42
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
Measurements and other properties of the cases are organized into columns, with one
column for each variable. Thus, average longevity and speed are variables, and, for
example, 30 mi/h is the value of the variable speed for the case grizzly bear.
Display 2.24 Facts on mammals. (Asterisks (*) mark missing
values.) [Source: World Almanac and Book of Facts, 2001, p. 237.]
2008 Key Curriculum Press
2.2 Graphical Displays of Distributions
43
Counts of how many and measurements of how much are called quantitative
variables. Speed is a quantitative variable because speed is measured on a numerical
scale. A variable that groups cases into categories is called a categorical variable.
Predator is a categorical variable because it groups the mammals into those who
eat other animals and those who dont. Although the categories are coded 1 if a
mammal preys on other animals and 0 if it does not, these numbers just indicate
the appropriate category and are not meant to be quantitative. In Display 2.24, the
asterisks (*) mark missing values.
Cases and Variables, Quantitative and Categorical
D6. Classify each variable in Display 2.24 as quantitative or categorical.
More About Dot Plots
Dot plots show
individual cases as dots.
Youve already seen dot plots, beginning in Chapter 1. As the name suggests, dot
plots show individual cases as dots (or other plotting symbols, such as x). When
you read a dot plot, keep in mind that different statistical software packages make
dot plots in different ways. Sometimes one dot represents two or more cases, and
sometimes values have been rounded. With a small data set, different rounding rules
can give different shapes.
Display 2.25 shows a dot plot of the speeds of the mammals from Display
2.24. The gap between the cheetahs speed and that of the other mammals shows up
clearly in the dot plot but not in the list of speeds in Display 2.24. Discoveries like
this demonstrate why you should always plot your data.
Display 2.25 Dot plot of the speeds of mammals.
When are dot plots most
useful?
As you saw in Section 2.1, a dot plot shows shape, center, and spread. Dot
plots tend to work best when
you have a relatively small number of values to plot
you want to see individual values, at least approximately
you want to see the shape of the distribution
you have one group or a small number of groups you want to compare
Histograms
Histograms show groups
of cases as bars.
A dot plot shows individual cases as dots above a number line. To make a
histogram, you divide the number line into intervals, called bins, and over
each bin construct a bar that has a height equal to the number of cases in that
bin. In fact, you can think of a histogram as a dot plot with bars drawn around
2008 Key Curriculum Press
44 Chapter 2 Exploring Distributions
the dots and the dots erased. The height of the bar becomes a visual substitute
for the number of dots. The plot in Display 2.26 is a histogram of the mammal
speeds. Like the dot plot of a distribution, a histogram shows shape, center, and
spread. The vertical axis gives the number of cases (the frequency or count)
represented by each bar. For example, four mammals have speeds of from 30 mi/h
up to 35 mi/h.
Display 2.26 Histogram of mammal speeds.
Borderline values go in
the bar to the right.
Most calculators and statistical soft ware packages place a value that falls at the
dividing line between two bars into the bar to the right. For example, in Display
2.26, the bar going from 30 to 35 contains cases for which 30 speed < 35.
Changing the width of the bars in your histogram can sometimes change your
impression of the shape of the distribution. For example, the histogram of the
speeds of mammals in Display 2.27 has fewer and wider bars than the histogram
in Display 2.26. It shows a more symmetric, bell-shaped distribution, and there
appears to be one peak rather than two. There is no right answer to the question
of which bar width is best, just as there is no rule that tells a photographer
when to use a zoom lens for a close-up. Different versions of a picture bring out
different features. The job of a data analyst is to find a plot that shows important
features of the distribution.
Display 2.27 Speeds of mammals using a histogram with
wider bars.
2008 Key Curriculum Press
2.2 Graphical Displays of Distributions
45
You can use your calculator to quickly display histograms with different bar
widths. [See Calculator Note 2C.] Shown here are the mammal speed data. The
numbers below the calculator screens indicate the window settings (minimum x,
maximum x, x-scale, minimum y, maximum y, y-scale).
When are histograms
most useful?
Relative frequency
histograms show
proportions instead
of counts.
Histograms work best when
you have a large number of values to plot
you dont need to see individual values exactly
you want to see the general shape of the distribution
you have only one distribution or a small number of distributions you want to
compare
you can use a calculator or computer to make the plots for you
A histogram shows frequencies on the vertical axis. To change a histogram
into a relative frequency histogram, divide the frequency for each bar by the
total number of values in the data set and show these relative frequencies on the
vertical axis.
Example: Converting Frequencies to Relative Frequencies
Four of the 18 mammals have speeds from 30 mi/h up to 35 mi/h. Convert the
frequency 4 to a relative frequency.
Solution
Four out of 18 is
or approximately 0.22. So about 0.22 of the mammals have
speeds in this range.
Example: Relative Frequency of Life Expectancies
Display 2.28 shows the relative frequency
distribution of life expectancies for
203 countries around the world. How
many countries have a life expectancy
of at least 70 but less than 75 years? What
proportion of the countries have a life
expectancy of 70 years or more?
2008 Key Curriculum Press
46 Chapter 2 Exploring Distributions
Display 2.28
Life expectancies of people in countries around the
world. [Source: Population Reference Bureau, World Population
Data Sheet, 2005.]
Solution
The bar including 70 years and up to 75 years has a relative frequency of about
0.30, so the number of countries with a life expectancy of at least 70 years but less
than 75 years is about 0.30 203, or about 61.
The proportion of countries with a life expectancy of 70 years or greater is the
sum of the heights of the three bars to the right of 70about 0.30 + 0.19 + 0.07,
or 0.56.
Histograms
D7. In what sense does a histogram with narrow bars, as in Display 2.26, give
you more information than a histogram with wider bars, as in Display 2.27?
In light of your answer, why dont we always make histograms with very
narrow bars?
D8. Does using relative frequencies change the shape of a histogram? What
information is lost and gained by using a relative frequency histogram rather
than a frequency histogram?
Stemplots
The plot in Display 2.29 is a stem-and-leaf plot, or stemplot, of the mammal
speeds. It shows the key features of the distribution and preserves all theoriginal
numbers.
Display 2.29
2008 Key Curriculum Press
Stemplot of mammal speeds.
2.2
Graphical Displays of Distributions
47
A stemplot shows cases
as digits.
In Display 2.29, the numbers on the left , called the stems, are the tens digits
of the speeds. The numbers on the right, called the leaves, are the ones digits of
the speeds. The leaf for the speed 39 mi/h is printed in bold. If you turn your book
90 counterclockwise, you will see that a stemplot looks something like a dot plot
or histogram; you can see the shape, center, and spread of the distribution.
The stemplot in Display 2.30 displays the same information, but with
split stems: Each stem from the original plot has become two stems. If the ones
digit is 0, 1, 2, 3, or 4, it is placed on the first line for that stem. If the ones digit is
5, 6, 7, 8, or 9, it is placed on the second line for that stem.
Display 2.30 Stemplot of mammal speeds, using split stems.
Spreading out the stems in this way is similar to changing the width of the
bars in a histogram. The goal here, as always, is to find a plot that conveys the
essential pattern of the distribution as clearly as possible.
You have compared two data distributions by constructing dot plots on
the same scale (see, for example, Display 2.10). Another way to compare two
distributions is to construct a back-to-back stemplot. Such a plot for the speeds of
predators and nonpredators is shown in Display 2.31. The predators tend to have
the faster speedsor, at least, there are no slow predators!
Display 2.31 Back-to-back stemplot of mammal speeds for
predators and nonpredators.
48
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
Usually, only two digits are plotted on a stemplot, one digit for the stem and
one digit for the leaf. If the values contain more than two digits, the values may
be either truncated (the extra digits simply cut off ) or rounded. For example, if
the speeds had been given to the nearest tenth of a mile, 32.6 mi/h could be either
truncated to 32 mi/h or rounded to 33 mi/h.
As with the other types of plots, the rules for making stemplots are flexible.
Do what seems to work best to reveal the important features of the data.
The stemplot of mammal speeds in Display 2.32 was made by statistical
software. Although it looks a bit different from the handmade plot in Display 2.30,
it is essentially the same. In the first two lines, N = 18 means that 18 cases were
plotted; N* = 21 means that there were 21 cases in the original data set for which
speeds were missing; and Leaf Unit = 1.0 means that the ones digits were graphed
as the leaves. The numbers in the left column keep track of the number of cases,
counting in from the extremes. The 2 on the left in the first line means that there
are two cases on that stem. If you skip down three lines, the 4 on the left means
that there are a total of four cases on the first four stems (11, 12, 20, and 25).
Display 2.32 Stem-and-leaf plot of mammal speeds made by
statistical software.
When are stem-and-leaf
plots most useful?
Stemplots are useful when
you are plotting a single quantitative variable
you have a relatively small number of values to plot
you would like to see individual values exactly, or, when the values contain
more than two digits, you would like to see approximate individual values
you want to see the shape of the distribution clearly
you have two (or sometimes more) groups you want to compare
Stemplots
D9. What information is given by the numbers in the bottom half of the far
left column of the plot in Display 2.32? What does the 2 in parentheses
indicate?
2008 Key Curriculum Press
2.2 Graphical Displays of Distributions
49
D10.How might you construct a stemplot of the data on gestation periods for the
mammals listed in Display 2.24? Construct the stemplot and describe the
shape of the distribution.
Do Units of Measurement Affect Your Estimates?
In this activity, you will see whether you and your class estimate lengths better
in feet or in meters.
1. Your instructor will split the class randomly into two groups.
2. If you are in the fist group, you will estimate the length of your classroom
in feet. If you are in the second group, you will estimate the length of the
classroom in meters (without estimating first in feet and then converting to
meters). Do this by looking at the length of the room; no pacing the length
of the room is allowed.
3. Find an appropriate way to plot the two data sets so that you can compare
their shapes, centers, and spreads.
4. Do the students in your class tend to estimate more accurately in feet or in
meters? What is the basis for your decision?
5. Why split the class randomly into two groups instead of simply letting the
left half of the room estimate in feet and the right half estimate in meters?
Bar Charts for Categorical Data
Bar charts show
the frequencies of
categorical data as
heights of bars.
You now have three different types of plots to use with quantitative variables.
What about categorical variables? You could make a dot plot, or you could make
what looks like a histogram but is called a bar chart or bar graph. There is one
bar for each category, and the height of the bar tells the frequency. A bar chart has
categories on the horizontal axis, whereas a histogram has measurementsvalues
from a quantitative variable.
The bar chart in Display 2.33 shows the frequency of mammals that fall into
the categories wild and domesticated, coded 1 and 0, respectively. (Note that
the bars are separated so that there is no suggestion that the variable can take on a
value of, say, 0.5.)
Display 2.33 Bar chart showing frequency of domesticated (0)
and wild (1) mammals.
50
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
The relative frequency bar chart in Display 2.34 shows the proportion of the
female labor force age 25 and older in the United States who fall into various
educational categories. The coding used in the display is as follows:
1: none8th grade
2: 9th grade11th grade
3: high school graduate
4: some college, no degree
5: associate degree
6: bachelors degree
7: masters degree
8: professional degree
9: doctorate degree
Display 2.34 The female labor force age 25 years and older by
educational attainment [Source: U.S. Census Bureau, March
2005 Current Population Survey, www.census.gov.]
The educational categories in Display 2.34 have a natural order from least
education to most education and are coded with the numbers 1 through 9. Note
that if you compute the mean of this distribution, there is no reasonable way to
interpret it. However, it does make sense to summarize this distribution using the
mode: More women fall into the category high school graduate than into any
other category. Thus, the numbers 1 through 9 are best thought of as representing
an ordered categorical variable, not a quantitative variable.
You will learn more about the analysis of categorical data in Chapter 10.
Bar Charts
D11.In the bar chart in Display 2.33, would it matter if the order of the bars were
reversed? In the bar chart in Display 2.34, would it matter if the order of
the first two bars were reversed? Comment on how we might define two
different types of categorical variables.
Summary 2.2: Graphical Displays of Distributions
When a variable is quantitative, you can use dot plots, stemplots (stem-and-leaf
plots), and histograms to display the distribution of values. From each plot, you
can see the shape, center, and spread of the distribution. However, the amount of
2008 Key Curriculum Press
2.2
Graphical Displays of Distributions
51
detail varies, and you should choose a plot that fits both your data set and your
reason for analyzing it.
Stemplots can retain the actual data values.
Dot plots are best used with a small number of values and show roughly
where the values lie on a number line.
Histograms show only intervals of values, losing the actual data values, and
are most appropriate for large data sets.
A bar chart shows the distribution of a categorical variable.
When you look at a plot, you should attempt to answer these questions:
Where did this data set come from?
What are the cases and the variables?
What are the shape, center, and spread of this distribution? Does the
distribution have any unusual characteristics, such as clusters, gaps, or
outliers?
What are possible interpretations or explanations of the patterns you see in
the distribution?
Practice
More About Dot Plots
P6. In the listing of the Westvaco data in
Display 1.1 on page 5, which variables are
quantitative? Which are categorical?
P7. Select a reasonable scale, and make a dot
plot of the gestation periods of the mammals
listed in Display 2.24 on page 43. Write a
sentence using shape, center, and spread
to summarize the distribution of gestation
periods for the mammals. What kinds of
mammals have longer gestation periods?
Histograms
P8. Make histograms of the average longevities
and the maximum longevities from Display
2.24. Describe how the distributions differ in
terms of shape, center, and spread. Why do
these differences occur?
P9. Convert your histograms from P8 of
the average longevities and maximum
longevities of the mammals to relative
frequency histograms. Do the shapes of the
histograms change?
P10. Using the relative frequency histogram of
life expectancy in countries around the
world (Display 2.28 on page 47), estimate
the proportion of countries with a life
expectancy of less than 50 years. Then
estimate the number of countries with a life
expectancy of less than 50 years. Describe
the shape, center, and spread of this
distribution.
Stemplots
P11. Make a back-to-back stemplot of the average
longevities and maximum longevities from
Display 2.24 on page 43. Compare the two
distributions.
52
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
Bar Charts for Categorical Data
P12. Display 2.35, educational attainment of
the male labor force, is the counterpart of
Display 2.34. What are the cases, and what
is the variable? Describe the distribution
you see here. How does the distribution of
female education compare to the distribution
of male education? Why is it better to look
at relative frequency bar charts rather
than frequency bar charts to make this
comparison?
Display 2.35 The male labor force age 25 years
and older by educational attainment.
[Source: U.S. Census Bureau, March 2005 Current
Population Survey, www.census.gov.]
P13. Using the Westvaco data in Display 1.1
on page 5, make a bar chart showing the
number of workers laid off in each round. In
addition to a bar showing layoffs, for each of
the five rounds, include a bar showing the
number of workers not laid off . Then make
a relative frequency bar chart. Describe any
patterns you see.
Exercises
E15. The dot plot in Display 2.36 shows the
distribution of the ages of pennies
in a sample collected by a statistics
class.
a. Where did this data set come from? What
are the cases and the variables?
b. What are the shape, center, and spread of
this distribution?
c. Does the distribution have any unusual
characteristics? What are possible
interpretations or explanations of the
patterns you see in the distribution? That
is, why does the distribution have the
shape it does?
E16. Suppose you collect this information for
each student in your class: age, hair color,
number of siblings, gender, and miles he
or she lives from school. What are the
cases? What are the variables? Classify
each variable as quantitative or
categorical.
Display 2.36 Age of pennies. Each dot represents
four points.
2008 Key Curriculum Press
2.2
Graphical Displays of Distributions
53
E17. Using your knowledge of the variables and
what you think the shape of the distribution
might be, match each variable in this list
with the appropriate histogram in Display
2.37.
E20. Convert the histogram in Display 2.38 into a
relative frequency histogram.
I. scores on a fairly easy examination in
statistics
II. heights of a group of mothers and their
12-year-old daughters
III. numbers of medals won by medal
winning countries in the 2004 Summer
Olympics
IV. weights of grown hens in a barnyard
.
Display 2.38 Ages of 1000 people.
E21. Display 2.39 shows the distribution of the
heights of U.S. males between the ages of
18 and 24. The heights are rounded to the
nearest inch.
Display 2.37 Four histograms with different shapes.
E18. Using the technology available to you, make
histograms of the average longevity and
maximum longevity data in Display 2.24
on page 43, using bar widths of 4, 8, and
16 years. Comment on the main features
of the shapes of these distributions and
determine which bar width appears to
display these features best.
E19. Rewrite each sentence so that it states a
relative frequency rather than a count.
a. Six students in a class of 30 got an A.
b. Out of the 50,732 people at a concert,
24,021 bought a T-shirt.
54
Chapter 2 Exploring Distributions
Display 2.39 Heights of males, age 18 to 24. [ Source:
U.S. Census Bureau, Statistical Abstract of the
United States, 1991.]
a. Draw a smooth curve to approximate the
histogram.
b. Without doing any computing, estimate
the mean and standard deviation.
c. Estimate the proportion of men age 18 to
24 who are 74 in. tall or less.
d. Estimate the proportion of heights that
fall below 68 in.
e. Why should you say that the distribution
of heights is approximately normal
rather than simply saying that it is
normally distributed?
2008 Key Curriculum Press
E22. The histogram in Display 2.40 shows the
distribution of SAT I math scores for
20042005.
a. Without doing any computing, estimate
the mean and standard deviation.
b. Roughly what percentage of the SAT I
math scores would you estimate are
within one standard deviation of the
mean?
c. For SAT I critical reading scores, the
shape was similar, but the mean was 10
points lower and the standard deviation
was 2 points smaller. Draw a smooth
curve to show the distribution of SAT I
critical reading scores.
E24. The plots in Display 2.41 show a form of
back-to-back histogram called a population
pyramid. Describe how the population
distribution of the United States differs from
the population distribution of Mexico.
Display 2.40 Relative frequency histogram of SAT I
math scores, 20042005. [Source: College
Board Online, www.collegeboard.org.]
E23. In this section, you looked at various
characteristics of mammals.
a. Would you predict that wild mammals or
domesticated mammals generally have
greater longevity?
b. Using the data in Display 2.24 on page 43,
make a back-to-back stemplot to compare
the average longevities.
c. Write a short summary comparing the
two distributions.
2008 Key Curriculum Press
Display 2.41 Population pyramids for the United
States and Mexico, 2005. [Source: U.S.
Census Bureau, International Data Base,
www.census.gov.]
2.2 Graphical Displays of Distributions
55
Lesson
1 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E25. Examine the grouped bar chart in Display
2.42, which summarizes some of the
information from Display 2.24 on page 43.
Display 2.42 Bar chart for nonpredators and
predators, showing frequency of wild
and domesticated mammals.
a. Describe what the height of the first three
bars represents.
b. How can you tell from this bar chart
whether a predator from the list in
Display 2.24 is more likely to be wild or
domesticated?
2.3
c. How can you tell from this bar chart
whether a nonpredator or a predator is
more likely to be wild?
E26. Make a grouped bar chart similar to that in
E25 for the hourly and salaried Westvaco
workers (see Display 1.1 on page 5), with
bars showing the frequencies of laid off and
not laid off for the two categories of workers.
E27. Consider the mammals speeds in Display
2.24.
a. Count the number of mammals that have
speeds ending in 0 or 5.
b. How many speeds would you expect to
end in 0 or 5 just by chance?
c. What are some possible explanations for
the fact that your answers in parts a and b
are so different?
E28. Look through newspapers and magazines
to find an example of a graph that is either
misleading or difficult to interpret. Redraw
the graph to make it clear.
Measures of Center and Spread
So far you have relied on visual methods for estimating summary statistics to
measure center and spread. In this section, you will learn how to compute exact
values of those same summary statistics directly from the data.
Measures of Center
The two most commonly used measures of center are the mean and the median.
The mean, , is the same number that many people call the average. To
compute the mean, sum all the values of x and divide by the number of
values, n:
(The symbol , for sum, means to add up all the values of x.)
56
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
2 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The mean is the
balance point.
The mean is the balance point of a distribution. To estimate the mean visually
on a dot plot or histogram, find where you would have to place a finger below the
horizontal axis in order to balance the distribution, as if it were a tray of blocks
(see Display 2.43).
Display 2.43 The mean is the balance point of a distribution.
The mean is the
balance point.
The median is the value that divides the data into halves, as shown in
Display 2.44. To find it for an odd number of values, list all the values in order
and select the middle one. If there are n values and n is odd, you will find the
. If n is even, the median is the average of the values
median at position
on either side of position
.
Display 2.44 The median divides the distribution into two
equal areas.
Example: Effect of Round 2 Layoffs on Measures of Center
The ages of the hourly workers involved in Round 2 of the layoffs at Westvaco
were 25, 33, 35, 38, 48, 55, 55*, 55*, 56, and 64* (* indicates laid off in Round 2).
The two dot plots in Display 2.45 show the distributions of hourly workers before
and after the second round of layoffs. What was the effect of Round 2 on the mean
age? On the median age?
2008 Key Curriculum Press
2.3
Measures of Center and Spread
57
10-03-2009 13:57
Lesson
3 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 2.45 Ages of Westvaco hourly workers before and after
Round 2, showing the means and medians.
Solution
Means
Before: The sum of the ten ages is 464, so the mean age is , or 46.4 years.
After: There are seven ages and their sum is 290, so the mean age is
, or
41.4 years.
The layoffs reduced the mean age by 5 years.
Medians
Before: Because there are ten ages, n = 10, so
=
or 5.5, and the
median is halfway between the fifth ordered value, 48, and the sixth
ordered value, 55. The median is
, or 51.5 years.
After: There are seven ages, so
=
or 4. The median is the fourth
ordered value, or 38 years.
The layoffs reduced the median age by 13.5 years.
Measures of Center
D12.Find the mean and median of each ordered list, and contrast their behavior.
a. 1, 2, 3
b. 1, 2, 6
c. 1, 2, 9
d. 1, 2, 297
D13.As you saw in D12, typically the mean is affected more than the median by
an outlier.
a. Use the fact that the median is the halfway point and the mean is the
balance point to explain why this is true.
b. For the distributions of mammal speeds in Display 2.31 on page 48,
the means are 43.5 mi/h for predators and 31.5 mi/h for nonpredators.
The medians are 40.5 mi/h and 33.5 mi/h, respectively. What about the
distributions causes the means to be farther apart than the medians?
c. What about the shapes of the plots in Display 2.45 explains why the
means change so much less than the medians?
58
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
4 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Measuring Spread Around the Median:
Quartiles and IQR
Pair a measure of center
with a measure of
spread.
Use IQR as a measure of
spread with the median.
You can locate the median of a distribution by dividing your data into a lower
and upper half. You can use the same idea to measure spread: Find the values that
divide each half in half again. These two values, the lower quartile, Q1, and the
upper quartile, Q3, together with the median, divide your data into four quarters.
The distance between the upper and lower quartiles, called the interquartile
range, or IQR, is a measure of spread.
IQR = Q3 Q1
San Francisco, California, and Springfield, Missouri, have about the same
median temperature over the year. In San Francisco, half the months of the year
have a normal temperature above 56.5F, half below. In Springfield, half the
months have a normal temperature above 57F, half below. If you judge by these
medians, the difference hardly matters. But if you visit San Francisco, you better
take a jacket, no matter what month you go. If you visit Springfield, take your
shorts and a T-shirt in the summer and a heavy coat in the winter. The difference
in temperatures between the two cities is not in their centers but in their variability.
In San Francisco, the middle 50% of normal monthly temperatures lie in a narrow
9 interval between 52.5F and 61.5F, whereas in Springfield the middle 50% of
normal monthly temperatures range over a 31 interval, varying from 40.5F to
71.5F. In other words, the IQR is 9F for San Francisco and 31F for Springfield.
Finding the Quartiles and IQR
If you have an even number of cases, finding the quartiles is straightforward:
Order your observations, divide them into a lower and upper half, and then divide
each half in half. If you have an odd number of cases, the idea is the same, but
theres a question of what to do with the middle value when you form the upper
and lower halves.
There is no one standard answer. Different statistical software packages use
different procedures that can give slightly different values for the quartiles. In this
book, the procedure is to omit the middle value when you form the two halves.
Example: Finding the Quartiles and IQR for Workers Ages
Find the quartiles and IQR for the ages of the hourly workers at Westvaco before
and after Round 2 of the layoffs.
Solution
Before: There are ten ages: 25, 33, 35, 38, 48, 55, 55, 55, 56, 64. Because n is even,
the median is halfway between the two middle values, 48 and 55, so it is 51.5. The
lower half of the data is made up of the first five ordered values, and the median
of the lower half is the third value, so Q1 is 35. The upper half of the data is the
set of the five largest values, and the median of these is again the third value,
so Q3 is 55. The IQR is 55 35, or 20.
2008 Key Curriculum Press
2.3 Measures of Center and Spread
59
10-03-2009 13:57
Lesson
5 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
After: After the three workers are laid off in Round 2, there are seven ages: 25, 33,
35, 38, 48, 55, 56. Because n is odd, the median is the middle value, 38. Omit this
one number. The lower half of the data is made up of the three ordered values
to the left of position 4. The median of these is the second value, so Q1 is 33.
The upper half of the data is the set of the three ordered values to the right of
position 4, and the median of these is again the second value, so Q3 is 55. The
IQR is 55 33, or 22.
Finding the Quartiles and IQR
D14.Here are the medians and quartiles for the speeds of the domesticated and
wild mammals:
a. Use the information in Display 2.24 on page 43 to verify these values,
and then use them to summarize and compare the two distributions.
b. Why might the speeds of domesticated mammals be less spread out than
the speeds of wild mammals?
D15.The following quote is from the mystery The List of Adrian Messenger, by
Philip MacDonald (Garden City, NY: Doubleday, 1959, p. 188). Detective
Firth asks Detective Seymour if eyewitness accounts have provided a
description of the murderer:
Descriptions? he said. You mustve collected quite a few. How did they
boil down?
To a no-good norm, sir. Seymour shrugged wearily. They varied so much,
the average was useless.
Explain what Detective Seymour means.
Five-Number Summaries, Outliers, and Boxplots
The visual, verbal, and numerical summaries youve seen so far tell you about
the middle of a distribution but not about the extremes. If you include the
minimum and maximum values along with the median and quartiles, you get
the five-number summary.
60
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
6 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The five-number summary for a set of values:
Minimum: the smallest value
Lower or first quartile, Q1: the median of the lower half of the ordered set
of values
Median or second quartile: the value that divides the ordered set of values
into halves
Upper or third quartile, Q3: the median of the upper half of the ordered
set of values
Maximum: the largest value
The difference of the maximum and the minimum is called the range.
Display 2.46 shows the five-number summary for the speeds of the mammals
listed in Display 2.24.
Display 2.46 Five-number summary for the mammal speeds.
Display 2.47 shows a boxplot of the mammal speeds. A boxplot (or
box-and-whiskers plot) is a graphical display of the five-number summary.
The box extends from Q1 to Q3, with a line at the median. The whiskers run
from the quartiles to the extreme values.
Display 2.47 Boxplot of mammal speeds.
The maximum speed of 70 mi/h for the cheetah is 20 mi/h from the next
fastest mammal (the lion) and 28 mi/h from the nearest quartile. It is handy to
have a version of the boxplot that shows isolated casesoutlierssuch as the
cheetah. Informally, outliers are any values that stand apart from the rest. This
rule often is used to identify outliers.
A value is an outlier if it is more than 1.5 times the IQR from the nearest
quartile.
1.5 IQR rule for outliers
2008 Key Curriculum Press
Note that more than 1.5 times the IQR from the nearest quartile is another
way of saying either greater than Q3 plus 1.5 times IQR or less than Q1 minus 1.5
times IQR.
2.3
Measures of Center and Spread
61
10-03-2009 13:57
Lesson
7 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Outliers in the Mammal Speeds
Use the 1.5 IQR rule to identify outliers and the largest and smallest
non-outliers among the mammal speeds.
Solution
From Display 2.46, Q1 = 30 and Q3 = 42, so the IQR is 42 30 or 12, and
1.5 IQR equals 18.
At the low end:
Q1 1.5 IQR = 30 18 = 12
The pig, at 11 mi/h, is an outlier.
The squirrel, at 12 mi/h, is the smallest non-outlier.
At the high end:
Q3 +1.5 IQR = 42 + 18 = 60
The cheetah, at 70 mi/h, is an outlier.
The lion, at 50 mi/h, is the largest non-outlier.
A modified boxplot, shown in Display 2.48, is like the basic boxplot except
that the whiskers extend only as far as the largest and smallest non-outliers
(sometimes called adjacent values) and any outliers appear as individual dots or
other symbols.
Display 2.48 Modified boxplot of mammal speeds.
Boxplots are particularly useful for comparing several distributions.
Example: Boxplots That Show Outliers
Display 2.49 shows side-by-side modified boxplots of average longevity for wild
and domesticated mammals. Compare the two distributions.
Display 2.49 Comparison of average longevity.
62
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
8 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
The boxplot for domesticated animals has no median line. So many domesticated
animals had an average longevity of 12 years that it is both the median and the
upper quartile. These plots show that species of domestic mammals typically have
median average longevities of about 12 years, with about the middle half of these
average longevities falling between 8 and 12 years. The average longevity of wild
mammals centers at about the same place, but the wild mammal average longevities
have more variability, with the middle half between about 7.5 and 15.5 years.
Both shapes are roughly symmetric except for some unusually large average
longevitiestwo wild mammals have average longevities of more than 30 years.
[See Calculator Note 2D to learn how to display regular and modified boxplots and
five-number summaries on your calculator.]
When are boxplots most
useful?
Boxplots are useful when you are plotting a single quantitative variable and
you want to compare the shapes, centers, and spreads of two or more
distributions
you dont need to see individual values, even approximately
you dont need to see more than the five-number summary but would like
outliers to be clearly indicated
Five-Number Summaries, Outliers, and Boxplots
D16. Test your ability to interpret boxplots by answering these questions.
a. Approximately what percentage of the values in a data set lie within the
box? Within the lower whisker, if there are no outliers? Within the upper
whisker, if there are no outliers?
b. How would a boxplot look for a data set that is skewed right? Skewed
left? Symmetric?
c. How can you estimate the IQR directly from a boxplot? How can you
estimate the range?
d. Is it possible for a boxplot to be missing a whisker? If so, give an example.
If not, explain why not.
e. Contrast the information you can learn from a boxplot with what you
can learn from a histogram. List the advantages and disadvantages of
each type of plot.
2008 Key Curriculum Press
2.3
Measures of Center and Spread
63
10-03-2009 13:57
Lesson
9 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Measuring Spread Around the Mean:
The Standard Deviation
There are various ways you can measure the spread of a distribution around its
mean. Activity 2.3a gives you a chance to create a measure of your own.
Comparing Hand Spans: How Far Are You from the Mean?
What youll need: a ruler
1. Spread your hand on a ruler and
measure your hand span (the distance
from the tip of your thumb to the tip of
your little finger when you spread your
fingers) to the nearest half centimeter.
2. Find the mean hand span for your
group.
3. Make a dot plot of the results for your
group. Write names or initials above
below the
the dots to identify the cases. Mark the mean with a wedge
number line.
4. Give two sources of variability in the measurements. That is, give two
reasons why all the measurements arent the same.
5. How far is your hand span from the mean hand span of your group? How
far from the mean are the hand spans of the others in your group?
6. Make a plot of differences from the mean. Again label the dots with names
or initials. What is the mean of these differences? Tell how to get the
second plot from the first without computing any diff erences.
7. Using the idea of differences from the mean, invent at least two measures
that give a typical distance from the mean.
8. Compare your measures with those of the other groups in your class.
Discuss the advantages and disadvantages of each groups method.
The differences from the mean, x , are called deviations. The mean is
the balance point of the distribution, so the set of deviations from the mean will
always sum to zero.
Deviations from the mean sum to zero:
Example: Deviations from the Mean
Find the deviations from the mean for the predators speeds and verify that the
sum of these deviations is 0. Which predators speed is farthest from the mean?
64
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
10 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
The speeds 30, 30, 39, 42, 50, and 70 mi/h have mean 43.5 mi/h. The deviations
from the mean are
30 43.5 = 13.5
30 43.5 = 13.5
39 43.5 = 4.5
42 43.5 = 1.5
50 43.5 = 6.5
70 43.5 = 26.5
The sum of the deviations is 13.5 + (13.5) + (4.5) + (1.5) + 6.5 + 26.5,
which equals 0. The cheetahs speed, 70 mi/h, is farthest from the mean.
How can you use the deviations from the mean to get a measure of spread?
You cant simply find the average of the deviations, because you will get 0 every
time. As you might have suggested in the activity, you could find the average of
the absolute values of the deviations. That gives a perfectly reasonable measure
of spread, but it does not turn out to be very easy to use or very useful. Think of
how hard it is to deal with an equation that has sums of absolute values in it, for
example, y = | x 1 | + | x 2 | + | x 3 | . On the other hand, if you square the
deviations, which also gets rid of the negative signs, you get a sum of squares.
Such a sum is always quadratic no matter how many terms there are, for example,
y = (x 1 )2 + (x 2 )2 + (x 3 )2 = 3x2 12x + 14.
The measure of spread that incorporates the square of the deviations is the
standard deviation, abbreviated SD or s, that you met in Section 2.1. Because sums
of squares really are easy to work with mathematically, the SD offers important
advantages that other measures of spread dont give you. You will learn more
about these advantages in Chapter 7. The formula for the standard deviation, s, is
given in the box.
Formula for the Standard Deviation, s
The square of the standard deviation, s2 , is called the variance.
Calculators might label
and
the two versions
, or
and s.
2008 Key Curriculum Press
It might seem more natural to divide by n to get the average of the squared
deviations. In fact, two versions of the standard deviation formula are used: One
divides by the sample size, n; the other divides by n 1. Dividing by n 1 gives
a slightly larger value. This is useful because otherwise the standard deviation
computed from a sample would tend to be smaller than the standard deviation of
the population from which the sample came. (You will learn more about this in
Chapter 7.) In practice, dividing by n 1 is almost always used for real data even
if they arent a sample from a larger population.
2.3 Measures of Center and Spread
65
10-03-2009 13:57
Lesson
11 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: The Standard Deviation of Mammal Longevity
Compute the standard deviation of the average longevity of domesticated
mammals from Display 2.24 on page 43.
Solution
The table in Display 2.50 is a good way to organize the steps. First find the mean
longevity, , and then subtract it from each observed value x to get the deviations,
x . Square each deviation to get (x )2.
Display 2.50 Steps in computing the standard deviation.
To compute the standard deviation, sum the squared deviations, divide the
sum by n 1, and finally take the square root:
[You can organize the steps of calculating the standard deviation on your calculator.
See Calculator Note 2E.]
Your calculator will compute the summary statistics for a set of data. [See
Calculator Note 2F.] Here are the summary statistics for the domesticated mammal
longevity data. Note that the standard deviation calculated in the previous
example is denoted as Sx. Note also that the five-number summary is shown.
66
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
12 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The Standard Deviation
D17. Refer to the previous example for mammal longevities.
a. Does 4.67 years seem like a typical distance from the mean of 11 years
for the average longevities in the example?
b. The average longevities are measured in years. What is the unit of
measurement for the mean? For the standard deviation? For the
variance? For the interquartile range? For the median?
D18. The standard deviation, if you look at it the right way, is a generalization
of the usual formula for the distance between two points. How does the
formula for the standard deviation remind you of the formula for the
distance between two points?
D19. What effect does dividing by n 1 rather than by n have on the standard
deviation? Does which one you divide by matter more with a large number
of values or with a small number of values?
Summaries from a Frequency Table
To find the mean of the numbers 5, 5, 5, 5, 5, 5, 8, 8, and 8, you could sum them
and divide their sum by how many numbers there are. However, you could get
the same answer faster by taking advantage of the repetitions:
You can use formulas to find the mean and standard deviation of values
in a frequency table, like the one in Display 2.51 in the example on the
next page.
Formulas for the Mean and Standard Deviation of Values
in a Frequency Table
If each value x occurs with frequency f, the mean of a frequency table is
given by
The standard deviation is given by
where n is the sum of the frequencies, or
2008 Key Curriculum Press
2.3 Measures of Center and Spread
67
10-03-2009 13:57
Lesson
13 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Mean and Standard Deviation of Coin Values
Suppose you have five pennies, three nickels, and two dimes. Find the mean value
of the coins and the standard deviation.
Solution
The table in Display 2.51 shows a way to organize the steps in computing the
mean using the formula for the mean of values in a frequency table.
Display 2.51 Steps in computing the mean of a frequency table.
Display 2.52 gives an extended version of the table, designed to organize the
steps in computing both the mean and the standard deviation.
Display 2.52 Steps in computing the standard deviation of
values in a frequency table.
[See Calculator Note 2F to learn how to compute numerical summaries from a
frequency table.]
Summaries from a Frequency Table
D20.Explain why the formula for the standard deviation in the box on page 67
gives the same result as the formula on page 65.
2008 Key Curriculum Press
68 Chapter 2 Exploring Distributions
10-03-2009 13:57
Lesson
14 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Summary 2.3: Measures of Center and Spread
Your first step in any data analysis should always be to look at a plot of your data,
because the shape of the distribution will help you determine what summary
measures to use for center and spread.
To describe the center of a distribution, the two most common summaries are
the median and the mean. The median, or halfway point, of a set of ordered
values is either the middle value (if n is odd) or halfway between the two
middle values (if n is even). The mean, or balance point, is the sum of the
values divided by the number of values.
To measure spread around the median, use the interquartile range, or IQR,
which is the width of the middle 50% of the data values and equals the
distance from the lower quartile to the upper quartile. The quartiles are the
medians of the lower half and upper half of the ordered list of values.
To measure spread around the mean, use the standard deviation. To compute
the standard deviation for a data set of size n, first find the deviations from
the mean, then square them, sum the squared deviations, divide by n 1, and
take the square root.
A boxplot is a useful way to compare the general shape, center, and spread
of two or more distributions with a large number of values. A modified boxplot
also shows outliers. An outlier is any value more than 1.5 times the IQR from the
nearest quartile.
Practice
Measures of Center
P14.Find the mean and median of these ordered
lists.
a.1 2 3 4
b. 1 2 3 4 5
c.1 2 3 4 5 6
d. 1 2 3 4 5 . . . 97 98
e.1 2 3 4 5 . . . 97 98 99
P15.Five 3rd graders, all about 4 ft tall, are
standing together when their teacher, who
is 6 ft tall, joins the group. What is the new
mean height? The new median height?
P16.The stemplots in Display 2.53 show the life
expectancies (in years) for females in the
countries of Africa and Europe. The means
are 53.6 years for Africa and 79.3 years for
Europe.
a. Find the median life expectancy for each
set of countries.
b. Is the mean or the median smaller for
each distribution? Why is this so?
Display 2.53 Female life expectancies in Africa and
Europe. [Source: Population Reference Bureau,
World Population Data Sheet, 2005.]
2008 Key Curriculum Press
2.3 Measures of Center and Spread
69
10-03-2009 13:57
Lesson
15 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Measuring Spread Around the Median:
Quartiles and IQR
P17. Find the quartiles and IQR for these ordered
lists.
a. 1 2 3 4 5 6
b. 1 2 3 4 5 6 7
c. 1 2 3 4 5 6 7 8
d. 1 2 3 4 5 6 7 8 9
P18. Display 2.54 shows a back-to-back stemplot
of the average longevity of predators and
nonpredators.
b. Estimate the median number of viewers,
and use this median in a sentence.
P20. Use the medians and quartiles from D14 on
page 60 and the data in Display 2.24 on page
43 to construct side-by-side boxplots of the
speeds of wild and domesticated mammals.
(Dont show outliers in these plots.)
P21. The stemplot of average mammal longevities
is shown in Display 2.56.
Display 2.56 Average longevity (in years) of
38 mammals.
Display 2.54 Average longevities of predators and
nonpredators.
a. By counting on the plot, find the median
and quartiles for each group of mammals.
b. Write a pair of sentences summarizing
and comparing the shape, center, and
spread of the two distributions.
Five-Number Summaries, Outliers,
and Boxplots
P19. The boxplot in Display 2.55 shows the
number of viewers who watched the 101
prime-time network television shows in the
week that Seinfeld aired its last new episode.
Display 2.55 Modified boxplot of number of viewers
of prime-time television shows.
a. Seinfeld had more viewers than any other
show. About how many viewers did it
have?
70
Chapter 2 Exploring Distributions
a. Use the stemplot to find the five-number
summary.
b. Find the IQR.
c. Compute Q1 1.5 IQR. Identify any
outliers (give the animal name and
longevity) at the low end.
d. Identify an outlier at the high end and the
largest non-outlier.
e. Draw a modified boxplot.
The Standard Deviation
P22. Verify that the sum of the deviations from
the mean is 0 for the numbers 1, 2, 4, 6, and
9. Find the standard deviation.
P23. Without computing, match each list of
numbers in the left column with its standard
deviation in the right column. Check any
answers you arent sure of by computing.
a. 1 1 1 1
i. 0
b. 1 2 2
ii. 0.058
c. 1 2 3 4 5
iii. 0.577
d. 10 20 20
iv. 1.581
e. 0.1 0.2 0.2
v. 3.162
f. 0 2 4 6 8
vi. 3.606
g. 0 0 0 0 5 6 6 8 8
vii. 5.774
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
16 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Summaries from a Frequency Table
P24. Display 2.57 shows the data on family size
for two representative sets of 100 families,
one set from 1967 and the other from 1997.
Display 2.57 Number of children in a sample of
families, 1967 and 1997. [Source: U.S.
Census Bureau, www.census.gov.]
b. Find the median number of children per
family for 1967.
c. Use the formulas to compute the mean
and standard deviation of the 1967
distribution.
P25. Refer to Display 2.57.
a. Use the formulas for the mean and
standard deviation of values in a frequency
table to compute the mean number of
children per family and the standard
deviation for the 1997 distribution.
b. Find the median number of children
for 1997.
c. What are the positions of the quartiles in
an ordered list of 100 numbers? Find the
quartiles for the 1967 distribution and
compute the IQR. Do the same for the
1997 distribution.
d. Write a comparison of the shape, center,
and spread of the distributions for the
two years.
a. Try to visualize the shapes of the two
distributions. Are they symmetric,
skewed left , or skewed right?
Exercises
E29. The mean of a set of seven values is 25. Six
of the values are 24, 47, 34, 10, 22, and 28.
What is the 7th value?
E30. The sum of a set of values is 84, and the
mean is 6. How many values are there?
E31. Three histograms and three boxplots appear
in Display 2.58. Which boxplot displays the
same information as
a. histogram A?
b. histogram B?
c. histogram C?
Display 2.58
2008 Key Curriculum Press
Match the histograms with their
boxplots.
2.3 Measures of Center and Spread
71
10-03-2009 13:57
Lesson
17 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E32.The test scores of 40 students in a first-period
class were used to construct the first
boxplot in Display 2.59, and the test scores
of 40 students in a second-period class were
used to construct the second. Can the third
plot be a boxplot of the combined scores of
the 80 students in the two classes? Why or
why not?
E36. The boxplots below show the average
longevity of mammals, from Display 2.24.
a. Using only the basic boxplot in Display
2.60, show that there must be at least one
outlier in the set of average longevity.
Display 2.60 Boxplot of average longevity of
mammals.
b. How many outliers are there in the
modified boxplot of average longevity in
Display 2.61?
Display 2.59
Boxplots of three sets of test scores.
E33.Make side-by-side boxplots of the speeds of
predators and nonpredators. (The stemplot
in Display 2.31 on page 48 shows the values
ordered.) Are the boxplots or the back-to-back
stemplot in Display 2.31 better for
comparing these speeds? Explain.
E34.The U.S. Supreme Court instituted a
temporary ban on capital punishment
between 1967 and 1976. Between 1977 and
2000, 31 states carried out 683 executions.
(The other 19 states either did not have
a death penalty or executed no one.) The
five states that executed the most prisoners
were Texas (239), Virginia (81), Florida
(50), Missouri (46), and Oklahoma (30).
The remaining 26 states carried out these
numbers of executions: 26, 25, 23, 23, 23, 22,
16, 12, 11, 8, 8, 7, 6, 4, 3, 3, 3, 3, 2, 2, 2, 1, 1,
1, 1, 1. For all 50 states, what was the mean
number of executions per state? The median
number? What were the quartiles? Draw a
boxplot, showing any outliers, of the number
of executions for all 50 states. [Source: U.S.
Department of Justice, Bulletin: Capital Punishment
2000.]
E35.Make a back-to-back stemplot comparing
the ages of those retained and those laid
off among the salaried workers in the
engineering department at Westvaco (see
Display 1.1 on page 5). Find the medians
and quartiles, and use them to write a verbal
comparison of the two distributions.
72
Chapter 2 Exploring Distributions
Display 2.61 Modified boxplot of average longevity
of mammals, showing outliers.
c. How many outliers are shown in Display
2.49 on page 62? How can that be,
considering the boxplot in Display 2.61?
E37. No computing should be necessary to
answer these questions.
a. The mean of each of these sets of values is
20, and the range is 40. Which set has the
largest standard deviation? Which has the
smallest?
I. 0 10 20 30 40
II. 0 0 20 40 40
III. 0 19 20 21 40
b. Two of these sets of values have a
standard deviation of about 5. Which
two?
I. 5 5 5 5 5 5
II. 10 10 10 20 20 20
III. 6 8 10 12 14 16 18 20 22
IV. 5 10 15 20 25 30 35 40 45
E38. The standard deviation of the first set
of values listed here is about 32. What
is the standard deviation of the second
2008 Key Curriculum Press
10-03-2009 13:57
Lesson
18 de 18
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
set? Explain. (No computing should be
necessary.)
16 23 34 56 78 92 93
20 27 38 60 82 96 97
E39.
Consider the set of the heights of all female
National Collegiate Athletic Association
(NCAA) athletes and the set of the heights
of all female NCAA basketball players.
Which distribution will have the larger
mean? Which will have the larger standard
deviation? Explain.
E40. Consider the data set 15, 8, 25, 32, 14, 8, 25,
and 2. You can replace any one value with a
number from 1 to 10. How would you make
this replacement
is computed by removing the largest 5% of
values and the smallest 5% of values from
the data set and then computing the mean
of the remaining middle 90% of values. (The
percentage that is cut off at each end can
vary depending on the software.)
a. Find the trimmed mean of the maximum
longevities in Display 2.24 on page 43.
b. Is the trimmed mean affected much by
outliers?
E43.This table shows the weights of the pennies
in Display 2.3 on page 31. For example, the
four pennies in the second interval, 3.0000 g
to 3.0199 g, are grouped at the midpoint of
this interval, 3.01.
a. to make the standard deviation as large as
possible?
b. to make the standard deviation as small
as possible?
c. to create an outlier, if possible?
E41. Another measure of center that sometimes is
used is the midrange. To find the midrange,
compute the mean of the largest value and
the smallest value.
The statistics in this computer output
summarize the number of viewers of
prime-time television shows (in millions)
for the week of the last new Seinfeld episode.
a. Using these summary statistics alone,
compute the midrange both with and
without the value representing the
Seinfeld episode. (Seinfeld had the largest
number of viewers and Seinfeld Clips,
with 58.53 million viewers, the second
largest.) Is the midrange affected much
by outliers? Explain.
b. Compute the mean of the ratings
without the Seinfeld episode, using only
the summary statistics in the computer
output.
E42. In computer output like that in E41,
TrMean is the trimmed mean. It typically
2008 Key Curriculum Press
a. Find the mean weight of the pennies.
b. Find the standard deviation.
c. Does the standard deviation appear to
represent a typical deviation from the
mean?
E44.Suppose you have five pennies, six nickels,
four dimes, and five quarters.
a. Sketch a dot plot of the values of the
20 coins, and use it to estimate the mean.
b. Compute the mean using the formula for
the mean of values in a frequency table.
c. Estimate the SD from your plot: Is it
closest to 0, 5, 10, 15, or 20?
d. Compute the standard deviation using
the formula for the standard deviation of
values in a frequency table.
2.3 Measures of Center and Spread
73
10-03-2009 13:57
Lesson
1 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E45.On the first test of the semester, the scores
of the first-period class of 30 students had a
mean of 75 and a median of 70. The scores of
the second-period class of 22 students had a
mean of 70 and a median of 68.
a. To the nearest tenth, what is the mean
test score of all 52 students? If you cannot
calculate the mean of the two classes
combined, explain why.
b. What is the median test score of all 52
students? If you cannot find the median
of the two classes combined, explain why.
E46.The National Council on Public Polls
rebuked the press for its coverage of a Gallup
poll of Islamic countries. According to the
Council:
News stories based on the Gallup poll
reported results in the aggregate without
regard to the population of the countries
they represent. Kuwait, with less than
2.4
2 million Muslims, was treated the same
as Indonesia, which has over 200 million
Muslims. The aggregate quoted in the
media was actually the average for the
countries surveyed regardless of the size
of their populations.
The percentage of people in Kuwait who
thought the September 11 terrorist attacks
were morally justified was 36%, while the
percentage in Indonesia was 4%. [Source:
www.ncpp.org.]
a. Suppose that the poll covered only
these two countries and that the people
surveyed were representative of the
entire country. What percentage of all the
people in these two countries thought
that the terrorist attacks were morally
justified?
b. What percentage would have been
reported by the press?
Working with Summary Statistics
Summary statistics are very useful, but only when they are used with good
judgment. This section will teach you how to tell which summary statistic to
use, how changing units of measurement and the presence of outliers affect your
summary statistics, and how to interpret percentiles.
Which Summary Statistic?
Plot first, then look for
summaries.
74
Which summary statistics should you use to describe a distribution? Should
you use mean and standard deviation? Median and quartiles? Something else?
The right choice can depend on the shape of your distribution, so you should
always start with a plot. For normal distributions, the mean and standard
deviation are nearly always the most suitable. For skewed distributions, the
median and quartiles are often the most useful, in part because they have a simple
interpretation based on dividing a data set into fourths.
Sometimes, however, the mean and standard deviation will be the right choice
even if you have a skewed distribution. For example, if you have a representative
sample of house prices for a town and you want to use your sample to estimate the
total value of all the towns houses, the mean is what you want, not the median.
In Chapter 7, youll see why the mean and standard deviation are the most useful
choices when doing statistical inference.
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:59
Lesson
2 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Choosing the right summary statistics is something you will get better at
as you build your intuition about the properties of these statistics and how they
behave in various situations.
Which Summary Statistic?
D21. Explain how to determine the total amount of property taxes if you know
the number of houses, the mean value, and the tax rate. In what sense is
knowing the mean equivalent to knowing the total?
D22. When a measure of center for the income of a communitys residents is
given, that number is usually the median. Why do you think that is the case?
The Effects of Changing Units
This discussion illustrates some important properties of summary statistics. It will
also help you develop your intuition about how the geometry and the arithmetic
of working with data are related.
The lowest temperature on record for Washington, D.C., is 15F. How does
that temperature compare with the lowest recorded temperatures for capitals of
other countries? Display 2.62 gives data for the few capitals whose record low
temperatures turn out to be whole numbers on both the Fahrenheit and Celsius
scales.
Display 2.62 Record low temperatures for seven capitals.
[Source: National Climatic Data Center, 2002.]
The dot plot in Display 2.63 shows that the temperatures are centered at about
32F, with an outlier at 22F. The spread and shape are hard to determine with
only seven values.
Display 2.63 Dot plot for record low temperatures in degrees
Fahrenheit for seven capitals.
2008 Key Curriculum Press
2.4
Working with Summary Statistics
75
10-03-2009 13:59
Lesson
3 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
What happens to the shape, center, and spread of this distribution if you
convert each temperature to the number of degrees above or below freezing, 32F?
To find out, subtract 32 from each value, and plot the new values. Display 2.64
shows that the center of the dot plot is now at 0 rather than 32 but the spread and
shape are unchanged.
Display 2.64
Dot plot of the number of degrees Fahrenheit
above or below freezing for record low
temperatures for the seven capitals.
Adding (or subtracting) a constant to each value in a set of data doesnt
change the spread or the shape of a distribution but slides the entire distribution
a distance equivalent to the constant. Thus, the transformation amounts to a
recentering of the distribution.
What happens to the shape and spread of this distribution if you convert each
temperature to degrees Celsius? The Celsius scale measures temperature based on
the number of degrees above or below freezing, but it takes 1.8F to make 1C. To
),
convert, divide each value in Display 2.64 by 1.8 (or, equivalently, multiply by
and plot the new values. Display 2.65 shows that the center of the new dot plot
is still at 0 and the shape is the same. However, the spread has shrunk by a
factor of .
Display 2.65
Dot plot for record low temperatures in degrees
Celsius for the seven capitals.
Multiplying each value in a set of data by a positive constant doesnt change
the basic shape of the distribution. Both the mean and the spread are multiplied
by that number. This transformation amounts to a rescaling of the distribution.
[See Calculator Note 2G to explore on your calculator the effects of changing
units.]
Recentering and Rescaling a Data Set
Recentering a data setadding the same number c to all the values in the
setdoesnt change the shape or spread but slides the entire distribution by
the amount c, adding c to the median and the mean.
Rescaling a data setmultiplying all the values in the set by the same
positive number ddoesnt change the basic shape but stretches or shrinks
the distribution, multiplying the spread (IQR or standard deviation) by d and
multiplying the center (median or mean) by d.
76
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:59
Lesson
4 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The Effects of Changing Units
D23. Suppose a U.S. dollar is worth 9.4 Mexican pesos.
a. A set of prices, in U.S. dollars, has mean $20 and standard deviation $5.
Find the mean and standard deviation of the prices expressed in pesos.
b. Another set of prices, in Mexican pesos, has a median of 94 pesos and
quartiles of 47 pesos and 188 pesos. Find the median and quartiles of the
same prices expressed in U.S. dollars.
The Influence of Outliers
A summary statistic is resistant to outliers if the summary statistic is not changed
very much when an outlier is removed from the set of data. If the summary
statistic tends to be affected by the removal of outliers, it is sensitive to outliers.
Display 2.66 shows a dot plot of the number of viewers of prime-time
television shows (in millions) in a particular week. (A boxplot of these data is
shown in Display 2.55 on page 70.) The three highest valuesthe three shows
with the largest numbers of viewersare outliers.
Display 2.66 Number of viewers of prime-time television shows
in a particular week.
The printout in Display 2.67 gives the summary statistics for all 101 shows.
Display 2.67 Printout of summary statistics for number of viewers.
The printout in Display 2.68 gives the summary statistics for the number of
viewers when the three outliers are removed from the set of data. Compare this
printout with the one in Display 2.67 and notice which summary statistics are
most sensitive to the outliers.
Display 2.68 Summary statistics for number of viewers without
outliers.
2008 Key Curriculum Press
2.4 Working with Summary Statistics
77
10-03-2009 13:59
Lesson
5 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The Influence of Outliers
D24. Are these measures of center for the number of television viewers affected
much by the three outliers? (Refer to Displays 2.662.68.) Explain.
a. mean
b. median
D25. Are these measures of spread for the number of television viewers affected
much by the three outliers? Explain why or why not.
a. range
b. standard deviation
c. interquartile range
Percentiles and Cumulative Relative Frequency Plots
Percentiles measure position within a data set. The first quartile, Q1, of a
distribution is the 25th percentilethe value that separates the lowest 25% of the
ordered values from the rest. The median is the 50th percentile, and Q3 is the 75th
percentile. You can define other percentiles in the same way. The 10th percentile,
for example, is the value that separates the lowest 10% of ordered values in a
distribution from the rest. In general, a value is at the kth percentile if k% of all
values are less than or equal to it.
For large data sets, you might see data listed in a table or plotted in a graph,
like those for the SAT I critical reading scores in Display 2.69. Such a plot
is sometimes called a cumulative percentage plot or a cumulative relative
frequency plot. The table shows that, for example, 28% of the students received
a score of 450 or lower and about 13% received a score between 400 and 450.
Display 2.69
Cumulative relative frequency plot of SAT I critical
reading scores and percentiles, 20042005. [Source:
The College Board, www.collegeboard.org.]
2008 Key Curriculum Press
78 Chapter 2 Exploring Distributions
10-03-2009 13:59
Lesson
6 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
[See Calculator Note 2H to learn how to construct cumulative relative frequency plots
on your calculator.]
Percentiles and Cumulative Relative Frequency Plots
D26. Refer to Display 2.69.
a. Use the plot to estimate the percentile for an SAT I critical reading score
of 425.
b. What two values enclose the middle 90% of SAT scores? The middle 95%?
c. Use the table to estimate the score that falls at the 40th percentile.
D27. What proportion of cases lie between the 5th and 95th percentiles of a
distribution? What percentiles enclose the middle 95% of the cases in a
distribution?
Summary 2.4: Working with Summary Statistics
Knowing which summary statistic to use depends on what use you have for that
summary statistic.
If a summary statistic doesnt change much whether you include or exclude
outliers from your data set, it is said to be resistant to outliers.
The median and quartiles are resistant to outliers.
The mean and standard deviation are sensitive to outliers.
Recentering a data setadding the same number c to all the valuesslides
the entire distribution. It doesnt change the shape or spread but adds c to the
median and the mean. Rescaling a data setmultiplying all the values by the same
nonzero number dis like stretching or squeezing the distribution. It doesnt
change the basic shape but multiplies the spread (IQR or standard deviation) by
| d | and multiplies the measure of center (median or mean) by d.
The percentile of a value tells you what percentage of all values lie at or below
the given value. The 30th percentile, for example, is the value that separates the
distribution into the lowest 30% of values and the highest 70% of values.
Practice
Which Summary Statistic?
P26. A community in Nevada has 9751
households, with a median house price of
$320,000 and a mean price of $392,059.
a. Why is the mean larger than the median?
b. The property tax rate is about 1.15%.
What total amount of taxes will be
assessed on these houses?
c. What is the average amount of taxes per
house?
2008 Key Curriculum Press
P27. A news release at www.polk.com stated that
the median age of cars being driven in 2004
was 8.9 years, the oldest to date. The median
was 8.3 years in 2000 and 7.7 years in 1995
a. Why were medians used in this news
story?
b. What reasons might there be for the
increase in the median age of cars? (The
median age in 1970 was only 4.9 years!)
2.4 Working with Summary Statistics
79
10-03-2009 13:59
Lesson
7 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The Effects of Changing Units
P28. The mean height of a class of 15 children
is 48 in., the median is 45 in., the standard
deviation is 2.4 in., and the interquartile
range is 3 in. Find the mean, standard
deviation, median, and interquartile
range if
a. you convert each height to feet
b. each child grows 2 in.
c. each child grows 4 in. and you convert
the heights to feet
P29. Compute the means and standard deviations
(use the formula for s) of these sets of
numbers. Use recentering and rescaling
wherever you can to avoid or simplify the
arithmetic.
a. 1 2 3
b. 11 12 13
c. 10 20 30
d. 105 110 115
e. 800 900 1000
The Influence of Outliers
P30. The histogram and boxplot in Display 2.70
and the summary statistics in Display 2.71
show the record low temperatures for the
50 states.
a. Hawaii has a lowest recorded temperature
of 12F. The boxplot shows Hawaii as an
outlier. Verify that this is justified.
b. Suppose you exclude Hawaii from the
data set. Copy the table in Display 2.71,
substituting the value (or your best
estimate if you dont have enough
information to compute the value) of each
summary statistic with Hawaii excluded.
Exercises
E47.Discuss whether you would use the mean
or the median to measure the center of each
set of data and why you prefer the one you
chose.
a. the prices of single-family homes in your
neighborhood
80
Chapter 2 Exploring Distributions
Display 2.70 Record low temperatures for the
50 states. [Source: National Climatic Data
Center, 2002, www.ncdc.noaa.gov.]
Display 2.71 Summary statistics for lowest
temperatures for the 50 states.
Percentiles and Cumulative Relative
Frequency Plots
P31.Estimate the quartiles and the median of the
SAT I critical reading scores in Display 2.69
on page 78, and then use these values to
draw a boxplot of the distribution. What
is the IQR?
b. the yield of corn (bushels per acre) for a
sample of farms in Iowa
c. the survival time, following diagnosis, of
a sample of cancer patients
2008 Key Curriculum Press
10-03-2009 13:59
Lesson
8 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E48. Mean versus median.
a. You are tracing your family tree and
would like to go back to the year 1700.
To estimate how many generations back
you will have to trace, would you need to
know the median length of a generation
or the mean length of a generation?
b. If a car trip takes 3 h, do you need to
know the mean speed or the median
speed in order to find the total distance?
c. Suppose all trees in a forest are right
circular cylinders with radius 3 ft. The
heights vary, but the mean height is 45
ft, the median is 43 ft, the IQR is 3 ft, and
the standard deviation is 3.5 ft. From this
information, can you compute the total
volume of wood in all the trees?
E49. The histogram in Display 2.72 shows record
high temperatures for the 50 states.
Display 2.73 Summary statistics for record high
temperatures for the 50 U.S. states.
c. Are there any outliers in the data in C?
E50. Tell how you could use recentering and
rescaling to simplify the computation of the
mean and standard deviation for this list of
numbers:
5478.1 5478.3 5478.3 5478.9 5478.4 547
E51. Suppose a constant c is added to each value
in a set of data, x1, x2, x3, x4, and x5. Prove
that the mean increases by c by comparing
the formula for the mean of the original
data to the formula for the mean of the
recentered data.
E52. Suppose a constant c is added to each value
in a set of data, x1, x2, x3, x4, and x5. Prove
that the standard deviation is unchanged
by comparing the formula for the standard
deviation of the original data to the formula
for the standard deviation of the recentered
data.
E53. The cumulative relative frequency plot in
Display 2.74 shows the amount of change
carried by a group of 200 students. For
example, about 80% of the students had
$0.75 or less.
Display 2.72 Record high temperatures for the 50
U.S. states. [Source: National Climatic Data
Center, 2002, www.ncdc.noaa.gov.]
a. Suppose each temperature is converted
from degrees Fahrenheit, F, to degrees
Celsius, C, using the formula
If you make a histogram of the
temperatures in degrees Celsius, how
will it differ from the one in Display 2.72?
b. The summary statistics in Display 2.73
are for record high temperatures in
degrees Fahrenheit. Make a similar table
for the temperatures in degrees Celsius.
Display 2.74
Cumulative percentage plot of amount
of change.
a. From this plot, estimate the median
amount of change.
2008 Key Curriculum Press
2.4 Working with Summary Statistics
81
10-03-2009 13:59
Lesson
9 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Estimate the quartiles and the
interquartile range.
c. Is the original set of amounts of change
skewed right, skewed left, or symmetric?
d. Does the data set look as if it should
be modeled by a normal distribution?
Explain your reasoning.
E54. Use Display 2.74 to make a boxplot of the
amounts of change carried by the students.
E55. Did you ever wonder how speed limits on
roadways are determined? Most government
jurisdictions set speed limits by this standard
practice, described on the website of the
Michigan State Police.
Speed studies are taken during times that
represent normal free-flow traffic. Since
modified speed limits are the maximum
allowable speeds, roadway conditions
must be close to ideal. The primary basis
for establishing a proper, realistic speed
limit is the nationally recognized method
of using the 85th percentile speed. This
is the speed at or below which 85% of the
traffic moves. [Source: www.michigan.gov.]
The 85th percentile speed typically is rounded
down to the nearest 5 miles per hour. The
table and histogram in Display 2.75 give the
measurements of the speeds of 1000 cars
on a stretch of road in Mellowville with no
curviness or other additional factors. At what
speed would the speed limit be set if the
guidelines described were followed?
Display 2.75 Speed of 1000 cars in Mellowville.
E56. Refer to the distribution of speeds in E55.
Make a cumulative relative frequency plot of
these speeds.
E57. The cumulative relative frequency plot in
Display 2.76 gives the ages of the CEOs
(Chief Executive Officers) of the 500 largest
U.S. companies. Does A, B, or C give its
median and quartiles? Using the diagram,
explain why your choice is correct.
A. Q1 51; median 56; Q3 60
B. Q1 50; median 60; Q3 70
C. Q1 25; median 50; Q3 75
82
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 13:59
Lesson
1 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E58. Refer to the distribution of ages in E57. Can
you give the median and quartiles of the
distribution of ages in months? If so, do it. If
not, explain why not.
Display 2.76 Cumulative relative frequency plot of
CEO ages. [Source: www.forbes.com]
2.5
These are both the
same normal curve.
The Normal Distribution
You have seen several reasons why the normal distribution is so important:
It tells you how variability in repeated measurements often behaves
(diameters of tennis balls).
It tells you how variability in populations often behaves (weights of pennies,
SAT scores).
It tells you how means (and some other summary statistics) computed from
random samples behave (the Westvaco case, Activity 1.2a).
In this section, you will learn that if you know that a distribution is normal
(shape), then the mean (center) and standard deviation (spread) tell you
everything else about the distribution. The reason is that, whereas skewed
distributions come in many different shapes, there is only one normal shape. Its
true that one normal distribution might appear tall and thin while another looks
short and fat. However, the x-axis of the tall, thin distribution can be stretched
out so that it looks exactly the same as the short, fat one.
The Standard Normal Distribution
Because all normal distributions have the same basic shape, you can use
recentering and rescaling to change any normal distribution to the one with
mean 0 and standard deviation 1. Solving problems involving normal
distributions depends on this important property.
The normal distribution with mean 0 and standard deviation 1 is called the
standard normal distribution. In this distribution, the variable along the
horizontal axis is called a z-score.
2008 Key Curriculum Press
2.5 The Normal Distribution
83
10-03-2009 14:00
Lesson
2 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The standard normal distribution is symmetric, with total area under the
curve equal to 1, or 100%. To find the percentage, P, that describes the area to the
left of the corresponding z-score, you can use the z-table or your calculator.
The next two examples show you how to use the z-table, Table A on pages
824825.
Example: Finding the Percentage When You Know the z-Score
Find the percentage, P, of values less than z = 1.23, the shaded area in Display
2.77. Find the percentage greater than z = 1.23.
Display 2.77 The percentage of values less than z = 1.23.
Solution
Think of 1.23 as 1.2 +0.03. In Table A on pages 824825, find the row labeled
1.2 and the column headed .03. Where this row and column intersect, you find
the number .8907. That means that 89.07% of standard normal scores are less
than 1.23.
The total area under the curve is 1, so the proportion of values greater than
z = 1.23 is 1 0.8907, or 0.1093, which is 10.93%.
A graphing calculator will give you greater accuracy in finding the proportion
of values that lie between two specified values in a standard normal distribution.
For example, you can find the proportion of values that are less than 1.23 in a
standard normal distribution like this:
[To learn more about calculating the proportion of values between two
z-scores, see Calculator Note 2I.]
Example: Finding the z-Score When You Know the Percentage
Find the z-score that falls at the 75th percentile of the standard normal distribution,
that is, the z-score that divides the bottom 75% of values from the rest.
2008 Key Curriculum Press
84 Chapter 2 Exploring Distributions
10-03-2009 14:00
Lesson
3 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
First make a sketch of the situation, as in Display 2.78.
Display 2.78 The z-score that corresponds to the 75th percentile.
Look for .7500 in the body of Table A. No value in the table is exactly equal to
.7500. The closest value is .7486. The value .7486 sits at the intersection of the row
labeled .60 and the column headed .07, so the corresponding z-score is roughly
0.60 + 0.07, or 0.67.
You can use a graphing calculator to find the 75th percentile of a standard
normal distribution like this:
[To learn more about finding the z-score that has a specified proportion of values
below it, see Calculator Note 2J.]
The Standard Normal Distribution
D28. For the standard normal distribution,
a.
b.
c.
d.
what is the median?
what is the lower quartile?
what z-score falls at the 95th percentile?
what is the IQR?
Standard Units: How Many Standard Deviations Is It
from Here to the Mean?
Converting to standard units, or standardizing, is the two-step process of
recentering and rescaling that turns any normal distribution into the standard
normal distribution.
First you recenter all the values of the normal distribution by subtracting the
mean from each. This gives you a distribution with mean 0. Then you rescale by
2008 Key Curriculum Press
2.5 The Normal Distribution
85
10-03-2009 14:00
Lesson
4 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
dividing all the values by the standard deviation. This gives you a distribution
with standard deviation 1. You now have a standard normal distribution. You
can also think of the two-step process of standardizing as answering two
questions: How far above or below the mean is my score? How many standard
deviations is that?
The standard units or z-score is the number of standard deviations that a
given x-value lies above or below the mean.
How far and which way to the mean?
x mean
How many standard deviations is that?
Example: Computing a z-Score
In a recent year, the distribution of SAT I math scores for the incoming class at the
University of Georgia was roughly normal, with mean 610 and standard deviation
69. What is the z-score for a University of Georgia student who got 560 on the
math SAT?
Solution
A score of 560 is 50 points below the mean of 610. This is
deviation below the mean. Alternatively, using the formula,
or 0.725 standard
So the students z-score is 0.725.
To unstandardize, think in reverse. Alternatively, you can solve the z-score
formula for x and get
x = mean + z SD
Example: Finding the Value When You Know the z-Score
What was a University of Georgia students SAT I math score if his or her score
was 1.6 standard deviations above the mean?
Solution
The score that is 1.6 standard deviations above the mean is
x = mean + z SD = 610 + 1.6(69) 720
86
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 14:00
Lesson
5 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Using z-Scores to Make a Comparison
In the United States, heart disease kills roughly one-and-a-quarter times as many
people as cancer. If you look at the death rate per 100,000 residents by state, the
distributions for the two diseases are roughly normal, provided you leave out
Alaska and Utah, which are outliers because of their unusually young populations.
The means and standard deviations for all 50 states are given here.
Alaska had 88 deaths per 100,000 residents from heart disease, and 111 from
cancer. Explain which death rate is more extreme compared to other states. [Source:
Centers for Disease Control, National Vital Statistics Report, vol. 53, no. 5, October 12, 2004.]
Solution
Alaskas death rate for heart disease is 2.88 standard deviations below the mean.
The death rate for cancer is 2.74 standard deviations below the mean. These rates
are about equally extreme, but the death rate for heart disease is slightly more
extreme.
Standard Units
D29.Standardizing is a process that is similar to other processes you have seen
already.
a. Youre driving at 60 mi/h on the interstate and are now passing the
marker for mile 200, and your exit is at mile 80. How many hours from
your exit are you?
b. What two arithmetic operations did you do to get the answer in part a?
Which operation corresponds to recentering? Which corresponds to
rescaling?
Solving the Unknown Percentage Problem
and the Unknown Value Problem
Now you know all you need to know to analyze situations involving two related
problems concerning a normal distribution: finding a percentage when you know
the value, and finding the value when you know the percentage.
2008 Key Curriculum Press
2.5 The Normal Distribution
87
10-03-2009 14:00
Lesson
6 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Percentage of Males Taller Than 74 Inches
For groups of similar individuals, heights often are approximately normal in
their distribution. For example, the heights of 18- to 24-year-old males in
the United States are approximately normal, with mean 70.1 in. and standard
deviation 2.7 in. What percentage of these males are more than 74 in. tall?
[Source: U.S. Census Bureau, Statistical Abstract of the United States, 1991.]
Solution
First make a sketch of the situation, as in Display 2.79. Draw a normal shape
above a horizontal axis. Place the mean in the middle on the axis. Then mark and
label the points that are two standard deviations either side of the mean, 64.7 and
75.5, so that about 95% of the values lie between them. Next, mark and label the
points that are one and three standard deviations either side of the mean (67.4
and 72.8, and 62 and 78.2). Finally, estimate the location of the given value of x
and mark it on the axis.
Display 2.79 The percentage of heights greater than 74 in.
Standardize:
Look up the proportion: The area to the left of the z-score 1.44 is 0.9251, so the
proportion of males taller than 74 in. is 1 0.9251, or 0.0749 or 7.49%.
Example: Percentage of Males Between 72 and 74 Inches Tall
The heights of 18- to 24-year-old
males in the United States are
approximately normal, with mean
70.1 in. and standard deviation 2.7 in.
What percentage of these males are
between 72 and 74 in. tall?
88
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 14:00
Lesson
7 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
First make a sketch, as in Display 2.80.
Display 2.80 The percentage of male heights between 72 and
74 in.
Standardize: From the previous example, a height of 74 in. has a z-score of 1.44.
For a height of 72 in.,
Look up the proportion: The area to the left of the z-score 1.44 is 0.9251. The
area to the left of the z-score 0.70 is 0.7580. The area you want is the area between
these two z-scores, which is 0.9251 0.7580, or 0.1671. So the percentage of
18- to 24-year-old males between 72 and 74 in. tall is about 16.71%.
You can also use a graphing calculator to find this value. [See Calculator
Note 2I for more details.]
Example: 75th Percentile of Female Heights
The heights of females in the United States
who are between the ages of 18 and 24 are
approximately normally distributed, with
mean 64.8 in. and standard deviation
2.5 in. What height separates the shortest
75% from the tallest 25%?
2008 Key Curriculum Press
2.5 The Normal Distribution
89
10-03-2009 14:00
Lesson
8 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
First make a sketch, as in Display 2.81.
Display 2.81 The 75th percentile in height for women age 18
to 24.
Look up the z-score: If the proportion P is 0.75, then from Table A, you find that z
is approximately 0.67.
Unstandardize:
For an unknown percentage problem:
First standardize by converting the given value to a z-score:
Then look up the percentage.
For an unknown value problem, reverse the process:
First look up the z-score corresponding to the given percentage. Then
unstandardize:
Solving the Unknown Percentage Problem and the
Unknown Value Problem
D30. Age of cars. The cars in Clunkerville have a mean age of 12 years and a
standard deviation of 8 years. What percentage of cars are more than 4 years
old? (Warning: This is a trick question.)
Central Intervals for Normal Distributions
You learned in Section 2.1 that if a distribution is roughly normal, about 68% of
the values lie within one standard deviation of the mean. It is helpful to memorize
this fact as well as the others in the box on the next page.
90
Chapter 2 Exploring Distributions
2008 Key Curriculum Press
10-03-2009 14:00
Lesson
9 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Middle 90% of Death Rates from Cancer
According to the table on page 87, the death rates per 100,000 residents from
cancer are approximately normal, with mean 196 and SD 31. The middle 90% of
death rates are between what two numbers?
Solution
The middle 90% of values in this distribution lie within 1.645 standard
deviations of the mean, 196. That is, about 90% of the values lie in the interval
196 1.645(31), or between about 145 and 247.
2008 Key Curriculum Press
2.5 The Normal Distribution
91
10-03-2009 14:00
Lesson
10 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
You can confirm this result using your calculator: Shade the area under
this normal curve between 145 and 247 and calculate the area. [See Calculator
Note 2K.]
Central Intervals for Normal Distributions
D31. Use Table A on pages 824825 to verify that 99.7% of the values in a normal
distribution lie within three standard deviations of the mean.
Summary 2.5: The Normal Distribution
The standard normal distribution has mean 0 and standard deviation 1. All
normal distributions can be converted to the standard normal distribution by
converting to standard units:
First, recenter by subtracting the mean.
Then rescale by dividing by the standard deviation:
Standard units z tell how far a value x is from the mean, measured in standard
deviations. If you know z, you can find x by using the formula x = mean + z SD.
If your population is approximately normal, you can compute z and then
use Table A or your calculator to find the corresponding proportion. Be sure to
make a sketch so that you know whether to use the proportion in the table or to
subtract that proportion from 1.
For any normal distribution,
68% of the values lie within 1 standard deviation of the mean
90% of the values lie within 1.645 standard deviations of the mean
95% of the values lie within 1.960 (or about 2) standard deviations of the mean
99.7% (or almost all) of the values lie within 3 standard deviations of the mean
Practice
The Standard Normal Distribution
P32. Find the percentage of values below
each given z-score in a standard normal
distribution.
a. 2.23 b. 1.67 c. 0.40 d. 0.80
92
Chapter 2 Exploring Distributions
P33. Find the z-score that has the given
percentage of values below it in a standard
normal distribution.
a. 32%
b. 41%
c. 87%
d. 94%
2008 Key Curriculum Press
10-03-2009 14:00
Lesson
11 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
P34. What percentage of values in a standard
a. 1.46 and 1.46?
b. 3 and 3?
P35. For a standard normal distribution, what
interval contains
a. the middle 90% of z-scores?
b. the middle 95% of z-scores?
Standard Units
P36. Refer to the table in the example on page 87.
a. California had 196 deaths from heart
disease and 154 deaths from cancer per
100,000 residents. Which rate is more
extreme compared to other states? Why?
b. Florida had 295 deaths from heart disease
and 234 deaths from cancer per 100,000
residents. Which rate is more extreme?
c. Colorado had an unusually low rate of
heart disease, 143 deaths per 100,000
residents. Hawaii had an unusually low
rate of cancer, 156 deaths per 100,000
residents. Which is more extreme?
Solving the Unknown Percentage Problem
and the Unknown Value Problem
2.7 in. The heights of 18- to 24-year-old
females are also approximately normally
distributed and have mean 64.8 in. and
standard deviation 2.5 in.
a. Estimate the percentage of U.S. males
between 18 and 24 who are 6 ft tall or
taller.
b. How tall does a U.S. woman between 18
and 24 have to be in order to be at the
35th percentile of heights?
Central Intervals for Normal Distributions
P38. Refer to the table in the example on page 87.
a. The middle 90% of the states death rates
from heart disease fall between what two
numbers?
b. The middle 68% of death rates from heart
disease fall between what two numbers?
P39. Refer to the information in P37. Which of
the following heights are outside the middle
95% of the distribution? Which are outside
the middle 99%?
A. a male who is 79 in. tall
B. a female who is 68 in. tall
C. a male who is 65 in. tall
D. a female who is 65 in. tall
P37. The heights of 18- to 24-year-old males in
the United States are approximately normal,
with mean 70.1 in. and standard deviation
Exercises
E59. What percentage of values in a standard
normal distribution fall
a. below a z-score of 1.00? 2.53?
b. below a z-score of 1.00? 2.53?
c. above a z-score of 1.5?
d. between z-scores of 1 and 1?
E60. On the same set of axes, draw two normal
curves with mean 50, one having standard
deviation 5 and the other having standard
deviation 10.
E61. Standardizing. Convert each of these values
to standard units, z. (Do not use a calculator.
These are meant to be done in your head.)
2008 Key Curriculum Press
a.
b.
c.
d.
e.
f.
x = 12, mean 10, SD 1
x = 12, mean 10, SD 2
x = 12, mean 9, SD 2
x = 12, mean 9, SD 1
x = 7, mean 10, SD 3
x = 5, mean 10, SD 2
E62. Unstandardizing. Find the value of x that was
converted to the given z-score.
a. z = 2, mean 20, SD 5
b. z = 1, mean 25, SD 3
c. z = 1.5, mean 100, SD 10
d. z = 2.5, mean 10, SD 0.2
2.5 The Normal Distribution
93
10-03-2009 14:00
Lesson
12 de 12
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E63. SAT I critical reading scores are scaled so that
they are approximately normal, with mean
about 505 and standard deviation about 111.
a. Find the probability that a randomly
selected student has an SAT I critical
reading score
i. between 400 and 600
ii. over 700
iii. below 450
b. What SAT I critical reading scores fall in
the middle 95% of the distribution?
E64. SAT I math scores are scaled so that they are
approximately normal, with mean about 511
and standard deviation about 112. A college
wants to send letters to students scoring in
the top 20% on the exam. What SAT I math
score should the college use as the dividing
line between those who get letters and those
who do not?
E65. Height limitations for flight attendants.
To work as a flight attendant for United
Airlines, you must be between 5 ft 2 in. and
6 ft tall. [Source: www.ual.com.] The mean height
of 18- to 24-year-old males in the United
States is about 70.1 in., with a standard
deviation of 2.7 in. The mean height of
18- to 24-year-old females is about 64.8 in.,
with a standard deviation of 2.5 in. Both
distributions are approximately normal.
What percentage of men this age meet
the height limitation? What percentage of
women this age meet the height limitation?
E66. Where is the next generation of male
professional basketball players coming from?
a. The mean height of 18- to 24-year-old
males in the United States is approximately
normally distributed, with mean 70.1 in.
and standard deviation 2.7 in. Use this
information to approximate the percentage
of men in the United States between the
ages of 18 and 24 who are as tall as or taller
than each basketball player listed here.
Then, using the fact that there are about
13 million men between the ages of 18 and
24 in the United States, estimate how many
are as tall as or taller than each player.
94
Chapter 2 Exploring Distributions
i. Shawn Marion, 6 ft 7 in.
ii. Allen Iverson, 6 ft 0 in.
iii. Shaquille ONeal, 7 ft 1 in.
b. Distributions of real data that are
approximately normal tend to have
heavier tails than the ideal normal curve.
Does this mean your estimates in part a
are too small, too big, or just right?
E67. Puzzle problems. Problems that involve
computations with the normal distribution
have four quantities: mean, standard
deviation, value x, and proportion P below
value x. Any three of these values are enough
to determine the fourth. Think of each row
in this table as little puzzles, and find the
missing value in each case. This isnt the sort
of thing you are likely to run into in practice,
but solving the puzzles can help you become
more skilled at working with the normal
distribution.
E68. More puzzle problems. In each row of this
table, assume the distribution is normal.
Knowing any two of the mean, standard
deviation, Q1, and Q3 is enough to determine
the other two. Complete the table.
E69. ACT scores are approximately normally
distributed, with mean 18 and standard
deviation 6. Without using your calculator,
roughly what percentage of scores are
between 12 and 24? Between 6 and 30?
Above 24? Below 24? Above 6? Below 6?
E70. A group of subjects tested a certain brand of
foam earplug. The number of decibels (dB)
that noise was reduced for these subjects was
2008 Key Curriculum Press
10-03-2009 14:00
Lesson
1 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
c. Use your work in part b to suggest a
rough rule for using the mean and
standard deviation of a set of positive
values to check whether it is possible that
a distribution might be approximately
normal.
approximately normally distributed, with
mean 30 dB and standard deviation 3.6 dB.
The middle 95% of noise reductions were
between what two values?
E71.The heights of 18- to 24-year-old males in
the United States are approximately normally
distributed with mean 70.1 in. and standard
deviation 2.7 in.
a. If you select a U.S. male between ages
18 and 24 at random, what is the
approximate probability that he is less
than 68 in. tall?
b. There are roughly 13 million 18- to
24-year-old males in the United States.
About how many are between 67 and
68 in. tall?
c. Find the height of 18- to 24-year-old
males that falls at the 90th percentile.
E72.If the measurements of height are
transformed from inches into feet, will that
change the shape of the distribution in E71?
Describe the distribution of male heights in
terms of feet rather than inches.
E73.The British monarchy. Over the 1200 years
of the British monarchy, the average reign of
kings and queens has lasted 18.5 years, with
a standard deviation of 15.4 years.
a. What can you say about the shape of the
distribution based on the information
given?
b. Suppose you made the mistake of
assuming a normal distribution. What
fraction of the reigns would you estimate
lasted a negative number of years?
E74.NCAA scores. The histogram in Display 2.82
was constructed from the total of the scores
of both teams in all NCAA basketball
play-off games over a 57-year period.
Display 2.82 Total points scored in NCAA play-off
games. [Source: www.ncaa.com.]
a. Approximate the mean of this distribution.
b. Approximate the standard deviation of
this distribution.
c. Between what two values do the middle
95% of total points scored lie?
d. Suppose you choose a game at random
from next years NCAA play-offs. What is
the approximate probability that the total
points scored in this game will exceed
150? 190? Do you see any potential
weaknesses in your approximations?
Chapter Summary
Distributions come in various shapes, and the appropriate summary statistics (for
center and spread) usually depend on the shape, so you should always start with a
plot of your data.
Common symmetric shapes include the uniform (rectangular) distribution
and the normal distribution. There are also various skewed distributions. Bimodal
distributions often result from mixing cases of two kinds.
2008 Key Curriculum Press
Chapter Summary
95
10-03-2009 14:09
Lesson
2 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Dot plots, stemplots, and histograms show distributions graphically and let
you estimate center and spread visually from the plot.
For approximately normal distributions, you ordinarily use the mean (balance
point) and standard deviation as the measure of center and spread. If you know
the mean and standard deviation of a normal distribution, you can use z-scores
and Table A or your calculator to find the percentage of values in any interval.
The mean and standard deviation are not resistanttheir values are sensitive
to outliers. For a description of a skewed distribution, you should consider using
the median (halfway point) and quartiles (medians of the lower and upper halves
of the data) as summary statistics.
Later on, when you make inferences about the entire population from a
sample taken from that population, the sample mean and standard deviation will
be the most useful summary statistics, even if the population is skewed.
E75. The map in Display 2.83, from the U.S.
National Weather Service, gives the number of
tornadoes by state, including the District of
Columbia.
f. Describe the shape, center, and spread
of the distribution of the number of
tornadoes.
E76. Display 2.84 shows some results of the Third
International Mathematics and Science study
for various countries. Each case is a school.
Display 2.83 The number of tornadoes per state in a
recent year. [Source: www.ncdc.noaa.gov.]
a. Make a stemplot of the number of
tornadoes.
b. Write the five-number summary.
Display 2.84 Boxplots of mathematics instruction
time by country for 9-year-olds. [Source:
Report #8, April 1998, of the Third International
Mathematics and Science Study (TIMSS), p. 6.]
c. Identify any outliers.
d. Draw a boxplot.
e. Compare the information in your
stemplot with the information in your
boxplot. Which plot is more informative?
96
Chapter 2 Exploring Distributions
a. Estimate the median for the United
States. Use this median value in a
sentence that makes it clear what the
median represents in this context.
2008 Key Curriculum Press
10-03-2009 14:09
Lesson
3 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Why are there only lines and no boxes for
Norway and Singapore?
c. Describe how the distribution of values
for the United States compares to the
distributions of values for the other
countries.
E77.A university reports that the middle 50% of
the SAT I math scores of its students were
between 585 and 670, with half the scores
above 605 and half below.
a. What SAT I math scores would be
considered outliers for that university?
b. What can you say about the shape of this
distribution?
E78.These statistics summarize a set of television
ratings from a week without any special
programming. Are there any outliers among
the 113 ratings?
a. From your knowledge of the world,
match the boxplots to the correct region.
b. Match the summary statistics (for
Groups AC) to the correct boxplot (for
Regions 13).
E80. The National Climatic Data Center records
high and low temperatures by state since
1890. Stem-and-leaf plots of the years each
state had its lowest temperature and the years
each state had its highest temperature are
shown in Display 2.86. What do the stems
represent? What do the leaves represent?
Compare the two distributions with respect
to shape, center, spread, and any interesting
features.
E79.The boxplots in Display 2.85 show the life
expectancies for the countries of Africa,
Europe, and the Middle East. The table
shows a few of the summary statistics for
each of the three data sets.
Display 2.86
Display 2.85 Life expectancies for the countries
of Africa, Europe, and the Middle East.
[Source: Population Reference Bureau, World
Population Data Sheet, 2005.]
2008 Key Curriculum Press
Stem-and-leaf plots of record low and
high temperatures of states. [Source: National
Climatic Data Center, 2002.]
E81. A distribution is symmetric with
approximately equal mean and median. Is
it necessarily the case that about 68% of the
values are within one standard deviation of
the mean? If yes, explain why. If not, give an
example.
Chapter Summary
97
10-03-2009 14:09
Lesson
4 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E82. Display 2.87 shows two sets of graphs. The
first set shows smoothed histograms IIV for
four distributions. The second set shows the
corresponding cumulative relative frequency
plots, in scrambled order AD. Match each
plot in the first set with its counterpart in the
second set.
Distributions
Display 2.87 Four distributions with different
shapes and their cumulative relative
frequency plots.
98
Chapter 2 Exploring Distributions
E83. The average number of pedestrian deaths
annually for 41 metropolitan areas is given
in Display 2.88.
Display 2.88 Average annual pedestrian deaths.
[Source: Environmental Working Group and
the Surface Transportation Policy Project.
Compiled from National Highway Traffic Safety
Administration and U.S. Census data. USA Today,
April 9, 1997.]
2008 Key Curriculum Press
10-03-2009 14:09
Lesson
5 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. What is the median number of deaths?
Write a sentence explaining the meaning
of this median.
Display 2.90 shows dot plots of the same
data.
b. Is any city an outlier in terms of the
number of deaths? If so, what is the city,
and what are some possible explanations?
c. Make a plot of the data that you think
will show the distribution in a useful
way. Describe why you chose that plot
and what information it gives you about
average annual pedestrian deaths.
d. In which situations might giving the
death rate be more meaningful than
giving the number of deaths?
E84. The side-by-side boxplots in Display 2.89
give the percentage of 4th-grade-age
children who are still in school on various
continents according to the United Nations.
Each case is a country. The four regions
marked 1, 2, 3, and 4 are Africa, Asia,
Europe, and South/Central America, not
necessarily in that order.
a. Which region do you think corresponds
to which number?
b. Is the distribution of values for any
region skewed left ? Skewed right?
Symmetrical?
Display 2.89 Boxplots of the percentage of
4th-grade-age children still in school
in countries of the world, by continent.
[Source: 1993 Information Please Almanac.]
2008 Key Curriculum Press
Display 2.90 Dot plots of the percentage of 4th
grade- age children still in school in
countries of the world, by continent.
c. Match each dot plot to the corresponding
boxplot.
d. In what ways do the boxplots and
dot plots give diff erent impressions?
Why does this happen? Which type
of plot gives a better impression of the
distributions?
E85.The first AP Statistics Exam was given in
1997. The distribution of scores received
by the 7667 students who took the exam is
given in Display 2.91. Compute the mean
and standard deviation of the scores.
Display 2.89 Scores on the first AP Statistics Exam.
[Source:The College Board.]
Chapter Summary
99
10-03-2009 14:09
Lesson
6 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E86. For the countries of Europe, many average
life expectancies are approximately the
same, as you can see from the stemplot in
Display 2.53 on page 69. Use the formulas
for the summary statistics of values in a
frequency table to compute the mean and
standard deviation of the life expectancies
for the countries of Europe.
E87. Construct a set of data in which all values
are larger than 0, but one standard deviation
below the mean is less than 0.
E88. Without computing, what can you say about
the standard deviation of this set of values:
4, 4, 4, 4, 4, 4, 4, 4?
E89. In this exercise, you will compare how
dividing by n versus n 1 affects the SD for
various values of n. So that you dont have to
compute the sum of the squared deviations
each time, assume that this sum is 400.
a. Compare the standard deviation that
would result from
i. dividing by 10 versus dividing by 9
ii. dividing by 100 versus dividing by 99
iii. dividing by 1000 versus dividing
by 999
b. Does the decision to use n or n 1 in the
formula for the standard deviation matter
very much if the sample size is large?
E90. If two sets of test scores arent normally
distributed, its possible to have a larger
z-score on Test II than on Test I yet be in
a lower percentile on Test II than on Test I.
The computations in this exercise will
illustrate this point.
a. On Test I, a class got these scores: 11, 12,
13, 14, 15, 16, 17, 18, 19, 20. Compute the
z-score and the percentile for the student
who got a score of 19.
b. On Test II, the class got these scores:
1, 1, 1, 1, 1, 1, 1, 18, 19, 20. Compute the
z-score and the percentile for the student
who got a score of 18.
c. Do you think the student who got a score
of 19 on Test I or the student who got a
score of 18 on Test II did better relative to
the rest of the class?
100
Chapter 2 Exploring Distributions
E91. The average income, in dollars, of people in
each of the 50 states was computed for 1980
and for 2000. Summary statistics for these
two distributions are given in Display 2.92.
Display 2.84 Summary statistics of the average
income, in dollars, for the 50 states
for 1980 and 2000. [Source: U.S. Census
Bureau, Statistical Abstract of the United States,
20042005.]
a. Explain the meaning of $7,007 for the
minimum in 1980.
b. Are any states outliers for either year?
c. In 2000 the average personal income
in Alabama was $23,768, and in 1980 it
was $7,836. Did the income in Alabama
change much in relation to the other
states? Explain your reasoning.
E92. For these comparisons, you will either use
the SAT I critical reading scores in Display
2.69 on page 78 or assume that the scores
have a normal distribution with mean 505
and standard deviation 111.
a. Estimate the percentile for an SAT I
critical reading score of 425 using the
cumulative relative frequency plot. Then
find the percentile for a score of 425 using
a z-score. Are the two values close?
b. Estimate the SAT I critical reading score
that falls at the 40th percentile, using the
table in Display 2.69. Then find the 40th
percentile using a z-score. Are the two
values close?
c. Estimate the median from the cumulative
relative frequency plot. Is this value close
to the median you would get by assuming
a normal distribution of scores?
2008 Key Curriculum Press
10-03-2009 14:09
Lesson
7 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
d. Estimate the quartiles and the
interquartile range using the plot. Find
the quartiles and interquartile range
assuming a normal distribution of scores.
E93. For 17-year-olds in the United States, blood
cholesterol levels in milligrams per deciliter
have an approximately normal distribution
with mean 176 mg/dL and standard deviation
30 mg/dL. The middle 90% of the cholesterol
levels are between what two values?
E94. Display 2.93 shows the distribution of
batting averages for all 187 American League
baseball players who batted 100 times or
more in a recent season. (A batting average
is the fraction of times that a player hits
safelythat is, the hit results in a player
advancing to a baseusually reported to
three decimal places.)
c. Use your mean and SD from part b to
compute an estimate of the percentage of
players who batted over .300 (or 300).
d. Now use the histogram to estimate the
percentage of players who batted over .300.
Compare to your estimate from part c.
E95. How good are batters in the National
League? Display 2.94 shows the distribution
of batting averages for all 223 National
Leaguers who batted 100 times or more in a
recent season.
Display 2.94 National League batting averages.
[Source: CBS SportsLine.com, www.sportsline
.com.]
Display 2.93 American League batting averages.
[Source: CBS SportsLine.com, www.sportsline
.com.]
a. Do the batting averages appear to be
approximately normally distributed?
b. Approximate the mean and standard
deviation of the batting averages from the
histogram.
2008 Key Curriculum Press
a. Approximate the mean and standard
deviation of the batting averages from the
histogram.
b. Compare the distributions of batting
averages for the two leagues. (See E94
for the American League.) What are
the main differences between the two
distributions?
c. A batter hitting .300 in the National
League is traded to a team in the
American League. What batting average
could be expected of him in his new
league if he maintains about the same
position in the distribution relative to his
peers?
Chapter Summary
101
10-03-2009 14:09
Lesson
8 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
AP1. These summary statistics are for the
distribution of the populations of the major
cities in Brazil.
Which of the following best describes the
shape of this distribution?
skewed right without outliers
skewed right with at least one outlier
roughly normal, without outliers
skewed left without outliers
skewed left with at least one outlier
AP2. Which of these lists contains only summary
statistics that are sensitive to outliers?
mean, median, and mode
standard deviation, IQR, and range
mean and standard deviation
median and IQR
five-number summary
AP3. This stem-and-leaf plot shows the ages of
CEOs of 60 corporations whose annual sales
were between $5 million and $350 million.
Which of the following is not a correct
statement about this distribution?
The distribution is skewed left (towards
smaller numbers).
The oldest of the 60 CEOs is 74 years old.
The distribution has no outliers.
The range of the distribution is 42.
The median of the distribution is 50.
AP4. A traveler visits Europe and stays thirty
days in thirty different hotels, paying each
102
Chapter 2 Exploring Distributions
day with her credit card. The hotels charged
a mean price of 50 euros, with a standard
deviation of 10 euros. When the charges
appear on her credit card statement in
the United States, she finds that her bank
charged her $1.20 per euro, plus a $5 fee
for each transaction. What is the mean and
standard deviation of the thirty daily hotel
charges in dollars, including the fee?
mean $50, standard deviation $17
mean $60, standard deviation $12
mean $60, standard deviation $17
mean $65, standard deviation $12
mean $65, standard deviation $17
AP5. The scores on a nationally administered test
are approximately normally distributed with
mean 47.3 and standard deviation 17.3.
Approximately what must a student have
scored to be in the 95th percentile nationally?
AP6. A particular brand of cereal boxes is labeled
16 oz. This dot plot shows the actual
weights of 100 randomly selected boxes.
Which of the following is the best estimate
of the standard deviation of these weights?
AP7. The distribution of the number of points
earned by the thousands of contestants in
the Game of Pig World Championship has
mean 20 and standard deviation 6. What
proportion of the contestants earned more
than 26 points?
2008 Key Curriculum Press
10-03-2009 14:09
Lesson
9 de 9
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Less than 1%
16%
32%
84%
This proportion cannot be determined
from the information given.
AP8. Anya scored 70 on a statistics test for which
the mean was 60 and the standard deviation
was 10. She also scored 60 on a chemistry
test for which the mean was 50 and the
standard deviation was 5. If the scores for
both tests were approximately normally
distributed, which best describes how Anya
did relative to her classmates?
Anya did better on the statistics test than
she did on the chemistry test because she
scored 10 points higher on the statistics
test than on the chemistry test.
Anya did equally well relative to her
classmates on each test, because she
scored 10 points above the mean on each.
Anya did better on the chemistry test
than she did on the statistics test because
she scored 2 standard deviations above
the mean on the chemistry test and only
1 standard deviation above the mean on
the statistics test.
Its impossible to tell without knowing the
number of points possible on each test.
Its impossible to tell without knowing
the number of students in each class.
Investigative Task
AP9. A game invented by three college students
involves giving the name of an actor and
then trying to connect that actor with
actor Kevin Bacon, counting the number
of steps needed. For example, Sarah Jessica
Parker has a Bacon number of 1 because
she appeared in the same movie as Kevin
Bacon, Footloose (1984). Will Smith has a
Bacon number of 2. He has never appeared
in a movie with Kevin Bacon; however, he
was in Bad Boys II (2003) with Michael
Shannon, who was in The Woodsman (2004)
2008 Key Curriculum Press
with Kevin Bacon. Display 2.95 gives the
number of links required to connect each
of the 645,957 actors in the Internet Movie
Database to Kevin Bacon.
Display 2.95 Bacon numbers. [Source: www.cs.virginia
.edu.]
a. How many people have appeared in a
movie with Kevin Bacon?
b. Who is the person with Bacon number 0?
It has been questioned whether Kevin Bacon
was the best choice for the center of the
Hollywood universe. A possible challenger
is Sean Connery. See Display 2.96.
Display 2.96 Connery numbers.
c. Do you think Kevin Bacon or Sean
Connery better deserves the title
Hollywood center? Make your case
using statistical evidence (as always).
d. (For movie fans.) What is Bacons
Connery number? What is Connerys
Bacon number?
AP Sample Test
103
10-03-2009 14:09
Lesson
1 de 2
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Relationships Between Two
Quantitative Variables
What variables
contribute to a
college having a
high graduation
rate? Scatterplots,
correlation, and
regression are
the basic tools used
to describe relationships between two
quantitative variables.
2008 Key Curriculum Press
10-03-2009 14:13
Lesson
2 de 2
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
In Chapter 2, you compared the speeds of predators and nonpredators. Not
surprisingly, among mammals meat eaters were usually faster than vegetarians.
Some nonpredators, however, such as the horse (48 mi/h) and the elk (45 mi/h),
were faster than some predators, such as the dog (39) and the grizzly (30). Because
of this variability, comparing the two groups was a matter for statistics; that is, you
needed suitable plots and summaries.
The comparison involved a relationship between two variables, one
quantitative (speed) and one categorical (predator or not). In this chapter, youll
learn how to explore and summarize relationships in which both variables are
quantitative. The data set on mammals in Display 2.24 on page 43 raises many
questions of this sort: Do mammals with longer average longevity also have
longer maximum longevity? Is there a relationship between speed and longevity?
The approach to describing distributions in Chapter 2 boiled down to finding
shape, center, and spread. For distributions that are approximately normal, two
numerical summariesthe mean for center, the standard deviation for spread
tell you basically all you need to know. When comparing two quantitative
variables, you can see the shape of the distribution by making a scatterplot. For
scatterplots with points that lie in an oval cloud, it turns out once again that two
summaries tell you pretty much all you need to know: the regression line and the
correlation. The regression line tells about center: What is the equation of the line
that best fits the cloud of points? The correlation tells about spread: How spread
out are the points around the line?
In this chapter, you will learn to
describe the pattern in a scatterplot, and decide what its shape tells you
about the relationship between the two variables
find a regression line through the center of a cloud of points to summarize
the relationship
use the correlation as a measure of how spread out the points are from
this line
use diagnostic tools to check for information the summaries dont tell you,
and decide what to do with that information
make shape-changing transformations to re-express a curved relationship so
that you can use a line as a summary
2008 Key Curriculum Press
10-03-2009 14:13
Lesson
1 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.1
Scatterplots
In Chapter 1, you explored the relationship between the ages of employees at
Westvaco and whether the employees were laid off when the company downsized.
There is more to see. In the scatterplot in Display 3.1, for example, each employee
is represented by a dot that shows the year the employee was born plotted against
the year the employee was hired.
A scatterplot shows the
relationship between
two quantitative
variables.
Display 3.1 Year of birth versus year of hire for the 50 employees
in Westvaco Corporations engineering department.
In this scatterplot, you can see a moderate positive association: Employees
hired in an earlier year generally were born in an earlier year, and employees hired
in a later year generally were born in a later year. This trend is fairly linear. You
can visualize a summary line going through the center of the data from lower left
to upper right. As you move to the right along this line, the points fan out and
cluster less closely around the line.
Sometimes its easier to think about peoples ages than about the years
they were born. The scatterplot in Display 3.2 shows the ages of the Westvaco
employees at the time layoffs began plotted against the year they were hired. This
scatterplot shows a moderate negative association: Those people hired in later
years generally were younger at the time of the layoffs than people hired in
earlier years.
Display 3.2 Age at layoffs versus year of hire for the 50 employees
in Westvaco Corporations engineering department.
106 Chapter 3 Relationships Between Two Quantitative Variables
2008 Key Curriculum Press
10-03-2009 14:20
Lesson
2 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
[You can use a graphing calculator to create scatterplots. See Calculator Note 3A.]
Interpreting Scatterplots
D1. You will examine Displays 3.1 and 3.2 more closely in these questions and in
E7 on page 114.
a. Why should the two variables plotted in Display 3.1 show a positive
association and the two variables plotted in Display 3.2 show a negative
association?
b. Why do all but one of the points in Display 3.1 lie on or below a diagonal
line running from the lower left to the upper right?
c. Is this sentence a reasonable interpretation of Display 3.2? As time
passed, Westvaco tended to hire younger and younger people.
Describing the Pattern in a Scatterplot
Shape, trend, and
strength
For the distribution of a single quantitative variable, shape, center, and spread
is a useful summary. For bivariate (two-variable) quantitative data, the summary
becomes shape, trend, and strength.
You might find it helpful to follow this set of steps as you practice describing
scatterplots.
1. Identify the variables and cases. On a scatterplot, each point represents
a case, with the x-coordinate equal to the value of one variable and the
y-coordinate equal to the value of the other variable. You should describe
the scale (units of measurement) and range of each variable.
2. Describe the overall shape of the relationship, paying attention to
linearity: Is the pattern linear (scattered about a line) or curved?
clusters: Is there just one cluster, or is there more than one?
outliers: Are there any striking exceptions to the overall pattern?
3. Describe the trend. If as x gets larger y tends to get larger, there is a positive
trend. (The cloud of points tends to slope upward as you go from left to
right.) If as x gets larger y tends to get smaller, there is a negative trend. (The
cloud of points tends to slope downward as you go from left to right.)
4. Describe the strength of the relationship. If the points cluster closely around
an imaginary line or curve, the association is strong. If the points are
scattered farther from the line, the association is weak.
If, as in Display 3.1, the points tend to fan out at one end (a tendency called
heteroscedasticity), the relationship varies in strength. If not, it has constant
strength.
5. Does the pattern generalize to other cases, or is the relationship an instance
of what you see is all there is?
6. Are there plausible explanations for the pattern? Is it reasonable to conclude
that one variable causes the other? Is there a third or lurking variable that
might be causing both?
2008 Key Curriculum Press
3.1 Scatterplots
107
10-03-2009 14:20
Lesson
3 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
A plot of variable A
against (or versus)
variable B shows A on
the y-axis and B on the
x-axis.
Example: Dormitory Populations
The plot in Display 3.3 shows, for the 50 states in the United States, the number of
people living in college dormitories versus the number of people living in cities,
in thousands. Describe the pattern in the plot.
Display 3.3
Number of people living in college dormitories
versus number of people living in cities for the
50 states in the United States. [Source: U.S. Census Bureau,
2000 Census of Population and Housing.]
Solution
1. Variables and cases. The scatterplot plots dormitory population against urban
population, in thousands, for the 50 U.S. states. Dormitory population
ranges from near 0 to a high of more than 174,000 in New York. The
urban population ranges from near 0 to about 17 million in Texas and
New York and 32 million in California.
2. Shape. While most states follow a linear trend, the three states with the largest
urban population suggest curvature in the plot because, for those states,
the number of people living in dormitories is proportionately lower than in
the smaller states. California can be considered an outlier with respect to its
urban population, which is much larger than that of other states. It is also
an outlier with respect to the overall pattern, because it lies far below the
generally linear trend.
3. Trend. The trend is positivestates with larger urban populations tend to
have larger dormitory populations, and states with smaller urban populations
tend to have smaller dormitory populations.
4. Strength. The relationship varies in strength. For the states with the smallest
urban populations, the points cluster rather closely around a line. For the
states with the largest urban populations, the points are scattered farther from
the line. Overall, the strength of the relationship is moderate.
5. Generalization. The 50 states arent a sample from a larger population of
cases, so the relationship here does not generalize to other cases. Because
both variables tend to change rather slowly, however, we can expect the
relationship in Display 3.3 to be similar to that of other years.
108
Chapter 3
Relationships Between Two Quantitative Variables
2008 Key Curriculum Press
10-03-2009 14:20
Lesson
4 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
6. Explanation. It is tempting to attribute the positive relationship to the idea
that cities attract colleges. (Just pick a large city nearby and see how many
colleges you can name that are located there.) The main reason for the
positive relationship, however, is not nearly so interesting: Both variables ar
related to a states population. The more people in a state, the more people
live in dormitories and the more people live in cities. (Theres a moral here:
Interpreting association can be tricky, in part because the two variables you s
in a plot often will be related to some lurking variable that you dont see.)
Describing the Pattern in a Scatterplot
D2. Display 3.4 is derived from the data for Display 3.3 by converting
the variables to the proportion of a states population living in college
dormitories (given as the number living in dorms per 1000 state residents)
and the proportion of the states population living in cities.
a. Follow steps 16 in the previous example to describe what you see in thi
new plot.
b. When you go from totals (Display 3.3) to proportion of total population
(Display 3.4), the relationship changes from positive to negative and
becomes weaker. Give an explanation for the di erences in these
two plots.
Display 3.4
The proportion of people living in college
dormitories versus the proportion of people living
in cities for the 50 U.S. states.
Summary 3.1: Scatterplots
A scatterplot shows the relationship between two quantitative variables. Each
case is a point, with the x-coordinate equal to the value of one variable and the
y-coordinate equal to the value of the other variable.
In describing a scatterplot, be sure to cover all of the following:
cases and variables (What exactly does each point represent?)
shape (linear or curved, clusters, outliers)
2008 Key Curriculum Press
3.1 Scatterplots
109
10-03-2009 14:20
Lesson
5 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
trend (positive, negative, or none)
strength (strong, moderate, or weak; constant or varying)
generalization (Does the pattern generalize?)
explanation (Is there an explanation for the pattern?)
When asked to compare two scatterplots, dont just describe each separately.
Be sure to describe how their shapes, trends, and strengths are similar and
how they differ.
Practice
Describing the Pattern in a Scatterplot
P1. Growing kids. This table gives median heights
of boys at ages 2, 3, 4, 5, 6, and 7 yr.
a. Scatterplot. Plot height versus age; that
is, put height on the y-axis and age on
the x-axis.
b. Shape, trend, and strength. Describe
the shape, trend, and strength of the
relationship.
c. Generalization. Would you expect
these data to allow you to make good
predictions of the median height of
8-year-olds? Of 50-year-olds?
d. Explanation. It doesnt quite fit to say that
age causes height, but there is still an
underlying cause-and-effect relationship.
How would you describe it?
P2. Late planes and lost bags. A great way to cap
off a long day of travel is to have your plane
arrive late and then find that the airline has
lost your luggage. As Display 3.5 shows, some
airlines handle baggage better than others.
a. Which airline has the worst record for
mishandled baggage? For being on time?
b. Where on the plot would you find the
airline with the best on-time record and
the best mishandled-baggage rate? Which
airlines are best in both categories?
c. Determine whether this statement is
true or false, and explain your answer:
American had a mishandled-baggage
rate that was more than twice the rate of
Southwest.
Display 3.5 On-time arrivals versus mishandled
baggage. [Source: U.S. Department of
Transportation, Air Travel Consumer Report,
October 2005.]
d. Is there a positive or a negative
relationship between the on-time
percentage and the rate of mishandled
baggage? Is it strong or weak?
e. Would you expect the relationship in
this plot to generalize to some larger
population of commercial airlines?
Why or why not? Would you expect
the relationship in this plot to be roughly
the same for data from 10 years ago? For
next year?
110 Chapter 3 Relationships Between Two Quantitative Variables
2008 Key Curriculum Press
10-03-2009 14:20
Lesson
6 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Exercises
E1. For each of the lettered scatterplots in
Display 3.6, give the trend (positive or
negative), strength (strong, moderate, or
weak), and shape (linear or curved). Which
plots show varying strength?
a.
b.
c.
d.
e.
f.
g.
h.
Display 3.6 Eight scatterplots with various
distributions.
E2. For each set of cases and variables, tell
whether you expect the relationship to
be (i) positive or negative and (ii) strong,
moderate, or weak.
LaTasha Colander crosses the finish line of the
womens 100-meter dash final at the 2004 U.S.
Olympic Team Track and Field Trials.
2008 Key Curriculum Press
3.1 Scatterplots
111
10-03-2009 14:20
Lesson
7 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E3. Match each set of cases and variables (AD)
with the short summary (IIV) of its
scatterplot.
A.
B.
C.
D.
b. Describe the shape of the plot. Do you
see any clusters? Are there any outliers?
Is the relationship linear or curved? Is the
overall trend positive or negative? What
is the strength of the relationship?
c. Is the distribution of the percentage
of students taking the SAT I bimodal?
Explain how the scatterplot shows this.
Is the distribution of SAT I math scores
bimodal?
I. strong negative relationship, somewhat
curved
II. strong, curved positive relationship
III. moderate, roughly linear, positive
relationship
IV. moderate negative relationship
E4. SAT I math scores. In 2005, the average
SAT I math score across the United States
was 520. North Dakota students averaged
605, Illinois students averaged 606, and
students from the nearby state of Iowa
did even better, averaging 608. Why do
states from the Midwest do so well? It is
easy to jump to a false conclusion, but the
scatterplot in Display 3.7 can help you find
a reasonable explanation.
a. Estimate the percentage of students in
Iowa and in Illinois who took the SAT I.
New York had the highest percentage of
students who took the SAT I. Estimate
that percentage and the average SAT I
math score for students in that state.
112
Chapter 3
Display 3.7 Average SAT I math scores by state
versus the percentage of high school
graduates who took the exam. [Source:
College Board, www.collegeboard.com.]
d. The cases used in this plot are the 50 U.S.
states in 2005. Would you expect the
pattern to generalize to some other set of
cases? Why or why not?
e. Suggest an explanation for the trend.
(Hint: The SAT is administered from
Princeton, New Jersey. An alternative
exam, the ACT, is administered from
Iowa. Many colleges and universities in
the Midwest either prefer the ACT or at
least accept it in place of the SAT, whereas
colleges in the eastern states tend to
prefer the SAT.) Is there anything in the
data that you can use to help you decide
whether your explanation is correct?
Relationships Between Two Quantitative Variables
2008 Key Curriculum Press
10-03-2009 14:20
Lesson
8 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E5. Each of the 51 cases plotted on the
scatterplots in Display 3.8 is a top-rated
university. The y-coordinate of a point tells
the graduation rate, and the x-coordinate
tells the value of some other quantitative
variablethe percentage of alumni who
gave that year, the student/faculty ratio, the
75th percentile of the SAT scores (math plus
critical reading) for a recent entering class,
and the percentage of incoming students
who ranked in the top 10% of their high
school graduating class.
rate? Which variable is almost useless for
predicting graduation rate?
d. Generalization. The cases in these plots
are the 51 universities that happened
to come out at the top of one particular
rating scheme. Do you think the complete
set of all U.S. universities would show
pretty much the same relationships? Why
or why not?
e. Explanation. Consider the two variables
with the strongest relationship to
graduation rates. Offer an explanation
for the strength of these particular
relationships. In what ways, if any, can
you use the data to help you decide
whether your explanation is in fact
correct?
E6. Hat size. What does hat size really measure?
A group of students investigated this
question by collecting a sample of hats.
They recorded the size of the hat and then
measured the circumference, the major
axis (the length across the opening in the
long direction), and the minor axis. (See
Display 3.9 on the next page. Hat sizes
have been changed to decimals; all other
measurements are in inches.) Is hat size
most closely related to circumference, major
axis, or minor axis? Answer this question by
making appropriate plots and describing the
patterns in those plots.
Display 3.8 Scatterplots showing the relationship
between graduation rate and four other
variables for 51 top-rated universities.
[Source: U.S. News and World Report, 2000.]
a. Compare the shapes of the four plots.
i. Which plots show a linear shape?
Which show a curved shape?
ii. Which plots show just one cluster?
Which show more than one?
iii. Which plots have outliers?
b. Compare the trends of the relationships:
Which plots show a positive trend? A
negative trend? No trend?
c. Compare the strengths of the
relationships: Which variables give more
precise predictions of the graduation
2008 Key Curriculum Press
3.1 Scatterplots
113
10-03-2009 14:20
Lesson
9 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.10 Age at hire versus year of hire for
the 50 employees in Westvaco
Corporations engineering department.
c. Display 3.11 shows the year of birth of
the Westvaco employees plotted against
the year they were hired. Open circles
represent employees laid off, and solid
circles represent employees kept. Does
this scatterplot suggest a reason why
older employees tended to be laid off
more frequently?
Display 3.9 Hat sizes, with circumference and axes
in inches. [Source: Roger Johnson, Carleton
College, data from student project.]
Display 3.11 Year of birth versus year of hire for
Westvaco employees.
E8. Passenger aircraft. Airplanes vary in their
E7. Westvaco, revisited. To determine
whether Westvaco discriminated by age in
size, speed, average flight length, and cost
laying off employees, you could investigate
of operation. You can probably guess that
whether it might have discriminated in
larger planes use more fuel per hour and
hiring. Display 3.10 shows the age at hire
cost more to operate than smaller planes,
plotted against the year the person was
but the shapes of the relationships are less
obvious. Display 3.12 lists data on the 33
hired.
most commonly used passenger airplanes
a. Describe the pattern in the plot, following
in the United States. The variables are the
the six-step model.
number of seats, average cargo payload in
b. Does this plot provide evidence that
tons, airborne speed in miles per hour, flight
Westvaco discriminated by age in hiring?
length in miles, fuel consumption in gallons
per hour, and operating cost per hour in
dollars.
2008 Key Curriculum Press
114 Chapter 3 Relationships Between Two Quantitative Variables
10-03-2009 14:20
Lesson
10 de 10
http://acr.keymath.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.12 Data on passenger aircraft. [Source: Air Transport Association of America, 2005,
www.air-transport.org.]
a. cost per hour
i. Make scatterplots with cost per hour
on the y-axis to explore this variables
dependence on the other variables.
Report your most interesting findings.
Here are examples of some questions
you could investigate: For which
variable is the relationship to the
cost per hour strongest? Is there any
one airplane whose cost per hour, in
relation to other variables, makes it
an outlier?
2008 Key Curriculum Press
3.1 Scatterplots
115
10-03-2009 14:20
Lesson
1 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
ii. Do your results mean that larger
planes are less efficient? Define your
own variable, and plot it against
other variables to judge the relative
efficiency of the larger planes.
b. flight length
i. Make scatterplots with length of flight
on the x-axis to explore this variables
relationship to the other variables.
Report your most interesting
findings. Here is an example of a
question you could investigate:
Which variable, cargo or number of
seats, shows a stronger relationship
to flight length? Propose a reasonable
explanation for why this should be so.
ii. Do planes with a longer flight length
tend to use less fuel per mile than
planes with a shorter flight length?
3.2
c. speed, seats, and cargo
i. Make scatterplots to explore the
relationships between the variables
speed, seats, and cargo. Report your
most interesting findings. Here are
some examples of questions you
could investigate: For which variable,
cargo or number of seats, is the
relationship to speed more obviously
curved? Explain why that should be
the case. Which plane is unusually
slow for the amount of cargo it
carries? Which plane is unusually
slow for the number of seats it has?
ii. The plot of cargo against seats has two
parts: a flat stretch on the left and a fan
on the right. Explain, in the language
of airplanes, seats, and cargo, what
each of the two patterns tells you.
Getting a Line on the Pattern
In this section, you will learn how to use a regression line to summarize the
relationship between two quantitative variables. This section deals first with
the simple situation in which all the data points lie close to a line. In practice,
however, data points are often more scattered. The second part of this section
shows how to choose a summary line when your data points form an oval cloud.
To begin, you will review the properties of a linear equation.
Lines as Summaries
Youve seen the equation of a line, y = slope x + y-intercept, so the review here
will be brief. Linear relationships have the important property that for any two
points (x1, y1) and (x2, y2) on the line, the ratio
is a constant. This ratio is the slope of the line. The rise and run are illustrated
in Display 3.13, where the slope is the ratio of the two sides of the right triangle.
This ratio is the same for any two points on the line because all the triangles
formed are similar.
2008 Key Curriculum Press
116 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
2 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The slope of a line is the
change in y divided by
the change in x.
How thick is a single sheet of your book? One sheet alone is too thin to
measure directly with a ruler, but you could measure the thickness of 100 sheets
together, then divide by 100. This method would give you an estimate of the
thickness but no information about how much your estimate is likely to vary from
the true thickness. The approach in the next activity lets you judge precision as
well as thickness.
Pinch 50 sheets of
paper, not up to
page 50.
Pinching Pages
What youll need: a ruler with a millimeter scale, a copy of your textbook
1. Pinch together the front cover and first
50 sheets of your textbook. Then
measure and record the thickness to
the nearest millimeter.
2. Repeat for the front cover plus 100,
150, 200, and 250 sheets.
3. Plot your data on a scatterplot, with
number of sheets on the horizontal
scale and total thickness on the vertical
scale.
4. Does the plot look linear? Should it?
Discuss why or why not, and make
your measurements again if necessary. On the plot, place a straight line
that best fits the cloud of points.
5. Find the slope and y-intercept of your line. What does the y-intercept tell
you? What does the slope tell you? What is your estimate of the thickness
of a sheet?
6. Use the information in your graph to discuss how much your estimate in
step 5 is likely to vary from the true thickness.
7. How would your line have changed if you hadnt included the front cover?
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern
117
23-03-2009 21:33
Lesson
3 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Interpreting Slope
D3. Decide what the variables y and x represent in these situations.
a. Suppose regular unleaded gasoline costs $2.60 per gallon. The number
2.60 is the slope of the line you get if you plot y versus x.
b. Suppose your car averages 30 mi/gal. The number 30 is the slope of a line
fitted to a scatterplot of y versus x.
c. A pints a pound the world around. This slogan summarizes the slope of
a line fitted to a scatterplot of y versus x for various quantities of various
kinds of liquids.
The next example illustrates how to find the equation of a line when you know
two points that fall on the line.
Example: Minimum Wage
In an effort to keep wages of hourly workers at a level that allows some possibility
of making a decent living, the United States government establishes a minimum
hourly wage rate. The scatterplot in Display 3.14 shows the minimum wage (in
dollars) for every five years from 1960 through 2005. The line on the plot is the
least squares regression line, which you will learn about later in this section.
Estimate the slope of the line. What does the slope tell you? Estimate the equation
of the line.
Display 3.14 Minimum wages at five-year intervals, 1960
through 2005.
Solution
In theory, you can find the slope from any two points on the line. Here, however,
you have to estimate the coordinates from the graph. In such cases, you usually
can produce a better estimate of the slope by choosing two points that are far
apart. For this plot, choosing the points on the line for the years 1960 and 2000
works well. Approximate points are (1960, 0.80) for (x1, y1) and (2000, 4.80) for
(x2, y2). The estimated slope is
2008 Key Curriculum Press
118
Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
4 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Be sure to qualify your
interpretations of the
slope. The minimum
wage didnt go up
exactly $0.10 per year.
It tended to go up this
amount or went up
about this amount.
The slope tells you that the minimum wage increased by about $0.10 per year over
the 45-yr period 1960 through 2005.
You can write the equation of a line in terms of its slope and y-intercept.
y = slope x + y-intercept
You have the slope, but you cannot read the y-intercept from the plot. To find it,
use the slope and a point such as (1960, 0.80) to solve for the y-intercept.
y = slope x + y-intercept
0.80 = 0.10(1960) + y-intercept
y-intercept = 0.80 196 = 195.20
The equation of the line using these approximate values is
y = 0.10x 195.20
In statistics this equation usually is written with the intercept first, becoming
y = 195.20 + 0.10x
Lines as Summaries
D4. The Consumer Price Index (CPI) is a measure of the prices paid by urban
consumers for a selected group of goods and services (called a market
basket) thought to be typical of urban households. The CPI often is used to
adjust salaries, rents, and other segments of the economy. The CPI is, itself,
a statistical estimate based on a number of large-scale surveys conducted by
agencies of the federal government. Display 3.15 shows the CPI for every five
years from 1970 to 2005. Fit a line to these data by eye, and use two points
to estimate its slope. What is the annual rate of increase for the CPI over this
35-yr period? What is the equation of your line?
Display 3.15 CPI at five-year intervals, 19702005.
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern
119
23-03-2009 21:33
Lesson
5 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Using Lines for Prediction
There are two main reasons why you would want to fit a line to a set of data:
to find a summary, or model, that describes the relationship between the two
variables
to use the line to predict the value of y when you know the value of x. In
Lines are summaries and
cases where it makes sense to do this, the variable on the x-axis is called the
can be used to predict.
predictor or explanatory variable, and the variable on the y-axis is called the
predicted or response variable.
In the previous example, the equation
y = 195.20 + 0.10x
models the rise in the minimum wage for the years 1960 through 2005. Knowing
this equation enables you to make a general statement about the minimum wage
throughout these years: The minimum wage went up roughly $0.10 per year.
You might instead want to use the line to predict the minimum wage in one of
the years for which no amount is given or for years before 1960 or after 2005.
Example: Predicting the Minimum Wage
Use the equation y = 195.20 + 0.10x to predict the minimum wage in the years
2003 and 1950.
Solution
The predicted minimum wage for 2003 is
y = 195.20 + 0.10x = 195.20 + 0.10(2003) = 5.10
Assuming the linear trend continues back to earlier years, the predicted minimum
wage for 1950 is
y = 195.20 + 0.10x = 195.20 + 0.10(1950) = 0.20
The predicted minimum wage for 2003 is very close to the actual minimum
wage of $5.15 per hour. But the actual minimum wage in 1950 was $0.75 per hour,
not a negative number! As you can see, making the assumption that the linear
trend continues can be risky. This type of prediction, making a prediction when
the value of x falls outside the range of the actual data, is called extrapolation.
Interpolationmaking a prediction when the value of x falls inside the range of
the data, as does 2003is safer.
Suppose you know the value of x and use a line to predict the corresponding
value of y. You know that your prediction for y wont be exact, but you hope
that the error will be small. The prediction error is the difference between the
is read y-hat and may observed value of y and the predicted value of y, or . You usually dont know
what that error is. If you did, you wouldnt need to use the line to predict the value
be called the predicted
value or the fitted value. of y. You do, however, know the errors for the points used to construct the line.
These differences are called residuals:
residual = observed value of y predicted value of y = y
2008 Key Curriculum Press
120 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
6 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The residual is the
signed vertical distance
of the data point from
the line.
The geometric interpretation of the residual is shown in Display 3.16. A residual
is the signed vertical distance from an observed data point to the regression line.
The residual is positive if the point is above the line and negative if the point is
below the line.
Display 3.16 Residual y
Example: Finding Residuals
Display 3.17 shows the mean net income (after expenses and before taxes, in
thousands of dollars) for doctors who were board-certified in family practice and
working during the years 19901998 and 2001.
Display 3.17 Mean net income for family practitioners,
19902001. [Source: U.S. Census Bureau, Statistical Abstract of
the United States, 20042005.]
The equation of the fitted line is = 8300.6 + 4.2248x, where x is the year
is the income in thousands of dollars.
Graph the fitted line with the data points. What is the residual for the year
1996?
and
Solution
You can use a graphing calculator to graph a scatterplot with a summary line.
[See Calculator Note 3B.]
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern
121
23-03-2009 21:33
Lesson
7 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The actual net income value for 1996 was $139,000. Using the equation of the
fitted line, the prediction for 1996 is
= 8300.6 + 4.2248x = 8300.6 + 4.2248(1996) = 132.1008, or $132,101
You also can use your calculator to calculate a predicted value quickly. [See
Calculator Note 3C.]
To find the residual, subtract the predicted value from the observed value:
y = 139 132.1008 = 6.8992
or about $6899. The residual is positive because the observed value is higher than
the predicted value. That is, the point lies above the line.
You can use a calculator to calculate residuals for all points in a data set
simultaneously. [See Calculator Note 3D.]
Using Lines for Prediction
D5. Test how well you understand residuals.
a. If a residual is large and negative, where is the point located with respect
to the line? Draw a diagram to illustrate. What does it mean if the
residual is 0?
b. If someone said that they had fit a line to a set of data points and all their
residuals were positive, what would you say to them?
c. Interpret the y-intercept of the regression line in the previous example.
Does this make sense?
2008 Key Curriculum Press
122 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
8 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
D6. What do you think of the arithmetic and the reasoning in this passage from
Mark Twains Life on the Mississippi?
In the space of one hundred and seventy-six years the Lower Mississippi
has shortened itself two hundred and forty-two miles. That is an average
of a trifle over one mile and a third per year. Therefore, any calm person,
who is not blind or idiotic, can see that in the Old Olitic Silurian
Period, just a million years ago next November, the Lower Mississippi
River was upwards of one million three hundred thousand miles long,
and stuck out over the Gulf of Mexico like a fishing rod.
And by the same token any person can see that seven hundred and
forty-two years from now the Lower Mississippi will be only a mile and
three quarters long, and Cairo and New Orleans will have joined their
streets together, and be plodding comfortably along under a single mayor
and a mutual board of aldermen. There is something fascinating about
science. One gets such wholesale returns of conjecture out of such a
trifling investment of fact. [Source: James R. Osgood and Company, 1883, p. 208.]
Given that the Mississippi/Missouri river system was about 3710 mi long in
the year 2000, write an equation that Twain would say gives the length of the
river in terms of the year.
Least Squares Regression Lines
The general approach to fitting lines to data is called the method of least
squares. The method was invented about 200 years ago by Carl Friedrich
Gauss (17771855), Adrien-Marie Legendre (17521833), and Robert Adrain
(17751843), who were working independently of one another in Germany,
France, and Ireland, respectively.
The least squares regression line, also called least squares line or regression
line, for a set of data points (x, y) is the line for which the sum of squared errors
(residuals), or SSE, is as small as possible.
Example: Regression Line for the Passenger Jets
This table shows cost per hour versus number of seats for three models of
passenger jets from the data in Display 3.12 on page 115. (Some of the values
have been rounded.)
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 123
23-03-2009 21:33
Lesson
9 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Which of these two equations gives the least squares regression line for predicting
cost from number of seats?
= 367 + 16x
= 300 + 16x
Solution
The least squares regression line minimizes the sum of the squared errors,
SSE, so the equation with the smaller SSE must be the equation of the
regression line.
For the equation = 367 + 16x:
For the equation
= 300 + 16x:
[You can use your calculator to calculate the SSE quickly. See Calculator Note 3E.]
The first equation has the smaller SSE, so it must be the equation of the least
squares regression line. Note that for this line, except for rounding error, the sum
, is equal to 0. This is always the case for the least squares
of the residuals,
regression line, but it can be true for other lines, too.
[You can use a calculator program to visually explore the least squares regression line
and SSE. See Calculator Note 3F.]
In addition to making the sum of the squared errors as small as possible, the
least squares regression line has some other properties, given in the box on the
next page.
2008 Key Curriculum Press
124 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
10 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Properties of the Least Squares Regression Line
The fact that the sum of squared errors, or SSE, is as small as possible means
that, for the least squares regression line, these properties also hold:
The sum (and mean) of the residuals is 0.
The line contains the point of averages,
The standard deviation of the residuals is smaller than for any other line
that goes through the point
The line has slope b1, where
There are some appealing mathematical relationships among these properties,
which, taken together, show that the line through the point of averages
having slope b1 does, in fact, minimize the sum of the squared errors.This gives
you a way to find the equation of the least squares line:
First,
compute the slope b1 using the formula in the box. Then find the y-intercept, b0,
and solving the equation
using the point
Example: Least Squares Line for the Passenger Jets
Find the least squares line for the passenger jets data given in the previous
example.
Solution
Finding the line requires three main steps: Find the point of averages
, find
the slope b1, and use the point and slope to find the y-intercept b0.
A convenient way to organize the computations is to work from a table.
Point of averages: The point of averages
squares regression line passes through this point.
is
, and the least
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 125
23-03-2009 21:33
Lesson
11 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Slope: To compute the slope, first create two new columns for deviations from
the mean, one for x . and the other for y :
Now create two more columns, one for
and the other for
The ratio of the sums of the last two columns gives the slope
y-intercept: Now that you have a point on the line,
slope, 16, you can find the y-intercept from the equation
and the
This agrees with what you found in the previous example. That is, the
equation of the least squares regression line (with rounded y-intercept) is
[You also can use your calculator to find the equation of the least squares line. See
Calculator Note 3G.]
Least Squares Regression Lines
D7. You might have wondered why statisticians dont fit a regression line by
rather
minimizing the sum of the absolute values of the residuals,
than the sum of the squares of the residuals. Here you will learn one
reason why.
2008 Key Curriculum Press
126 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
12 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. Plot the points in the table and the line y = 1 + x. Explain why this is the
line that best fits these points. Compute the sum of the absolute values of
the residuals.
b. Draw another line that passes between the two points at x = 0 and also
passes between the two points at x = 2. Compute the sum of the absolute
values of the residuals for this line and compare it to your sum from
part a.
c. Draw yet another line that passes between the two points at x = 0 and
also passes between the two points at x = 2. Find the sum of the absolute
values of the residuals for this line and compare it to your sums from
parts a and b.
d. Draw a line that does not pass between the two points at x = 0. Find the
sum of the absolute values of the residuals for this line and compare it to
your sums from parts a and b.
e. Now find the least squares regression line and compute the sum of the
squared residuals. Compute the sum of the squared residuals for your
lines in parts b, c, and d. What can you conclude?
f. Find the standard deviation of the residuals for the least squares
regression line and for the lines in parts b, c, and d. What can you
conclude?
Reading Computer Output
When you are working with real data, the best way to get the least squares line
is by computer or calculator. Display 3.18 shows typical computer output for the
minimum wage data in Display 3.14 on page 118.
Display 3.18 Data Desk output giving the equation of the least
squares line for the minimum wage data.
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 127
23-03-2009 21:33
Lesson
13 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
You can ignore most of the output for now. You will learn how to interpret it
in Chapter 11. For the time being, focus on the first two columns in the last three
rows, which are reproduced in Display 3.19.
Display 3.19 The lower-left corner of the computer output gives
the y-intercept and slope.
The y-intercept is the coefficient in the row labeled Constant and is 196.977.
The slope is the coefficient of the predictor variable Year and is 0.100909.
The SSE for the regression line is found in the Residual row and is 0.354545.
Reading Computer Output
D8. Doctors incomes. Display 3.20 shows the mean net income y of family
practitioners versus year x (from page 121), with Data Desk computer
output for the least squares line.
Display 3.20 Scatterplot of mean net income (in thousands
of dollars) of doctors board-certifi ed in family
practice, 19902001, and Data Desk output for the
regression.
2008 Key Curriculum Press
128 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
14 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. What is the equation of the least squares line? Estimate the SSE from the
scatterplot in Display 3.20, and then find it in the computer output.
b. The Minitab software output for this regression is shown in Display 3.21.
How is it different from the Data Desk output?
Display 3.21 Minitab output for the regression of family
practitioners income versus year.
Summary 3.2: Getting a Line on the Pattern
For many quantitative relationships, it makes sense to use one variable, x, called
the predictor or explanatory variable, to predict values of the other variable, y,
called the predicted or response variable. When the data are roughly linear, you
can use a fitted line, called the least squares regression line, as a summary or
model that describes the relationship between the two variables. You might also
use it to predict the value of an unknown value y when you know the value of x.
Interpolationusing a fitted relationship to predict a response value when
the predictor value falls within the range of the datagenerally is much more
trustworthy than extrapolationpredicting response values based on the
assumption that a fitted relationship applies outside the range of the observed data.
Each residual from a fitted line measures the vertical distance from a data
point to the line:
residual = observed value predicted value = y y
The least squares regression line for a set of pairs (x, y) is the line for which
the sum of squared errors, or SSE, is as small as possible. For this line, these
properties hold:
The sum (and mean) of the residuals is 0.
The line contains the point of averages,
The variation in the residuals is as small as possible.
The line has slope b1, where
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 129
23-03-2009 21:33
Lesson
15 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
To find the equation of the regression line,
compute and .
find the slope using the formula for b1
compute the y-intercept: b0 = b1 .
The equation is
value of y.
= b0 + b1x. Remember to use a hat,
Practice
Lines as Summaries
P3. Display 3.22 shows the weight of a students
pink eraser, in grams, plotted against
the number of days into the school year.
Estimate the slope of the line drawn on the
graph. Interpret the slope in the context
of the situation. [Source: Zachs Eraser, CMC
ComMuniCator, 28 (June 2004): 28.]
a.
b.
c.
d.
, to indicate a predicted
Estimate the slope of the line.
What does the slope tell you?
Estimate the equation of the line.
Students were instructed to measure their
hand width with their fingers spread
apart as far as possible. The scatterplot
shows a smaller cloud of points below
the main one. Why do you think that
is the case? What would happen to the
regression line if those points were
removed?
Using Lines for Prediction
P5. If you attend a university where class sizes
tend to be small, are you more likely to give
to your alumni fund after you graduate than
if you graduate from a university with large
classes? Display 3.24 shows a scatterplot of
a sample of 40 universities. Each university
Display 3.22 Weight of pink eraser.
appears as a point. The vertical coordinate,
P4. Display 3.23 shows the hand width of 383
y, tells the percentage of alumni who gave
students plotted against hand length. The line
money. Each x-coordinate tells the student/
drawn on the plot is the least squares line.
faculty ratio (number of students per faculty
member). The equation of the fitted line is
approximately = 55 2x.
a. Which is the explanatory variable and
which is the response variable?
b. Explain how you can see from the
graph that an increase of five students
per faculty member corresponds to a
decrease of about 10 percentage points in
the giving rate. Explain how you can see
this from the equation of the fitted line.
c.
Does the y-intercept have a useful
Display 3.23 Hand width and hand length, in
interpretation
in this situation?
inches, for 383 students.
2008 Key Curriculum Press
130 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
16 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
d. Use the regression line to predict the
giving rate for a university with a student/
faculty ratio of 16. When you use the
regression line to predict the giving rate,
would you expect a rather large error or a
relatively small error in your prediction?
e. Use the plot to estimate the residual for
the university with the highest student/
faculty ratio and for the university with
the highest giving rate.
f. The university with the lowest student/
faculty ratio, 6 to 1, had a giving rate of
32%. Use the equation of the fitted line to
find the residual for that university.
g. Suppose the Alumni Association at
Piranha State University boasts a giving
rate of 80%. Without knowing the
student/faculty ratio at PSU, can you
tell whether the prediction error will be
positive or negative?
Display 3.24 Percentage of alumni giving to the
alumni fund versus the student/
faculty ratio for 40 highly rated U.S.
universities.
c. Interpret the slope and y-intercept in the
context of this situation.
d. Verify that the least squares regression
line goes through the point of averages.
e. Verify that the sum of the residuals is 0.
P7. Use the statistical functions of your
calculator to make a scatterplot, find the
regression equation for predicting percentage
on-time arrivals from mishandled baggage,
and compute residuals for the airline data
from P2 on page 110. [See Calculator Notes
3A, 3G, and 3D.] The data values are given in
Display 3.25.
Display 3.25 Comparison, by airline, of mishandled
baggage and on-time arrival rate.
[Source: U.S. Department of Transportation, Air
Travel Consumer Report, October 2005.]
Least Squares Regression Line
P6. The fat and calorie contents of 5 oz of three
kinds of pizza are represented by the data
points (9, 305), (11, 309), and (13, 316).
a. Plot the points.
b. Compute the equation of the least squares
regression line by hand, and draw the line
on your plot.
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 131
23-03-2009 21:33
Lesson
17 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Reading Computer Output
P8. The JMP-IN computer output in Display
3.26 is for the pizza data in P6. Does it give
the same results that you computed by hand?
Where in the output is the SSE found?
Display 3.26 JMP-IN computer output for pizza data.
Exercises
E9. Display 3.27 shows cost in dollars per hour
versus number of seats for three aircraft
models. Five lines, labeled AE, are shown
on the plot. Their equations, listed below, are
labeled IV.
a. Match each line (AE) with its equation
(IV).
I. cost = 290 + 15.8 seats
II. cost = 400 + 15.8 seats
III. cost = 1000 + 15.8 seats
IV. cost = 370 + 25 seats
V. cost = 900 + 10 seats
b. Match each line (AE) with the
appropriate verbal description (IV):
I. This line overestimates cost.
II. This line underestimates cost.
III. This line overestimates cost for the
smallest plane and underestimates
cost for the largest plane.
IV. This line underestimates cost for the
smallest plane and overestimates cost
for the largest plane.
V. On balance, this line gives a better fit
than the other lines.
Display 3.27 Cost in dollars per hour versus number
of seats for three aircraft models.
2008 Key Curriculum Press
132 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
18 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E10. Examine the scatterplot in Display 3.28.
c. Consider the possible summary lines in
Display 3.29.
i. Which line gives predicted values
for calorie content that are too high?
How can you tell this from the plot?
ii. Which line tends to give predicted
calorie values that are too low?
iii. Which line tends to overestimate
calorie content for lower-fat pizzas
and underestimate calorie content for
higher-fat pizzas?
iv. Which line has the opposite problem,
underestimating calorie content
Display 3.28 Calories versus fat, per 5-oz serving,
when fat content is lower and
for seven kinds of pizza. [Source: Consumer
overestimating calorie content when
Reports, July 2003.]
fat content is higher?
a. Which two kinds of pizza in Display 3.28
v. Which line fits the data best overall?
have the fewest calories? Which two have
E11. Heights of boys. The scatterplot in Display
the least fat? Which region of the graph
3.30 shows the median height, in inches, for
has the pizzas with the most fat?
boys ages 2 through 14 years.
b. Display 3.29 shows the data again, with
five possible summary lines. Match each
equation (IV) with the appropriate
line (AE).
I. calories 70 15 fat
II. calories 10 25 fat
III. calories 150 15 fat
IV. calories 110 15 fat
V. calories 170 10 fat
Display 3.30 Median height versus age for boys.
[Source: National Health and Nutrition
Examination Survey (NHANES), 2002,
www.cdc.gov.]
Display 3.29 Five possible fitted lines for the
pizza data.
a. Estimate the slope of the line that
summarizes the relationship between age
and median height.
b. Explain the meaning of the slope with
respect to boys and their median height.
c. Write the equation of the line using the
slope from part a and a point on the line.
d. Interpret the y-intercept. Does the
interpretation make sense in this context?
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 133
23-03-2009 21:33
Lesson
19 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E12. Pizza again. Display 3.31 shows the calorie
and fat content of 5 oz of various kinds of
pizza.
Display 3.32 Reaction distance at various speeds.
Display 3.31 Calories and fat content per 5-oz
serving, for seven kinds of pizza. [Source:
Consumer Reports, January 2002.]
a. Use the line on the scatterplot to predict
the calorie content of a pizza with 10.5 g
of fat. Often use the line to predict the
calorie content of a pizza with 15 g of fat.
b. Use the two predictions in part a to
estimate the slope of the line. Write the
equation of the line using this slope and a
point on the line.
c. There are 9 calories in a gram of fat. How
is your estimated slope related to this
number?
E13. Stopping on a dime? In an emergency, the
typical driver requires about 0.75 second to
get his or her foot onto the brake pedal. The
distance the car travels during this reaction
time is called the reaction distance. Display
3.32 shows the reaction distances for cars
traveling at various speeds.
a. Plot reaction distance versus speed, with
speed on the horizontal axis. Describe the
shape of the plot.
b. What should the y-intercept be?
c. Find the slope of the line of best fit
by calculating the change in y per
unit change in x. What does the slope
represent in this situation?
d. Write the equation of the line that fits
these data.
e. Use the equation of the line in part d to
predict the reaction distance for a car
traveling at a speed of 55 mi/h and at
75 mi/h.
f. How would the equation change if it
actually took 1 second, instead of 0.75
second, for drivers to react?
E14. The scatterplot in Display 3.33 shows
operating cost (in dollars per hour) versus
fuel consumption (in gallons per hour) for
a sample of commercial aircraft.
Display 3.33 Operating cost versus fuel
consumption for commercial aircraft.
2008 Key Curriculum Press
134 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
20 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. Which is the explanatory variable and
which is the response variable?
b. Estimate the slope of the regression line
from the graph, and interpret it in the
context of this situation.
c. The y-intercept is 470. Does this value
have a reasonable interpretation in this
situation?
d. Use the line to predict the cost per hour for
a plane that consumes 1500 gal/h of fuel.
E15. Arsenic is a potent poison sometimes found in
groundwater. Long-term exposure to arsenic
in drinking water can cause cancer. How
much arsenic a person has absorbed can be
measured from a toenail clipping. The plot in
Display 3.34 shows the arsenic concentrations
in the toenails of 21 people who used water
from their private wells plotted against the
arsenic concentration in their well water. Both
measurements are in parts per million.
in drinking water should be less than
0.01 mg/L. (1 mg/L = 1 ppm.) Is this
standard exceeded in any of these wells?
[Source: www.who.int.]
E16. More pizza. Refer to the pizza data in E12.
a. The least squares residuals for the pizza
data are, in order from smallest to largest,
40.58, 17.66, 15.95, 1.03, 14.28,
26.44, and 34.50. Match each residual
with its pizza.
b. What does the residual for Pizza Huts
Pan pizza tell you about the pizzas
number of calories versus fat content?
c. For Pizza Huts Hand Tossed and
Dominos Deep Dish, are the residuals
positive or negative? How can you tell
this from the scatterplot in Display 3.31?
E17. The level of air pollution is indicated by a
measure called the air quality index (AQI).
An AQI greater than 100 means the air
quality is unhealthy for sensitive groups such
as children. The table and plot in Display
3.35 show the number of days in Detroit that
the AQI was greater than 100 for the years
2001, 2002, and 2003.
Display 3.34 Arsenic concentrations. [Source: M. R.
Karagas et al. Toenail Samples as an Indicator
of Drinking Water Arsenic Exposure, Cancer
Epidemiology, Biomarkers and Prevention 5 (1996):
84952.]
a. What is the predictor variable, and what
is the response variable?
b. Describe the relationship.
c. Estimate the residual for the person with
the highest concentration of arsenic in
the well water.
d. Find the person on the plot with
the largest residual. What was the
concentration of arsenic in that persons
toenails?
e. The World Health Organization has set a
standard that the concentration of arsenic
Display 3.35 Air quality index for 20012003.
[Source: U.S. Environmental Protection Agency,
www.epa.gov.]
a. By hand, compute the equation of the
least squares line.
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 135
23-03-2009 21:33
Lesson
21 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Interpret the slope in the context of this
situation.
c. Which year has the largest residual? What
is this residual?
d. Compute the SSE for this line.
e. Verify that the sum of the residuals is 0.
f. Find the SSE for the line that has the
same slope as the least squares line but
passes through the point for 2002. Is
this SSE larger or smaller than the SSE
for the least squares line? According to
the least squares approach, which line
fits better?
g. Find the slope of the line that passes
through the points for 2001 and 2003.
Then find the fitted value for 2002.
Finally, find all three residuals and the
Display 3.36 Median height for girls ages 214.
value of the SSE for this line.
b Judging from the plot, is the residual
h. The least squares line doesnt pass
for 11-year-olds positive or negative?
through any of the points, and yet
Compute this residual to check your
judging by the SSE that line fits better
answer.
than the one in part g. Do you agree that
c. Verify that the line contains the point of
the least squares line fits better than the
averages,
lines in parts f and g? Explain why or
d. How does the regression line for girls
why not.
compare to the line for boys in E11?
E18. Even more pizza. Refer again to the table and
E20. Sum of residuals. In this exercise, you will
scatterplot in Display 3.31 on page 134.
show that the sum of the residuals is equal
a. By hand, compute the equation of the
to 0 if and only if the regression line passes
least squares regression line for using fat
through the point of averages,
to predict calories. How close was your
a. Show that for a horizontal line the sum
estimate of the equation in E12?
of the residuals will be 0 if and only if
b. Which of these values must be the SSE for
the line passes through the point of
this regression? Explain your answer.
averages.
0 29.3 861.4 4307
b. Show that no matter what the slope of the
E19. Heights of girls. Display 3.36 gives the
line is, the sum of the residuals will be 0
median height in inches for girls ages 214.
if and only if the line passes through the
a. Practice using your calculator by making
point of averages.
a scatterplot, finding the equation of
c. Why isnt it good enough to define the
the least squares line for median height
regression line as the line that makes the
versus age, and graphing the equation on
sum of the residuals equal 0?
the plot.
2008 Key Curriculum Press
136 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
22 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E21. Height versus age. Display 3.37 shows a
standard computer printout for the median
height versus age data of E11.
Display 3.37 Computer output of median height
versus age data.
a. Write the equation of the regression line.
How does it compare to your estimate of
the equation in E11?
b. What is the SSE for this least squares line?
Does its value seem reasonable given the
scatterplot in Display 3.30 on page 133?
E22. Part of a printout for the percentage of
alumni who give to their colleges versus the
student/faculty ratio is shown in Display
3.38. (These are the data in the scatterplot
shown in Display 3.24 on page 131.)
Display 3.38 Computer output: regression analysis
of percentage giving to alumni fund
versus student/faculty ratio.
a. What equation is given in the printout for
the least squares regression line?
b. Examine the table of unusual
observations. What is the student/faculty
ratio at the college with the largest
residual (in absolute value)? Find this
college in Display 3.24 on page 131.
c. Verify that the fit and the value of the
largest residual were computed correctly.
d. Locate the SSE on the printout. Why is
this value so large?
E23. For the least squares regression line you
found in E19, calculate the residuals for
girls ages 2, 8, and 14. What does this
suggest about the pattern of growth beyond
what is summarized in the equation of the
regression line?
E24. More about slope.
a. You and three friends, one right after
the other, each buy the same kind of
gas at the same pump. Then you make a
scatterplot of your data, with one point
per person, plotting the number of
gallons on the x-axis and the total price
paid on the y-axis. Will all four points lie
on the same line? Explain.
b. You and the same three friends each
drive 80 mi but at different average
speeds. Afterward, you plot your data
twice, first as a set of four points with
coordinates average speed, x, and elapsed
time, y, and then as a set of points with
coordinates average speed, x, and y*
1
defined as _________
. Which plot will give
elapsed time
a straight line? Explain your reasoning.
Will the other plot be a curve opening up,
a curve opening down, or neither?
E25. The data set in Display 3.39 is the pizza data
of E12 augmented by other brands of cheese
pizza typically sold in supermarkets.
a. Plot calories versus fat. Does there appear
to be a linear association between calories
and fat? If so, fit a least squares line to the
data, and interpret the slope of the line.
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 137
23-03-2009 21:33
Lesson
23 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Plot fat versus cost. Does there appear to
be a linear association between cost and
fat? If so, fit a least squares line to the
data, and interpret the slope of the line.
c. Plot calories versus cost. Does there
appear to be a linear association between
cost and calories?
d. Write a summary of your findings.
E26. Poverty. What variables are most closely
associated with poverty? Display 3.40
provides information on population
characteristics of the 50 U.S. states plus
the District of Columbia. Each variable
is measured as a percentage of the states
population, as described here:
Percentage living in metropolitan areas
Percentage white
Percentage of adults who have graduated
from high school
Percentage of families with incomes
below the poverty line
Percentage of families headed by a single
parent
Construct scatterplots to determine which
variables are most strongly associated with
poverty.
Write a letter to your representative in
Congress about poverty in America, relying
only on what you find in these data. Point
out the variables that appear to be most
strongly associated with poverty and those
that appear to have little or no association
with poverty.
Display 3.39 Food values and cost per 5-oz serving
of pizza. [Source: Consumer Reports,
January 2002.]
2008 Key Curriculum Press
138 Chapter 3 Relationships Between Two Quantitative Variables
23-03-2009 21:33
Lesson
24 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.40 Characteristics of state populations, as percentage of population.
[Source: U.S. Census Bureau, www.census.gov.]
2008 Key Curriculum Press
3.2 Getting a Line on the Pattern 139
23-03-2009 21:33
Lesson
1 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.3
The correlation
coefficient measures the
amount of variation from
the regression line.
Correlation: The Strength of a Linear Trend
Some of the linear relationships youve seen in this chapter have been extremely
strong, with points packed tightly around the regression line. Other linear
relationships have been quite weakalthough there was a general linear trend, a
lot of variation was present in the values of y associated with a given value of x.
Still other linear relationships have been in between.
In this section, youll learn how to measure the strength of a linear
relationship numerically by using the correlation coefficient, r (which, from this
point on, will be referred to simply as the correlation ), where
Just as
in the last section, youll start by working intuitively and visually and then move
on to a computational approach.
Estimating the Correlation
Examine the scatterplots and their correlations in Display 3.41. To get a rough
idea of the size of a correlation, it is helpful to sketch an ellipse around the cloud
of points in the scatterplot. If the ellipse has points scattered throughout and the
points appear to follow a linear trend, then the correlation is a reasonable measure
of the strength of the association. If the ellipse slants upward as you go from left to
right, the correlation is positive. If the ellipse slants downward as you go from le
to right, the correlation is negative. If the ellipse is fat, the correlation is weak and
the absolute value of r is close to 0. If the ellipse is skinny, the correlation is strong
and the absolute value of r is close to 1.
Display 3.41 Scatterplots with ellipses and their correlations.
[Source: George Cobb, Electronic Companion to Statistics (Cogito
Learning Media, Inc., 1997), p. 114.]
In Activity 3.3a, you will learn more about correlation and youll practice
finding the value of r using your calculator.
2008 Key Curriculum Press
140 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
2 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Was Leonardo Correct?
What youll need: a measuring tape, yardstick, or meterstick
Leonardo da Vinci was a scientist and an artist who combined these skills to
draft extensive instructions for other artists on how to proportion the human
body in painting and sculpture. Three of Leonardos rules were
height is equal to the span of the outstretched arms
kneeling height is three-fourths of the standing height
the length of the hand is one-ninth of the height
1. Work with a partner to measure your height, kneeling height, arm span,
and hand length. Combine your data with the rest of your class.
2. Check Leonardos three rules visually by plotting the data on three scatterplots.
3. For the plots that have a linear trend, use your calculator to find the
equation of the regression line and the value of r (the correlation). [See
Calculator Notes 3G and 3H.]
4. Interpret the slopes of the regression lines. Interpret the correlations.
5. Do the three relationships described by Leonardo appear to hold? Do they
hold strongly?
Estimating the Correlation
D9. Match each of the four scatterplots with its correlation, choosing from the
values 0.783, 0.783, 0.908, and 0.999.
a. year of birth versus year of hire (Display 3.1 on page 106)
b. age at layoffs versus year of hire (Display 3.2 on page 106)
c. median height of boys versus age (Display 3.30 on page 133)
d. calories in pizza versus fat (Display 3.31 on page 134)
D10. Four relationships are described here.
I. For a random sample of students from the senior class, x represents
the day of the month of the persons birthday and y represents the
cost of the persons most recent haircut.
II. For a random collection of U.S. coins, x represents the diameter and
y represents the circumference.
III. For a random sample of bags of white socks, x represents the
number of socks and y represents the price per bag.
IV. For a random sample of bags of white socks, x represents the
number of socks and y represents the price per sock.
a. Which of these relationships have a positive correlation, and which a
negative correlation? Which has the strongest relationship? The weakest?
b. For each of the four relationships, discuss the connection between your
ability to precisely predict a value of y for a given value of x and the
strength of the correlation.
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend
141
22-03-2009 20:55
Lesson
3 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
A Formula for the Correlation, r
A formula for the correlation, r, follows. It looks impressive, but the basic idea is
simpleyou convert x and y to standardized values (z-scores), and then find their
average product (dividing by n 1).
You can think of the
correlation, r, as the
average product of the
z-scores.
In this formula, sx is the standard deviation of the xs, and sy is the standard
deviation of the ys. Remember that the z-score tells you how many standard
deviations the value lies above or below the mean.
Example: Computing r for the Airline Data
Compute the correlation for the relationship between the number of mishandled
bags per thousand passengers and the percentage of on-time arrivals for the
airline data in Display 3.25 on page 131.
Solution
When you are on a desert island or taking a test where you must compute r by hand,
it is easiest to organize your work as in Display 3.42. First, compute the average
values of x and y, and .. Then compute their standard deviations, sx and sy .
Display 3.42 Calculations for the correlation for airline data.
2008 Key Curriculum Press
142 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
4 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Then convert each value of x and y to standard units, or z-scores. The correlation,
r, is the average product of the z-scoresthe sum of the products in the last
column divided by n 1.
[You can use a calculator to find the value of r. See Calculator Note 3H.]
A way to visualize the computations in the example is to look at the four
quadrants formed on the scatterplot by dividing it vertically at the mean value of
x and horizontally at the mean value of y. Such a plot is shown in Display 3.43.
Points in Quadrant I, such as the point for US Airways, have positive z-scores for
both x and y, so their product contributes a positive amount to the calculation
of the correlation. Points in Quadrant III, such as the point for Northwest, have
negative z-scores for both xand y, so their product also contributes a positive
amount to the calculation of the correlation. Points in Quadrants II and IV, such
as America West and Delta, contribute negative amounts to the calculation of the
correlation because one z-score is positive and the other is negative.
The points in Quadrants
II and IV have negative
products zx zy.
The correlation, r, is a
quantity without units.
Display 3.43 Scatterplot divided into quadrants at
Because r is the average of the products zx zy and z-scores have no units,
r has no units. In fact, r does not depend on the units of measurement in the
original data. In Activity 3.3a, if you measure arm spans and heights in inches and
your friend measures them in centimeters, you will both compute the same value
for the correlation.
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend
143
22-03-2009 20:55
Lesson
5 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
A Formula for the Correlation, r
D11. Look at Display 3.42, showing the calculations for the correlation, r.
a. Confirm the calculations for the first row, America West.
b. Which point makes the largest contribution to the correlation? Where is
this point on the scatterplot?
c. Which point makes the smallest contribution to the correlation? Where
is this point on the scatterplot?
D12. Understanding r.
a. Explain in your own words what the correlation measures.
b. Explain in your own words why r has no units.
c. When computing the correlation between two variables, does it matter
which variable you select as y and which you select as x? Explain.
D13. Refer to Display 3.43.
a. What can you say about r if there are many points in Quadrants I and III
and few in Quadrants II and IV?
b. What can you say about r if there are many points in Quadrants II and
IV and few in Quadrants I and III?
c. What can you say about r if points are scattered randomly in all four
quadrants?
Correlation and the Appropriateness
of a Linear Model
It is tempting to believe that a high correlation (either positive or negative) is
evidence that a linear model is appropriate for your data. Alas, the real world
is not so simple. For example, Display 3.44 shows the number of blogs
(Web-based periodic postings of a persons thoughts) for the first few years
after 2003, when bloggings popularity took off. The growth is exponential, yet
the (linear) correlation is very strong, r = 0.91. The points do cluster fairly
tightly about the linear regression line, but that line is not the best model for
the data. The plot on the right shows the points and the graph of the best-fitting
exponential equation, y = 0.353 1.140x. (You will learn how to fit exponential
equations to data in Section 3.5.)
2008 Key Curriculum Press
144 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
6 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.44 The number of blogs, in millions, versus the
number of months after March 2003. [Source: State of
the Blogosphere, October 2005, Part 1: Blogosphere Growth, posted
by Dave Sifry, October 17, 2005. Technorati News, www.technorati
.com. Table numbers estimated from graph.]
The moral: Always
plot your data before
computing summary
statistics!
The quiz scores for 22 students in Display 3.45, on the other hand, have a
correlation, r, of only 0.48. There is quite a bit of scatter, partly because the quizzes
covered very different topics. Quiz 2 covered exponential growth, and Quiz 3
covered probability. In spite of the scatter, a line is the most appropriate model
because there is no curvature in the pattern of data points.
Display 3.45 Scores on two consecutive 30-point quizzes.
Correlation and the Appropriateness of a Linear Model
D14. When the correlation is small in absolute value, what does it mean for the
prediction error? Why would anyone want to fit a line to data in a case in
which the correlation is small, as in Display 3.45 (quiz scores)?
D15. Provide a real-life scenario involving two variables for each situation.
Assume r is positive in each case.
a. r is small and you do not want to fit a line.
b. r is small and you do want to fit a line.
c. r is large and you do not want to fit a line.
d. r is large and you do want to fit a line.
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend 145
22-03-2009 20:55
Lesson
7 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
D16. Its common in situations similar to the growth in blogs for the numbers
to increase exponentially for the first few years. What do you think would
happen if the table in Display 3.44 were continued to include numbers for
months up to the current year?
The Relationship Between the Correlation
and the Slope
By now you might have observed that the slope of the regression line, b1, and the
correlation, r, always have the same sign. But these have a more specific relationship.
The slope varies
directly as the
correlation.
Finding the Slope from the Correlation and the SDs
The slope of a least squares regression line, b1, and the correlation, r, are related
by the equation
where sx is the standard deviation of the xs and sy is the standard deviation of
the ys. This means that if you standardize the data so that sx = 1 and sy = 1,
then the slope of the regression line is equal to the correlation.
Example: Critical Reading and Math SAT Scores
In 2005, the mean critical reading score for all SAT I test takers was 508, with a
standard deviation of 113. For math scores, the mean was 520, with a standard
deviation of 115. The correlation between the two scores was not given but is
known to be quite high. If you can estimate this correlation as, say, 0.7, you can
find the equation of the regression line and use it to estimate the math score from
a students critical reading score. [Source: The College Board, 2005 College Bound Seniors:
A Profile of SAT Program Test Takers.]
Solution
The formula gives an estimate of the slope of
To find the y-intercept, use the fact that the point ( ,
regression line:
y = slope x + y-intercept
) = (508, 520) is on the
520 = 0.71 (508) + y-intercept
y-intercept = 159.32
The equation is
= 0.71x + 159.32.
2008 Key Curriculum Press
146 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
8 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The Relationship Between the Correlation and the Slope
D17. Find the equation of the regression line for predicting an SAT I critical
reading score given the students SAT I math score.
Correlation Does Not Imply Causation
In a sample of elementary school students, there is a strong positive relationship
between shoe size and scores on a standardized test of ability to do arithmetic.
Does this mean that studying arithmetic makes your feet bigger? No. Shoe size
and skill at arithmetic are related to each other because both increase as a child
gets older. Age is an example of a lurking variable.
Beware the Lurking Variable
A lurking variable is a variable that you didnt include in your analysis but
that might explain the relationship between the variables you did include.
That is, when variables x and y are correlated, it might be because both are
consequences of a third variable, z, that is lurking in the background.
Two variables might
be highly correlated
without one causing
the other.
Even if you cant identify a lurking variable, you should be careful to avoid
jumping to a conclusion about cause and effect when you observe a strong
relationship. The value of r does not tell you anything about why two variables
are related. The statement Correlation does not imply causation can help you
remember this. To conclude that one thing causes another, you need data from a
randomized experiment, as youll learn in the next chapter.
Correlation Does Not Imply Causation
D18. Display 3.43 on page 143 shows a negative association between the
percentage of on-time arrivals and the number of mishandled bags per
thousand passengers. Discuss whether you think one of these variables
might cause the other, or whether a lurking variable might account for both.
D19. For the sample of 50 top-rated universities in E5 on page 113, theres a
very strong positive relationship between acceptance rate (percentage of
applicants who are offered admission) and SAT scores (the 75th percentile
for an entering class). Explain why these two variables have such a strong
relationship. Does one cause the other? If not, how might you account for
the strong relationship?
D20. People who argue about politics and public policy often point to relationships
between quantitative variables and then offer a cause-and-effect explanation
to support their points of view. For each of these relationships, first give a
possible explanation by assuming a causal relationship and then give
another possible explanation based on a lurking variable.
a. Faculty positions in academic subjects with a higher percentage of
male faculty tend to pay higher salaries. (For example, engineering and
geology have high percentages of male faculty and high average salaries;
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend 147
22-03-2009 20:55
Lesson
9 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
journalism, music, and social work have much lower percentages of male
faculty and much lower academic salaries.)
b. States with larger reported numbers of hate groups tend to have more
people on death row. (Here, also tell how you could adjust for the lurking
variable to uncover a more informative relationship.)
c. States with higher reported rates of gun ownership tend to have lower
reported rates of violent crime.
Interpreting r2
You might have noticed that computer outputs for regression analysis, like that in
Display 3.18 on page 127, give the value of R-squared, or r2, rather than the value
of r. The student in this discussion will show you how to think about r2 as the
fraction of the variation in the values of y that you can eliminate by taking x into
account.
Alexis: Ive invented another way to measure the strength of the
relationship between x and y. Its based on the idea that the
less variation there is from the linear trend, the stronger the
correlation.
Statistician: How does it work?
Alexis: Let me ask the questions for a change. What is the best way to
predict the values of y?
Statistician: Id fit a least squares line and use the equation y = b0 + b1x.
Alexis: And how would you measure your total error?
Statistician: Well, Id use the sum of the squares of the residuals, just like we
did in the last section:
Alexis: Thats what I hoped you would say.
Statistician: We consultants try to be helpful.
Alexis: Dont get cocky. Im about to change the rules. Pretend that you
cant use the information about x. You dont have the x-values, and
you want a single fitted value for y. What value would you choose?
Statistician: Id use
Alexis: And could you again use the sum of the squared errors to measure
your total error?
Statistician: Sure, its almost like the standard deviation. Id find the sum of the
squares of the deviations from the mean, which Ill call the total
sum of squared error, SST:
2008 Key Curriculum Press
148 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
10 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Okay, now for my new way to measure the strength of the
relationship. If you have a strong relationship, the SSE will be
small compared to the SST, right?
Statistician: Right.
Alexis:
Alexis:
On the other hand, if you have a weak relationship, x isnt much
use for predicting y, and the SSE will be almost as big as the SST.
Right?
Statistician: I see where youre going with this. You can use a ratio to measure
the strength of the relationship.
Alexis:
Exactly. Except now I have a problem. My ratio is near 0 when the
relationship is strong, and its near 1 when the relationship is weak.
Thats backward! Oh, I see how to fix it. Just subtract my ratio
from 1.
Statistician: Good. Your new ratio is near 1 when the relationship is strong and
near 0 when the relationship is weak. Your old ratio, SSE/SST, gave
the proportion of error still there after the regression, so your new
ratio . . .
Alexis:
I can handle it from here. SST is the total error I started with.
SST minus SSE is the amount of error I get rid of by using the
relationship of y with x. So my new ratio is the proportion of error
I eliminate by using the regression.
Statistician: Right!
Alexis:
But now I have two measures of strengththe correlation, r, and
my new ratio. Which one should I use?
Statistician: Lucky for youwith a little algebra, they turn out to be equivalent.
Alexis: Cool!
Statistician: We statisticians call r2 the coeffcient of determination. It tells us
the proportion of variation in the ys that is explained by x.
Alexis: I like it. Anythings better than those z-scores!
Predicting Pizzas
Display 3.46 compares two sets of predictions of the calorie content in the seven
kinds of pizza from E12 on page 134.
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend
149
22-03-2009 20:55
Lesson
11 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.46 Two ways of predicting calorie content from fat
content for seven kinds of pizza.
If you had to pick a single number as your predicted calorie content, you
might choose the mean, 307.14 calories per serving. The fourth column in the
table and the plot in Display 3.47 show the resulting errors.
Display 3.47 Squared deviations around the mean of y.
If you use the regression equation, calories = 112 + 14.9 fat, to predict
calories, the resulting errors are much smaller in most cases and are given in the
second to last column of Display 3.46 and shown on the plot in Display 3.48.
Display 3.48 Squared deviations around the least squares line.
2008 Key Curriculum Press
150 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
12 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The coeffcient of
determination tells the
proportion of the total
variation in y that can be
explained by x.
Using Alexiss formula,
About 82% of the variation in calories among these brands of pizza can be
attributed to fat content.
Taking the square root gives r = 0.908.
Interpreting r2
D21. The scatterplot in Display 3.49 shows IQ plotted against head circumference,
in centimeters, for a sample of 20 people. The mean IQ was 101, and the
mean head circumference was 56.125 cm. The correlation is 0.138.
Display 3.49 [Source: M. J. Tramo, W. C. Loftus, R. L. Green, T. A. Stukel, J. B.
Weaver, and M. S. Gazzaniga, Brain Size, Head Size, and IQ in
Monozygotic Twins, Neurology 50 (1998): 124652.]
a. If you knew nothing about any possible relationship between head
circumference and IQ, what IQ would you predict for a person with a
head circumference of 54 cm?
b. The regression equation is IQ = 0.997 head circumference + 45. By
about how much does this equation predict IQ will change with a
1-cm increase in head circumference?
c. What IQ does this equation predict for a person with a head
circumference of 54 cm? How much faith do you have in this prediction?
d. How much of the variability in IQ is accounted for by the regression?
Does the regression equation help you predict IQ in any practical sense?
e. If more people were added to the plot, how do you think the regression
equation and correlation would change?
D22. Use Alexiss formula for r2 to explain why
D23. In a study of the effect of temperature on household heating bills, an
investigator said, Our research shows that about 70% of the variability in
the number of heating units used by a particular house over the years can be
explained by outside temperature. Explain what the investigator might have
meant by this statement.
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend 151
22-03-2009 20:55
Lesson
13 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Regression Toward the Mean
The regression line is a
line of means.
Regression toward the
mean is another term
for regression effect.
Display 3.50 shows a hypothetical data set with the height of younger sisters
plotted against the height of their older sisters. There is a moderate positive
association: r = 0.337. For both younger and older sisters, the mean height is
65 in. and the standard deviation is 2.5 in. The line drawn on the first plot,
y = x, indicates the location of points representing the same height for both
sisters. If you rotate your book and sight down the line, you can see that the
points are scattered symmetrically about it.
In the second plot, look at the vertical strip for the older sisters with heights
between 62 in. and 63 in. The X is at the mean height of the younger sisters with
older sisters in this height range. It falls at about 64 in., not between 62 in. and
63 in. as you would expect. Looking at the vertical strip on the right, the mean
height of younger sisters with older sisters between 68 in. and 69 in. is only
about 66 in. If you were to use the line y = x to predict the height of the younger
sister, you would tend to predict a height that is too small if the older sister is
shorter than average and a height that is too large if the older sister is taller than
average.
The flatter line through the third scatterplot in Display 3.50 is the least
squares regression line. Notice that this line gets as close as it can to the center
of each vertical strip. Thus, the least squares line is sometimes called the line of
means. The predicted value of y at a given value of x, using the regression line
as the model, is the estimated mean of all responses that can be produced at that
particular value of x.
Notice that the regression line has a smaller slope than the major axis ( y = x )
of the ellipse. This means that the predicted values are closer to the mean than
you might expect, which will always be the case for positively correlated data
following a linear trend. The difference between these two lines is sometimes
called the regression effect. If the correlation is near +1 or 1, the two lines will
be nearly on top of each other and the regression effect will be minimal. For a
moderate correlation such as that for the sisters heights, the regression effect
will be quite large.
The regression effect was first noticed by British scientist Francis Galton
around 1877. Galton noticed that the largest sweet-pea seeds tended to produce
daughter seeds that were large but smaller than their parent. The smallest
sweet-pea seeds tended to produce daughter seeds that were also small but larger
than their parent. There was, in Galtons words, a regression toward the mean.
This is the origin of the term regression line. [Source: D. W. Forrest, Francis Galton: The Life and
Work of a Victorian Genius (Taplinger, 1974).]
The regression effect is with us in everyday life whenever some element of
chance is involved in a persons score. For example, athletes are said to experience
a sophomore slump. That is, athletes who have the best rookie seasons do not
tend to be the same athletes who have the best second year. The top students on
the second exam in your class probably did not do as well, relative to the rest of
the class, on the first exam. The children of extremely tall or short parents do not
tend to be as extreme in height as their parents. There does, indeed, seem to be a
phenomenon at work that pulls us back toward the average. As Galton noticed,
this prevents the spread in human height, for example, from increasing. Look for
this effect as you work on regression analyses of data.
2008 Key Curriculum Press
152 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
14 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.50
2008 Key Curriculum Press
Scatterplots showing the regression effect.
3.3 Correlation: The Strength of a Linear Trend
153
22-03-2009 20:55
Lesson
15 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Regression Toward the Mean
D24. Why is the regression line sometimes called the line of means?
D25. The equation of the regression line for the scatterplot in Display 3.50
is y = 43.102 + 0.337x. Interpret the slope of this line in the context
of the situation and compare it to the interpretation of the slope of the
line y = x.
Summary 3.3: Correlation
In your study of normal distributions in Chapter 2, you used the mean to tell
the center and then used the standard deviation as the overall measure of how
much the values deviated from that center. For well-behaved quantitative
relationshipsthat is, those whose scatterplots look ellipticalyou use the
regression line as the center and then measure the overall amount of variation
from the line using the correlation, r. You can think of the correlation, r, as the
average product of the z-scores.
Geometrically, the correlation measures how tightly packed the points of the
scatterplot are about the regression line.
The correlation has no units and ranges from 1 to +1. It is unchanged if you
interchange x and y or if you make a linear change of scale in x or y, such as
from feet to inches or from pounds to kilograms.
In assessing correlation, begin by making a scatterplot and then follow these
steps:
1. Shape: Is the plot linear, shaped roughly like an elliptical cloud, rather
than curved, fan-shaped, or formed of separate clusters? If so, draw
an ellipse to enclose the cloud of points. The data should be spread
throughout the ellipse; otherwise, the pattern might not be linear or
might have unusual features that require special handling. You should
not calculate the correlation for patterns that are not linear.
2. Trend: If your ellipse tilts upward to the right, the correlation is
positive; if it tilts downward to the right, the correlation is negative. The
relationship between the correlation and the slope, b1, of the regression
line is given by
3. Strength: If your ellipse is almost a circle or is horizontal, the relationship
is weak and the correlation is near zero. If your ellipse is so thin that it
looks like a line, the relationship is very strong and the correlation is
near +1 or 1.
2008 Key Curriculum Press
154 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
16 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Correlation is not the same as causation. Two variables may be highly
correlated without one having any causal relationship with the other. The
value of r tells nothing about why x and y are related. In particular, a strong
relationship between x and y might be due to a lurking variable.
You can interpret the value r2 as the proportion of the total variation in y that
can be accounted for by using x in the prediction model:
The regression effect (or regression toward the mean) is the tendency of
y-values to be closer to their mean than you might expect. That is, the regression
line is flatter than the major axis of the ellipse surrounding the data.
Practice
Estimating the Correlation
P9. By comparing to the plots in Display 3.41 on
page 140, match each of the five scatterplots
in Display 3.51 with its correlation, choosing
from 0.95, 0.5, 0, 0.5, and 0.95.
P10. The table in E12 ( Display 3.31 on page
134 ) gives the amount of fat and number of
calories in various pizzas.
a. Guess a value for the correlation, r.
b. Calculate r using your calculator.
a.
A Formula for the Correlation, r
P11. Eight artificial data sets are shown here.
For each one, find the value of r, without
computing if possible. Drawing a quick
sketch might be helpful.
b.
c.
a.
b.
c.
d.
e.
f.
g.
h.
d.
e.
Display 3.51
2008 Key Curriculum Press
Five scatterplots.
3.3 Correlation: The Strength of a Linear Trend
155
22-03-2009 20:55
Lesson
17 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
P12. The table in E12 ( Display 3.31 on page 134 )
gives the amount of fat and number of calories
in various pizzas. In P10, you used your
calculator to find the correlation, r. This
time, make a table like that in Display 3.42 on
page 142, and use the formula to find r. What
do you notice about the products
P13. The scatterplot in Display 3.52 is divided into
quadrants by vertical and horizontal lines that
pass through the point of averages, ( x, y )
Display 3.52 Scatterplot divided into quadrants at
the point of averages, ( x, y ).
a. Is the correlation positive or negative?
b. Give the coordinates of the point that will
contribute the most to the correlation, r.
c. Consider the product
Where are the points that have a positive
product? How many of the 30 points have
a positive product?
d. Where are the points that have a negative
product? How many of the 30 points have
a negative product?
Correlation and the Appropriateness
of a Linear Model
P14. Both plots in Display 3.53 have a correlation
of 0.26. For each plot, is fitting a regression
line (as shown on the plot) an appropriate
thing to do? Why or why not?
Display 3.53 Two scatterplots with the same
correlation.
The Relationship Between the Correlation
and the Slope
P15. Imagine a scatterplot of two sets of exam
scores for students in a statistics class. The
score for a student on Exam 1 is graphed
on the x-axis, and his or her score on Exam
2 is graphed on the y-axis. The slope of the
regression line is 0.368. The mean of the
Exam 1 scores is 72.99, and the standard
deviation is 12.37. The mean of the Exam 2
scores is 75.80, and the standard deviation
is 7.00.
a. Find the correlation of these scores.
b. Find the equation of the regression line for
predicting an Exam 2 score from an Exam
1 score. Predict the Exam 2 score for a
student who got a score of 80 on Exam 1.
2008 Key Curriculum Press
156 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
18 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
c. Find the equation of the regression line
for predicting an Exam 1 score from an
Exam 2 score.
d. Sketch a scatterplot that could represent
the situation described.
Correlation Does Not Imply Causation
P16. If you take a random sample of U.S. cities
and measure the number of fast-food
franchises in each city and the number of
cases of stomach cancer per year in the city,
you find a high correlation.
a. What is the lurking variable?
b. How would you adjust the data for the
lurking variable to get a more meaningful
comparison?
P17. If you take a random sample of public school
students in grades K12 and measure weekly
allowance and size of vocabulary, you will
find a strong relationship. Explain in terms
of a lurking variable why you should not
conclude that raising a students allowance
will tend to increase his or her vocabulary.
P18. For the countries of the United Nations,
there is a strong negative relationship
between the number of TV sets per
thousand people and the birthrate. What
would be a careless conclusion about
cause and effect? What is the lurking
variable?
Interpreting r2
P19. Data on the association between high
school graduation rates and the percentage
of families living in poverty for the 50 U.S.
states were presented in E26. Display 3.54
contains the scatterplot and a standard
computer output of the regression analysis.
Display 3.54
Poverty rates versus high school
graduation rates.
a. Under SOURCE, the Total variation
is the SST, and the Error variation is the
SSE. From this information, find r, the
correlation.
b. Write an interpretation for r2 in the
context of these data.
c. Does the presence of a linear relationship
here imply that a state that raises its
graduation rate will cause its poverty rate
to go down? Explain your reasoning.
d. What are the units for each of the values
x, y, b1, and r?
2008 Key Curriculum Press
3.3 Correlation: The Strength of a Linear Trend
157
22-03-2009 20:55
Lesson
19 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Regression Toward the Mean
P20. The plot in Display 3.55 shows the heights
of older sisters plotted against the heights
of their younger sisters. On a copy of this
scatterplot, draw vertical lines to divide the
points into six groups. Mark the approximate
location of the mean of the y-values of each
vertical strip. Sketch the regression line,
y = 43 + 0.337x. Note that the regression
line comes as close as possible to the mean
of each vertical strip. Now draw an ellipse
around the data and connect the two ends
of the ellipse. Is the regression line flatter
than this line? Does this plot show the
regression effect?
P21. Display 3.56 shows the first two exam
scores for 29 college students enrolled in an
introductory statistics course. Do you see
any evidence of regression to the mean? If so,
explain the nature of the evidence.
Display 3.55 The heights of older sisters versus the
heights of their younger sisters.
Display 3.56
Exam scores.
Exercises
E27. Each scatterplot in Display 3.57 was made on
the same set of axes. Match each scatterplot
with its correlation, choosing from 0.06,
0.25, 0.40, 0.52, 0.66, 0.74, 0.85, and 0.90.
a.
b.
c.
d.
e.
f.
g.
h.
Display 3.57 Eight scatterplots with various
correlations.
E28. Estimate the correlation between the
variables in these scatterplots.
a. The proportion of the state population
living in dorms versus the proportion
living in cities in Display 3.4 on page 109.
b. The graduation rate versus the 75th
percentile of SAT scores in E5 on
page 113.
c. The college graduation rate versus the
percentage of students in the top 10%
of their high school graduating class
in E5 on page 113.
E29. For each set of pairs, ( x, y ), compute the
correlation by hand, standardizing and
finding the average product.
a. ( 2, 1 ), ( 1, 1 ), ( 0, 0 ), ( 1, 1 ), ( 2, 1 )
b. ( 2, 2 ), ( 0, 2 ), ( 0, 3 ), ( 0, 4 ), ( 2, 4 )
E30. For each artificial data set in P11 on
page 155, compute the correlation by hand,
standardizing and finding the average
product.
E31. The scatterplot in Display 3.58 shows part
of the hat size data of E6 on page 113. The
plot is divided into quadrants by vertical and
horizontal lines that pass through the point
of averages, ( x, y ).
2008 Key Curriculum Press
158 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
20 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Draw a pair of elliptical scatterplots to
illustrate each comparison.
i. One
is larger than the other, the
are equal, and the correlations are
weak.
ii. One
is larger than the other, the
are equal, and the correlations are
strong.
Display 3.58
Head circumference, in inches, versus
hat size.
a. Estimate the value of the correlation.
b. Using the idea of standardized scores,
explain why the correlation is positive.
c. Identify the point that contributes the
most to the correlation. Explain why the
contribution it makes is large.
d. Identify a point that contributes little
to the correlation. Explain why the
contribution it makes is small.
E32. The ellipses in Display 3.59 represent
scatterplots that have a basic elliptical shape.
E33. Several biology students are working
together to calculate the correlation for the
relationship between air temperature and
how fast a cricket chirps. They all use the
same crickets and temperatures, but some
measure temperature in degrees Celsius and
others measure it in degrees Fahrenheit.
Some measure chirps per second, and others
measure chirps per minute. Some use x
for temperature and y for chirp rate, while
others have it the other way around.
a. Will all the students get the same value
for the slope of the least squares line?
Explain why or why not.
b. Will they all get the same value for the
correlation? Explain why or why not.
Display 3.59 Three pairs of elliptical scatterplots.
a. Match these conditions with the
corresponding pair of ellipses.
I. One
is larger than the other, the
are equal, and the correlations are
strong.
II. One of the correlations is stronger
than the other, the
are equal, and
the
are equal.
III. One
is larger than the other, the
are equal, and the correlations are
weak.
2008 Key Curriculum Press
E34. For the sample of top-rated universities
in E5 on page 113, the graduation rate has
mean 82.7% and standard deviation 8.3%.
The student/faculty ratio has mean 11.7
and standard deviation 4.3. The correlation
is 0.5.
a. Find the equation of the least squares
line for predicting graduation rate from
student/faculty ratio.
b. Find the equation of the least squares line
for predicting student/faculty ratio from
graduation rate.
3.3 Correlation: The Strength of a Linear Trend
159
22-03-2009 20:55
Lesson
21 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E35. These questions concern the relationship
b. Over the last 30 years, there has been
between the correlation, r, and the slope, b1,
a strong positive correlation between
of the regression line.
the average price of a cheeseburger and
the average tuition at private liberal arts
a. If y is more variable than x, will the slope
colleges.
of the least squares line be greater (in
absolute value) than the correlation?
c. Over the last decade, there has been
Justify your answer.
a strong positive correlation between
the price of an average share of stock,
b. For a list of pairs (x, y), r = 0.8, b1 = 1.6,
as measured by the S&P 500, and the
and the standard deviations of x and y are
number of Web sites on the Internet.
25 and 50. (Not necessarily in that order.)
Which is the standard deviation for x?
E38. Manufacturers of low-fat foods often
Justify your answer.
increase the salt content in order to keep
the flavor acceptable to consumers. For a
c. Students in a statistics class estimated and
sample of different kinds and brands of
then measured their head circumferences
cheeses, Consumer Reports measured several
in inches. The actual circumferences had
variables, including calorie content, fat
SD 0.93, and the estimates had SD 4.12.
content, saturated fat content, and sodium
The equation of the least squares line for
content. Using these four variables, you can
predicting estimated values from actual
form six pairs of variables, so there are six
values was = 11.97 + 0.36x. What was
different correlations. These correlations
the correlation?
turned out to be either about 0.95 or
d. What would be the slope of the least
about 0.5.
squares line for predicting actual head
a. List all six pairs of variables, and for each
circumferences from the estimated
pair decide from the context whether the
values?
correlation is close to 0.95 or to 0.5.
E36. Lost final exam. After teaching the same
b. State a careless conclusion based on
history course for about a hundred years, an
taking the negative correlations as
instructor has found that the correlation, r,
evidence of cause and effect.
between the students total number of points
before the final examination and the number
c. Explain the negative correlation using the
of points scored on their final examination
idea of a lurking variable.
is 0.8. The pre-final-exam point totals for all E39. A study to determine whether ice cream
students in this years course have mean 280
consumption depends on the outside
and SD 30. The points on the final exam have
temperature gave the results shown in
mean 75 and SD 8. The instructors dog ate
Display 3.60.
Julies final exam, but the instructor knows
that her total number of points before the
exam was 300. He decides to predict her final
exam score from her pre-final-exam total.
What value will he get?
E37. Lurking variables. For each scenario, state
a careless conclusion assuming cause and
effect, and then identify a possible lurking
variable.
a. For a large sample of different animal
species, there is a strong positive
correlation between average brain
weight and average life span.
2008 Key Curriculum Press
160 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:55
Lesson
22 de 22
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.60 Data table, scatterplot, and regression analysis for the effects of outside
temperature on ice cream consumption. [Source: Koteswara Rao Kadiyala, Testing for
the Independence of Regression Disturbances, Econometrica 38 (1970): 97117.]
a. Use the values of SST and SSE in the
regression analysis to compute r, the
correlation for the relationship between
the temperature in degrees Fahrenheit
and the number of pints of ice cream
consumed per person. Check your
answer against R-sq in the analysis.
b. Compute the value of the residual that is
largest in absolute value.
2008 Key Curriculum Press
c. Is there a cause-and-effect relationship
between the two variables?
d. What are the units for each of x, y, b1,
and r?
e. The letters MS stand for mean square.
How do you think the MS is computed?
3.3 Correlation: The Strength of a Linear Trend
161
22-03-2009 20:55
Lesson
1 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E40. The scatterplot in Display 3.61 shows part of
the aircraft data of Display 3.12 on page 115.
For these data, r2 = 0.83. Should r2 be used
as a statistical measure for these data? If so,
interpret this value of r2 in the context of
the data. If not, explain why not.
E41. Suppose a teacher always praises students
who score exceptionally well on a test
and always scolds students who score
exceptionally poorly. Use the notion of
regression toward the mean to explain
why the results will tend to suggest the
false conclusion that scolding leads to
improvement whereas praise leads to
slacking off.
E42. A few years ago, a school in New Jersey
tested all its 4th graders to select students for
a program for the gifted. Two years later, the
students were retested, and the school was
shocked to find that the scores of the gifted
students had dropped, whereas the scores of
the other students had remained, on average,
the same. What is a likely explanation for
this disappointing development?
Display 3.61 Scatterplot of number of seats versus
fuel consumption (gal/h) for passenger
aircraft.
3.4
Diagnostics: Looking for Features That the
Summaries Miss
As you learned in Chapter 2, summaries simplify. They are useful because they
omit detail in order to emphasize a few general features. This quality also makes
summaries potentially misleading, because sometimes the detail that is ignored
has an important message to convey. Knowing just the mean and the standard
deviation of a distribution doesnt tell you if there are any outliers or whether the
distribution is skewed. The same is true of the regression line and the correlation.
This section is about diagnosticstools for looking beyond the summaries
to see how well they describe the data and what features they leave out. The first
part of this section deals with individual cases that stand apart from the overall
pattern and with how these cases influence the regression line and the correlation.
The second part shows you how to identify systematic patterns that involve many
or all of the casesthe shape of the scatterplot.
Which Points Have the Influence?
Just as among people,
some data points have
more influence than
others.
Not all data points are created equal. You saw in the calculation of the correlation
in Display 3.42 on page 142 that some points make large contributions and some
small. Some make positive contributions and some negative. Your goal is to learn
to recognize the points in a data set that might have an unusually large influence
on where the regression line goes or on the size and sign of the correlation.
2008 Key Curriculum Press
162 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
2 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Near and Far
What youll need: an open area in which to step off
distances
In this activity, you compare the actual distance to an
object with what the distance appears to be.
1. Go to an open area, such as the hall or lawn of your
school, and pick a spot as your origin. Choose six
objects at various distances from the origin. Five of
the objects should be within 10 to 20 paces, and the
other should be a long way away (at least 100 paces).
2. For each of the six objects, estimate the number of paces
from the origin to the object. Record your estimates.
3. From your origin, walk to each of your objects and
count the actual number of regular paces it takes you
to get there. Record this number beside your estimate.
4. Plot your data on a scatterplot, with your estimated
value on the x-axis and the actual value on the y-axis.
Does the plot show a linear trend?
5. Determine the equation of the regression line, and calculate the correlation.
6. Delete the point for the object that is farthest away from the origin.
Determine the equation of the regression line and calculate the correlation
for the reduced data set.
7. Did the extreme point have any influence on the regression line? On the
correlation? Explain.
In Chapter 2, you learned about outliers for distributionsvalues that are
separated from the bulk of the data. Outliers are atypical cases, and they can
exert more than their share of influence on the mean and standard deviation. For
scatterplots, as you will soon see, working with two variables together means that
there can be outliers of various kinds. Different kinds of outliers can have different
types of influence on the least squares line and the correlation. Unfortunately,
there is no rule you can use to identify outliers in bivariate data. Just look for
points surrounded by white space.
Judging a Points Influence
Points separated from the bulk of the data by white space are outliers and are
potentially influential. To judge a points influence, compare the regression
equation and correlation computed first with and then without the point
in question.
To see these ideas in action, turn to the data on mammal longevity in
Display 2.24 on page 43 and think about how to summarize the relationship
between maximum and average longevity.
2008 Key Curriculum Press
3.4Diagnostics: Looking for Features That the Summaries Miss
163
22-03-2009 20:56
Lesson
3 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Example: Influential Mammals
The average elephant lives 35 years. The oldest elephant on record lived 70 years.
The average hippo lives 41 yearslonger than the average elephantbut the
record-holding hippo lived only 54 years. The oldest-known beaver lived 50 years,
almost as long as the champion hippo, but the average beaver cashes in his wood
chips after only 5 short years of making them. Other mammals, however, are
more predictable. If you look at the entire sample, shown in Display 3.62, it
turns out that the elephant (E), hippo (H), and beaver (B) are the oddballs of
the bunch. For the rest, theres an almost linear relationship between average
longevity and maximum longevity. The least squares line for the entire sample
has the equation
M = 10.53 + 1.58A
where M, or M-hat, stands for predicted maximum longevity and A stands for
observed average longevity. For every increase of 1 year in average longevity, the
model predicts a 1.58-year increase in maximum longevity. The correlation for
the relationship between these two variables is 0.77. How much influence do the
oddballs have on these summaries?
Points surrounded by
white space might have
strong influence.
Display 3.62 Maximum longevity versus average longevity.
Solution
The hippo has the effect of pulling the right end of the regression line downward
( like putting a heavy weight on one end of a seesaw ), as you can see in Display 3.63.
When the hippo is removed, that end of the regression line will spring upward
and the slope will increase. Because one large residual has been removed and many
of the remaining residuals have been reduced in size, the correlation will increase.
The new slope is 1.96, and the new correlation is 0.80. The hippo has considerable
influence on the slope and some influence on the correlation.
Now envision the scatterplot with just the elephant, E, missing. Because E
is close to the straight line fi t to the data, it produces a small residual. Thus, you
would expect that removing E should not change the slope of the regression line
much (not nearly as much as removing H did ) and should reduce the correlation
just a bit. In fact, the correlation does decrease some, to 0.72 from 0.77. However,
the new slope is 1.53. It turns out that removing the elephant gives the hippo even
more influence, and the slope decreases.
2008 Key Curriculum Press
164 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
4 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.63 Regression lines for maximum longevity versus
average longevity, with and without the hippo.
Finally, envision the scatterplot with just the beaver, B, removed. B produces
a large, positive residual close to the left end of the regression line. Thus,
removing B should allow the left end of the line to drop, increasing the slope,
and removing a large residual should increase the correlation. The new slope
is 1.69 (an increase from 1.58), and the new correlation is 0.83 (an increase
from 0.77). The beaver also has considerable infl uence on both slope and
correlation.
2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That the Summaries Miss
165
22-03-2009 20:56
Lesson
5 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
With a little practice, you often can anticipate the influence of certain points
in a scatterplot, as in the previous example, but it is difficult to state general rules.
The best rule is the one given in the box on page 163: Fit the line with and without
the questionable point and see what happens. Then report all the results, with
appropriate explanations.
Why the Anscombe Data Sets Are Important
Display 3.64 shows four scatterplots. These plots, known as the Anscombe data
after their inventor, are arguably the most famous set of scatterplots in all of
statistics. The questions that follow invite you to figure out why statistics books
refer to them so often. In the process, youll learn more about what a summary
doesnt tell you about a data set.
Display 3.64
Four regression data sets invented by Francis J.
Anscombe. [Source: Francis J. Anscombe, Graphs in Statistical
Analysis, American Statistician 27 (1973): 1721.]
D26. For each plot in Display 3.64, first give a short verbal description of the
pattern in the plot. Then
a. either fit a line by eye and estimate its slope or tell why you think a line is
not a good summary
b. either estimate the correlation by eye or tell why you think a correlation
is not an appropriate summary
D27. Display 3.65 shows a computer output for one of the four Anscombe data set
plots. Can you tell which one? If so, tell how you know. If not, explain why
you cant tell.
2008 Key Curriculum Press
166 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
6 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.65 Regression analysis for one of the Anscombe data
sets.
D28. Display 3.66 lists values for the Anscombe plots.
Display 3.66
Anscombe plot data values.
a. Which plot has a point that is highly influential both with respect to the
slope of the regression line and with respect to the correlation?
b. Compared to the other points in the plot, does the influential point lie far
from the least squares line or close to it?
c. How would the slope and correlation change if you were to remove this
point? Discuss this first without actually performing the calculations.
Then carry out the calculations to verify your conjectures.
Residual Plots: Putting Your Data Under a Microscope
As you can see from the Anscombe plots, there are many features of the shape
of a scatterplot that you cant learn from the standard set of summary numbers.
Only when the cloud of points is elliptical, as in Display 3.41 on page 140, does
the least squares line, together with the correlation, give a good summary of
the relationship described by the plot. If the cloud of points isnt elliptical, thes
summaries arent appropriate. How can you decide?
2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That the Summaries Miss
167
22-03-2009 20:56
Lesson
7 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Residual plots may
uncover more
detailed patterns.
A special kind of scatterplot, called a residual plot, often can help you see
more clearly whats going on. For some data sets, a residual plot can even show
you patterns you might otherwise have overlooked completely. Statisticians use
residual plots the way a doctor uses a microscope or an X rayto get a better
look at less obvious aspects of a situation. ( Plots you use in this way are called
diagnostic plots because of the parallel with medical diagnosis. ) Push the
analogy just a little. Youre the doctor, and data sets are your patients. Sets with
elliptical clouds of points are the healthy ones; they dont need special attention.
A residual plot is a scatterplot of residuals, yy , versus predictor values, x
(or, sometimes, versus predicted values, y ).
Example: Constructing a Residual Plot
Return to the data on percentage of on-time arrivals versus mishandled
baggage for airlines, introduced in P2 on page 110. Calculate the residuals and
make a plot.
Solution
Visualize each residualthe difference between the observed value of y and the
predicted value, yas a vertical segment on the scatterplot in Display 3.67.
Display 3.67
Scatterplot of airline data.
The calculated residuals are shown in Display 3.68, with the list of carriers
ordered from smallest to largest on the x-scale. This allows the size of the residuals
in the far right column to appear in the same order as in Display 3.67. Alaska
produces a negative residual of modest size, whereas US Airways produces a large
positive residual.
The residual plot, Display 3.69, is simply a scatterplot of the residuals versus
the original x-variable, mishandled baggage. Note that 0 is at the middle of the
residuals on the vertical scale.
2008 Key Curriculum Press
168 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
8 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.68 Table showing residuals for the airline data.
Display 3.69
Residual plot for the airline data.
The residual plot shows nearly random scatter, with no obvious trends. This
is the ideal shape for a residual plot, because it indicates that a straight line is
reasonable model for the trend in the original data. [You can use your calculator to
create residual plots. See Calculator Note 3I.]
Residual Plots
D29. In Display 3.69, identify which residual belongs to Delta and which
to Northwest.
2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That the Summaries Miss
169
22-03-2009 20:56
Lesson
9 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
D30. To see how residual plots magnify departures from the regression line,
compare the Anscombe plots in Display 3.64 with Display 3.70, which shows
the four corresponding residual plots in scrambled order.
Display 3.70 Residual plots for the four Anscombe data sets.
a. Match each of the original scatterplots in Display 3.64 with its
corresponding residual plot in Display 3.70.
b. Describe the overall difference between the original scatterplots and
the residual plots. What do the scatterplots show that the residual plots
dont? What do the residual plots show that the scatterplots dont?
What to Look For in a Residual Plot
A careful data analyst always looks at a residual plot.
If the original cloud of points is elliptical, so that a line is an appropriate
summary, the residual plot will look like a random scatter of points.
Residual plots
sometimes yield
surprises.
Use residual plots to check for systematic departures from constant slope
(linear trend) and constant strength (same vertical spread). Look in particular
for plots that are curved or fan-shaped. Its true that for data sets with only one
predictor value (like those in this chapter), you often can get a good idea of what
the residual plot will look like by carefully inspecting the original scatterplot.
Once in a while, however, you get a surprise.
Example: Interpreting a Residual Plot
E19 on page 136 introduced data on median height versus age for young girls.
Display 3.71 shows the scatterplot of these data, with the regression line. The overall
average growth rate for the 12-year period is the slope of the regression line.
2008 Key Curriculum Press
170 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
10 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The plot looks nearly linear, but is a line a suitable model?
Display 3.71 Median height versus age for young girls.
Solution
The residual plot, shown in Display 3.72, quite dramatically reveals that the trend
is not as linear as first imagined. The curvature in the residual plot mimics the
curvature in the original scatterplot, which is harder to see. A line is not a good
model for these data.
Display 3.72 Residual plot of median height versus age for
young girls.
Residuals sometimes
are plotted against the
predicted values, y.
Statistical software often plots residuals against the predicted values, y, rather
than against the predictor values, x. For simple linear regression, both plots have
exactly the same shape as long as the slope of the regression line is positive.
Types of Residual Plots
D31. Display 3.73 shows a scatterplot and two residual plots for the data set
consisting of these three ordered pairs (x, y): (0, 1), (1, 0), and (2, 2). One
residual plot plots residuals versus predictor values, x, the sort of plot you
2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That the Summaries Miss
171
22-03-2009 20:56
Lesson
11 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
get from graphing calculators. The other plots residuals versus predicted
(fitted) y-values, or y, the sort of plot you get from computer software
packages. Explain how the residual plots were produced and how you can
tell which residual plot is which. The equation of the least squares line is
y = 0.5 + 0.5x.
Display 3.73
A scatterplot and two residual plots.
Summary 3.4: Diagnostics: Looking for Features
That the Summaries Miss
For the simplest clouds of data pointselliptical in shape, with linear trend and
no outliersyou can summarize all the main features of a scatterplot with just a
few numbers, mainly the slope of the fitted line, y-intercept, and correlation. Not
all plots are this simple, however, and a good statistician always does diagnostic
checks for outliers and influential points and for departures from constant slope
or constant strength.
Points separated from the bulk of the data by white space are outliers and
potentially influential.
To judge a points influence, fit a line to the data and compute a regression
equation and a correlation first with and then without the point in question.
If the change in the regression equation and correlation is meaningful in your
situation, report both sets of summary statistics.
For some data sets, a residual plot can show patterns you might otherwise
overlook. A residual plot is a scatterplot of residuals, y y , versus predictor
values, x. A residual plot also can be constructed as a scatterplot of residuals,
y y , versus fitted values, . Use residual plots to check for systematic departures
from linearity and for constant variability in y across the values of x. If the data
arent linear, the residual plot doesnt look random. If the data have nonconstant
variability, the residual plot is fan-shaped.
Practice
Which Points Have the Influence?
P22. The data in Display 3.74 show some
interesting patterns in the relationship
between domestic and international gross
income from the ten movies with the highest
domestic gross ticket sales.
a. Construct a scatterplot suitable for
predicting international sales from
domestic sales. Describe the pattern in
the data.
2008 Key Curriculum Press
172 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
12 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Find the least squares line and the
correlation for these data.
c. Remove the most influential data point
and recalculate the least squares line and
correlation. Describe the influence of the
removed point.
P23. A data table and scatterplot of one students
results from Activity 3.4a are shown in
Display 3.75.
a. How well did the student do in estimating
the number of paces?
b. Which point appears to be most
influential?
c. Calculate the slope of the regression line
and the correlation with and without
this point. Describe the influence of this
point.
Display 3.74 Ticket sales for the ten highest-grossing
domestic (United States and Canada)
movies of all time. [Source: Internet Movie
Database, us.imdb.com, September 12, 2006.]
Display 3.75
Sample data from Activity 3.4a.
Residual Plots
P24. For the set of (x, y) pairs (0, 0), (0, 1), (1, 1),
and (3, 2), the equation of the least squares
line is y = 0.5 + 0.5x.
a. Plot the data and graph the least squares
line.
b. Next complete a table for the predicted
values and residuals, like the table in
Display 3.68 on page 169.
c. Using the values in your table, plot
residuals versus predictor, x.
d. How does the residual plot differ from the
scatterplot?
2008 Key Curriculum Press
3.4 Diagnostics: Looking for Features That the Summaries Miss
173
22-03-2009 20:56
Lesson
13 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
P25. Display 3.76 shows four scatterplots
( AD ) for the data from a sample of
commercial aircraft. Display 3.77 shows four
corresponding residual plots (IIV).
a. Match the residual plots to the
scatterplots.
b. Using scatterplots AD as examples,
describe how you can identify each of
these in a scatterplot from the residual
plot.
i. a curve with increasing slope
ii. unequal variation in the responses
iii. a curve with decreasing slope
iv. two linear patterns with di erent
slopes
c. For one of the plots, two line segments
joined together seem to give a better fit
than either a single line or a curve. Which
plot is this? Is this pattern easier to see in
the original scatterplot or in the residual
plot?
Display 3.76 Four scatterplots for the sample of commercial aircraft.
Display 3.77 Four residual plots corresponding to the scatterplots in Display 3.71.
2008 Key Curriculum Press
174 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
14 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Exercises
E43. Extreme temperatures. The data in Display
3.78 provide the maximum and minimum
temperatures ever recorded on each continent.
E44. The data and plot in Display 3.79 are from
E15 on page 135. They show the arsenic
concentrations in the toenails of 21 people
who used water from their private wells.
Both measurements are in parts per million.
Display 3.78 Maximum and minimum recorded
temperatures for the continents.
[Source: National Climatic Data Center, 2005,
www.ncdc.noaa.gov .]
a. Construct a scatterplot of the data
suitable for predicting the minimum
temperature from a given maximum
temperature. Is a straight line a good
model for these points? Explain.
b. Fit a least squares line to the points and
calculate the correlation, even if you
thought in part a that a straight line was
not a good model.
c. Explain, in words and numbers, what
influence Antarctica has on the slope of
the regression line and on the correlation.
How could an account of these data be
misleading if it were not accompanied by
a plot?
Display 3.79 Arsenic concentrations.
Two climbers stand on Mount Erebus,
Antarctica, 12,500 ft above sea level.
2008 Key Curriculum Press
a. Which point do you think has the most
influence on the slope and correlation?
What would be the effect of removing
3.4Diagnostics: Looking for Features That the Summaries Miss
175
22-03-2009 20:56
Lesson
15 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
this point? Perform the calculations to see
if your intuition is correct.
b. Find a point that you think has almost no
influence on the slope and correlation.
Perform the calculations to see if your
intuition is correct.
c. Find a point whose removal you think
would make the correlation increase.
Perform the calculations to see if your
intuition is correct.
E45. How effective is a disinfectant? The data in
Display 3.80 show (coded) bacteria colony
counts on skin samples before and after a
disinfectant is applied.
Display 3.80 Coded bacteria colony counts before
(x) and after (y) treatment. [Source:
Snedecor and Cochran, Statistical Methods (Iowa
State University Press, 1967), p. 422.]
a. Plot the data, fit a regression line to them,
and complete a copy of the table, filling in
the predicted values and residuals.
b. Plot the residuals versus x, the count
before the treatment. Comment on the
pattern.
c. Use the residual plot to determine for
which skin sample the disinfectant was
unusually effective and for which skin
sample it was not very effective.
E46. Textbook prices. Display 3.81 compares
recent prices at a college bookstore to those
of a large online bookstore.
a. The equation of the regression line is
online = 3.57 + 1.03 college. Interpret
this equation in terms of textbook prices.
Display 3.81 Prices for a sample of textbooks at
a college bookstore and an online
bookstore.
b. Construct a residual plot. Interpret it and
point out any interesting features.
c. In comparing the prices of the textbooks,
you might be more interested in a
different line: y = x. Draw this line on a
copy of the scatterplot in Display 3.81.
What does it mean if a point lies above
this line? Below it? On it?
d. A boxplot of the differences
college price online price is shown
in Display 3.82. Interpret this boxplot.
Display 3.82 A boxplot of the differences between
the college price and the online price
for various textbooks.
2008 Key Curriculum Press
176 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
16 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E47. Pizzas, again. Display 3.83 shows the
pizza data from E12 on page 134, with its
regression line.
Display 3.83 Calories versus fat, per 5-oz serving, for
seven kinds of pizza.
a. Estimate the residuals from the graph,
and use your estimates to sketch a rough
version of a residual plot for this data set.
b. Which pizza has the largest positive
residual? The largest negative residual?
Are any of the residuals so extreme as
to suggest that those pizzas should be
regarded as exceptions?
c. Is any one of the pizzas a highly
influential data point? If so, specify
which one(s), and describe the effect
on the slope of the fitted line and the
correlation of removing the influential
point or points from the analysis.
E48. Aircraft. Look again at Display 3.76 on page
174, which shows a scatterplot of flight length
versus number of seats.
a. Does the slope of the pattern increase,
decrease, or stay roughly constant as you
move from left to right across the plot?
b. Focusing on the variation (spread) in
flight length, y, for planes with roughly
the same seating capacity, compare
the spreads for planes with few seats, a
moderate number of seats, and a large
number of seats. As you move from left to
right across the plot, how does the spread
change, if at all?
2008 Key Curriculum Press
c. Suppose a friend chose a plane from
the sample at random and told you the
approximate number of seats. Could you
guess its flight length to within 500 miles
if the number of seats was between 50
and 150? If it was between 200 and 300?
Explain.
d. What is the relationship between your
answer in part b and residual plot I in
Display 3.77?
e. Give an explanation for why the variation
in flight length shows the pattern it does.
E49. Match each scatterplot ( AD ) in Display
3.84 with its residual plot ( IIV ) in Display
3.85. For which plots is a linear regression
appropriate?
Display 3.84 Four scatterplots.
Display 3.85 Four residual plots.
3.4 Diagnostics: Looking for Features That the Summaries Miss
177
22-03-2009 20:56
Lesson
17 de 17
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E50. Can either of the plots in Display 3.86 be a
residual plot? Explain your reasoning.
Display 3.86
E53. Can you recapture the scatterplot from the
residual plot? The residual plot in Display
3.88 was calculated from data showing the
recommended weight (in pounds) for men at
various heights over 64 in. The fitted weights
ranged from 145 lb to 187 lb. Make a rough
sketch of the scatterplot of these data.
Residual Plots?
E51. Display 3.87 gives the data set for the three
passenger jets from the example on page 123,
along with a scatterplot showing the least
squares line. (Values have been rounded.)
a. Use the equation of the line to find
predicted values and residuals to
complete the table in Display 3.87.
b. Use your numbers from part a to construct
two residual plots, one with the predictor,
x, on the horizontal axis and the other with
the predicted value, y, on the horizontal
axis. How do the two plots differ?
Display 3.88 Residuals of recommended weight
versus height for men.
E54. The plot in Display 3.89 shows the residuals
resulting from fitting a line to the data for
female life expectancy (life exp) versus gross
national product (GNP, in thousands of
dollars per capita) for a sample of countries
from around the world. The regression
equation for the sample data was
life exp = 67.00 + 0.63 GNP
Sketch the scatterplot of life exp versus GNP.
Display 3.87 Cost per hour versus number of seats for
three models of the passenger aircraft.
E52. Explain why a residual plot of ( x, residual )
and a plot of ( predicted value, residual ) have
exactly the same shape if the slope of the
regression line is positive. What changes if
the slope is negative?
Display 3.89 Residuals of female life expectancy
versus gross national product.
2008 Key Curriculum Press
178 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:56
Lesson
1 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
3.5
Transforming data is
sometimes called
re-expressing data.
Shape-Changing Transformations
For scatterplots in which the points form an elliptical cloud, the regression line
and correlation tell you pretty much all you need to know. But data dont always
behave so obligingly. For plots in which the points are curved, fan out, or contain
outliers, the usual summaries do not tell you everything and can actually be
misleading. What do you do then? This section shows you one possible remedy:
Transform the data to get the shape you want.
Youre already familiar with linear transformations from Section 2.4things
like changing temperatures from degrees Fahrenheit to degrees Celsius or
changing distances from feet to inches or times from minutes to seconds. These
linear transformationsadding or subtracting a constant or multiplying or
dividing by a constantcan change the center and spread of the distribution
without changing its basic shape.
Nonlinear transformations, such as squaring each value or taking logarithms,
do change the basic shape of the plot. This section shows how a transformation of
a measurement scale can lead to simplified statistical analyses. One of the most
common nonlinear relationships is the exponential, and that is where we begin
our discussion.
Exponential Growth and Decay
Whenever a quantity
changes by an amount
proportional to the
amount present,
consider logarithms.
Data over time will have
more varied patterns,
sometimes diffcult to
see.
Exponential functions often arise when you study how a quantity grows or decays
as time passes. Many such quantities grow by an amount that is proportional to
the amount present. This means that the amount present at one point in time is
multiplied by a fixed constant to get the amount
present at the next point in time,
resulting in a function of the form y = abx. The amount of growth of a population
is proportional to population size, the growth of a bank account is proportional
to the amount of money in the account, the amount by which a radioactive
substance decays is proportional to how much is left, and the amount by which
a cup of coffee cools is proportional to how much hotter it is than the air around
it. For such situations, replacing y with log y often gives a plot that is much more
nearly linear than the original plot.
The examples in this section illustrate exponential growth and decay. At
the same time, they illustrate another important feature of measurements taken
over time: There is often an up-and-down pattern to the residuals. This pattern
cannot be removed by a simple transformation.
Whenever your measurements come in chronological order, what happens
next is likely to depend on what just happened. As a result, the patterns in your
data might be subtler than the ones youve seen up to this point. In addition, the
difference between a meaningful pattern and a quirk in the data might be harder
to detect because the data typically show only one observation for a single point
in time.
An Example of Exponential Decay
In Activity 3.5a, you will study a population that decreases exponentially
over time.
2008 Key Curriculum Press
3.5 Shape-Changing Transformations
179
22-03-2009 20:57
Lesson
2 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Copper Flippers
What youll need: 200 pennies, a paper cup
Count 200 pennies. Youll use these for an exercise
to see how fast pennies die.
A certain insect, the copper flipper, has a life
span determined by the fact that there is a 5050
chance a particular live flipper will die at the end
of the day. So, on average, half of any population of
copper flippers will die during the first day of life. Of those that survive the first
day, on average half die during the second day, and so on.
By extraordinary coincidence, these bugs behave like a bunch of tossed
pennies. If you toss 200 pennies, about half should come up heads and half
tails. The heads represent the insects that survive the first day, and the tails
represent those that die. You can collect the pennies that came up heads and
toss them again to see how many survive the second day, and so on. This gives
you a physical model for the distribution of the life span of the insects.
1. Place your pennies in a cup, shake them up, and toss them on a table.
Count the number of heads, and record the number.
2. Set aside the pennies that came up tails. Place the pennies that came up
heads back in the cup, and repeat the process.
3. After each toss (day), set the tails aside, collect the heads, count them, and
toss them again. Repeat this process until you have fewer than five pennies
left, but stop before you get to zero heads.
4. Construct a scatterplot of your data, with the number of the toss on the
horizontal axis and the number of heads on the vertical axis. Does the
pattern look linear?
5. How might you find an equation to summarize this pattern?
Exponential Functions and Log Transformations
How do you know if you have an exponential relationship of the form y = abx ?
One clue is that the points cluster about a function similar to those in Display
3.90. Another clue is that you have a variable whose values are mostly clustered
at one end but range over two or more orders of magnitude (powers of 10).
The best test, however, is to take the log of each value of y and see if this will
straighten the points.
Replacing y with
log y will straighten
y = abx.
Exponential relationships have an underlying model of the form y = abx. If
a > 1, a 1 gives the growth rate per time period. If 0 < a < 1, a 1 is
negative and gives the decay rate. The points can be linearized (straightened)
by taking the logarithm (base 10 or base e) of each value of y. The result will be
a linear equation of the form
log y = log a + (log b)x
2008 Key Curriculum Press
180 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
3 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.90 The exponential functions
Analyzing the Copper Flippers
Display 3.91 shows the scatterplot for one students data from Activity 3.5a.
Display 3.91 Number of heads versus number of the toss.
Does a log transformation appear to be appropriate here? The pattern looks
much like the left-hand curve in Display 3.90, and the values for the number of
heads are clustered at the smaller values but range over two orders of magnitude.
In addition, the number of heads remaining after each toss of the coins is
roughly proportional to the number of coins tossed. A log transformation is
worth a try. Display 3.92 shows the natural log (base e) of the number of heads
plotted against the toss number, along with the regression line, for the data of
Display 3.91.
Compare the scatterplot in Display 3.92 to the residual plot in Display 3.93.
(The line segments are added to help your eye follow the time sequence.) Does
the model appear to fit well if you look only at the scatterplot? How, if at all,
does the residual plot alter your judgment of how well the line fits? The cyclical
up-and-down pattern of residuals is common in such time series data.
The equation of the regression line (shown in Display 3.92) for the
= 5.21 0.66x. If you solve this
transformed data is given by the equation
equation for y, you get y =
.The number of copper fippers that are
alive each day is about 52% of the number alive the previous day. In other words,
the decay rate is estimated to be 48%, or 0.48.
2008 Key Curriculum Press
3.5
Shape-Changing Transformations
181
22-03-2009 20:57
Lesson
4 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.92
A plot of ln(heads) versus the number of the toss,
with the regression line.
Display 3.93 Residual plot for the ln(heads) regression.
Exponential Growth and Decay
D32. How would the scatterplot and the least squares line change if the coins had
a probability of 0.6 of coming up heads? What insect death rate does this
situation model?
An Example of Exponential Growth
Display 3.94 shows the population density (people per square mile) of the
United States for all census years through 2000. For the years prior to 1960,
only the 48 contiguous states are included. Alaska and Hawaii were added to
the census in 1960. To find a reasonable model for this situation, start with a
scatterplot.
Plot. Obviously, the pattern here is not linear. A curve of this type can be
straightened by proportionally decreasing the large y-values (population densities,
in this case). For variables like population growth (or the growth of many other
phenomena), the logarithmic transformation works well. This transformation not
only solves the data analysis problem nicely but also gives a neat interpretation to
the resulting model.
2008 Key Curriculum Press
182 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
5 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.94
Population density of the United States over the
census years. [Source: U.S. Census Bureau, Statistical Abstract of
the United States, 20042005, p. 8.]
Transform and Plot Again. You can see the transformed points in Display
3.95. For example, for the year 1800, the point (1800, ln 6.1) or (1800, 1.808) is
plotted.
Fit. Although there is still some curvature, the pattern is much more nearly
linear, and a straight line might be a reasonable model to fit these data. The
regression line and regression analysis also are shown in Display 3.95 (on the
next page).
The equation of the regression line is
= 25.118 + 0.0148x. Solving for
y gives
This means that the population density is growing at about 1.5% per year.
2008 Key Curriculum Press
3.5 Shape-Changing Transformations
183
22-03-2009 20:57
Lesson
6 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.95
ln (population density) versus year, with regression
line and computer output.
Residuals. In the case of data over time, it is often advantageous to plot the
residuals over time, as in Display 3.96.
Display 3.96 Residual plot of ln (population density) versus year.
The result is not exactly random scatter! Well, it is about the best you can do.
The problem is that there are subtle patterns in the dataand in the residuals
that no simple model will adequately account for.
Exponential Growth and Decay
D33. Relate the pattern you see in the residual plot in Display 3.96 to the pattern
of the data in the original scatterplot (Display 3.94).
2008 Key Curriculum Press
184 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
7 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
D34. Why is there a huge jump from a large positive residual to a large negative
residual as you move from 1800 to 1810? What events in U.S. history explain
some of the other features of the residual plot?
D35. If you use a computer to fit a line to the (year, density) data, it will
automatically compute a correlation.
a. Explain why a correlation is not a very useful summary for this data set.
b. In Display 3.95, the computer gave the value of R-sq as 97.3% for the
transformed data. Statisticians ordinarily are not very interested in the
size of this diagnostic measure for time-ordered data. Can you think of
any reasons why?
D36. What is the estimated annual rate of growth of the population density of the
United States? What is the estimated rate of growth over a decade?
Example: Logarithmic Transformation
Do aircraft with a higher typical speed also have a greater average flight length? As
you might expect, the answer is yes, but the relationship is nonlinear. (See Display
3.97.) Is there a simple equation that relates typical speed and flight length? The
solution that follows leads you through one approach to these questions.
Display 3.97 Data table and scatterplot for the flight length and
speed of various aircraft. [Source: Air Transport Association
of America, 2005, www.air-transport.org.]
2008 Key Curriculum Press
3.5 Shape-Changing Transformations
185
22-03-2009 20:57
Lesson
8 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Solution
The plot is curved much in the manner of exponential growth, so shrinking the
y-scale is in order. Display 3.98 shows a scatterplot of ln( flight length) versus
speed, along with a least squares line and computer output.
Display 3.98 ln(flight length) versus speed.
Although the pattern of points appears much more linear and the fit looks
pretty good, the lack of randomness in the residual plot in Display 3.99 indicates
that a linear model still does not really fit the points. D38 and E58 will give you a
chance to continue the detective work.
Display 3.99 Residual plot for ln(flight length) versus speed.
2008 Key Curriculum Press
186 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
9 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Log Transformations
D37. How can you tell from Display 3.97 that the flight length values range over
two orders of magnitude? Show how transforming from flight length to
ln( flight length) will shrink the larger y-values more than the smaller ones
and thus help straighten the plot.
D38. Describe the pattern in the residual plot in Display 3.99, and tell what it
suggests as a next step in the analysis.
D39. In what sense, if any, is the relationship in Display 3.97 one of cause and
effect? What is your evidence?
Log-Log Transformations of Power Functions
Wildlife biologists can estimate the length of an alligator without getting very
close to it. However, to get its weight with any accuracy, they have to wrestle
it onto a scale. This is a procedure that neither the biologist nor the alligator is
happy to participate in.
Perhaps you can help the biologist predict the weight of an alligator spotted in
the swamp from an estimate of its length. One way to do that is to collect data on
both weight and length and then find a regression line that provides a good model
of the relationship. Display 3.100 shows the weights and lengths of 25 alligators as
measured by experts from the Florida Game and Freshwater Fish Commission.
Display 3.100 Alligator weights and lengths. [ Source: Florida Game
and Freshwater Fish Commission.]
A scatterplot of the data (Display 3.101, on the next page) shows that the
relationship is not linear. On thinking carefully about an appropriate model, you
might realize that length is a linear measure while weight is related to volumea
cubic measure. So perhaps weight is related to the cube of length, or some power
close to that. That is, the relationship is of the form weight
where a
is constant.
2008 Key Curriculum Press
3.5 Shape-Changing Transformations
187
22-03-2009 20:57
Lesson
10 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.101 Plot of the alligator data.
Power relationships have an equation of the form
y
axb
as the underlying model. The points can be linearized (straightened) by
taking the logarithm (base 10 or base e) of both the values of x and the values
of y. The result will be a linear equation of the form
log y log a b log x
Thus, if ln(weight) is plotted against ln(length), the plot should be fairly linear
and the slope of the least squares line will provide an estimate of the power, b. The
result of this transformation is shown in Display 3.102. The regression equation is
ln(weight) = 3.29 ln(length) 10.2.
The plot does indeed look linear, and the estimate of b is 3.29. (Natural logs
are used here, but logs to base 10 will produce essentially the same results.) The
biologist can use this model to predict ln(weight) from ln(length) and then change
the predicted value back to the original scale, if he or she chooses.
Note that the residual plot still shows a bit of curvature, mainly because the
three largest alligators are somewhat influential. But the offsetting advantage
of the power model is that the residuals are fairly homogeneous; that is, they
dont tend to grow or shrink as length increases. This means that the error in
the prediction of weight will be relatively constant for alligators of all lengths.
The exponential model also fits these data well, but the residuals then lose their
homogeneity with no substantial decrease in their size.
Ultimately, a model should be selected based on its intended use.These
biologists wanted a model that predicts weight well for all reasonable values of
length, not just for large alligators. Furthermore, a model should make sense to
the experts in the field of use. The biologists could understand why weight (or
volume) should have a cubic relationship with length but could see no reason why
weight should grow exponentially with length.
2008 Key Curriculum Press
188 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
11 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.102 Linear regression of ln(weight) versus ln(length),
with residual plot.
Example: Using the Regression Equation
Use the regression equation to predict the weight of an alligator that is 100 inches
in length.
Solution
The prediction is ln(weight) = 3.29ln(100) 10.2
is found by solving
ln(weight)
weight
4.951. The predicted weight
4.951
141.3 lb
Both biologist and alligator can rest more easily.
Log-Log Transformations of Power Functions
D40. For variable x taking on the integer values from 1 through 10, sketch
the graph of a power function with a = 1 and b = 2. Compare it to the
exponential equation with a = 1 and b = 2. Discuss the differences between
exponential models and power models.
D41. Use the alligator data to show that you get the same predicted weight for a
given length whether you use base 10 or base e logarithms.
2008 Key Curriculum Press
3.5
Shape-Changing Transformations 189
22-03-2009 20:57
Lesson
12 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Power Transformations
You always can use a log-log transformation to straighten points that follow a power
However, sometimes your knowledge of a situation can allow
relationship,
you, with a little thought, to go directly to an appropriate power transformation
without transforming through logarithms. For example, in exploring the
relationship between time in free-fall and the distance an object falls, it is the square
root of the distance that is linearly related to the time. In relating gas mileage to
size of cars, miles per gallon could just as well be gallons per mile (a reciprocal
transformation). Commonly used power transformations are given in the box.
To use a power transformation, transform y by replacing it with a power of y,
such as
The next example describes this process.
Example: A Power Transformation
The success of sustainable harvesting of timber depends on how fast trees grow,
and one way to measure a trees growth rate is to find the relationship between its
diameter and its age. If you know this relationship and the relationship between the
age and height of a tree, then you can estimate the growth rate for total volume of
timber. Displays 3.103 through 3.106 give data on the age (in years) and diameters
(in inches at chest height) of a sample of oak trees, a scatterplot of diameter versus
age, a residual plot, and numerical summaries in the form of computer output.
Based on these data, find a good model for predicting tree diameter from age.
Display 3.103 Diameters and ages of oak trees. [Source: Herman
H. Chapman and Dwight B. Demeritt, Elements of Forest
Mensuration, 2nd ed. (J. B. Lyon Company, 1936).]
2008 Key Curriculum Press
190 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
13 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.104
Diameter versus age of oak trees, with regression
line.
Display 3.105
Display 3.106
Residual plot for the oak tree data.
Numerical summary in the form of computer
output of the ages and diameters (inches at chest
height) of a sample of oak trees.
Solution
Inspection reveals that the point cloud is roughly elliptical with a slight downward
curvature. Straightening this curve will require expanding the y-scale. You can
formulate a power transformation by thinking carefully about the practical
2008 Key Curriculum Press
3.5
Shape-Changing Transformations
191
22-03-2009 20:57
Lesson
14 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
situation, rather than jumping immediately to the log-log transformation. If
the diameter does not grow linearly with age, perhaps the cross-sectional area,
does. Try squaring the diameters.
Display 3.107 shows a scatterplot of diameter squared versus age, together with
the equation of the least squares line. Display 3.108 shows the plot of residuals
versus age. Even though the value of
is about the same as before, the scatterplot
and residual plot are now less curved. Taking all the evidence together, a linear
model appears to fit the transformed data better than it does the original data.
Display 3.107 Diameter squared versus age for the oak tree data,
with the regression line.
Display 3.108
Measurement scales are
selected by the user, so
select one that meets
your needs.
Residual plot for diameter squared versus age.
Power transformations like the ones you have just seen can straighten a curved
plot (or change a fan shape to a more nearly oval shape). By choosing the right
power, you often can take a data set for which a fitted straight line and correlation
are not suitable and convert it into one for which those summaries work well.
Is it cheating to change the shape of your data? (You wanted a linear cloud,
but you got a curved wedge. You didnt like that, so you fiddled with the data
until you got what you wanted.) In fact, as youll see, changing scale is a matter
of re-expressing the same data, not replacing the data with entirely new facts.
The intelligent measurer selects a scale that is most useful for the problem the
measurements were taken to solve.
[You can use your calculator to perform all the transformations youve learned
about in this section. See Calculator Note 3J.]
2008 Key Curriculum Press
192 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
15 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Power Transformations
D42. Give a plausible explanation for why a tree
might grow at a rate that makes the
square of its diameter proportional to
its age.
D43. If you need to predict the diameter of some
oak trees for which you know only the age,
would you rather do the predicting for
10-year-old trees or for 40-year-old trees?
Explain your reasoning.
Summary 3.5: Shape-Changing
Transformations
For many scatterplots that show slopes or spreads that change as x changes, you
can find a shape-changing transformation that brings your scatterplot much
closer to the form for which the summaries of this chapter work besta cloud
of points in the shape oftan ellipse with a linear trend. A transformation will not
always make the pattern linear, however.
Transformations should have some basis in reality; they should not be simply
chosen arbitrarily to see what might happen. Ideally, the transformation you use
will be related in a plausible way to the situation that created your data.
The most common transformations are powers, in which you replace y with
and so forth, and logarithms, in which you
replace y with log10y or ln y. You can also transform x.
Here are some other helpful facts to consider when choosing a transformation
involving logarithms:
A log-log transformation replacing x with log x and y with log y will straighten
data modeled by a power function of the form y
Replacing y with log y will straighten data modeled by an exponential
function of the form y
If you have data on a quantity that changes over time by an amount roughly
proportional to the quantity at a given time, then the logarithm of the
quantity will be roughly a linear function of time.
Consider a log transformation whenever you have a variable whose values
are clustered at one end and range over two or more orders of magnitude
(powers of 10).
For data collected over time (or over some other sequential ordering, such as
distance along a path), there generally is one data value for each time, and each
data value usually is correlated quite highly with the values to either side (its close
neighbors). So the data will tend to have a much more intricate pattern than can
be modeled by a straight line. Careful analysis of the pattern in the residuals often
can help you see what is really happening over time.
2008 Key Curriculum Press
3.5
Shape-Changing Transformations
193
22-03-2009 20:57
Lesson
16 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Practice
Exponential Growth and Decay
P26. Dying dice. One of the authors gathered data
on dying dice, starting with 200, and used
the rule that a die lives if it lands showing
1, 2, 3, or 4. Here are the results:
200 122 81 58 29 19 11 8 6 4 2 2
a. Construct a scatterplot of the number of
"live" dice versus the roll number.
b. Transform the number of live dice using
natural logs, and construct a scatterplot
of ln(dice) versus roll number.
c. Fit a line by the least squares method, and
use its slope to estimate the rate of dying.
d. Plot residuals versus roll number, and
describe the pattern.
P27. Florida is one of the fastest-growing states in
the United States. The population figures for
each census year from 1830 through 2000 are
given in Display 3.109.
Display 3.109 Population of Florida over the years
18302000. [Source: U.S. Census Bureau,
Statistical Abstract of the United States,
20042005.]
a. Find a transformation that straightens the
relationship.
b. Fit a line to the transformed data, and
use it to estimate the rate of population
growth.
c. Produce a residual plot for your model,
and comment on the pattern in the
residuals.
Exponential Functions and Log
Transformations
P28. The logarithm of a number is the exponent
when you write the number in the form
base raised to a power. Thus,
=3
means the same thing as 1000 = . Use
this fact to transform y into logs (in base 10)
in this set of (x, y) pairs: (2, 1000), (1, 100),
(0, 10), and (1, 1).Then plot log10 y versus
x and check that the points lie in a straight
line. Find the slope and y-intercept of the
line.
P29. Repeat P28 for these two sets of pairs.
a. (6, 1000), (4, 100), (2, 10), (0, 1)
b. (5, 0.0001), (6, 0.01), (8, 100)
2008 Key Curriculum Press
194 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
17 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
P30. Verify that if log10 y = c + dx, then y = abx,
where a = 10c and b = 10d. Use this fact to
rewrite each of your fitted equations in P28
and P29 in the form y = ab x.
P31. Rewrite the fitted equation in Display 3.98
on page 186 in the form
flight length = abspeed
P32. In setting standards for the consumption
of fish tainted by chemicals in the water
from which they were taken, the U.S.
government commissioned a study of fish
consumption for one such contaminated
area. A part of the study involved interviews
with a sample of noncommercial fishers
that asked, among other things, how often
the person fished in this water over the past
month and how many fish meals his or
her family consumed over the past month.
The number of meals was then converted
to grams of fish consumed (a statistical
process in itself).The number of fishing
trips is fairly easy for people to remember,
but they arent so accurate at reporting the
amount of fish their family ate. A good
model relating the number of fishing trips to
consumption would be helpful in estimating
fish consumption. From the set of data in
Display 3.110,find such a model.
Display 3.110 Fish consumption data.
Log-Log Transformations of Power
Functions
P33. Use the regression equation on page 189 to
predict the weight of an alligator that is
75 inches long.
2008 Key Curriculum Press
3.5
Shape-Changing Transformations
195
22-03-2009 20:57
Lesson
18 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
P34. Having a good measure of tidal velocity (the
speed at which water depth increases) in
an estuary is critically important, especially
during storms. Tidal velocity is difficult
to measure, but it is related to the depth
of the water. Thus, a good model of this
relationship would allow scientists to predict
the velocity from measurements of water
depth. Display 3.111 shows measurements
of the depth of water (in meters) and tidal
velocities (in meters per second) for certain
locations in an estuary.
Power Transformations
P35. Three sets of pairs (x, y) are given here. For
each set, (i) plot y versus x, (ii) find a power
of y to use in place of y itself to get a linear
plot, and (iii) plot that power of y versus x.
a. (0, 0), (1, 1), (2, 8), (3, 27)
b. (1, 10), (2, 5), (5, 2), (10, 1)
c. (100, 10), (64, 8), (25, 5), (9, 3), (1, 1)
P36. In P35, suppose that, instead of plotting
a power of y versus x, you plot y versus a
power of x. For each of the three sets of
pairs (x, y), find the power of x for which the
points lie on a line. What is the relationship
between the powers of y in P35 and the
powers of x in this problem?
P37. For each of these relationships, first write the
equation that relates x and y. Then use this
equation to find a power of y that you could
plot against x in order to get a linear plot.
a. y is the area of a circle, and x is the radius
of the circle.
b. y is the volume of a block whose sides
all have equal lengths, and x is the side
length.
Display 3.111 Tidal velocity versus the water depth.
[Source: Shi et al., International Journal of
Numerical Methods for Fluids, 2003.]
a. Describe the nature of the relationship
between depth and velocity.
b. Fit an appropriate model that would
allow the prediction of velocity from
depth.
c. y is the volume of an 8-ft section of log
with a circular cross section, and x is the
diameter of the logs cross section.
P38. If the square of a trees diameter is roughly
proportional to its age, then you can
expect the diameter itself to be roughly
proportional to the square root of the
trees age. For the data in Display 3.103
on page 190, use a computer or calculator
to make a scatterplot of diameter versus
square root of age, fit a least squares line,
and plot the residuals. Which residual plot
shows more of a fan shape, the plot of
diameter squared versus age or the plot of
diameter versus square root of age? If you
want a plot that shows an elliptical cloud,
which transformation should you choose?
P39. Display 3.112 shows the brain weights and
body weights for a collection of mammals.
The goal is to establish the relationship of
brain weight to body weight.
2008 Key Curriculum Press
196 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
19 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
a. Assuming that there is a power
relationship here, can you guess what it
is from the scatterplot? If y is written as a
function of x to some power, should the
power be greater than 1 or less than 1?
b. Plot log(brain) versus log(body). Describe
the pattern of the plot.
c. Fit a line to the plot in part b. Write
an equation relating y to x. Does your
equation support your answer to part a?
Display 3.112 The brain weights and body weights of mammals. [Source: T. Allison and
D. V. Cicchetti, Sleep in Mammals: Ecological and Constitutional Correlates, Science 194
(1976): 73234.]
2008 Key Curriculum Press
3.5
Shape-Changing Transformations
197
22-03-2009 20:57
Lesson
20 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Exercises
E55. For the data in Display 3.97 on page 185, try
fitting a straight line to the square root
of flight length as a function of speed.
Does this transformation work as well
as the log transformation? Explain your
reasoning.
E56. More dying dice. Follow the same steps as
in P26 on page 194 for these numbers of
surviving dice: 200, 72, 28, 9, 5, 2, and 1. Use
your data to estimate what the probability
of dying was in order to generate these
numbers.
E57. Growing kids. Median heights and weights
of growing boys are presented in Display
3.113. What model would you choose to
predict weight from a boys known height?
Look through the complete listing of the
data (Display 3.12 on page 115) and the
scatterplots from P25 to see whether any
features other than ight length separate the
aircra into the same two groups.
E59. Chimp hunting parties. After Jane Goodall
discovered that chimpanzees are not solely
vegetarian, much research began into the
behavior of chimpanzees as hunters. Some
animals hunt alone or in small groups, while
others hunt in large groups. Where does the
chimp fit in, and what is the success rate of
chimps hunting parties? Not surprisingly,
the success of the hunt depends in part on
the size of the hunting party. Display 3.114
gives some data on the number of chimps
in a hunting party and the success rate of
parties of that size.
Display 3.113 Median heights and weights of
growing boys. [Source: National Health
and Nutrition Examination Survey (NHANES),
2002, www.cdc.gov.]
E58. Cost per seat per mile and flight length,
revisited. As you saw in P25 on page 174,
when cost per seat per mile is plotted against
flight length, the pattern is not linear. The
residual plot in Display 3.77 on page 174
strongly suggests that two line segments
might provide a better model than a single
line. Apparently, there is one relationship
for aircra meant for longer routes and
another for aircra meant for shorter routes.
Display 3.114 Hunting party size and percentage
of success.
[Sources: Mathematics Teacher,
August 2005, p. 13; C. B. Stanford, Chimpanzee
Hunting Behavior and Human Evolution,
American Scientist 83 (1995).]
a. Plot the data in a way that allows the
building of a model to predict success
from size of hunting party. Describe the
pattern you see.
2008 Key Curriculum Press
198 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
21 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Will a simple linear regression model
of the major trend(s) in U.S. population
growth.
work well here? Why or why not?
c. Make a plot over time of the immigration
c. Look for a transformation that will
by decade. Describe the pattern you see
produce a model with better predicting
here. Can you fit one of the models from
ability than the simple linear one. Fit the
this section to data that look like this?
model to the data.
d. Investigate the residuals from the model E61. Display 3.116 gives data about passengers on
United Airlines flight 815, Chicago-OHare
in part c. Are you happy with the fit of
that model?
to Los Angeles, on October 31, 1997. There
were 186 passengers, but the data concern
E60. The data in Display 3.115 are the population
those 33 passengers who had tickets for
of the United States from 1830 through 2000
the Chicago-to-Los Angeles leg only. The
and the number of immigrants entering the
variables are
country in the decade preceding the given
year.
X: number of days before the fiight that the
ticket was purchased
Y: price of the ticket
Display 3.115 Population and immigration in the
United States, 18302000. [Source: U.S.
Census Bureau, Statistical Abstract of the
United States, 20042005.]
a. Find the population growth for each
decade. Was the increase in population
constant from decade to decade? How
would you describe the pattern?
b. Fit a model to the (year, population) data
and defend your model as representative
2008 Key Curriculum Press
Display 3.116
Number of days before the fl ight that
the ticket was purchased and price
of airline ticket. [Source: New York Times,
Weekly Review, April 12, 1998.]
Because it is in the airlines interest to sell
tickets early, you might expect Y to be
negatively associated with X.
It happens that the first four cases in Display
3.116 are for passengers who first class,
and those passengers pay more than other
3.5 Shape-Changing Transformations
199
22-03-2009 20:57
Lesson
22 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
passengers no matter when they purchase
on oxygen consumption is that the rates
their ticket. So you can justify examining the
of consumption within organ tissue can
data for the 29 economy-class passengers
be explained largely by the relative size of
only.
the organ within the body. Is this theory
supported by the data? Explain your
Finally, one passenger paid $0 because he
reasoning.
or she used frequent-flyer miles. You are
justi ed in deleting this value from the data
set if the goal is to find a model that relates
price to time of purchase.
Can you find a model that relates the cost of
the ticket to the number of days in advance
that the ticket was purchased? Explain the
problems you encounter in doing this.
E62. Different body organs use different amounts
of oxygen, even when you take their mass
into consideration. For example, the brain
uses more oxygen per kilogram of tissue
than the lungs do. Scientists are interested
in how oxygen consumption is related to
Display 3.117 Oxygen use by certain animal organs.
the mass of an animal and whether that
The oxygen measurements are coded
values (original measurements not
relationship differs from organ to organ.
given) but are still comparable.
The data in Display 3.117 show typical body
[Source: K. Schmidt-Nielsen, Why Is Animal
mass, oxygen consumption in brain tissue,
Size So Important? (Cambridge University Press,
and oxygen consumption in lung tissue for a
1984), p. 94.]
selection of animals. (Oxygen consumption
often is measured in milliliters per hour per E63. How is the birthrate of countries related to
their economic output? Do richer countries
gram of tissue, but the actual units were not
have higher birthrates, perhaps because
recorded for these data.)
families can a ord more children? Or do
a. As you can see from the table, as
poorer countries have higher birthrates,
total body mass increases, the oxygen
perhaps due to the need for family workers
consumption in brain tissue tends to go
and a lack of education? Display 3.118
down. Define a function that models this
shows the birthrates (number of births
situation. Then find a way to describe the
per
thousand population) and the GNP
rate of decrease.
(in thousands of dollars per capita) for
b. Repeat part a for lung tissue. How does
a selection of countries from around the
this relationship differ from that of brain
world.
tissue?
a. Construct a scatterplot of these data and
c. It is known that the proportion of
comment on the pattern you observe.
body mass concentrated in the brain
b.
Fit a statistical model to these data and
decreases appreciably as the size of the
interpret
the slope and intercept of the
animal increases, whereas the proportion
model
in
the context of the data.
concentrated in the lungs remains
relatively constant. One possible theory
2008 Key Curriculum Press
200 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
23 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Study the trend in the percentage of
smokers for the group age 18 to 24. What
model would you use to explain the
relationship between the percentage of
smokers and the year for this age group?
Explain your reasoning. What feature
makes this data set more difficult to
model than the data set in part a?
c. Study the trend in the percentage of
smokers for the group age 65 and over.
Does this group show the same kind of
trend as seen in the two groups studied in
parts a and b? Explain.
Display 3.118
Birthrates and GNP for selected
countries, 2002. [Source: U.S. Census
Bureau, Statistical Abstract of the United States,
20042005.]
E64. According to the National Center for Health
Statistics, the percentage of males who
smoke has decreased markedly over the
past 40 years, but there still may be some
interesting trends to observe. Display 3.119
shows the percentage of males who smoke in
selected years for various age groups.
a. Study the trend in the percentage of
smokers for the entire male population
age 18 and over. The points follow the
pattern of exponential decay. How should
you modify the percentages before taking
their logarithms? Fit the model and
interpret the slope.
2008 Key Curriculum Press
Display 3.119
Percentage of males who smoke by
age group and year. [Source: National
Center for Health Statistics, 2003.]
E65. Is global warming a reality? One measure
of global warming is the amount of carbon
dioxide (CO2) in the atmosphere. Display
3.120 gives the annual average carbon
dioxide levels ( in parts per million ) in the
atmosphere over Mauna Loa Observatory in
Hawaii for the years 1959 through 2003.
a. Plot the data and describe the trend over
the years.
3.5 Shape-Changing Transformations
201
22-03-2009 20:57
Lesson
24 de 24
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
b. Fit a straight line to the data and look at
the residuals. Describe the pattern you
see.
c. Suggest another model that might fit
these data well. Fit the model and assess
how well it removes the pattern from the
residuals.
d. Use the model you like best to describe
numerically the growth rate in
atmospheric carbon dioxide over Hawaii.
Display 3.120 Carbon dioxide in the atmosphere.
[Source: Mauna Loa Observatory.]
E66. How does the average SAT math score for
students in a state relate to the percentage
of students taking the exam? Display 3.121
shows the average SAT math score for each
state in 2005, along with the percentage of
high school seniors taking the exam. Find
a model that seems like a good predictor
of average SAT math scores based on
knowledge of the percentage of seniors
taking the exam.
Display 3.121 Average SAT math scores by state.
[Source: College Board, www.collegeboard.com.]
2008 Key Curriculum Press
202 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:57
Lesson
1 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Chapter Summary
In Chapter 2, you worked with univariate data. In Chapter 3, you learned how to
uncover information for bivariate (two-variable) data, using plots and numerical
summaries of center and spread. In Chapter 2, the basic plot was the histogram.
For histograms of distributions that are approximately normal, the fundamental
measures of center and spread are the mean and the standard deviation. For
bivariate data, the basic plot is the scatterplot. For scatterplots that have an
elliptical shape, the fundamental summary measures are the regression line
(which you can think of as the measure of center) and the correlation (which
you can think of as the measure of spread).
For now, correlation and regression merely describe your data set. In
Chapter 11, you will learn to use numerical summaries computed from a sample
to make inferences about a larger population. Using diagnostic tools such as
residual plots and finding transformations that re-express a curved relationship as
a linear one will come in especially handy because you wont be able to make valid
inferences unless the points form an elliptical cloud.
Review Exercises
E67. Leonardos rules. A class of 15 students
recorded the measurements in Display 3.122
for Activity 3.3a.
Display 3.122 Sample measurements, in
centimeters, for Activity 3.3a.
2008 Key Curriculum Press
a. Construct scatterplots and fit least
squares lines for each of Leonardos rules
in Activity 3.3a. Do the rules appear to
hold?
b. Interpret the slopes of your regression
lines.
c. If appropriate, find the value of r for
each of the three relationships. Which
correlation is strongest? Which is
weakest?
E68. Space Shuttle Challenger. On January 28,
1986, because two O-rings did not seal
properly, Space Shuttle Challenger exploded
and seven people died. The temperature
predicted for the morning of the flight was
between 26F and 29F. The engineers were
concerned that the cold temperatures would
cause the rubber O-rings to malfunction.
On seven previous flights at least one of the
twelve O-rings had shown some distress.
The NASA officials and engineers who
decided not to delay the flight had available
to them data like those on the scatterplot
in Display 3.123 before they made that
decision.
Chapter Summary
203
22-03-2009 20:58
Lesson
2 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E69. Exam scores. Students scores on two exams
in a statistics course are given in Display
3.125 along with a scatterplot with regression
line and a residual plot. The regression
equation is Exam 2 = 51.0 + 0.430(Exam 1),
and the correlation, r, is 0.756.
Display 3.123 Flights when at least one O-ring
showed some distress.
a. Why did it seem reasonable to launch
despite the low temperature?
b. Display 3.123 contains information only
about flights that had O-ring failures.
Data for all flights were available on
a table like the one in Display 3.124.
Add the missing points to a copy of the
scatterplot in Display 3.123. How do these
data affect any trend in the scatterplot?
Would you have recommended launching
the space shuttle if you had seen the
complete plot? Why or why not?
Display 3.125 Data for exam scores in a statistics
class, with scatterplot and residual
plot.
Display 3.124 Challenger O-ring data. [Source:
Siddhartha R. Dalal et al., Risk Analysis of
the Space Shuttle: Pre-Challenger Prediction
of Failure, Journal of the American Statistical
Association 84 (1989): 94547.]
a. Is there a point that is more influential
than the other points on the slope of the
regression line? How can you tell from
the scatterplot? From the residual plot?
b. How will the slope change if the
scores for this one influential point are
removed from the data set? How will the
correlation change? Calculate the slope
and correlation for the revised data to
check your estimate.
2008 Key Curriculum Press
204 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
3 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
c. Construct a residual plot of the revised
data. Does a linear model fit the data
well?
d. Refer to the scatterplot of Exam 2 versus
Exam 1 in Display 3.125. Does this plot
illustrate regression to the mean? Explain
your reasoning.
E70. Suppose you have the Exam 1 and Exam 2
scores of all students enrolled in U.S. History.
a. The slope of the regression line for
predicting the scores on Exam 2 from the
scores on Exam 1 is 0.51. The standard
deviation for Exam 1 scores is 11.6,
and the standard deviation for Exam 2
scores is 7.0. Use only this information to
find the correlation coefficient for these
scores.
b. Suppose you know, in addition, that the
means are 82.3 for Exam 1 and 87.8 for
Exam 2. Find the equation of the least
squares line for predicting Exam 2 scores
from Exam 1 scores.
E71. You are given a list of six values, 1.5, 0.5,
0, 0, 0.5, and 1.5, for x and the same list of six
values for y. Note that the list has mean 0 and
standard deviation 1.
a. Match each x-value with a y-value so
that the resulting six pairs ( x, y ) have
correlation 1.
Display 3.126
2008 Key Curriculum Press
b. Match the x- and y-values again so that
the points have the largest possible
correlation less than 1.
c. Match the values again, this time to get a
correlation as close to 0 as possible.
d. Match the values a fourth time to get a
correlation of 1.
E72. Display 3.126 lists the values of six variables,
with a scatterplot matrix showing all 30
possible scatterplots for these variables. For
example, the first scatterplot in the first row
has variable B on the x-axis and variable
A on the y-axis. The first scatterplot in the
second row has variable A on the x-axis and
variable B on the y-axis.
a. For five pairs of variables the correlation
is exactly 0, and for one other pair it
is 0.02, or almost 0. Identify these six
pairs of variables. What do they have in
common?
b. At the other extreme, one pair of
variables has correlation 0.87; the next
highest correlation is 0.58, and the third
highest is 0.45. Identify these three pairs,
and put them in order from strongest to
weakest correlation.
c. Of the remaining six pairs, four have
correlations of about 0.25 ( give or take a
little ) and two have correlations of about
Data table for 6 variables and a scatterplot matrix of all 30 possible
scatterplots for the variables.
Chapter Summary
205
22-03-2009 20:58
Lesson
4 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
0.1 (give or take a little). Which four pairs
have correlations around 0.25?
d. Choose several scatterplots that you think
best illustrate the phrase the correlation
measures direction and strength but not
shape, and use them to show what you
mean.
E73. Decide whether each statement is true or
false, and then explain your decision.
a. The correlation is to bivariate data what
the standard deviation is to univariate
data.
b. The correlation measures direction and
strength but not shape.
c. If the correlation is near 0, knowing the
value of one variable gives you a narrow
interval of likely values for the other
variable.
d. No matter what data set you look at, the
correlation coefficient, r, and least squares
slope, b1, will always have the same sign.
E74. Look at the scatterplot of average SAT I math
scores versus the percentage of students
taking the exam in Display 3.7 on page 112.
a. Estimate the correlation.
b. What possibly important features of
the plot are lost if you give only the
correlation and the equation of the least
squares line?
c. Sketch what you think the residual plot
would look like if you fitted one line to all
the points.
E75. The correlation between in-state tuition and
out-of-state tuition, measured in dollars, for
a sample of public universities is 0.80.
a. Rewrite the sentence above so that
someone who does not know statistics
can understand it.
b. Does the correlation change if you
convert tuition costs to thousands of
dollars and recompute the correlation?
Does it change if you take logarithms
of the tuition costs and recompute the
correlation?
c. Does the slope of the least squares line
change if you convert tuition costs to
thousands of dollars and recompute
the slope? Does it change if you take
logarithms of the tuition costs and
recompute the slope?
E76. Display 3.127 shows a scatterplot divided
into quadrants by vertical and horizontal
lines that pass through the point of averages,
( x, y ).
a. For each of the four quadrants, give the
sign of zx (the standardized value of x),
zy (the standardized value of y), and their
product zx zy.
b. Which point(s) make the smallest
contribution to the correlation? Explain
why the contributions are small.
Display 3.127
A scatterplot divided into quadrants
by lines passing through the means.
E77. Rank these summaries for three sets
of bivariate data by the strength of the
relationship, from weakest to strongest.
2008 Key Curriculum Press
206 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
5 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E78. Theres an extremely strong relationship
between the price of books online and the
price at your local bookstore.
a. Does this mean the prices are almost the
same?
b. Explain why it is wrong to say that the
prices online cause the prices at your
local bookstore. Why is the relationship
so strong if neither set of prices causes the
other?
E79. Describe a set of cases and two variables for
which you would expect to see regression
toward the mean.
E80. Life spans. In Chapter 2, you looked at
the characteristics of mammals ( given in
Display 2.24 on page 43 ) one at a time. Now
you can look at the relationship between
two variables. For example, is longevity
associated with gestation period? The
variables are average longevity in years,
maximum longevity in years, gestation
period in days, and speed in miles per hour.
a. Construct a scatterplot of gestation
period versus maximum longevity.
Describe what you see, including an
estimate of the correlation.
Display 3.128
2008 Key Curriculum Press
b. Repeat part a, with average longevity
in place of maximum longevity. Does
the average longevity or the maximum
longevity give a better prediction of the
gestation period?
c. Does speed appear to be associated with
average longevity?
E81. Spending for police. The data in Display
3.128 give the number of police officers, the
total expenditures for police officers, the
population, and the violent crime rate for a
sample of states in 2000.
a. Explore and summarize the relationship
between the number of police officers and
total expenditures for police.
b. Explore and summarize the relationship
between the population of the states
and the number of police officers they
employ.
c. Is the number of police officers
strongly related to the rate of violent
crime in these states? Explain. Find
a transformation that straightens
these data. Check the linearity of your
transformed data with a residual plot.
Number of police officers and related variables. [Source: U.S. Census Bureau,
Statistical Abstract of the United States, 20042005.]
Chapter Summary
207
22-03-2009 20:58
Lesson
6 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
E82. House prices. Display 3.129 gives the selling
prices for all houses sold in a Florida
community in one month.
a. Construct a model to predict the selling
price from the area, transforming any
variables, if necessary. Would you use
the same model for both new and used
houses?
b. Are there any influential observations
that have a serious effect on the model? If
so, what would happen to the slope of the
prediction equation and the correlation if
you removed this (or these) point(s) from
the analysis?
c. Predict the selling price of an old house
measuring 1000 sq ft. Do the same for an
old house measuring 2000 sq ft. Which
prediction do you feel more confident
about? Explain.
d. Explain the effect of the number of
bathrooms on the selling price of the
houses. Is it appropriate to fit a regression
model to price as a function of the
number of bathrooms and interpret
the results in the usual way? Why or
why not?
Display 3.129
2008 Key Curriculum Press
208 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
7 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.129
2008 Key Curriculum Press
Chapter Summary
209
22-03-2009 20:58
Lesson
8 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.129
Selling prices of houses in Gainesville, Florida. [Source: Gainesville Board of
Realtors, 1995.]
E83. Spending for schools. Display 3.130 provides
data on spending and other variables related
to public school education for 2001. The
variables are defined as
ExpPP
expenditure per pupil (in
dollars)
ExpPC
expenditure per capita (per
person in the state, in dollars)
average teacher salary (in
TeaSal
thousands of dollars)
%Dropout percentage who drop out of
school
Enroll
number of students enrolled
(in thousands)
Teachers number of teachers (in
thousands)
a. Examine the association between perpupil expenditure and average teacher
salary, with the goal of predicting perpupil expenditure. Is this a cause-andeffect relationship?
b. Analyze the effect of average teacher
salary on per-capita expenditure
( spending on public schools divided
by the number of people in the state ).
Compare the association to the
association in part a. Are the relative
sizes of the correlations about what you
would expect?
c. Are any variables good predictors of the
percentage of dropouts? Explain your
reasoning.
Display 3.130
2008 Key Curriculum Press
210 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
9 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Display 3.130 Public school education by state in 2001. [Source: U.S. Census Bureau, Statistical
Abstract of the United States, 20042005.]
2008 Key Curriculum Press
Chapter Summary
211
22-03-2009 20:58
Lesson
10 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
AP1. This scatterplot shows the age in years
of the oldest and youngest child in 116
households with two or more children age
18 or younger living with their parents.
(Some points have been moved slightly to
show that multiple households are at each
coordinate.) Which of the following is not a
reasonable interpretation of this scatterplot?
There arent any points in the upper-left
region because older children tend
to move out to go to college or to get
married.
The older the oldest child in a household,
the older the youngest child tends to be.
There are no households represented
here in which the only children are twins.
The variability in the age of the youngest
child in these households tends to
increase with the age of the oldest child.
Few households have a range of more
than 12 years in the ages of all of the
children in the household.
AP2. In a study of 190 nations, the least squares
line for the relationship between birthrate
(per thousand per year) and female literacy
rate (in percent) is birthrate = 0.38
literacy + 53.5, with r = 0.8. Uganda has a
birthrate of 47 and a female literacy rate of
60. What is the residual for Uganda?
17.1
64.1
29.3
69.8
16.3
AP3. In a linear regression of the heights of a
group of trees versus their circumferences,
the pattern of residuals is U-shaped. Which
of the following must be true?
I. A nonlinear regression would be a
better model.
II. For trees near the middle of the range
of tree circumferences studied, the
predicted tree height tends to be too tall.
III. r will be close to 0.
II only
III only
I and II
I and III
I, II, and III.
AP4. A recent study models the relationship
between the number of teachers at a high
school and the number of sick days these
teachers take in a year. This scatterplot
shows data for all high schools in a county
during one year.
Which of the following is not a reasonable
way to proceed with the analysis?
Remove the four outliers permanently
from the data set.
Run the regression again without the
four outliers to see how much the slope
and correlation change.
Verify the values for the four outliers to
make sure they are correct.
Try to find a transformation that makes
the cloud of points more elliptical.
Make a residual plot in order to judge
the linearity of the points.
2008 Key Curriculum Press
212 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
11 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
AP5. Upon checking out of a large hospital,
2000 patients rated their satisfaction
with their stay on a scale of 010, with
10 indicating complete satisfaction. The
relationship between the satisfaction rating
and the patients length of stay (in days) was
analyzed using linear regression. Here is part
of the computer printout for this regression.
Which is a correct interpretation of this
regression?
Patients who stay longer in the hospital
tend to be more satisfied than patients
with shorter stays, although this
relationship is weak.
The correlation is 0.228. However,
the sign on the correlation cannot be
determined.
The value of R2 indicates that the
relationship between satisfaction rating
and length of stay is weak but linear.
The y-intercept of the regression line
indicates that no patients rated their
satisfaction less than 4.
The slope of the regression line indicates
that as a patient stays longer in the
hospital, his or her satisfaction tends
to increase day by day.
AP6. The least squares equation to estimate
the population of the fictional country of
Barbaria is log10( population ) = 0.01t + 7,
where t is the number of years since 1950.
Which of the following is closest to the
predicted population of Barbaria for the
year 2000?
7.5
27
7,500,000
31,600,000
1,000,000,000,000,000,000,000,000,000
2008 Key Curriculum Press
AP7. The Barbarian Aptitude Test (BAT) gives
each Barbarian two scores, one for pillaging
and one for burning. The scores range from
a low of 0 to a high of 50. The least squares
equation for a large group of Barbarians who
took the BAT is burning = 0.3 pillaging + 19,
with r = 0.6. Which is the best interpretation
of the slope of this line?
A Barbarian who studies harder and
improves her pillaging score by 1 point
on the next BAT will tend to increase her
burning score by about 0.3 point as well.
Barbarians tend to score about 0.3 point
higher on burning than on pillaging.
Barbarians score about 30% as many
points on burning as on pillaging.
The burning score is highly correlated
with the pillaging score.
A Barbarian who earned 1 more point
on pillaging than another Barbarian
tended to earn only 0.3 point more on
burning.
AP8. A least squares regression analysis using
a rating of each Barbarians personal
cleanliness as the explanatory variable and
the number of raids he or she has carried
out as the response variable found a positive
relationship with R2 = 0.81. Which is not a
correct interpretation of this information?
The correlation between personal
cleanliness and the number of raids is 0.9.
There is a strong relationship between
personal cleanliness and number of raids
among Barbarians.
A Barbarian who is more personally
clean than another also tends to have
made more raids.
There is an 81% chance that the
relationship between personal
cleanliness and number of raids is linear.
The variation in the residuals for the
number of raids among Barbarians
is about 19% of the variation in the
original responses.
AP Sample Test
213
22-03-2009 20:58
Lesson
12 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Investigative Tasks
AP9. Siris equation. Athletes and exercise
scientists sometimes use the proportion
of fat to overall body mass as one measure
of fittness, but measuring the percentage
of body fat directly poses a challenge.
Fortunately, some good statistical detective
work by W. E. Siri in the 1950s provided
an alternative to direct measurement that
is still in use today. Siris method lets you
estimate the percentage of body fat from
body density, which you can measure
directly by hydrostatic (underwater)
weighing. In this exercise, youll see how
transformations and residual plots play a
crucial role in finding Siris model.
a. A first model. Use the data in Display
3.131.
Display 3.131 The percentage of body fat and body
density of 15 women. [Source: M. L. Pollock,
University of Florida, 1956.]
i. Plot percentage of body fat versus
body density, and from your plot
explain whether you think a line gives
a poor fit, a moderately good fit, or an
extremely good fit.
ii. Write the equation of the least squares
line.
iii. Does the value of r2 tend to con rm
your opinion about how well the line
fits?
iv. Construct a residual plot and describe
the pattern. Does the plot tend to
confirm or raise questions about your
opinion? Explain.
b. A new model.
i. Explain how knowing that fat is less
dense than the rest of the body might
have led Siri to plot the percentage of
body fat against the reciprocal
of density,
Construct this
plot and fit a least squares line, and
compare its equation to the one Siri
found:
% body fat = 450 + 495
Next, plot residuals versus
What features of this plot confirm that
the transformation has improved the
linear fit?
ii. Find the correlation between the
percentage of body fat and body
density and the correlation between
the percentage of body fat and the
reciprocal of body density. Comment
on using correlation as the only
criterion for assessing the usefulness
of a model.
iii. Suggest another model for the
percentage of body fat and body
density data that might work nearly
as well as Siris.
2008 Key Curriculum Press
214 Chapter 3 Relationships Between Two Quantitative Variables
22-03-2009 20:58
Lesson
13 de 13
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
AP10. In AP9, you explored the relationship
between the percentage of body mass that
is fat and body density. Display 3.132 is
an extension of the data in AP9, including
skinfold measurements and data for men.
a. Does Siris model for relating percentage
of body fat to body density hold for men
as well as it did for women? That model
was
% body fat = 450 + 495
b. The variable skinfold is the sum of a
number of skinfold thicknesses taken
at various places on the body. ( The
units are millimeters.) The skinfold
measurements are used to predict
body density. Find a good model
for predicting density from skinfold
measurements based on these data
for women. Do your models require
re-expression?
2008 Key Curriculum Press
Display 3.132 Percentage of body fat, body density,
and skinfold for 15 women and
14 men. [Source: M. L. Pollock, University
of Florida, 1956.]
AP Sample Test
215
22-03-2009 20:58
Lesson
1 de 3
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Sample Surveys
and Experiments
What prompts a
hamster to prepare
for hibernation? A
student designed an
experiment to see
whether the number
of hours of light in
a day affects the
concentration of a key
brain enzyme.
2008 Key Curriculum Press
22-03-2009 20:58
Lesson
2 de 3
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Most of what youve done in Chapters 2 and 3, as well as in the first part of
Chapter 1, is part of data explorationways to uncover, display, and describe
patterns in data. Methods of exploration can help you look for patterns in just
about any set of data, but they cant take you beyond the data in hand. With
exploration, what you see is all you get. Often, thats not enough.
Pollster: I asked a hundred likely voters who they planned to vote for, and
fifty-two of them said theyd vote for you.
Politician: Does that mean Ill win the election?
Pollster: Sorry, I cant tell you. My stat course hasnt gotten to inference yet.
Politician: Whats inference?
Pollster: Drawing conclusions based on your data. I can tell you about the
hundred people I actually talked to, but I dont yet know how to
use that information to tell you about all the likely voters.
Methods of inference can take you beyond the data you actually have, but
only if your numbers come from the right kind of process. If you want to use
100 likely voters to tell you about all likely voters, how you choose those 100
voters is crucial. The quality of your inference depends on the quality of your
data; in other words, bad data lead to bad conclusions. This chapter tells you
how to gather data through surveys and experiments in ways that make sound
conclusions possible.
Here's a simple example.1 When you taste a spoonful of chicken soup and
decide it doesnt taste salty enough, thats exploratory analysis: Youve found a
pattern in your one spoonful of soup. If you generalize and conclude that the
whole pot of soup needs salt, thats an inference. To know whether your inference
is valid, you have to know how your one spoonfulthe datawas taken from
the pot. If a lot of salt is sitting on the bottom, soup from the surface wont be
representative, and youll end up with an incorrect inference. If you stir the soup
thoroughly before you taste, your spoonful of data will more likely represent the
whole pot. Sound methods for producing data are the statisticians way of making
sure the soup gets stirred so that a single spoonfulthe samplecan tell you
about the whole pot. Instead of using a spoon, the statistician relies on a chance
device to do the stirring and on probability theory to make the inference.
Soup tasting illustrates one kind of question you can answer using statistical
methods: Can I generalize from a small sample (the spoonful) to a larger
population (the whole pot of soup)? To use a sample for inference about a
population, you must randomize, that is, use chance to determine who or what
gets into your sample.
1The inspiration for this metaphor came from Gudmund Iversen,who teaches statistics at Swarthmore College.
2008 Key Curriculum Press
22-03-2009 20:58
Lesson
3 de 3
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The other kind of question is about comparison and cause. For example,
if people eat chicken soup when they get a cold, will this cause the cold to go
away more quickly? When designing an experiment to determine if a pattern in
the data is due to cause and effect, you also must randomize. That is, you must
use chance to determine which subjects get which treatments. To answer the
question about chicken soup, you would use chance to decide which of your
subjects eat chicken soup and which dont, and then compare the duration of
their colds.
The first part of this chapter is about designing surveys. A well-designed
survey enables you to make inferences about a population by looking at a sample
from that population. The second part of the chapter introduces experiments. An
experiment enables you to determine cause by comparing the effects of two or
more treatments.
In this chapter, you will learn
reasons for using samples when conducting a survey
how to design a survey by randomly selecting participants
how surveys can go wrong (bias)
how to design a sound experiment by randomly assigning treatments to
subjects
how experiments can determine cause
how experiments can go wrong (confounding)
2008 Key Curriculum Press
22-03-2009 20:58
Lesson
1 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
4.1
Why samples are
necessary
Sampling saves money
and time.
Testing every unit can be
destructive.
Samples can give more
information than can
a cursory study of all
individuals.
Why Take Samples, and How Not To
Taking a sample survey can help you determine the percentage of people in a
population who have a particular characteristic. For example, the Gallup poll
periodically asks adults in the United States questions such as whether they
approve of the job the president is doing. Polls or surveys such as this rely on
samples to get their percentages; that is, they dont ask every adult in the United
States but instead ask only a sample of about 1500. Similarly, quality assurance
methods in a manufacturing plant do not call for checking the quality of every
item coming off the production line; rather, they recommend that a limited
number of items (a sample) be checked carefully for quality.
Cost in money or time is a primary reason to use samples. Imagine: Its
Sunday night at 8:00, and Nielsen Media Research is gathering data about what
proportion of TV sets are tuned to a particular program and what kinds of
people are watching that program. To find out how many TV sets are tuned to
the program, electronic meters have been hooked up to televisions in a sample
of households. To find out who is watching it, a sample of people are filling out
diaries. Why doesnt Nielsen Media Research include everyone in the United
States in these surveys? To hook up a meter to every TV set would cost more than
anyone would be willing to pay for the information. Also, to try to get a diary
from every TV viewer about what they were watching at 8:00 p.m. on Sunday
would take so much time that the information would no longer be very useful.
So for two reasonsmoney and timeNielsen ratings are based on samples.
Sampling lets a cook know how the soup will taste without eating it all just to
make sure. A light bulb manufacturer cant test the life of every bulb produced,
or there would be none left to sell. Whenever testing destroys the things you test,
your only choice is to work with a sample.
If time and money are limitedand they always aretheres a tradeoff
between the number of people in your sample and the amount and quality of
information you can expect to get from each person. Using a sample allows you
to spend more time and money gathering high-quality information from each
individual. This often produces greater accuracy in the results than you could get
from a quick, but error-prone, study of every individual.
Census Versus Sample
In statistics, the set of people or things that you want to know about is called
the population. The individual elements of the population sometimes are called
units. In everyday language, population often refers to the number of units in
the set, as when you say In 1990, the population of Massachusetts was about
6 million. In statistics, population refers to the set itself (for example, the people
of Massachusetts). The number of units is called the population size. Ordinarily,
you dont get to record data on all the units in the population, so you use a sample.
The sample is the set of units you do get to study. The special case where you
collect data on the entire population is a census.
Nielsen Media Research takes a survey so they can get an estimate of the
proportion of all U.S. households that are tuned to a particular television program.
2008 Key Curriculum Press
4.1 Why Take Samples, and How Not To
219
22-03-2009 20:59
Lesson
2 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
S is for statistic and
sample: A statistic is a
numerical summary
of the sample. P is
for parameter and
population: A parameter
is a numerical summary
of the population.
The true proportion that Nielsen would get from a survey of every household is
called a population parameter. Nielsen uses the proportion in the sample as an
estimate of this parameter. Such an estimate from a sample is called a statistic.
Not all statistics are created equal. Some arent very good estimators of the
population parameter. For example, the maximum in the sample isnt a very good
estimator of the maximum in the populationit is almost always too small. You
will learn more about the properties of estimators in Chapter 7.
Census Versus Sample
D1. In which of these situations do you think a census is used to collect data, and
in which do you think sampling is used? Explain your reasoning.
a. An automobile manufacturer inspects its new models.
b. A cookie producer checks the number of chocolate chips per cookie.
c. The U.S. president is determined by an election.
d. Weekly movie attendance figures are released each Sunday.
e. A Los Angeles study does in-depth interviews with teachers in order
to find connections between nutrition and health.
Bias: A Potential Problem with Survey Data
Samples offer many advantages, but some samples are more trustworthy than
others. In this section, you will learn about two ways to get untrustworthy results:
bias in the way you select your sample
bias in the way you get a response from the units in your sample
In Activity 4.1a, youll examine the kind of problem that can lie in wait for
the unwary sampler. Suppose you have just won a contract to estimate the average
length of stay in the childrens ward of a hospital. How will you gather the data?
Time in the Hospital
What youll need: one deck of cards for you and your partner
In this activity, youll estimate the average length of stay in a five-bed hospital
ward. Youll sample from the population lengths of stay, represented by the
numbers on the 40 cards (not counting face cards) in an ordinary deck of
cards. Youll estimate the average length of stay from a sample of five patients.
1. Shuffle your 40 cards several times. Deal out a row of five cards to represent
your first patients. The numbers on the cards tell you how many days they
will be on your ward. For example, suppose your first five cards are
These cards mean the patient in bed 1 will be in the hospital for 1 day, the
patient in bed 2 will be there for 2 days, the patient in bed 3 will be there
for 8 days, and so forth. Your partner should record this information on a
(continued)
220 Chapter 4 Sample Surveys and Experiments
2008 Key Curriculum Press
22-03-2009 20:59
Lesson
3 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
chart like the one in Display 4.1. Place the cards that represent patients in a
stack separate from unused cards.
Display 4.1 Lengths of stay for the first five patients.
2. Deal out the other cards one at a time, assigning the next patient to a bed
as it becomes available. To continue with the example, suppose the next
patient is 9 . First available bed is bed 1, and the patient will be in it
for 9 days. The next available bed is bed 2, and the next patient, say, 9 , will
be there for 9 days. Display 4.2 shows a chart of the lengths of stays at this
stage. The next patient will go into bed 5.
Display 4.2 Lengths of stay for the first seven patients.
3. Continue dealing out patient cards, each representing a hospital stay. Record
these stays until all five beds have been filled for at least 20 days. You should
end up with something like Display 4.3. (Save your chart; youll need it later.)
Display 4.3 Lengths of stay for the patients during the first 20 days.
(continued)
2008 Key Curriculum Press
4.1 Why Take Samples, and How Not To
221
22-03-2009 20:59
Lesson
4 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
(continued)
4. Select a day at random from days 1 through 20 (or however many days all
of your beds are full).
5. Compute the average length of stay for the five patients in the beds on
that day.
6. Pool your results with those of the rest of your class until you have
about 30 estimates of the average length of a stay. Make a dot plot of the
30 averages from your class. Where is this distribution centered? Is there
much variability in your estimates?
7. Compute the average stay for the whole population (the original deck of all
40 cards).
8. Compare your results in step 6 and step 7. On average, are your estimates
generally too low, too high, or about right? Why is this the case?
9. What are the units in this situation? Did every unit have an equal chance
of being in the sample? If so, explain. If not, which units had the greater
chance?
10. How could you improve the sampling method?
In everyday language, we say an opinion is biased if it unreasonably favors
one point of view over others. A biased opinion is not balanced, not objective.
In statistics, bias has a similar meaning in that a biased sampling method is
unbalanced.
A sampling method is biased if it produces samples such that the estimate
from the sample is larger or smaller, on average, than the population parameter
being estimated.
Sampling bias lies in
the method, not in the
sample.
Theres an important distinction here between the sample itself and the
method used for choosing the sample.
Investigator: What makes a good sample?
Statistician: A good sample is representative. That is, it looks like a small
version of the population. Proportions you compute from the
sample are close to the corresponding proportions you would
get if you used the whole population. The same is true for other
numerical summaries, such as averages and standard deviations
or medians and IQRs.
Investigator: How can I tell if my sample is representative?
Statistician: Theres the rub. In practice, you cant. You can tell only by
comparing your sample with the population, and if you know
that much about the population, why bother to take a sample?
222 Chapter 4 Sample Surveys and Experiments
2008 Key Curriculum Press
22-03-2009 20:59
Lesson
5 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Investigator: Great! First you tell me my sample should be representative, and
then you tell me theres no way to know whether it is. Is that the
best statisticians can do?
Statistician: Nope. Although you cant tell about any particular sample, it is
possible to tell whether a sampling method is good or not. Thats
where bias comes in.
Investigator: I thought biased was just a fancy word for nonrepresentative.
Not true?
Statistician: Now were getting to the point. Bias refers to the method, not the
samples you get from it. A sampling method is biased if it tends to
give nonrepresentative samples.
Investigator: Now I get it. I may not be able to tell whether my sample is
representative, but if I use an unbiased method, then I can be
confident that my sample is likely to be representative. Right?
Statistician: Now youre thinking like a statistician. Theres more detail to come,
but you have the big picture in focus.
Bias
D2. Explain the difference between nonrepresentative and biased as these
terms pertain to sampling.
D3. Which statements describe an event that is possible? Which describe an
event that is impossible?
A. A representative sample results from a biased sample-selection method.
B. A nonrepresentative sample results from a biased sample-selection
method.
C. A representative sample results from an unbiased sample-selection
method.
D. A nonrepresentative sample results from an unbiased sample-selection
method.
Sample Selection Bias
Size bias is one kind of
sample selection bias.
2008 Key Curriculum Press
Sample selection bias, or sampling bias, is present in a sampling method if
samples tend to result in estimates of population parameters that systematically
are too high or too low. Various forms of this selection bias can undermine the
usefulness of samples and surveys.
You explored one kind of sampling bias in Activity 4.1a, in which patients
who spent more days in the hospital were more likely to be selected for the
sample. In fact, the chance of selection is proportional to the length of stay. A
5-day stay is five times as likely to be chosen as a 1-day stay. This type of sample
selection bias is called size bias. Suppose a wildlife biologist samples lakes in a
state by dropping grains of rice at random onto a map of the state and then selects
for study the lakes that have rice on them.This is another example of size bias.
4.1 Why Take Samples, and How Not To
223
22-03-2009 20:59
Lesson
6 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Voluntary response
bias is another kind of
sampling bias.
Convenience sampling is
almost surely biased.
Judgment sampling,
even when taken by
experts, is usually biased.
The quality of your
sample depends on
having a good sampling
frame.
When a television or radio program asks people to call in and take sides on
some issue, those who care about the issue will be overrepresented, and those
who dont care as much might not be represented at all. The resulting bias from
such a volunteer sample is called voluntary response bias and is a second type of
sample selection bias.
Heres a simple sampling method: Take whatevers handy. For example, what
percentage of the students in your graduating class plan to go to work immediately
after graduation? Rather than find a representative sample of your graduating class,
it would be a lot quicker to ask your friends and use them as your samplequicker
and more convenient, but almost surely biased because your friends are likely to
have somewhat similar plans. A convenience sample is one in which the units
chosen are those that are easy to include. The likelihood of bias makes convenience
samples about as worthless as voluntary response samples.
Because voluntary response sampling and convenience sampling tend to
be biased methods, you might be inclined to rely on the judgment of an expert
to choose a sample that he or she considers representative. Such samples, not
surprisingly, are called judgment samples. Unfortunately, though, experts might
overlook important features of a population. In addition, trying to balance
several features at once can be almost impossibly complicated. In the early days
of election polling, local experts were hired to sample voters in their locale by
filling certain quotas (so many men, so many women, so many voters over the
age of 40, so many employed workers, and so on). The poll takers used their own
judgment as to whom they selected for the poll. It took a very close election (the
1948 presidential election, in which polls were wrong in their prediction) for the
polling organizations to realize that quota sampling was a biased method.
An unbiased sampling method requires that all units in the population have
a known chance of being chosen, so you must prepare a list of population
units, called a sampling frame or, more simply, frame, from which you select
the sample. If you think about enough real examples, youll come to see that
making this list is not something you can take for granted. For the Westvaco
employees in Chapter 1 or for the 50 U.S. states, creating the list is not hard, but
other populations can pose problems. How would you list all the people using the
Internet worldwide or all the ants in Central Park or all the potato chips produced
in the United States over a year? For all practical purposes, you canthere will
often be a difference between the populationthe set of units you want to know
224 Chapter 4 Sample Surveys and Experiments
2008 Key Curriculum Press
22-03-2009 20:59
Lesson
7 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
aboutand the sampling framethe list of units you use to create your sample.
A sample might represent the units in the frame quite well, but how well your
sample represents the population depends on how well youve chosen your frame.
Quite often, a convenient frame fails to cover the population of interest (using a
telephone directory to sample residents of a neighborhood, for example), and a
bias is introduced by this incomplete coverage. If you start from a bad frame, even
the best sampling methods cant save you: Bad frame, bad sample.
Sample Selection Bias
D4. Identify the type of sampling method used in each of these surveys. Would
you expect the estimate of the parameter to be too high or too low?
a. You use your statistics class to estimate the percentage of students in your
school who study at least 2 hours a night.
b. You send a survey to all people who have graduated from your school in
the past 10 years. You use the mean annual income of those who reply
to estimate the mean annual income of all graduates of your school in
the past 10 years.
c. A study was designed to estimate how long people live after being
diagnosed with dementia. The researchers took a random sample of the
people with dementia who were alive on a given day. The date the person
had been diagnosed was recorded, and after the person died the date of
death was recorded.
D5. You want to know the percentage of voters who favor state funding for
bilingual education. Your population of interest is the set of people likely to
vote in the next election. You use as your frame the phone book listing of
residential telephone numbers. How well do you think the frame represents
the population? Are there important groups of individuals who belong to
the population but not to the frame? To the frame but not to the population?
If you think bias is likely, identify what kind of bias and how it might arise.
Response Bias
Bias doesnt always
come from the sampling
method.
Nonresponse bias can
occur when people do
not respond to surveys.
In all the examples so far, bias has come from the method of taking the sample.
Unfortunately, bias from other sources can contaminate data even from wellchosen sampling units.
Perhaps the worst case of faulty data is no data at all. It isnt uncommon
for 40% of the people contacted to refuse to answer a survey. These people
might be different from those who agree to participate. An example of this
nonresponse bias came from a controversial study that found that left-handers
died, on average, about 9 years earlier than right-handers. The investigators sent
questionnaires to the families of everyone listed on the death certificates in two
counties near Los Angeles asking about the handedness of the person who had
died. One critic noted that only half the questionnaires were returned. Did that
change the results? Perhaps. [Source: Left-Handers Die Younger, Study Finds,
Los Angeles Times,April 4, 1991.]
Questionnaire bias
2008 Key Curriculum Press
Nonresponse bias, like bias that comes from the sampling method, arises
from who replies. Questionnaire bias arises from how you ask the questions.
4.1 Why Take Samples, and How Not To
225
22-03-2009 20:59
Lesson
8 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
The opinions people give can depend on the tone of voice of the interviewer, the
appearance of the interviewer, the order in which the questions are asked, and
many other factors. But the most important source of questionnaire bias is the
wording of the questions. This is so important that those who report the results of
surveys should always provide the exact wording of the questions.
For example, Readers Digest commissioned a poll to determine how the
wording of questions affected peoples opinions. The same 1031 people were asked
to respond to these two statements:
1. I would be disappointed if Congress cut its funding for public television.
2. Cuts in funding for public television are justified as part of an overall effort to
reduce federal spending.
Note that agreeing with the first statement is pretty much the same as disagreeing
with the second. However, 54% agreed with the first statement, 40% disagreed, and
6% didnt know, while 52% agreed with the second statement, 37% disagreed, and
10% didnt know. [Source: Fred Barnes, Can You Trust Those Polls? Readers Digest, July 1995,
pp. 4954.]
Incorrect response
Another problem polls and surveys have is trying to ensure that people tell
the truth. Often, the people being interviewed want to be agreeable and tend to
respond in the way they think the interviewer wants them to respond. Newspaper
columnist Dave Barry reported that he was called by Arbitron, an organization
that compiles television ratings. Dave reports:
So I figured the least I could do, for television, was be an Arbitron household.
This involves two major responsibilities:
1. Keeping track of what you watch on TV.
2. Lying about it.
At least thats what I did. I imagine most people do. Because lets face it:
Just because you watch a certain show on television doesnt mean you want
to admit it. [Source: Dave Barry, Dave Barry Talks Back, copyright 1991 by Dave Barry.
Used by permission of Crown Publishers, a division of Random House, Inc.]
226 Chapter 4 Sample Surveys and Experiments
2008 Key Curriculum Press
22-03-2009 20:59
Lesson
9 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
Bias from incorrect responses might be the result of intentional lying, but it
is more likely to come from inaccurate measuring devices, including inaccurate
memories of people being interviewed in self-reported data. Patients in medical
studies are prone to overstate how well they have followed the physicians orders,
just as many people are prone to understate the amount of time they actually
spend watching TV. Measuring the heights of students with a meterstick that
has one end worn off leads to a measurement bias, as does weighing people on a
bathroom scale that is adjusted to read on the light side.
Response Bias
D6. Like Dave Barry, people generally want to appear knowledgeable and
agreeable, and they want to present a favorable face to the world. How might
that affect the results of a survey conducted by a school on the satisfaction of
its graduates with their education?
D7. Another part of the Readers Digest poll described on page 226 asked
Americans if they agree with the statement that it is not the governments
job to financially support television programming. The poll also asked them
if theyd be disappointed if Congress cut its funding for public television.
Which question do you think brought out more support for public television?
D8. How is nonresponse bias different from voluntary response bias?
Summary 4.1: Why Take Samples, and How Not To
The population is the set of units you want to know about. The sample is the set of
units you choose to examine. A census is an examination of all units in the entire
population. Important reasons for using a sample in many situations rather than
taking a census include these:
Testing sometimes destroys the items.
Sampling can save money.
Sampling can save time.
Sampling can make it possible to collect more or better information on
each unit.
2008 Key Curriculum Press
4.1 Why Take Samples, and How Not To
227
22-03-2009 20:59
Lesson
10 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
A sampling method is biased if tends to give results that, on average, are too low
or too high.This can happen if the method of taking the sample or the method of
getting a response is flawed.
Sources of bias from the method of taking the sample include
using a method that gives larger units a bigger chance of being in the sample
(size bias)
letting people volunteer to be in the sample
using a sample just because it is convenient
selecting the sampling units based on expert judgment
constructing an inadequate sampling frame
Types of bias derived from the method of getting the response from the sample
include
nonresponse bias
questionnaire bias
incorrect response or measurement bias
Practice
Census Versus Sample
P1. You want to estimate the average number of
TV sets per household in your community.
a. What is the population? What are the
units?
b. Explain the advantages of sampling over
conducting a census.
c. What problems do you see in carrying
out this sample survey?
Bias
P2. Four people practicing shooting a bow and
arrow made these patterns on their targets.
Display 4.4 Results of four archers.
a. Which person had shots that were biased
and had low variability?
b. Which person had shots that were biased
and had high variability?
c. Which person had shots that were
unbiased and had low variability?
228 Chapter 4 Sample Surveys and Experiments
d. Which person had shots that were
unbiased and had high variability?
e. Do you think it would be easiest to help
Al, Cal, or Dal improve?
Sample Selection Bias
P3. Describe the type of sample selection
bias that would result from each of these
sampling methods.
a. A county offcial wants to estimate the
average size of farms in a county in Iowa.
He repeatedly selects a latitude and
longitude in the county at random and
places the farms at those coordinates in
his sample. If something other than a
farm is at the coordinates, he generates
another set of coordinates.
b. In a study about whether valedictorians
succeed big in life, a professor traveled
across Illinois, attending high school
graduations and selecting 81 students to
participate. . . . He picked students from
the most diverse communities possible,
from little rural schools to rich suburban
schools near Chicago to city schools.
[Source: Michael Ryan, Do Valedictorians Succeed Big in
Life? PARADE, May 17, 1998, pp. 1415.]
2008 Key Curriculum Press
22-03-2009 20:59
Lesson
11 de 12
http://acr.keypress.com/KeyPressPortalV3.0/Viewer/Lesson.htm
c. To estimate the percentage of students
who passed the most recent AP Statistics
Exam, a teacher on an Internet discussion
list for teachers of AP Statistics asks
teachers on the list to report to him how
many of their students took the test and
how many passed.
d. To estimate the average length of the
pieces of string in a bag, a student reaches
in, mixes up the strings, selects one,
mixes them up again, selects another,
and so on.
e. In 1984, Ann Landers conducted a poll
on the marital happiness of women by
asking women to write to her.
P4. Suppose the Museum of Fine Arts in Boston
wants to estimate what proportion of people
who come to Boston from out of town
planned their trip to Boston mainly to visit
the museum. The sample will consist of
all out-of-town visitors to the museum on
several randomly selected days. On buying
a ticket to the museum, people will be
asked whether they came from out of town
and, if so, what the main reason for their
trip was. Do you expect the museums
estimate to be too high, too low, or just
about right? Why? What kind of sampling
method is this?
Response Bias
P5. Consider this pair of questions related to gun
control:
I. Should people who want to buy guns
have to pass a background check to make
sure they have not been convicted of a
violent crime?
II. Should the government interfere with an