Introduction and Descriptive Stats 1
Introduction and Descriptive Stats 1
Welcome Everyone!
Statistics is divided into two major types: Descriptive and Inferential. We will spend the first 3
weeks covering descriptive statistics, then about 2 weeks on some background material that will
help you understand inferential statistics, THEN the rest of the semester on inferential
techniques. It may seem as if descriptive statistics is not that important, but descriptive stats is
always used when we apply inferential techniques. Everything in this class builds, so you will
get lots of practice on all of it by time we are done!
For each week, there is a Study Guide. It will have everything you need to know to complete
each week's material. It is designed to walk you through the text chapters. It contains additional
notes and explanations as well as suggested review exercises with solutions. It starts with
an Overview for the module, followed by objectives. Module Notes follow as a series of pages.
Suggested review exercises with solutions follow each section. To be successful, you should
complete these problems as you get to them before moving on. Just reading through the
material and without working the exercises will not help you learn the material.
There are arrows to navigate forward and backward through the pages within each module. I
am new to Canvas and thinking through places where links back to pages other than just
previous and next would be helpful. Please don't hesitate to drop me a quick note if you come
across a spot where you would like to be able to jump back to somewhere else! I am happy to
add anything that would make navigation easier!
Here, each week, I will simply introduce the unit and give a brief summary of key activities. The
study guide will have the details.
Our Descriptive Statistics Unit will cover Chapters 1-4. This week we will cover Chapters
1 and 2.
Get started with "What to do in Week 1"! ("Next" will take you there!)
Please do email me anytime with any questions you have about anything!
(I'm revising everything with the new edition and new platform and making changes to improve
the course. While I go through all the course materials very carefully, I am sure I still miss some
things. So, please let me know if there is anything confusing, links that don't work (Sometimes,
they work fine for me, then not for you! Grrrr!) or that you think may need correction. Thanks!)
1
Chapter 1
Statistics is simply about changing data into useful information. Think about some data you
might use in your daily life. A check book register or some record of spending is something
most of us have. Well, if you simply look at a list of all the numbers for, say, a 6 month period,
how much can you see by looking just at the numbers? But, we can use statistics to take that
list of numbers (data) and get some useful information from it: average expense, maximum
expense, minimum expense, distribution of expense amounts (lots of big ones, lots of little ones,
or a bunch in the middle. When you just look at a list of hundreds of numbers, it is difficult to
really draw any conclusions from that data or use it to help make decisions. But the variety of
statistical techniques available to us helps us to generate information we can use.
There are two types of statistical analysis: descriptive statistics -- using tabular, graphical, or
numerical methods to summarize and describe data -- and inferential statistics -- methods that
help us to draw conclusions about a population from a sample drawn from that population. (I'll
explain these terms shortly!)
While the bulk of the course appears to be on inferential techniques (there are ALOT more of
them and they are harder to explain!), the descriptive techniques are key aspects of any
statistical study. Any time we have a data set that we want to analyze by inferential statistics,
we would first need to do some descriptive analysis. We need to understand our data before
we use it! There are many things we can learn from our interpretations of descriptive analysis.
The techniques we learn in the first 4 chapters will be used throughout the course. Our first
case will be a descriptive statistics case -- and descriptive techniques will be part of the
inferential cases as well.
Take a few minutes and read through the introductory section of Chapter 1 on pages 2-4
for some examples from later chapters in the text.
1.1 – Key Statistical Concepts
Now, to help with the terminology we will use in inferential statistics. In inferential statistics, we
draw a sample from a population and calculate statistics (numerical facts) from that data, then
we draw conclusions about the population parameters (numerical "facts"). So, a population is
what we are interested in; what we want to study. It may not be people or about people. It
could be sales figures or stock prices. A parameter is a descriptive measure (average for
example) of a population. A sample is a subset of the "members" of the population of interest,
while a statistic is a descriptive measure that we calculate from that sample.
As a basic example, suppose we want to know the average age of MBA students in business
programs in the United States. Rather than collect the ages of all MBA students
(the population), we draw a random sample from that population and calculate the average age
of the students in our sample -- this is a statistic. But, waht we want to know is the average age
of the population (a parameter). We can use our sample average to "infer" something about the
population average. There are different ways to use the sample average. We can use it to
"estimate" the population parameter we want or we can use it to test some "hypothesis" or belief
we have about the population parameter.
One problem with all of this is that because we have only used a sample, we will not really know
what the population parameter is EXACTLY. So, when we do inference, we need measures of
reliability. If we are using our sample statistic for estimation, we use the confidence
level, which is the proportion of times we will be correct. When we use our sample statistic to
test a hypothesis, we use the significance level which is the proportion of times our conclusion
2
will be wrong. These terms will make a lot more sense after we get into statistical inference, so
do not worry too much about them for now. Just be clear that confidence level is used
for estimation and significance level is used for hypothesis testing. For some reason, later on,
students seem to want to state "confidence" when testing a hypothesis and this
is NOT statistically correct.
Now, you might skim through Section 1.1 for more clarification of these terms. But, you
will be seeing them all again and they will become more clear as we move on.
1.2 – Statistical Applications in Business
Statistics is used in all functions of organizations. Whether you are in marketing, accounting,
HR, operations, or finance, statistics is applicable and useful for helping make business
decisions. This text incorporates examples of applications in the difference disciplines through
separate sections in the text and through application boxes throughout the text and individual
problems. I have tried to assign these when possible and will try to point out the discipline
represented by the problems we encounter within the chapter material.
1.3 – Large Real Data Sets
As you will see, many of the "small" example problems we do use real data from larger studies,
but the first time you see a "real" data set with ALL of the "variables" that were collected in that
study, it can seem overwhelming. So, with this edition, the authors have provided eight large
data sets. I was going to say "full" data sets, but they have reduced the number of variables
(only 60!! in the GSS data sets). I will include problems using these data sets in the suggested
problems so that you get more practice with real data sets.
1.4 – Statistics and the Computer
There are many statistical software packages out there. Many of these are easy to use once
you know some statistics, but very confusing and difficult to use when you don't. There are so
many different statistical techniques and different ways to organize data, that you not only need
a manual for the programs, you also better know what you need to look up! Fortunately, Excel,
which most folks already have, can pretty much do what we need to do without the
complications of using more complex software.
We will be using Excel to do most of the calculations for us (although there are some concepts I
will want you to do by hand first so that you can better understand what Excel is doing). There
are several tools we will use with Excel and you need to make sure that you have them
installed.
Data Analysis: This is a standard Excel data analysis Add-in, but you may not have it activated.
To check, click on your Data tab. If it is activated, you will have "Data Analysis" on the far right
as shown below:
If you do not see that, you will need to activate it. This is done from the Options button under the
File menu. Under File, click Options, then Add-ins. At the bottom of the right-hand window, in
the "Manage" dropdown, select Excel Add-ins if not already selected, then Click Go. In the box
that appears, select both Analysis Toolpak and Analysis Toolpak-VBA and click OK. Now, you
should have the Data Analysis as shown above.
3
I show this process at the end of the course intro video on the syllabus page! Let me know if you
have any questions!
Excel Workbooks: These are available from the companion website for the text, but the
specific ones we will use are available from the text resources section at the top of the modules
page. I will also include the data sets on that page.
You need to make sure you have the Data Analysis activated and the workbooks and the data
sets downloaded before you move on to Chapter 2. We will be doing some work in XLSTAT, so
you will want to get that working as well. The instructions are with your text. We will cover some
Excel basics as we get into descriptive stats, but for the most part, I expect that you are
comfortable with Excel Basics. You will be on your own for figuring out a lot of the Excel used in
this course. But, it is not intense, and I am happy to help you with some stuff outside of class.
If you are not reasonably competent with Excel basics, you should review right away. But
don't panic! Holler if you have any questions or issues with anything!
Do the suggested chapter exercises for Chapter 1 and review the solutions before
moving on!!
Chapter 2 Section 2.1
Chapter 2
Now we jump into descriptive stats -- starting with graphs and tables!
In learning statistics, at first, it is easy to think that the hardest part is the math (especially for
those students who feel they are not strong in math), but you will soon discover that one of the
most difficult parts is determining what technique to use. There are 2 important things to think
about:
1. What kind of data do we have?
2. What is it we are trying to do?
So, before we can even get started with descriptive techniques, we need to tackle these two
topics.
We first need to learn some important terminology and a very important concept – there are
different types of data and in order to decide what technique to use, we need to determine what
kind we have!
2.1 – Types of Data and Information
First, some terminology:
Variable: some characteristic of a population or sample, such as company sales in
dollars or company number of employees, or cereal choice a person makes
Data: the observed values of the variables -- are the facts & figures that are collected,
summarized, analyzed, and interpreted.
Elements: the items that variables are collected about -- i.e., companies, individuals
4
Observations: the set of all measurements for an element -- e.g., we may collect data
for several variables about each company or individual like -- company size in sales dollars
and in number of employees. The observation for the element Company A includes
both variables for company size.
Types of data (often called "Scales of measurement")
While we often think of data as numbers, there are actually different types of data that is actually
a hierarchy based on the amount of information we "really" have that we can use. The amount
of information we have affects how we can use the data.
Nominal data is the lowest level of data in the hierarchy -- it is just categories or
classifications (labels or names) and all we can do is count the number of observations in
each category. We can't do math on these! Sometimes, when nominal data is recorded, it
is "coded" by assigning a number to each category. For example, female = 1; male = 2. But,
even if they are numbers (coded). They are still categories. This data is often
called qualitative or categorical data.
Ordinal data is nominal (categorical) data, but the categories can be put into a
meaningful order (i.e., freshmen, sophomore, junior, senior). This data is a little more
powerful than nominal data and there are some different tests we can do on these types of
data besides just counting. There are computations that are based on a ranking process.
This is still qualitative data.
Interval data is numerical data with fixed intervals between values (any numbers
really!!). An example is SAT scores or temperature. An additional type of data is ratio data
(most numerical data is ratio data) which is interval data with a true zero point -- distance,
height, weight. This is quantitative data and this is data that on which we can actually do
MATH!!
Below is a flow chart giving a visual picture of these concepts. One thing to remember is that
you need to think through how a variable is defined to be sure whether it is quantitative or
qualitative data.
*Adapted from the authors' slides from an earlier edition of our text
Now that you know the different types of data, we can break descriptive statistics down a bit
further.
5
Before we move on to talk about the nominal data techniques, we need to talk about the second
question we need to answer in determining the correct technique to use. "What is it we are
trying to do?" This is really asking you to identify what your text calls the "Problem Objective".
For descriptive statistics, we will consider two different problem objectives: describe a single set
of data and describe the relationship between two variables (sets of data). The flow chart below
is from Appendix 4 at the end of chapter 4 in your text.
Charts like these can be very helpful when trying to determine the correct technique to use. I
will encourage their use and refer to them often! Many students do not have any trouble with
descriptive stats techniques, but when we get to inferential stats, thinking about problem
objective and type of data and using flow charts like this to guide you can be invaluable when
trying to figure out what you need to do.
Do the suggested exercises for Section 2.1 and review the solutions before moving on!
6
Chapter 2 Section 2.2
2.2 – Describing a set of Nominal Data
Remember that for nominal and ordinal data, we only have frequency counts for each category.
Given the counts we can calculate the proportion or percentage that each category represents.
With this information, we can develop tables: frequency distributions and relative
frequency distributions. Or, there are two types of graphs we can do: bar charts to emphasize
the frequency or pie charts to emphasize the proportions (relative frequency) in each category.
While the text shows how to manually do all of these techniques, for descriptive stats, we are
going to be using Excel. Your text walks you through an example.
Watch the video below for a demonstration of using Excel to develop frequency
distributions and relative frequency distributions using Example 2.1. This will also give you
a review of some basic Excel skills. For those of you who are Excel gurus, I promise, there will
only be a small amount of Excel basics!
(Note that the new edition of the text has a new data set for this problem, but I have not redone the video.
Everything works the same, but the WRKSTAT variable in in column X instead of P.)
Frequency Distributions Video
After you have developed the table, remember that our purpose is to summarize our data in a
way that is meaningful, so be sure to pay particular attention to the INTERPRETATION of the
tables.
The next video demonstrates how to create bar charts and pie charts from the tables just
developed.
Watch the video below for a demonstration of using Excel to develop bar charts and pie
charts.
Bar charts and Pie Charts Video
While pie charts and bar charts are technically designed for representing frequency
distributions, they are often used simply to represent numbers associated with categories. See
Examples 2.2 and 2.3.
Do the suggested exercises for Section 2.2 and review the solutions before moving on.
Chapter 2 Section 2.32.3 -- Describing the
Relationship between 2 Nominal Variables and
Comparing Two or More Nominal Data Sets
What if we have 2 variables for each observation? Now we need some bivariate techniques.
We can describe the relationship between them with a cross-classification table (or cross
tabulation) and then use bar charts to represent them graphically.
A cross-classification table is a table shows the frequency of each combination of the values of
the two variables. It is a table with one variable and its categories across the top and the other
7
down the side. Each cell in the table represents one combination of the different categories of
the two variables and the frequency is a count of the number of observations that fall into that
particular combination of categories.
Below is the completed table from Example 2.4 in your text. Open the data file Xm02-04 .
Look at the data and make sure you can see what it is that is counted to develop the table.
Once you have your table, you can look for patterns in the numbers themselves or develop a
series of bar charts or combination of bar charts to examine them visually. Again, our goal is to
interpret the graphs!!
In Excel, we can use the Pivot Table tool to do the counts for us and develop our table. Watch
the next video for a demo showing how the pivot table and corresponding graph in Example 2.4
was created.
Click the link below for a video demonstration of using Excel to develop cross-
tabulations.
Cross Tabulation Table Video
One more important topic is data formats. Data can be stored in lots of different ways and
software packages usually require a specific format for different techniques. Software packages
often differ on what they require for the same technique. Whether using Excel or any other
software, you often will need to reformat your data depending on what you are trying to do.
Some possible formats for multivariate data are described in section 2.3c.
Do the suggested exercises for Section 2.3 and the suggested Chapter Review exercises
and review the solutions before moving on!
Descriptive Stats II
Welcome to Week 2!!
Our Descriptive Statistics Unit covers Chapters 1-4. This week we will cover chapters 3 and get
started on 4 with sections 4.1 and 4.2.
You should be getting started on Case 1! By the completion on this week's material, you have a
lot you can be getting going on! Your draft is due in week 4, then the final paper in week 5!
IMPORTANT!! The optional Team Contract assignment is due Saturday of this week!
To get started, "Next" will take you to "What to Do in Week 2"!
8
What to Do in Week 2!
1. Review the optional team contract information and discuss the issues with your
teammates! Do not wait to do this! Get your communication established now! The first case
starts NOW!
2. Submit your team contract by FRIDAY if you choose to do one. While submission is
optional, thinking through the issues is important!
3. Chapters 3 and 4.1 - 4.2: Use the Study Guide as you go through the chapter material,
completing any suggested review exercises as you go. It is important that you work through
problems for each section as you finish it. Don't go through the whole chapter before you
start working problems … take it one section at a time! All of the material builds, so work
problems as you complete each section.
4. Get started on Case 1!! Don't wait until you have finished all the material. As you work
through it, use the case data for sample problems!
5. Discuss the material with your team and work together to understand the material.
Remember that the best way to learn is to teach someone else!!
Return to the top of Week 2 Module
The "Study Guide" link above will take you to the start of the study guide!
Case 1 – Cadillac’s Lagging Sales
Note: This case is a bit old, but there are some really interesting things you can find in the data.
The purpose the case is to have you really explore the data using both graphical and
numerical techniques AND give insights into what the data really means and implications
for your findings. Be sure to review the case rubric for guidance!
For many years, the top selling luxury car in North America was the Cadillac. It reigned from
1950 to 1998. In 2000, it sank to 6th, behind, Lexus, BMW, and Mercedes. Cadillac’s sales
during 2000 were half those of 1978, Cadillac’s peak year. The problem is that Cadillac
appears to appeal mainly to older males. Younger people seeking luxury cars shop
elsewhere. Although Cadillac made $700 million in 2000, if the company does not pick up sales
among younger shoppers, profits will go the way of fins. To put Cadillac back on top, it is
necessary to understand who is buying. A survey of luxury car buyers was undertaken. The
following information was gathered from random samples of recent buyers of luxury cars.
(Keller, 8th edition)
In the attached data file , Columns 1-5 contain the ages of the owners of five luxury cars
(BMW, Cadillac, Lexus, Lincoln, Mercedes), columns 6-10 show their household incomes, and
Columns 11-15 have their years of education.
Thoroughly analyze the data for all five groups of car buyers using both numerical and
graphical descriptive statistics. I really want you to explore the data. Compare all the
groups to each other. Be sure you consider both central tendency and variability in both your
graphs and numerical analysis. Also be sure to consider any relationships between variables.
You should assume that the data is a properly drawn random sample of luxury car buyers, so
that the sample represents the overall market.
You should be sure that your report includes the following:
9
Based on your descriptive statistics, I want you to give some insights and interpretations of the
data. Discuss the descriptive statistics for each variable. What do these variables mean? If a
given average is high or low, does that make logical sense based on what these variables are?
You must include a discussion of variability. Does the variability of a given variable seem
reasonable? Is it high or low? What implications does this have for the analysis? What do the
graphs really show you?
Key things I’ll be looking for on Case 1 ---
Insights! Not just numbers! Show me that you are thinking -- if you make some
assumptions, state them. When you present a table or graph, ask WHY?
Demonstrated understanding of the concept of and importance of variability.
Creativity in using graphs and tables in your analysis and insights about those
comparisons.
Some conclusions based on your analysis.
A note about format: It is best to organize the paper by variable to make it easier to compare the
manufacturers. Use only summary tables within the text of your report. If any data is within your
text, it should be something you are discussing!! I do not want to see the raw data anywhere. I
do not need to see any calculations. You should use appendices appropriately. If you have
questions about this, please ask. The reader should not have to flip back and forth. This
should be completed as a WORD document with any Excel pasted in.
Please let me know if you have any questions about anything!
HAVE FUN!!! ;-)
Case 1 – Cadillac’s Lagging Sales
Note: This case is a bit old, but there are some really interesting things you can find in the data.
The purpose the case is to have you really explore the data using both graphical and
numerical techniques AND give insights into what the data really means and implications
for your findings. Be sure to review the case rubric for guidance!
For many years, the top selling luxury car in North America was the Cadillac. It reigned from
1950 to 1998. In 2000, it sank to 6th, behind, Lexus, BMW, and Mercedes. Cadillac’s sales
during 2000 were half those of 1978, Cadillac’s peak year. The problem is that Cadillac
appears to appeal mainly to older males. Younger people seeking luxury cars shop
elsewhere. Although Cadillac made $700 million in 2000, if the company does not pick up sales
among younger shoppers, profits will go the way of fins. To put Cadillac back on top, it is
necessary to understand who is buying. A survey of luxury car buyers was undertaken. The
following information was gathered from random samples of recent buyers of luxury cars.
(Keller, 8th edition)
In the attached data file , Columns 1-5 contain the ages of the owners of five luxury cars
(BMW, Cadillac, Lexus, Lincoln, Mercedes), columns 6-10 show their household incomes, and
Columns 11-15 have their years of education.
Thoroughly analyze the data for all five groups of car buyers using both numerical and
graphical descriptive statistics. I really want you to explore the data. Compare all the
groups to each other. Be sure you consider both central tendency and variability in both your
graphs and numerical analysis. Also be sure to consider any relationships between variables.
You should assume that the data is a properly drawn random sample of luxury car buyers, so
that the sample represents the overall market.
10
You should be sure that your report includes the following:
Based on your descriptive statistics, I want you to give some insights and interpretations of the
data. Discuss the descriptive statistics for each variable. What do these variables mean? If a
given average is high or low, does that make logical sense based on what these variables are?
You must include a discussion of variability. Does the variability of a given variable seem
reasonable? Is it high or low? What implications does this have for the analysis? What do the
graphs really show you?
Key things I’ll be looking for on Case 1 ---
Insights! Not just numbers! Show me that you are thinking -- if you make some
assumptions, state them. When you present a table or graph, ask WHY?
Demonstrated understanding of the concept of and importance of variability.
Creativity in using graphs and tables in your analysis and insights about those
comparisons.
Some conclusions based on your analysis.
A note about format: It is best to organize the paper by variable to make it easier to compare the
manufacturers. Use only summary tables within the text of your report. If any data is within your
text, it should be something you are discussing!! I do not want to see the raw data anywhere. I
do not need to see any calculations. You should use appendices appropriately. If you have
questions about this, please ask. The reader should not have to flip back and forth. This
should be completed as a WORD document with any Excel pasted in.
Please let me know if you have any questions about anything!
HAVE FUN!!! ;-)
Week 2 Overview
This module covers Chapter 3 and gets started on Chapter 4. We finish up graphical descriptive
statistics and then cover numerical descriptive statistics.
Chapter 3 covers graphical descriptive statistics for interval data. So we will
cover histograms (which you get to use A LOT!), line charts for time-series data, and scatter
diagramsto look for relationships between two interval variables.
Chapter 4 completes our discussion of descriptive statistics by introducing you to numerical
descriptive statistics. This week, we just get started. We cover a variety of calculations we can
do to better understand our data. Many of these you have seen before and some of them you
will already know how to do!! We will look at four types of measures: central
location, variability, relative standing, and linear relationship. This week, we see central location
and variability. An important point here is that most of these measures involve math! So …
guess what? For those that do, we are dealing only with interval data.
To this point, we have been using Excel a lot, but now we will switch gears to working out some
things by-hand. I have found that this can be valuable for really intuitively understanding what
some of these measures mean. And that is key to truly understanding what is going on when we
get to inferential statistics and start using these measures!
11
Objectives Chapters 3 and 4.1-4.2
After completion of Chapter 3, you will be able to:
Define and give examples of the following: positively skewed, negatively skewed, cross-
sectional data, time-series data.
Explain the use of, develop (in Excel), and interpret frequency distributions and
histograms.
Recognize the characteristics of histograms.
Draw conclusions from comparisons of histograms.
Distinguish between cross-sectional and time series data.
Explain the use of, develop (Excel), and interpret line charts.
Explain the use of and develop scatter diagrams using Excel.
Recognize patterns in scatter diagrams.
Apply techniques that insure graphical integrity and avoid graphical deception.
After completion of Sections 4.1 and 4.2 of Chapter 4, you will be able to:
Calculate and interpret measures of central tendency by hand and using Excel.
Recognize situations where median would be a better measure than mean.
Evaluate the general shape of a distribution by comparing the three measures of central
tendency.
Calculate and interpret measures of central location by hand and using Excel.
Use the Empirical Rule to show how values are distributed.
Explain situations when coefficient of variation is best for comparing variability.
Course Road Map
First, let's revisit our Course Road Map.
We continue with the descriptive statistics segment of our overall Course Road Map. As we
break down descriptive statistics further as shown in the next figure (a "descriptive stats
12
roadmap"), we have completed graphical techniques for nominal data and will finish up graphs
with interval data, and then get started talking about numerical techniques. We will hit it hard in
this module with Central Location and Variability. The next module will seem light in
comparison!
Chapter 3 Introduction
Before we jump in here, let's re-cap a bit. As we cover more descriptive techniques, the
difficulty will be in deciding what technique to use. This gets MUCH more complex when we get
into inferential stats! The above descriptive stats road map considers the breakdown by data
type, but remember that when deciding what test to do, we also need to consider the problem
objective. So, let's revisit the Flow Chart from Appendix 4 at the end of your text.
13
First, we need to determine our problem objective, For descriptive stats, there are two
possibilities we will consider: describe a single variable (data set) or describe the relationship
between 2 variables (data sets). Once we have decided that, we determine our data type. Note
that since ordinal data is categorical data just like nominal, all the techniques we use for nominal
data apply to ordinal. Then, following along, we then decide whether we want to use numerical
or graphical techniques. So, in the above flow chart, we have covered the graphical/nominal
and graphical/ordinal options for both problem objectives.
Now, we will address the graphical techniques for interval data. You can see that when we get
to numerical techniques, things get a bit more complex, but that is the topic for Chapter 4. Let's
finish up the graphical techniques first.
One point I want to emphasize about all these graphical techniques is that tables can show the
SAME information! For many of the graphs we do, we start with a table! You do not ALWAYS
need to use graphs to provide a visual or to learn something about your data. For example,
sometimes, looking at a table of cumulative relative frequency might show you something that
you miss looking at just a pie chart.
Chapter 3 Section 3.1
3.1 -- Graphical Techniques to Describe a Set of
Interval Data
If you read thorough the paragraph introducing Example 3.1 in your text, the key phrase
is, "What information can be extracted from these data?" This is why we do the analysis! Just
looking at 200 numbers, it is difficult to see anything useful! But, by using descriptive statistical
techniques, we are better able to interpret the data.
When we have interval data, there are several graphical techniques we can use: histograms,
stem-and-leaf plots, or Ogives. Each of these display the data in different ways and provide us
with somewhat different information. We are only going to do histograms.
Histograms
A histogram graphically displays the information included in a frequency distribution. We
created frequency distributions with nominal data by counting the number of times each
category occurred and developed Bar Charts to represent that table. A histogram will do the
same thing for interval data. See the example below.
14
The histogram is extremely useful at showing us the "shape" of our data – how it is "distributed"
over the range of its values. For example, are there lots of low values and only a few high
values? Or is most of the data clustered in the center of the range of values? So, before we can
develop a histogram, we need to convert your data into a frequency distribution. But, interval
data is what we call "continuous" data (rather than "discrete" -- terms that will be coming up
again later!), so we do not have categories. So what we need to do is to establish a set of
intervals (classes) that cover the full range of our data values.
Note that in the example above created with Excel, the values on the horizontal axis fall right in
the middle of the "bars". Despite fiddling with it for hours (ok .. this was years ago, but I don't
think it has changed!), there does not seem to be a way to place the values at the upper tic
mark, which is where they really belong! A workaround is to use "right justify" for the axis labels.
When we set up our intervals for a histogram, we are establishing a lower limit and an upper
limit for the category, so the first class in the histogram above goes from 0 to 4; the next goes
from values greater than 4 to 6; then greater than 6 to 8; and so on. When we do these in
Excel, we will tell Excel the upper limit for our classes, but remember that your interval covers
arrange of values. (Note: for any techniques that are part of Data Analysis in Excel, that is the
tool we will use. You are welcome to use XLStat if you wish, following the instructions in the
text.)
Steps for developing a histogram
Here are the steps needed to create a histogram. It will seem complex at first, but you will be
doing lots of these throughout the semester, so you will get very good at it! And Excel will do
most of the work once you determine your class intervals.
Step 1: Establish classes to cover the range of your data.
This means you need to establish lower and upper numerical limits for each class into which
you will categorize your data. You can do this several ways and your text gives some
guidelines. There is no right or wrong, but it is important to use enough classes to show the
variation in the data, but not so many that you have classes with a frequency of only 1 or 2. The
text recommends first determining the number of classes you will use. This is dependent on the
size of your data set and there is a table for guidance. Then use the range of your data divided
by the number of classes to get the required width of each class. I found this to be
cumbersome, so I tend to do it a little differently. No matter how you do it, it is somewhat trial-
and-error. You may need to change your limits and recreate the histogram depending on what
you see, but that is very quick and easy to do!
Here is what I do: I look at the range of the data (highest and lowest values) and use trial and
error to determine some convenient class widths that give what seem to be a reasonable
number of classes and frequency counts. And there is no rule that says your lowest value has
to be the lower limit of your first class (or the highest one, the upper limit of the last class). If
your lowest value is 5, you could still use, for example, 0 to 10 as your first class. However you
do it, the first thing you need to do is determine the highest & lowest values in your data set.
15
With Excel, you could use the "data/sort" menu option to find these, or the minimum and
maximum functions ("=max(input range)" or "=min(input range)"). Use the fx button to access
the function wizard if you are unsure of how to use these functions.
It is important that no value could be included in more than one class (no overlap), so do not
use the same number as the upper limit of one class and the lower limit of the next. For
example, use 1 to 9.99 and 10 to 19.99. Excel will assume this for you … you just use your
upper class limit.
Set up a table with your classes in the first column. Using Excel, you set up a column with only
the upper limits. Excel calls these "bins".
Steps 2 and 3: Go through your data and count the number of values that fall into each
class and record the frequency counts in the second column of your table.
Excel will do this for you with the histogram tool.
Follow these steps:
1. Determine your class intervals
2. Create a column with the upper limits of these classes
3. Go to "Data Analysis", Histogram, and fill in the dialog box.
Input range is your data column, and
Bin range is the column you created with the upper limits of your class intervals.
Notice the "Labels" box. Many of the tools we will use have this. You can include your
column label (variable name for your input range; bins or classes or something similar for
your bin range). Be sure that you are consistent. If you included a column label when you
selected your input range, you need to include one for your bin range. And, then be sure to
click the labels box. Be careful here …. If you click labels, but did not include the label in
your selected ranges, Excel will think the first data value is the label and not include it in your
data.
Be sure that you click the "Chart output" box. Otherwise, you won't get the graph. (I am
always doing this!! I get the frequency distribution, but no histogram!)
Once you have your histogram, it is IMPORTANT that you change the gap width to zero! A histogram
represents continuous data and there should be no gaps between the bars. Right click one of the bars
and select "Format Data Series" to do this. The first graph we will get has spaces between the bars -- this
is a bar chart, not a histogram and is NOT CORRECT for interval data.
Watch the video below for a demonstration of using Excel to create a histogram.
Histogram Video
Pay particular attention to the interpretation of the histogram in the text. The whole point of this
exercise is to get some information from the graph, so always interpret the graph!!
Shapes of histograms
One key piece of information that histograms can give us concerns the distribution of the data.
So, we need to look at the shape of the histogram. In evaluating shape, we
consider symmetry, skewness, and modality.
16
A symmetric histogram is one which essentially shoes a mirror image such that if we draw a
vertical line down the center of this graph, each side reflects the other. You could fold it over on
the center line and everything would match up. The figure below shows three symmetric
histograms.
Skewness occurs when there are extreme data values on one end or the other -- either
extremely high or extremely low. Income data would be skewed if Bill Gates was included!! The
histograms below represent positively skewed and negatively skewed data.
Here, I want to point out something that is often misunderstood. Positively skewed data has an
extreme point on the positive (right, high) side, so the tail is to the right with the bulk of the data
to the left. Conversely, negatively skewed data has the tail to the left (low) side and the bulk of
the data is to the right.
To evaluate modality, we look for what we call "modal classes" -- obvious peaks in the data. If
there is only one obvious peak, the data is unimodal; if two, it is bi-modal.
Also note that for a bimodal histogram, we are looking for 2 obvious peaks in the data -- the
peaks do not necessary need to be the same height.
In many of the inferential techniques we will do, a "required condition" of conduction the test is
that the data be approximately "Normally" distributed. A normal distribution is a specific
symmetrical, unimodal distribution that has a bell-shape as shown below. We will be using
histograms later on to check for "normality".
17
Work through Examples 3.2, 3.3, and 3.4 in your text for some practice creating,
interpreting and comparing histograms.
Do the suggested exercises for Section 3.1 and review the solutions before moving on.
18