0% found this document useful (0 votes)
33 views16 pages

Lecture Notes Chapter 2 - Organizing Data

Uploaded by

adam.ibr1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views16 pages

Lecture Notes Chapter 2 - Organizing Data

Uploaded by

adam.ibr1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2

ORGANIZING AND GRAPHING DATA

In this chapter we are going to discuss the methods to organize the data using tables and
graphical representations
Data that is recorded in the sequence in which they are collected and before they are processed
or ranked are called raw data or ungrouped data
Grouped data is a data set presented in a frequency distribution.
Why do we group data?
We group data since in its original form they are often too large and unmanageable, and so we
might not be able to see patterns and trends
The way we organize our data is to a certain extent dependent on the type of data that we are
working with (Qualitative vs Quantitative data)
We are going to start with Qualitative data since it is simpler to group.

Qualitative Data
A frequency distribution of a qualitative variable lists all categories and the number of
elements that belong to each of the categories.
Exercise 2.6
Thirty Adults were asked which of the following conveniences they would find most difficult to
do without: television (T), refrigerator (R), air conditioning (A), computer (C), or mobile phone
(M). Their responses are listed below
R A R A A T R M C A
A R R T C C T R A A
R A A T M C R A C R

The data as is presented above is an example of raw data. The values are in no order.
Note the variable is the type of convenience and it has according to the data 5 different
categories: television, refrigerator, air conditioning, computer, mobile phone
To prepare a frequency distribution we are going to list these 5 categories in the first column of
the table
The second column is the frequency or count of a category: it is how many observations in our
data fall in that category. For example Television has a frequency of 4 which means there are 4
people in our sample that consider the television as the most difficult convenience to give up
Frequency Distribution of Convenience most difficult to give up

Convenience Frequency Relative Frequency Percentage


Television 4 4/30 = 0.133 13.3
Refrigerator 9 9/30 = 0.300 30.0
Air conditioning 10 10/30 = 0.333 33.3
Computer 5 5/30 = 0.167 16.7
Mobile Phone 2 2/30 = 0.067 6.7
TOTAL 30 1.000 100.0

The table above is an example of grouped data ( data in a frequency distribution)

The third column is the relative frequency which is found by dividing the frequency of the class
by the total frequency
𝐅𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 𝐨𝐟 𝐭𝐡𝐞 𝐜𝐚𝐭𝐞𝐠𝐨𝐫𝐲
Relative Frequency =
𝐓𝐨𝐭𝐚𝐥

The relative frequency shows what proportion of the total frequency belongs to the
corresponding category. The total of the relative frequency is always = 1.

Percentage = (Relative Frequency)* 100


The sum of the percentages is always 100
Relative frequency and percentage make it easier for us to understand the distribution and to
compare different data sets especially if the totals are not the same
Graphical Presentation of Qualitative Data
A graph made of bars whose heights represent the frequencies of respective categories is called
a bar graph
To construct a bar graph we place the categories on the horizontal axis All categories have the
same width and we should leave a gap between the different categories ( the gaps also should
be the same ) and on the vertical axis we mark the frequency or relative frequency or
percentage
4UZK 3W[GROZGZO\K3MXGVNKY3IGTTUZ3

Convenience most difficult to give up HK3IRGYYKJ3GY3XOMNZ3RKLZ3YQK]KJ

12
10
8
6
4
2
0
Television Refrigerator Air Computer Mobile Phone
conditioning

A Pareto chart is a bar graph with bars arranged by their heights in descending order. To make
a Pareto chart, arrange the bars according to their heights such that the bar with the largest
height appears first on the left side, and then subsequent bars are arranged in descending
order with the bar with the smallest height appearing last on the right side.

A Pie Chart is a circle divided into portions that represent the relative frequencies or
percentages of a population or a sample belonging to different categories
 The Figure below is a Pie chart for the data

Convenience most difficult to give up

Television Refrigerator Air conditioning Computer Mobile Phone


Quantitative Data
We are going to discuss different types of frequency distributions for quantitative data
depending on the type of data and range of values
Types of frequency distributions are :
1. Single-valued class
If the values of the quantitative variable can assume a limited number of values, then it is
better to prepare a frequency distribution where each class refers to a single value of the
variable ( vs interval classes which we will discuss later )
The class in the frequency distribution refers to the values the variable can assume

Example:

The following data represents the number of siblings ( brothers or sisters) a sample of
20 students have :

1 2 0 5 1 1 3 2 0 5
2 1 2 1 2 0 1 3 1 2

Frequency Distribution
Number of Siblings Frequency Relative Frequency Percentage
0 3 3/20 = 0.15 15
1 7 7/20 = 0.35 35
2 6 6/20 = 0.30 30
3 2 2/20 = 0.10 10
4 0 0 /20 = 0.00 0
5 2 2 / 20 = 0.10 10
TOTAL 20 1.00 100

The first column represents the class which is made up of a single value. The second column is
the frequency which is the count that is how many observations in the sample has this value for
the variable
The third column is the relative frequency which presents the proportion of the sample with
this value of the variable. The relative frequency is calculated as
𝐂𝐥𝐚𝐬𝐬 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲
Relative Frequency = 𝐓𝐨𝐭𝐚𝐥

The fourth column is the percentage which is equal to the relative frequency * 100
Graphical Representation

 Bar Diagram

Number of Siblings
8
7
6
5
4
3
2
1
0
0 1 2 3 4 5

 Dot Plots

How to draw a dot plot:


The Minimum value and the maximum value in the data set is 0 and 5
So we draw the horizontal line and mark on it the values that the variable can assume .
Then we place a dot above each value that represents the number of siblings
If we don’t have the frequency distribution then we use the raw data :
The first value in our data set is “1” so we place a dot above the value “1” , then the second
value is “2” we place a dot above “2” and we continue as such till the last value in our data set .
If we have 2 or more observations with the same value we place the dots above each other
The benefits of a dot plot is,that it is quick to form and it can illustrate where data points in a
set naturally cluster.
In our situation since we have already grouped the data, we can directly use the value of the
variable and the frequency column to draw the dot plot
Note : we do not have in our example any student that has 4 siblings so we do not delete the
value 4 from our table or from the graphical representations we leave it and indicate that it has
a zero frequency .
2. Interval classes

If the data set extends over a large interval then using single valued classes becomes
difficult and not very practical
We are going to group the values in class intervals. The classes should be
non-overlapping that is each observation can belong to one class only. We usually group
the data by using classes with limits.

Number of Text Frequency


Variable messages
32 – 37 10
38 – 43 9
Third class 44 – 49 13 Freq of 3rd
class
50 – 55 6
Lower Limit of 56 - 61 2
5th class

Upper limit of 5th class

Class Frequency : the number of values in a data set that belong to a certain class
Class Limits : Each class has a lower limit which is the smallest value that can go
into the class and an upper limit which is the largest value that can go into the
class
The values : 32,38, 44, 50 and 55 are the lower limits
The values : 37, 43, 49, 55, and 61 are the upper limits
Exercise 2.11
Frequency Distribution of the number of text messages sent on 40 randomly selected days
during 2015 by a high school student

Number of Text Midpoint Frequency Relative Percentage


messages frequency
32 – 37 34.5 10 10/40 = 0.250 25.0
38 – 43 40.5 9 9/40 = 0.225 22.5
44 – 49 46.5 13 13/40 = 0.325 32.5
50 – 55 52.5 6 6/40 = 0.150 15.0
56 - 61 58.5 2 2/40 = 0.050 5.0
Total Σf = 40 1.000 100.0

In the class interval frequency distribution both limits are included in the class.
For Example: class 1 both the values of 32 and 37 are included in class 1. It can be represented as [32-37]

𝐿𝑜𝑤𝑒𝑟 𝐿𝑖𝑚𝑖𝑡+𝑈𝑝𝑝𝑒𝑟 𝐿𝑖𝑚𝑖𝑡


Class midpoint =
2

32+37
For example: the midpoint of class 1 is equal to 2
= 34.5

𝐂𝐥𝐚𝐬𝐬 𝐟𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲
Relative Frequency = 𝐓𝐨𝐭𝐚𝐥
For Example : the relative frequency of class 1 = 10/40 = 0.250
The total is referred to as the sum of frequency = Σ f

Class Width – Although sometimes we have frequency distributions with classes of unequal
width, it is better to construct frequency distributions with classes of the same width.

Class width is the difference between 2 consecutive lower limits


For example: Width of class 1 = Lower Limit of class 2 – Lower Limit of class 1 = 38-32 =6

Graphical Representation
A histogram is a graph in which classes are marked on the horizontal axis and the frequencies,
relative frequencies, or percentages are marked on the vertical axis. In a histogram, the bars are
drawn adjacent to each other. It is the area of the bar which reflects the frequency of the class.
Therefore if classes are of unequal width the frequency must be adjusted before drawing the
histogram.
Histogram

Frequency
14

12

10

0
32 – 37 38 – 43 44 – 49 50 – 55 56 - 61

Frequency polygon is a graphical representation where the midpoint of the class is used to
represent the class interval.

14 Frequency Polygon
12
10
8
6
4
2
0
34.5 40.5 46.5 52.5 58.5
A cumulative frequency distribution gives the total number of values that fall below the
upper limit of each class.

Number of Text Frequency


messages
32 – 37 10
38 – 43 9
44 – 49 13
50 – 55 6
56 - 61 2
Total Σf = 40

Number of Text Cumulative Frequency


messages
32 – 37 10
32 – 43 10 +9 = 19
32– 49 10+ 9 +13 = 32
32– 55 10+9+13+6 = 38
32 - 61 10 + 9 + 13 + 6 +2 = 40

To calculate the cumulative frequency of a class we add the frequencies of all the classes that
precede it as shown in the table above . Note that each class starts with 32 which is the lower
limit of the first class to indicate that this is the frequency from the beginning till the end of the
current class

Number of Text Cumulative Frequency Cumulative relative


messages Frequency
32 – 37 10 10/40 = 0.25
32 – 43 10 +9 = 19 19/40 = 0.475
32– 49 10+ 9 +13 = 32 32/40= 0.80
32– 55 10+9+13+6 = 38 38/40= 0.95
32 - 61 10 + 9 + 13 + 6 +2 = 40 40/40= 1.00

𝐶𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑖𝑛 𝑎 𝑐𝑙𝑎𝑠𝑠


Cumulative Relative frequency = 𝑇𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑎 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡
For example if we look at class 4 in the above table ( 32-55 ) we notice that the cumulative
relative frequency is 0.95 . This means that 95% of the days there were less than 55 text
messages sent.
3. Less than method

The classes in the previous example were 32-37; 38 – 43 ; etc…. where both the lower
limit and the upper limit were inclusive in the class, but we can write the classes in a
different way, the less than method

The less than method is more appropriate when the data contains a decimal.
The first column in the table above shows the class intervals using Boundaries
For class 1: 72 is called the lower boundary and 82 is called the upper boundary.
Note that the value 82 is not included in the first class but belongs to class 2.

The first column lists the class intervals using boundaries not limits. When we construct a
frequency distribution using the less than method we call the end points of the class the lower
boundary and the upper boundary.
Tne Upper boundary is not included in the class .
for Example in class 1: 72 to less than 82 - 72 is included in class 1 but 82 belongs to class 2.

The Width is the difference between the 2 boundaries of a class:


Width = Upper Boundary – lower Boundary :
Width of class 1: 82-72 = 10
All classes in the above table are of equal width

𝐿𝑜𝑤𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦+𝑈𝑝𝑝𝑒𝑟 𝐵𝑜𝑢𝑛𝑑𝑎𝑟𝑦


The Midpoint of a class is equal to 2

72+82
Midpoint of class 1 = = 77
2

The graphical representations are the same as the interval method

In the graphical representation below we have the histogram and the frequency polygon
superimposed on it
Shapes of Histograms

An important aspect of a distribution is its shape

1. Symmetric
A histogram is symmetric if you cut it down the middle and the left-hand and right-hand
sides resemble mirror images of each other. Very few distributions are perfectly
symmetrical, but they are approximately symmetrical

Another additional information we can add to the shape of the distribution is how many peaks,
or highpoints it has.
A histogram is unimodal if it has only one peak The first distribution is called unimodal it has
one peak, one class that has a frequency higher than the rest of the classes, it is the 5th class (
this is called mode we will discuss that in more details sin chapter 3 ) So it can be described as
symmetrical unimodal
The second distribution is called bi modal because it has 2 distinct peaks the 3rd and the 7th
classes– Symmetrical bimodal
If the data has 3 or more peaks it is called multimodal
If there are numerous obvious peaks, we say there are multiple modes.

One peak is called unimodal

Two peaks is called bimodal

More than two peaks is multi modal

2. Skewed

Right Skewed Left Skewed

A Right Skewed histogram most values are clustered at the lower values of the variable , and as
the value of the variable increase the values start getting less So the distribution has a tail to
the right - it is also known as Positively skewed
A Left Skewed histogram most observations are clustered at the higher values of the variable ,
and as you go to lower values you get less values which appear as if the distribution has a tail
to the left. It is also known as Negatively Skewed
Fig a : unimodal right skewed and Fig b : Unimodal Left skewed

3. Uniform or Rectangular

All the classes have the same frequency

Outliers : or Extreme values are defined as values that are very small or very large relative to
the majority of the values in a data. ( In chapter 3 we will learn how to determine if a data set
has outliers )

How would you describe the shape of the distribution in ex 2.11? ex 2.18?
What about ex 2.6?
Stem and Leaf Display:
In a stem-and-leaf display of quantitative data, each value is divided into two portions–a stem
and a leaf.
Ex. 2.26
This example gives the time in minutes that each of 20 students waits in line to pay for
textbooks at their bookstore:

15 8 23 21 4 17 31 22 31 6
5 6 14 17 16 25 27 3 31 8
Construct a stem and leaf display.
To construct a stem and leaf display for this data we split each score into 2 parts
 The first part is the first digit which is 15 is 1 and it is called the stem
For the second student the time is 8 but we can rewrite it as 08 so 0 is the stem
Following this the stems for this data set are 0,1,2,3
Make a vertical list of all the stems in this data

Stems

0
1
2
3

 The second part consists of the second digit which is called the leaf
So for the first student time is 15 so 1 is the stem and the leaf is 5.
And for the second student 8 is the leaf
Stems

0 8 leaf of 08
1 5 leaf of 15
2
3

 For each of the remaining values write the leaf next to the stem.
The result we get is and unordered stem and leaf which is shown below

0 8 4 6 5 6 8
1 5 7 4 7 6
2 3 1 2 5 7
3 1 1 3 1

 For each stem, arrange its leaves in increasing order. The result is an ordered stem and
leaf as is shown below
0 4 5 6 6 8 8
1 4 5 6 7 7
2 1 2 3 5 7
3 1 1 1 3

One advantage of a stem and leaf display over the frequency distribution is that we do not lose
the values of each observation. We can still obtain the original data which we can not do with a
frequency distribution

In the example below the data ranges from 61 to 151, so for the values in the hundreds we use
the first 2 digits as stem and the last digit as leaf. Notice that there are no leaves in front of the
stem 10 which means that there are no values between 100 and 109.

The following example is called a back to back stem and leaf


The stem is the middle column and the first and third column are the leaves. This is used
sometimes when we have two sets of data and we need to compare them. Using a back to back
stem and leaf display makes the comparison easier
What is the minimum pulse rate before?
What is the minimum pulse rate after?
What is the maximum pulse rate before?
What is the maximum pulse rate after?
What is the shape of the distribution before? after?

You might also like