ORGANIZATION/CLASSIFICATION OF DATA
Measurements or counting gives rise to raw data. Raw data are collected data that have not been
organized numerically. Raw data are difficult to comprehend because it lacks organization,
summarization, which renders it meaningless. Thus, the raw data has to be put in some order
through classification and tabulation so as to reduce its volume and heterogeneity.
Characteristics of a Good Classification
• Comprehensiveness: Classification should cover all the items of the data. In other words, it
should be so comprehensive that it classifies all items in some group or class.
• Clarity: There should be no confusion of the placement of any data item in a group or class.
That is, classification should be absolutely clear.
• Homogeneity: The items within a specific group or class should be similar to each other.
• Suitability: The attribute or characteristic according to which classification is done should
agree with the purpose of classification.
• Stability: A particular kind of investigation should be effective on the same set of
classification.
• Elastic: As the purpose of classification changes, one should be able to change the basis of
classification.
Construction of frequency distribution
When summarizing large masses of raw data, it is often useful to distribute the data into classes,
or categories, and determine the number of individuals belonging to each class, called the class
frequency.
Frequency: Is the number of times a certain value or class of values of the data occurs. The
sum of the frequencies equal to the total observations(n) or sample size n.
Frequency Distribution
A tabular arrangement of data by their classes together with the corresponding class
frequencies is called frequency distribution or frequency table.
The techniques use to organize data depend on the type of variable (Quantitative(numerical) or
Qualitative(categorical)) associated with such data.
TYPES OF FREQUENCY DISTRIBUTION
Two types of frequency distributions that are most often used are the:
i. Categorical frequency distribution
ii. Quantitative frequency distribution
Page 1 of 12
Categorical Frequency Distribution:
A categorical frequency distribution is a table used to organize data that can be placed in specific
categories, such as nominal- or ordinal-level data. The categorical frequency distribution is
done by tallying responses by categories and place the results in tables. This can be
done to construct a summary table to organize the data for a single categorical variable
or construct a contingency table to organize the data from two or more categorical
variables.
a. Summary Table:
A summary table is usually constructed for a single categorical variable. The table present the
tallied responses as frequencies or percentages for each category. The table helps you to see the
differences among the categories by displaying the frequency(amount) or percentage of items in a
set of categories in a separate column.
For example, the data below represents the blood groups of 40 students in a Biostatistics class.
Construct a frequency distribution for the data.
Page 2 of 12
B. Quantitative variable
i. Ungrouped/ discrete frequency distribution
The frequency is constructed for a data based on a single data value for each class. This is used
when each distinct data occurs a number of times.
Example: Given below, are the wing length measurements (to the nearest whole millimeter) of 50
laughing doves.
Page 3 of 12
ii. Grouped Frequency Distribution: The data are organized into groups or
intervals with their corresponding frequencies.
Table 2.0: Height of 100 students in STA 131 class
TERMS ASSOCIATED WITH FREQUENCY DISTRIBUTION
i. Class intervals and class limits
Class Interval: class interval is a range of values into which data is grouped for the
purpose organizing large data set. A class interval is defined by two values:
a. Lower Class Limits
This is the smallest value that belong to a class interval. It is inclusive.
b. UPPER Class Limit
This is the largest value that belong to a class interval.
The end numbers 10-19 are called class limits; the smaller number (10) is the lower limit, and the
larger number (19) is the upper-class limit. A class interval with no upper- or lower-class limit
indicated is called an open class interval. For example, referring to age groups of individuals, the
class interval “ 65 years and above” is an open class interval.
c. Class boundaries:
Class boundaries are defined to eliminate any gaps between the classes (it has one more decimal
place than the data.). Class boundaries are those limits which are determined mathematically to
make an interval of a continuous variable continuous in both directions, and no gap exists between
classes. It’s the actual or real limits of a class interval.
a. The lower extreme point is called lower class boundary
b. The Upper extreme point is called Upper class boundary
Class boundaries are obtained as follows:
Page 4 of 12
1
Lower Class boundary= lower class limit− 2 𝛼
1
Upper Class boundary= upper class limit+ 𝛼
2
Where 𝛼 is the difference between the upper-class limit of any class interval and lower-class
limit of the next class interval.
d. Class mark (xc) or Mid-point of an interval:
1. The class mark is the midpoint of the class interval and is obtained by adding the lower-
class limit and upper-class limit and dividing by 2.
2. The class mark is also called the class midpoint.
3. It is used as representative value of the class interval for the calculation of mean, standard
deviation and other measures.
4. Class mark is the value representing the class interval. It is calculated as:
𝐿𝑜𝑤𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡 − 𝑈𝑝𝑝𝑒𝑟 𝑐𝑙𝑎𝑠𝑠 𝑙𝑖𝑚𝑖𝑡
𝑋𝑐 =
2
e. The size, or width, of a class interval
The size, or width, of a class interval is the difference between the lower- and upper-class
boundaries and is also referred to as the class width, class size, or class length. If all class intervals
of a frequency distribution have equal widths, this common width is denoted by c. In such case c is
equal to the difference between two successive lower-class limits or two successive upper-class
limits. For example, the boundaries of the class interval 10-19 is 9.5 - 19.5, the size = 19.5-9.5= 10
Table: Class limit, Class boundary, Class mark, Width, Relative frequency and Percentage
Relative frequency
Class Class Class limits Class Class Class Relative %
Interval Frequency Boundaries Mark Width Freq. Relative
Lower Upper Freq.
15- 19 18 15 19 14.5 19.5 17 5 0.18 18%
20- 24 34 20 24 19.5 24.5 22 5 0.34 34%
25- 29 21 25 29 24.5 29.5 27 5 0.21 21%
30- 34 12 30 34 29.5 34.5 32 5 0.12 12%
35- 39 9 35 39 34.5 39.5 37 5 0.09 9%
40-44 6 40 44 40.5 44.5 42 5 0.06 6%
100 1.00 100%
Class Frequency: The number of observations falling within a class is called its class frequency.
Total Frequency: The sum of all the frequencies is called total frequency.
Relative frequency: It is ratio of the frequency of the class to the total frequency. It’s used to
compare two or more frequency distributions or two or more items in the same frequency
distribution. The relative frequency is not expressed as percentage and its defined as:
Page 5 of 12
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝐶𝑙𝑎𝑠𝑠
𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐅𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 =
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Percentage Relative Frequency: This is the ratio of the frequency of a class to the total
frequency expressed as percentage. Its defined as:
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝐶𝑙𝑎𝑠𝑠
𝐏𝐞𝐫𝐜𝐞𝐧𝐭𝐚𝐠𝐞 (%) 𝐑𝐞𝐥𝐚𝐭𝐢𝐯𝐞 𝐅𝐫𝐞𝐪𝐮𝐞𝐧𝐜𝐲 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Guidelines for Number of Classes
i. There should be between 5 and 20 classes.
ii. The class width should be an odd number. This will guarantee that the class midpoints are
integers instead of decimal.
iii. The classes must be mutually exclusive. This means that no data value can fall into two
different classes.
iv. The classes must be all inclusive or exhaustive. This means that all data values must be
included.
v. The classes must be continuous. There are no gaps in a frequency distribution.
vi. The classes must be equal in width. The exception here is the first or last class. It is possible
to have an "below or " and above" class. This is often used with ages.
Creating a Grouped Frequency Distribution
l. Find the largest and smallest values
2. Compute the Range(R): Maximum - Minimum
3. Select the number of classes(K) desired. This is usually between 5 and 20.
4. Find the class width by dividing the range by the number of classes and rounding up.
Size(Width) = R/ K
a. You must round up, not off. Normally 3.2 would round to be 3, but in rounding up, it
becomes 4. If the range divided by the number of classes gives an integer value (no
remainder), then you can either add one to the number of classes or add one to the class
width.
Sometimes you are instructed to use a certain number of classes.
b. Pick a suitable value less than or equal to minimum value.
c. . Your starting value is the lower limit of the first class. Continue to add the class width to
this lower limit to get the rest of the lower limits.
d. To find the upper limit of the first class, subtract one from the lower limit of the second
class. Then continue to add the class width to this upper limit to find the rest of the upper
limits.
Page 6 of 12
5. Tally the data.
6. Find the frequencies.
7. Find the boundaries by subtracting 0.5 units from the lower limits and adding 0.5
units from the upper limits (if the data are recorded without decimal).
Cumulative Frequency Distribution:
Cumulative frequency corresponding to a class is the sum of all the frequencies up to and including
that class. It is obtained by adding the frequency of that class to all the frequencies of the previous
classes. Cumulative frequencies are of two types:
i. Less than cumulative frequency: The number of observations up to a given value is
called less than Cumulative frequency. Its obtained by adding all the frequencies of the
class that are less than the upper limit of the class.
ii. More than cumulative frequency: The number of observations “greater than” a value is
called more than cumulative frequency. It’s the number of observations that are greater
than the lower limit of a class interval.
Uses of Cumulative Frequency
1. It’s used to find out the number of observations less than or more than any given value.
2. It’s used to find out the number of observations falling between any two specified values of
the variable.
3. It’s used to find median, quartiles and percentiles.
Table: Showing Less than and More than Cumulative frequency
Class Class Cumulative Frequency Class Class
Interval Frequency Less than More than Mark Width
15- 19 18 < 19 18 More than 15 100 17 5
20- 24 34 < 24 52 More than 20 82 22 5
25- 29 21 < 29 73 More than 25 48 27 5
30- 34 12 < 34 85 More than 30 27 32 5
35- 39 9 < 39 94 More than 35 15 37 5
40-44 6 < 44 100 More than 40 6 42 5
100
Page 7 of 12
CONSTRUCTION OF FREQUENCY DISTRIBUTION
The following steps are involved in the construction of a frequency distribution.
(1) Find the range of the data: The range is the difference between the largest and the
smallest values.
(2) Decide the approximate number of classes in which the data are to be grouped. Where
the number of classes to used is not given, the number of classes can be estimated using
H.A. Sturge’s formula given as:
K = 1 + 3.322log N
Where K= Number of classes and N = total number of observations.
(3) Determine the approximate class size: The size (width)of class interval is obtained by
dividing the range of data by the number of classes and is denoted by W (class interval
width(size))
Size/ width= Range/ K
In the case of fractional results, the next higher whole number is taken as the size of the
class interval.
(4) Decide the starting point: The lower-class limits or class boundary should cover the
smallest value in the raw data. Usually class intervals of multiple of 5 are commonly used.
(5) Determine the remaining class limits (boundary): When the lowest class boundary has
been decided, you can compute the upper-class boundary by adding the class interval size
to the lower-class boundary. The remaining lower- and upper-class limits may be
determined by adding the class interval size repeatedly till the largest value of the data is
observed in the class.
(6) Distribute the data into respective classes: All the observations are divided into
respective classes by using the tally bar (tally mark) method, which is suitable for
tabulating the observations into respective classes. The number of tally bars is counted to
get the frequency against each class. The frequency of all the classes is noted to get the
grouped data or frequency distribution of the data. The total of the frequency columns
must be equal to the number of observations.
Page 8 of 12
2. Number of Classes = 1+3.322logN
Number of Classes = 1+3.322log57
Number of Classes = 1+3.322(1.75587) = 6.833
Approximately 7 class intervals
3. Class Interval Size (W) = Range /No. of Classes = 67 / 7 = 9.57 or 10
Effect of grouping:
As a result of grouping, it is possible to detect a pattern in the figures but grouping results in the
loss of information i.e. calculations made from a grouped frequency distribution can never be
exact, and consequently excessive accuracy can only result in spurious accuracy.
The reasons for constructing a frequency distribution are:
1. To organize the data in a meaningful, intelligible way.
2. To enable the reader to determine the nature and shape of the distribution.
3. To facilitate computational procedures for measures of average and spread.
4. To enable the researcher to draw charts and graphs for the presentation of data.
5. To enable the reader to make comparisons among different data sets.
Page 9 of 12
Construction of Frequency Distribution, Relative frequency and Cumulative Relative
Frequency
Example 2:
The following data represents the percent change in tuition levels at public, four-year colleges
(inflation adjusted) from 2008 to 2013 (Weismann, 2013). Create a frequency distribution,
histogram, and ogive for the data.
19.5 40.8 57.0 15.1 17.4 5.2 13.0 15.6 51.5 15.6 14.5 22.4 19.5 31.3 21.7 27.0
13.1 26.8 24.3 38.0 21.1 9.3 46.7 14.5 78.4 67.3 21.1 22.4 5.3 17.3 17.5 36.6
72.0 63.2 15.1 2.2 17.5 36.7 2.8 16.2 20.5 17.8 30.1 63.6 17.8 23.2 25.3 21.4
28.5 9.4;
Solution:
1. Find the range:
largest value - smallest value = 78.4 −2.2 =76.2
2. Pick the number of classes: Since there are 50 data points, Let’s use 8.
3. Find the class width:
width = range/ 8= 76.2/8≈9.525
Since the data has one decimal place, then the class width should round to one decimal
place. Make sure you round up.
width =9.6
4. Find the class limits:
2.2+9.6=11.8; 11.8+9.6=21.4; 21.4+9.6=31.0;
5. Find the class boundaries:
Since the data has one decimal place, the class boundaries should have two decimal places,
so subtract 0.05 from the lower-class limit to get the class boundaries. Add 0.05 to the
upper-class limit for the last class’s boundary.
2.2−0.05=2.15;11.8−0.05=11.75;21.4−0.05=21.35
Every value in the data should fall into exactly one of the classes. No data values should fall right
on the boundary of two classes.
6. Find the class midpoints:
midpoint = lower limit + upper limit /2
(2.2+11.7)/2=6.95; (11.8+21.3)/2=16.55
7. Tally and find the frequency of the data:
Page 10 of 12
Table 2.2. Frequency Distribution for Tuition Levels at Public, Four-Year Colleges
Class Class Class
Tally F RF CF
Limits Boundaries Midpoint
2.2- 11.7 2.15- 11.75 6.95 |||||||||| 6 0.12 6
11.8- 21.3 11.75- 21.35 16.55 |||||||||||||||||||||||||||||||| 20 0.40 26
21.4- 30.9 21.35- 30.95 26.15 |||||||||||||||||| 11 0.22 37
31.0- 45.0 30.95- 40.55 35.75 |||||||| 4 0.08 41
40.6- 50.1 40.55- 50.15 45.35 |||| 2 0.04 43
50.2- 59.7 50.15- 59.75 54.95 |||| 2 0.04 45
59.8- 69.3 59.75- 69.35 64.55 |||||| 3 0.06 48
69.4- 78.9 69.35- 78.95 74.15 |||| 2 0.04 50
RF= Relative Frequency and CF=Cumulative Frequency
Page 11 of 12
Tutorial Questions
1. Construct a frequency distribution with the suitable class interval size for
marks obtained by a class of 50 students as given below:
23, 50, 38, 42, 63, 75, 12, 33, 26, 39, 35, 47, 43, 52, 56, 59, 64, 77, 15, 21, 51, 54,
72, 68, 36, 65, 52, 60, 27, 34, 47, 48, 55, 58, 59, 62, 51, 48, 50, 41, 57, 65, 54, 43,
56, 44, 30, 46, 67, 53
b. Create the column of class boundaries, Class marks, %Relative frequency and
Cumulative frequency.
2. The following is the distribution of ages of new employees at a factory
a. Obtain the class boundaries and class marks of the class intervals
b. What is the upper-class limit of the class 30-39?
c. What is the lower-class limit of the class 50-59?
d. What is the class mark of the class 40-49?
e. What is the class width of the class 40-49?
f. What is the lower-class boundary of the class 30-39?
3. A medical research team studied the ages of patients who had strokes caused by stress. The
ages of 34 patients who suffered stress strokes were as follows.
29 30 36 41 45 50 57 61 28 50 36 58
60 38 36 47 40 32 58 46 61 40 55 32
61 56 45 46 62 36 38 40 50 27
Use 8 classes beginning with a lower-class limit of 25
i. Construct a frequency distribution for these ages.
ii. Create a column for Class boundaries, class mark and cumulative frequency, relative
frequency and cumulative relative frequency.
iii. Draw cumulative frequency curve
iv. Draw the histogram and frequency polygon
Page 12 of 12