CL 3 - Data Classification and Frequency Distribution
CL 3 - Data Classification and Frequency Distribution
______________________________________________________________________________
Data can be collected through primary and/or secondary sources. Whenever a set of data collected
contains a large number of observations, the best way to examine such data is to present it in some
compact and orderly form. Such a need arises because data contained in a questionnaire are in a form
which does not give any idea about the salient features of the problem under study. Such data are not
directly suitable for analysis and interpretation. For this reason the data set is organized and
summarized in such a way that patterns are revealed and are more easily interpreted. Such an
arrangement of data is known as the distribution of the data. Distribution is important because it
reveals the pattern of variation and helps in a better understanding of the phenomenon the data
present.
Classification of data
Classification of data is the process of arranging data in groups/classes on the basis of certain properties.
The classification of statistical data serves the following purposes:
(i) It condenses the raw data into a form suitable for statistical analysis.
(ii) It removes complexities and highlights the features of the data.
(iii) It facilitates comparisons and in drawing inferences from the data. For example, if university
students in a particular course are divided according to sex, their results can be compared
(iv) It provides information about the mutual relationships among elements of a data set. For
example, based on literacy and criminal tendency of a group of peoples, it can be
established whether literacy has any impact or not on criminal tendency.
(v) It helps in statistical analysis by separating elements of the data set into homogeneous
groups and hence brings out the points of similarity and dissimilarity
1
Basis of Classification
Statistical data are classified after taking into account the nature, scope, and purpose of an
investigation. Generally, data are classified on the basis of the following four bases:
Geographical Classification
In geographical classification, data are classified on the basis of geographical or locational differences
such as—cities, districts, or villages between various elements of the data set. The following is an
example of a geographical distribution
City : Mumbai Kolkata Delhi Guwahati
Population density: 654 685 423 205
(per square km)
Such a classification is also known as spatial classification.
Chronological Classification
When data are classified on the basis of time, the classification is known as chronological classification.
Such classifications are also called time series because data are usually listed in chronological order
starting with the earliest period. The following example would give an idea of chronological
classification:
Year : 1941 1951 1961 1971 1981 1991 2001
Population (crore): 31.9 36.9 43.9 54.7 75.6 85.9 98.6
Qualitative Classification
In qualitative classification, data are classified on the basis of descriptive characteristics or on the basis
of attributes like sex, literacy, region, caste, or education, which cannot be quantified. This is done in
two ways:
(i) Simple classification: In this type of classification, each class is subdivided into two sub-classes
and only one attribute is studied such as: male and female; blind and not blind, educated and
uneducated, and so on.
(ii) Manifold classification: In this type of classification, a class is subdivided into more than two
sub-classes which may be sub-divided further. An example of this form of classification is
shown in the box:
2
Quantitative Classification
In this classification, data are classified on the basis of some characteristics which can be measured such
as: height, weight, income, expenditure, production, or sales.
Quantitative variables can be divided into the following two types. The term variable refers to any
quantity or attribute whose value varies from one investigation to another.
(i) Continuous variable is the one that can take any value within the range of numbers. Thus
the height or weight of individuals can be of any value within the limits. In such a case data
are obtained by measurement,
(ii) Discrete (also called discontinuous) variable is the one whose values change by steps or
jumps and cannot assume a fractional value. The number of children in a family, number of
workers (or employees), number of students in a class, are few examples of a discrete
variable. In such a case data are obtained by counting.
The following are examples of continuous and discrete variables in a data set:
Organizing of Data
The best way to examine a large set of numerical data is first to organize and present it in an
appropriate tabular and graphical format.
When the data displayed (i.e. the numerical observations of a data set) are not arranged in any
particular order or sequence is said to be in raw form.
These data do not highlight any characteristic/trend, such as the highest, the lowest, and the average
weekly hours. A careful look at these data does not easily expose any significant trend regarding the
3
nature and pattern of variations therein. As such no meaningful inference can be drawn, unless these
data are reorganized to make them more useful.
Moreover, as the number of observations gets large, it becomes more and more difficult to focus on the
specific features in a set of data. Thus we need to organize the observation so that we can better
understand the information that the data are revealing.
The raw data can be reorganized in a data array and frequency distribution. Such an arrangement
enables us to see quickly some of the characteristics of the data we have collected.
When a raw data set is arranged in rank order, from the smallest to the largest observation or vice-
versa, the ordered sequence obtained is called an ordered array.
It may be observed that an ordered array does not summarize the data in any way as the number of
observations in the array remains the same. However, a few advantages of ordered arrays are as under
Advantages
(i) It provides a quick look at the highest and lowest observations in the data within which
individual values vary.
(ii) It helps in dividing the data into various sections or parts.
(iii) It enables us to know the degree of concentration around a particular observation.
(iv) It helps to identify whether any values appear more than once in the array.
Disadvantages
In spite of various advantages on converting a set of raw data into an ordered array, an array is a
cumbersome form of presentation which is tiresome to construct. It neither summarizes nor organizes
the data to present them in a more meaningful way. It also fails to highlight the salient characteristics of
the data which may be crucial in terms of their relevance to decision-making.
The above task cannot be accomplished unless the observations are appropriately condensed. The best
way to do so is to display them into a convenient number of groupings with the number of observations
falling in different groups indicated against each. Such tabular summary presentation showing the
4
number (frequency) of observations in each of several non-overlapping classes or groups is known as
frequency distribution (also referred to as grouped data).
Frequency Distribution
A frequency distribution divides observations in the data set into ordered classes or groups or
categories. The number of observations in each class is referred to as frequency and denoted as 𝑓.
Advantages
The following are a few advantages of grouping and summarizing raw data in this compact form:
(i) The data are expressed in a more compact form. One can get a deeper insight into the
salient characteristics of the data at the very first glance.
(ii) One can quickly note the pattern of distribution of observations falling in various classes.
(iii) It permits the use of more complex statistical techniques which help reveal certain other
obscure and hidden characteristics of the data.
Disadvantages
(i) In the process of grouping, individual observations lose their identity. It becomes difficult to
notice how the observations contained in each class are distributed. This applies more to a
frequency distribution which uses the tally method in its construction.
(ii) A serious limitation inherent in this kind of grouping is that there will be too much clustering
of observations in various classes in case the number of classes is too small. This will cause
some of the essential information to remain unexposed.
Hence, it is important that summarizing data should not be at the cost of losing essential details. The
purpose should be to seek an appropriate compromise between having too much of details or too little.
To be able to achieve this compromise, certain criteria are discussed for constructing a frequency
distribution.
The frequency distribution table of most biological variables develops a distribution which can be
compared with the standard distributions which can be compared with the standard distributions such
as 𝑛𝑜𝑟𝑚𝑎𝑙, 𝑏𝑖𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑟 𝑝𝑜𝑖𝑠𝑠𝑜𝑛. Tabulation of frequencies may be for
a) Qualitative data
b) Quantitative data
In qualitative data, there is no notion of magnitude or size of attribute, hence the presentation of
frequency distribution is very simple because the characteristic is not variable but 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒. In these
tbles , each characteristic such as births, deaths , attack etc., forms a complete group and cannot be
5
splited into sub-groups or sub-classes because there is no range of variability, and no class – interval.
These events do not have fractional parts. The following few tables represents the frequency
distribution of this type:
Each day at a large hospital, several hundred laboratory tests are performed. The laboratory tests were
subdivided by the shift of workers who performed the lab tests. The results are as follows:
b) Quantitative Data
Presentation of quantitative data is more bulky because the characteristic having a measured or size as
well as the frequency. The data of variable characteristics are continuous such as height, weight, pulse
rate, bleeding time etc. They have a range from the lowest to the highest value in the data set. This
range can be divided into groups or classes. The number of units or persons falling into the particular
group or a class is said to be class frequency.
As the number of observations obtained gets large, the method discussed above to condense the data
becomes quite difficult and time consuming. Thus to further condense the data into frequency
distribution tables, the following steps should be taken:
Decide the number of class intervals. The decision on the number of class groupings depends largely on
the judgment of the individual investigator and/or the range that will be used to group the data,
although there are certain guidelines that can be used. As a general rule, a frequency distribution should
have at least five class intervals (groups), but not more than fifteen. The following two rules are often
used to decide approximate number of classes in a frequency distribution:
6
I. If 𝑘 represents the number of classes and 𝑁 the total number of observations, then the value of
k will be the smallest exponent of the number 2, so that 2𝑘 ≥ 𝑁
Suppose if N = 30 observations if we apply this rule, then we have
23 = 8 (<30)
24 = 16 (<30)
25 = 32 (>30)
Thus we may choose k= 5 as the number of classes.
II. According to Sturge’s Rule, the number of classes can be determined by the formula
III. 𝑘 = 1 + 3.222 𝑙𝑜𝑔𝑒 𝑁
Where k is the number of classes and loge N is the logarithm of the total number of
observations
Applying this rule when N = 30 we get 𝑘 = 1 + 3.222 𝑙𝑜𝑔 30 = 1 + 3.222 (1.4771) = 5.759 ≅ 5
When constructing the frequency distribution it is desirable that the width of each class interval should
be equal in size. The size (or width) of each class interval can be determined by first taking the difference
between the largest and smallest numerical values in the data set and then dividing it by the number of
class intervals desired.
𝐿𝑎𝑟𝑔𝑒𝑠𝑡 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒−𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑛𝑢𝑚𝑒𝑟𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒
Width of class interval (h) =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 𝑑𝑒𝑠𝑖𝑟𝑒𝑑
The value obtained from this formula can be rounded off to a more convenient value based on the
investigator's preference.
The limits of each class interval should be clearly defined so that each observation (element) of the data
set belongs to one and only one class. Each class has two limits—𝑎 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 𝑎𝑛𝑑 𝑎𝑛 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡.
The usual practice is to let the lower limit of the first class be a convenient number slightly below or
equal to the lowest value in the data set.
There are two ways in which observations in the data set are classified on the basis of class intervals,
namely (𝑖) 𝐸𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒 𝑚𝑒𝑡ℎ𝑜𝑑, 𝑎𝑛𝑑 (𝑖𝑖) 𝐼𝑛𝑐𝑙𝑢𝑠𝑖𝑣𝑒 𝑚𝑒𝑡ℎ𝑜𝑑
Exclusive Method: When the data are classified in such a way that the upper limit of a class interval is
the lower limit of the succeeding class interval (i.e. no data point falls into more than one class interval),
then it is said to be the exclusive method of classifying data. This method is illustrated in Table below
7
Exclusive Method of Data Classification
Such classification ensures continuity of data because the upper limit of one class is the lower limit of
succeeding class. As shown in Table, 5 companies declared dividend ranging from 0 to 10 per cent, this
means a company which declared exactly 10 per cent dividend would not be included in the class 0–10
but would be included in the next class 10–20
Inclusive Method: When the data are classified in such a way that both lower and upper limits of a class
interval are included in the interval itself, then it is said to be the inclusive method of classifying data.
This method is shown in the following table.
Remarks:
1. An exclusive method should be used to classify a set of data involving continuous variables and an inclusive
method should be used to classify a set of data involving discrete variables.
2. If a continuous variable is classified according to the inclusive method, then certain adjustment in the class
interval is needed to obtain continuity as shown in Table below
and then subtract it from the lower limits of all the classes and add it to the upper limits of all the classes
8
From the above Table, we have x = (45 – 44) ÷ 2 = 0.5. Subtract 0.5 from the lower limits of all the
classes and add 0.5 to the upper limits. The adjusted classes would then be as shown in Table below
Example 1) A computer company received a rush order for as many home computers as could be
shipped during a six-week period. Company records provide the following daily shipments:
22 65 65 67 55 50 65
77 73 30 62 54 48 65
79 60 63 45 51 68 79
83 33 41 49 28 55 61
65 75 55 75 39 87 45
50 66 65 59 25 35 53
Group these daily shipments figures into a frequency distribution having the suitable number of classes.
9
Example 2): Following is the increase of D.A. in the salaries of employees of a firm at the following rates.
Rs 250 for the salary range up to Rs 4749
Rs 260 for the salary range from Rs 4750
Rs 270 for the salary range from Rs 4950
Rs 280 for the salary range from Rs 5150
Rs 290 for the salary range from Rs 5350
No increase of D.A for salary of Rs 5500 or more. What will be the additional amount required to be paid
by the firm in a year which has 32 employees with the following salaries (in Rs)?
5422 4714 5182 5342 4835 4719 5234 5035
5085 5482 4673 5335 4888 4769 5092 4735
5542 5058 4730 4930 4978 4822 4686 4730
5429 5545 5345 5250 5375 5542 5585 4749
Solution: Performing the actual tally and counting the number of employees in each salary range (or
class), we get the following frequency distribution as shown in table below:
If the data corresponding to one variable, say 𝑥, is grouped into m classes and the data corresponding to
another variable, say 𝑦, is grouped into n classes, then bivariate frequency table will have 𝑚 × 𝑛 𝑐𝑒𝑙𝑙𝑠.
Frequency distribution of variable x for a given value of 𝑦 is obtained by the values of 𝑥 and 𝑣𝑖𝑐𝑒 −
𝑣𝑒𝑟𝑠𝑎. Such frequencies in each cell are called 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏𝒂𝒍 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒊𝒆𝒔.
The frequencies of the values of variables 𝑥 𝑎𝑛𝑑 𝑦 together with their frequency totals are called the
𝒎𝒂𝒓𝒈𝒊𝒏𝒂𝒍 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒊𝒆𝒔.
10
Example 2): The following data give the points scored in a tennis match by two players X and Y at the
end of twenty games:
(10, 12) (7, 11) (7, 9) (15, 19) (17, 21) (12, 8) (16, 10) (14, 14) (22, 18) (16, 7)
(15, 16) (22, 20) (19, 15) (7, 18) (11, 11) (12, 18) (10, 10) (5, 13) (11, 7) (10, 10)
Taking class intervals as: 5–9, 10–14, 15–19 . . ., for both X and Y, construct
a) Bivariate frequency table.
b) Conditional frequency distribution for Y given X > 15
Solution: (i) The two-way frequency distribution is shown in Table below Table
11
equal to the class frequency to which it relates. But in the more than cumulative frequency
distribution, the frequencies of each class interval are added successively from bottom to top and
represent the cumulative number of observations greater than or equal to the class frequency to which
it relates. The frequency distribution given in following tables illustrates the concept of cumulative
frequency distribution:
12
Example 3) Convert the following frequency distribution for Diastolic blood pressure in persons age
group 25 - 34 into a corresponding
i. percentage frequency distribution and
ii. a percentage cumulative frequency distribution
Number of Number of
Blood pressure
males females
70-75 6 8
76-80 18 12
81-85 46 25
86-90 17 8
91-95 6 2
96-100 2
Total 95 55
Solution:
i. The relative frequency and percentage frequency distribution for Diastolic blood pressure in
persons sex -wise aged in age group 25 – 34 is shown as below:
Number c.f.
Blood Number c.f. c.f. Males c.f.
of females
pressure of males (males) (%) (females)
females (%)
70-75 6 6 6.3 8 8 14.5
76-80 18 24 25.3 12 20 36.4
81-85 46 70 73.7 25 45 81.8
86-90 17 87 91.6 8 53 96.4
91-95 6 93 97.9 2 55 100.0
96-100 2 95 100.0
Total 95 55
Prepared by : Dr Anil K Bhatia, Associate Professor (Statistics), The Assam Kaziranga University, Jorhat
13