Chapter 4 Organizing Data, Classification and Tabulation
Chapter 4 Organizing Data, Classification and Tabulation
4.1 Introduction
Data in itself is not information. We need to summarize and present data in useful ways to
support insight and help take effective decisions. Classification is the process of arranging
things (either actually or notionally) in groups or classes according to their resemblances
and affinities, and gives expression to the unity of attributes that may subsist amongst a
diversity of individuals. It serves the following purposes:
eliminates unnecessary details,
brings out clearly the points of similarity and dissimilarity, and
allows comparisons and drawing of inferences.
The first step in the process of classification is to select the basis of classification. Statistical
facts are classified according to their characteristics. Thus, the students of a college may be
classified according to their marital status, height, religion, etc. When a particular
characteristic has been chosen for this purpose, the next step in the process of classification
would be to note the similarity and dissimilarity as regards this chosen characteristic in the
various items. Items that would be alike in respect of this characteristic will be grouped
together. Thus, if the students are to be classified according to their marital status, all
married students would be put into one group and all unmarried in another. If the students
are classified on the basis of religion, there will be different groups for Hindus, Muslims,
Sikhs, etc. When the classification is made according to heights, each group will include only
those students whose heights lie within a certain range. It will be noted that the three
different characteristics (marital status, height and religion) give us groups that are
significantly different from one another. Thus, in the first case we have groups where the
characteristic is either present or absent, e.g., married or unmarried. In the second we have
groups where the characteristic is of differing quality e.g., students may be Hindus, Muslims
or Sikhs.
The characteristics of a population may be broadly divided into two categories: attributes
and variables. Attributes are qualitative characteristics which are not capable of being
described numerically, e.g., sex, nationality, colour of eye, etc. These characteristics are
called attributes or descriptive characteristics. When classification is to be made on the basis
or attributes, groups are differentiated either by the presence and absence of the attribute
(e.g., married or not married), or by its differing qualities. The qualities of an attribute can
easily be differentiated by means of some natural or physical line of demarcation, and their
natural differences determine the group into which a particular item is to be placed. Thus, if
we select colour of eye as the basis of classification, there will be a group of brown-eyed
people, another of blue-eyed people, and so on.
Data is organized by creating summaries in the form of tables and are, more often than not,
presented in the form of visualizations. Large amounts of data can be reviewed rapidly by
visual summaries in the form of graphs and charts. Such visualizations reveal significant
patterns to the data. More on visualization in the next chapter.
Because the methods used to organize categorical variables differ from the methods used to
organize and visualize numerical variables, this chapter discusses methods for classification
for numerical and categorical variables in separate sections.
It may be noted that creating tabular summaries may often risk distorting the information
that they present and undermining the usefulness of those summaries for decision making.
Surely, a summary of any kind means that some details have been suppressed. Resulting
information can end up being distorted if the class intervals are not chosen appropriately.
Table 4.5 Lower limit excluded classes Table 4.6 Upper limit excluded classes
size of either the upper limit or the lower limit are excluded from the frequency of that
class. These may be either of the lower limit excluded type as given in Table 4.5, or the
upper limit excluded type as given in Table 4.6.
When the variable involved is a continuous one, i.e., when the variable takes values on a
continuous scale, the meaning of the class limit becomes slightly different. Now, we cannot
have the limits as in Tables 4.4 to 4.6. The intervals must be such that they cover the entire
range and therefore there cannot be any gaps. When classifying weights of pre-teen kids,
we cannot use classes like 31-35, 36-40, 41-45, etc. (all in kg). This is so, because the weights
are not restricted to whole numbers. A weight of 40.6 kg will not fit any of these classes. A
little thought will reveal that only such classes as 30-35, 35-40, 40-45, etc. will do. Here any
weight less than 35 will be included in the first class, and any greater than 35 in the second
class. How about a weight of exactly 35 kg? Theoretically, a weight of exactly 35 kg has very
little chance of occurring. If it was measured accurately enough, maybe we can find that it is
1 gm (or even less) on one side or the other. But usually these weights-are not measured to
such accuracy, and if we get one that is reported at 35 kg. Where do we place it?
There are at least three courses that are available to a statistician: (1) include half an item in
the class 30-35 and the other half in 35-40, (2) toss a coin to decide, and (3) make a thumb
rule to include in the higher class the first time such an item occurs, and in the lower class
when it occurs the second time. The first option is perhaps the best statistically, but the idea
of half a boy in 35-40 class is diverting, and is best avoided. The second option suffers from
the fact that it will not permit rechecking the results. The third, or some variant of it, is
usually employed.
Illustration 4.1
From the following observations prepare a frequency distribution in ascending order
starting with the class of 5-10 (exclusive method):
Marks in English:
12 36 40 30 28 20 19 10 10 16
19 27 15 26 20 19 7 45 33 21
56 37 6 20 11 17 37 30 20 5
Solution:
Note the variable involved is a discrete one. We chose the 'exclusive' type of limits. The
resulting distributions with upper limit excluded and with lower limit excluded are as given in
Table 4.7.
Table 4.7
Table 4.8
Apparent Real class
class limits limits Frequency
43-47 42.5-47.5 3
48-52 47.5-52.5 3
53-57 52.5-57.5 6
58-62 57.5-62.5 8
63-67 62.5-67.5 6
68-72 67.5-72.5 8
73-77 72.5-77.5 3
78-82 77.5-82.5 3
N=30
The example below illustrates the procedure for calculating the percentile rank for a given
value of the variable.
Illustration 4.3
Let us determine from the data of Table 4.14 the percentile rank of an individual handloom
employee who earns ₹ 5,300 per week. This earning falls in the class ₹ 5000 – 6000. The
percentage of employees earning less than ₹ 6,000 (the upper limit of the class) is 65 and of
those earning less than ₹ 5,000 is 27. So, the percentile rank for one earning 5,300 is
between these two limits, 27 and 65. In the absence of any other information, we assume
that the earnings of 100 employees within this band of ₹ 5,000 – 6,000 are uniformly
distributed. We can, thus, use linear interpolation as illustrated in Fig. 4.1.
6000−5300
PRX ¿ 27+ × ( 65−27 )=53.6
6000−5000
Fig. 4.1 Calculating the Percentile Rank for Weekly Earnings of ₹ 5,300 in Data of Table
4.14
The following formula has been used to calculate the percentile rank for a variable value of
X:
( ULn −X )
PR X =PRUL + ×d (4.1)
i
where: n is the serial number of the class in which the required percentile is located,
PR UL is the percentile rank of the upper limit of this nth class,
d is percentage of all items contained within the nth class interval,
UL n is the upper limit of the nth class interval, and
i is the width of the nth class interval.
Calculating percentile point
As stated above, percentile point (or more simply, percentile) is the point on the
measurement scale below which a given percentage of the cases falls. We use the following
to illustrate its approximate calculation.
Illustration 4.4
Consider the marks obtained out of a maximum of 300 in the Joint Entrance Examination -
Main (JEE). In one particular year 10,25,029 candidates appeared in the examination. Table
4.15 shows the distribution of the candidates according to the scores obtained.
Table 4.15 Distribution of Marks Obtained in JEE – Main
Class interval Real limits
Lower Upper Lower Upper Frequency Less than Percentile rank
limit limit limit limit upper limit at upper limit
275 300 274.5 300 106 10,25,029 100.00
250 274 249.5 274.5 418 10,24,923 99.99
225 249 224.5 249.5 1,474 10,24,505 99.95
200 224 199.5 224.5 2,669 10,23,031 99.81
175 199 174.5 199.5 6,079 10,20,362 99.54
150 174 149.5 174.5 10,499 10,14,283 98.95
125 149 124.5 149.5 16,929 10,03,784 97.93
100 124 99.5 124.5 27,584 9,86,855 96.28
75 99 74.5 99.5 63,571 9,59,271 93.58
50 74 49.5 74.5 97,188 8,95,700 87.38
25 49 24.5 49.5 4,07,563 7,98,512 77.90
0 24 0 24.5 3,90,949 3,90,949 38.14
This is the starting point for determining percentiles. Suppose that we want to find the value
of 90th percentile, denoted as P9. It is, by definition, the point on the variable scale below
which 90% of the actual scores are. The first step in the process is to determine the class in
which this percentile is located. From the last column of Table 4.15, we note that 93.58% of
case are below the real upper limit of class 75 - 99, that is, above 99.5 marks, and that
87.38% of case are below the real upper limit of class 50 – 74 (which is also the real lower
limit of the class 75-99), that is, above 74.5 marks. This clearly means that 90 th percentile
lies within the class with real limits as 74.5 and 99.5.
At this point, it is not clear what score value we should assign because the point we want
lies somewhere within this interval. There are 63,571 scores in this interval. We will assume
that the scores are uniformly distributed throughout the interval. This assumption underlies
the procedure termed as linear interpolation, the same that was used for locating percentile
rank in Illustration 4.3.
P90 is the marks obtained by the candidate below which there are 90% of the total
candidates, which are the marks obtained by the candidate who has
90
10 , 25,029 × =9 , 22,526 candidates above him.
100
Fig. 4.2 shows the linear interpolation calculations using this assumption of uniform
Fig. 4.2 Location of the 90th Percentile Point in the Data of Table 4.15
distribution within this class. The value of the 90 th percentile will be located at a point
2,05,005 scores up from the bottom of the distribution. Because there are 8,95,700 cases
above the lower limit of this class interval, we must come up ( 9 , 22,526−8 , 95,700 ) more to
reach P90. This means that we must come up ( 9 , 22,526−8 , 95,700 ) out of the
(9 , 59,271−8 ,95,700) equal parts in the interval’s real width of 25 marks. We add this
quantity to the lower limit of the interval, which is 50.5. Thus, we get P90 ¿ 85.05.
What we did above can be converted into the following formula to calculate Pth percentile:
( c . f for the percentile−c . f for ( n−1 ) th class )
P=¿+i×
frequency of the class
(4.2)
where:
n represents the class number in which the required percentile is located,
LL is the real lower limit of this nth class interval,
i is width of this class interval,
c . f for the percentile is the number of items lying below the percentile, and
c . f for ( n−1 ) th class is the number of scores lying below LL.
Table 4.17. Outlay for Village and Small Industries in Public Sector, 2022-23
(in thousand crores of rupees)
Industry Expenditure
Small scale
Industry 39.35
Industrial Estates 7.58
Handloom
Industry 14.05
Village Industry 89.33
Coir Industry 1.79
In cases, where more than one attribute is studied, resulting in a subdivision of classes, the
classification is known as manifold. Thus, the population of a city may be divided into
literate and illiterate. Literate persons may again be divided into literate males and literate
females. The following illustration depicts an example of manifold classification:
Table 4.18 Distribution of Buildings in a District According to Habitation
(in thousands)
Under
District Inhabited Uninhabited Construction Total
Administrative 571 40 5 616
Other Urban 4,064 285 45 4,394
Rural 1,625 124 12 1,761
Total 6,260 449 62 6,771
When it is desired to represent three or more characteristics in a single table, such a table is
called higher-order table. Thus, if it is desired to represent the age, sex and course, of the
students, the table would take the form as shown in Table 4.19, and would be called a
higher order table.
Table 4.19 Skelton Table Showing Distribution of Students in a High School According to Age,
Sex and Course
Course
Arts Science Commerce
Age in years Male Female Male Female Male Female
14-15
15-16
16-17
17 and over
Illustration 4.5
In a trip organized by a college, there were 80 persons, each of whom paid ₹ 202.50 on an
average. There were 60 students, each of whom paid ₹ 200. Members of teaching staff were
charged at a higher rate. The number of helpers (all males) was six, and they were not
charged anything. The number of women was 20 per cent of the total, and there was only
one women staff member. Tabulate this information.
Solution:
Table 4.20 shows the data. Noting (G) denotes given information. Numbers in brackets
denote the sequence in which information is obtained from the given information. Final
calculation is (9), the rate of contribution of teachers which was determined as ₹ 300.
Sex Total
Rate of
Totals contribution
Female Male contribution
Students (4) 15 (5) 45 (G) 60 (G) 200 (7) 12,000
Teaching staff (G) 1 (2) 13 (1) 14 (9) 300 (8) 4,200
Helpers (G) 0 (G) 6 (G) 6 (G) 0 (G) 0
Totals (G) 16 (3) 64 (G) 80 (G) 202.50 (6) 16,200
Illustration 4.6
Classify the data of Table 4.1 using EXCEL.
Solution:
Frequency distributions are constructed using the Histogram module in the Data Analysis
package1 of the EXCEL.
Open a sheet and copy the data onto it. Then determine the minimum and maximum
values of the variables in this data. For this you use the min and max function commands by
typing =max(Range of cells in which data is located).
1
There is a good video of the process titled Use Excel 2016 to make Frequency distribution and Histogram for
quantitative data by Kwai Chan at https://www.youtube.com/watch?v=Giewd9yH4q0
Next, we have to decide on the class intervals. The minimum and maximum values suggest
using class intervals 61-65, 66-70, etc., till 85-90, a total of 6 classes. The real limits of these
are 60.5-65.5, 65.5-70.5, etc.
EXCEL uses the terminology bin for classes. The bins are specified by the upper limits of the
classes they represent. We enter the upper limits of the six classes as shown in the Screen
shot 4.1 here. We are now ready for using the histogram module.
Screen Shot 4.1
Under the Data tab, choose Data Analysis package. Then choose Histogram module in the
pop-up menu. A form titled Histogram is displayed as shown in Screen shot 4.2a. Fill in the
Input range (A1:K4 here), and the Bin range (A10:A150). Fill in the Output range too. Here
we want the output to be displayed at Cell C9. On pressing OK, we get the output table
starting at cell C9 as shown in Screen shot 4.2b
Screen Shot 4.2
If we had selected the chart output in the histogram form, we would have got a plot of this
data as well. (That is why this software module is named Histogram). The resulting table
can be edited to insert proper class intervals in place of bins.
Illustration 4.7
Thirty kids in a primary school were surveyed to determine the connection between intake
of fast food and the general heath. The data is shown in the Screen shot 4.3. The codes for
intake and general health are shown on the sheet.
Screen Shot 4.3
Illustration 4.8
Use PSPP for tabulating the data given in Table 4.2.
Solution:
After we open the saved data file (with extension .sav), choose the tab Descriptive Statistics
> Descriptive Statistics > Frequencies. A form titled Frequencies as shown in Screen Shot 4.5
pops up
2
There are many good videos on YouTube to explain this procedure in details. One such video is: Creating
Contingency Tables in Excel by Erik Heineman at https://www.youtube.com/watch?v=hpiI_HZfmIY
3
A very good video tutorial on YouTube on this topic:
Toussaint, L, Frequencies Analysis in SPSS: https://www.youtube.com/watch?v=JNGI_-n3dKo
Screen Shot 4.5
Here the variables are listed in the panel on the left from which we can select the variables
that we want to tabulate. There is only one variable here. We select it and press the arrow
key to transfer it to the Variable(s) panel on the right. We can select the statistics that we
want as the output. Since we are only interested in tabulation here, we can deselect all of
them, or leave Minimum and Maximum as selected. We could also choose what charts to
plot. We will not do so here. By pressing the Frequency Tables button, we can choose some
characteristics of how the data is plotted, like whether in ascending or descending order. On
pressing OK, a new window titled Output – PSPPIRE Output Viewer opens. It contains the
output of the operation and is shown in Screen Shot 4.6. The table shows the frequencies,
Percent, Valid Percent (which is the percent when missing data are excluded from the
calculations), and the cumulative percent.
Screen Shot 4.6
We next illustrate creating grouped-frequency tables. This is a bit more complicated, since
there is no direct way to do so.
Illustration 4.9
Tabulate the data of Table 4.2 as a grouped-frequency distribution.
Solution:
Since there is no direct way to group data in PSPP, the trick lies in transforming the variable
values, with one value for a group4. For this we use the Transform tab. Let us say we group
the data of Table 4.2 into six groups, namely 61-65, 66-70, 71-75, 76-80, 81-85, and 86-90 with
mid-points at 63, 68, 73, 78, 83 and 87, respectively. Let us name these groups by their mid-points.
After selecting the tab Transform, select Recode into Different Variables within the pop-up. A new
screen opens up as shown in Screen Shot 4.7. The panel on left shows the name of variables. There
is only one variable, Marks, which we select. Press the arrow to transfer it to the right panel. Then
under the group Output Variable, enter a name for the grouped variable. We have entered it as
simply GroupedVariable. The same is entered as label, and Old and New Values button at the bottom
is pressed. A form titled Recode into Different Variable: Old and New Values opens up, as shown in
Screen Shot 4.8, ready to accept the desired transformation.
Screen Shot 4.7
4
A video tutorial on this subject: Dr J., Grouped Frequency Table Trick in PSPP SPSS:
https://www.youtube.com/watch?v=-3Vq2MP6hK8
Screen Shot 4.9
We select the Range radio button and enter the limits of the first group, 61 through 65, and
enter 63, the mid-value of this group into New Value field, and press Add. The old and the
new values show up in the window on the right. Repeat this for each of the eight groups
that we have. After all groups show up, press Continue to go back to the earlier form Recode
into Different Variable. Press Change and then OK. The Data view of PSPP opens up with the
new variable GroupMarks values for each item (Screen Shot 4.9).
If we construct a frequency table now (using the procedure explained in Illustration 4.8, all
the cases for class 61-68 would be listed against the mid-value 63. It is useful, then, to label
63 as 61-65, and similarly the other mid-values. This is achieved by opening the variable
view of PSPP, and click on the value label cell against the variable GroupVariable. A form
titled Value Labels opens up (Screen Shot 4.10).
Screen Shot 4.10
The grouped frequencies can now be obtained using the process outlined in Illustration 4.8.
The navigation is the usual: Analyze > Descriptive Statistics > Frequencies and the choosing
GroupedVariable. The output is as shown in Screen Shot 4.11.
Screen Shot 4.11
Concepts Introduced
It is easier to understand the meaning of a large number of observations when they are
ordered and are grouped. A frequency distribution shows the number of observations for
various groups. Although putting scores into class intervals is convenient, we lose the
detailed information. For this reason, class intervals should not be too wide. A good
compromise is to use between 5 and 15 intervals.
We can construct distributions to show actual frequency (how many?) or relative frequency
(what percentage of the whole?). Relative frequency distributions are usually best for
comparing two or more distributions containing different numbers of cases.
The cumulative frequency distribution gives the number of scores below the upper real limit
of each score interval.
A percentile rank is the percentage of cases falling below a given point on the measurement
scale. Thus, it allows us to see how an individual has performed relative to the entire group.
A percentile is a point along the measurement scale below which a specified percentage of
the cases in the distribution falls.
The following formula can be used to calculate the percentile rank for a variable value of X:
( ULn −X )
PR X =PRUL + ×d (4.1)
i
where:
n is the serial number of the class in which the required percentile is located,
PRUL is the percentile rank of the upper limit of this nth class,
d is percentage of all items contained within the nth class interval,
UL n is the upper limit of the nth class interval, and
i is the width of the nth class interval.
Percentile point (or more simply, percentile) is the point on the measurement scale below
which a given percentage of the cases falls. The following formula can be used to calculate
the Pth percentile:
( c . f for the percentile−c . f for ( n−1 ) th class )
P=¿+i× (4.2)
frequency of the class
where:
n represents the class number in which the required percentile is located,
LL is the real lower limit of this nth class interval,
i is width of this class interval,
c . f for the percentile is the number of items lying below the percentile, and
c . f for ( n−1 ) th class is the number of scores lying below LL.
Categorical variables are organized by tallying values for the variable by categories and
tabulating the results. If the data are classified on the basis of one attribute only, the
process is termed as simple classification. These are also known as summary tables.
In cases, where more than one attribute is studied resulting in a subdivision of classes, the
classification is known as manifold.
Frequency tables in EXCEL can be calculated by using navigation Data> Data Analysis>
Histogram. EXCEL uses the terminology bin for classes. The bins are specified by the upper
limits of the classes they represent. We enter the upper limits of the various classes in a
column, and use it according to the procedure described in Sec. 4.8.
In PSPP we use the navigation Descriptive Statistics > Descriptive Statistics > Frequencies.
Since there is no direct way to group data in PSPP, the trick lies in transforming the variable
values, with one value for a group. For this we use the Transform tab. The process is
explained in Illustration 4.9 above.
4.23 A class of 32 students obtained the following marks in 2020 and 2021.
Unmarried
Unmarried
Unmarried
Unmarried
Total
Married
Married
Married
Married
s
M F M F M F M F M F M F M F M F
Literate
Illiterat
e