0% found this document useful (0 votes)
80 views24 pages

Lecture Notes STA 121 - NEW

The document provides an overview of descriptive statistics, emphasizing its importance in decision making across various fields. It defines key concepts such as population, sample, measurement, parameter, and statistic, illustrating their relationships with practical examples. Additionally, it highlights the roles of statistics in scientific research, data collection, analysis, and interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views24 pages

Lecture Notes STA 121 - NEW

The document provides an overview of descriptive statistics, emphasizing its importance in decision making across various fields. It defines key concepts such as population, sample, measurement, parameter, and statistic, illustrating their relationships with practical examples. Additionally, it highlights the roles of statistics in scientific research, data collection, analysis, and interpretation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LECTURE NOTES

DESCRIPTIVE STATISTICS (STA 121)


1. INTRODUCTION TO STATISTICS
1.0 Origin of Statistics
The word “Statistics” appears to have been derived from the Latin word Status or the Italian
word Statista, both meaning a “manner of standing” or “position”

1.1 Nature of Statistics


Decision making is part of everyday life. Every moment, there are some kinds of decision to
make such as; what to eat, what to wear, which course to offer in higher institution, what to
buy and so on. Decision are however made on the basis of numerical information supplied or
implied. The aspect of decision making which has to do with numerical information is
Statistics. No article can’t be produced without the relevant raw materials, so also an
establishment can’t arrive at any accurate, reasonable, reliable and acceptable decision on a
policy without relevant statistical data. The nature of statistics is to find out whether it is a
science or art. As an art, it makes use of the data to solve the problems of real life. As a
science, it studies numerical data in a systematic manner.
Why do we need to study statistics? The study of Statistics is essential for the following
reasons:
• Knowledge in statistics provides you with the necessary tools and conceptual
foundations in quantitative reasoning to extract information intelligently from array
of data.
• Statistical methods and analyses are often used to communicate research findings and
to support hypotheses and give credibility to research methodology and conclusions.
• It is important for researchers and also consumers of research to understand statistics
so that they can be informed, evaluate the credibility and usefulness of information,
and make appropriate decisions.
1.2 Importance of Statistics in Different Fields of Study
Since Statistics is a scientific method of decision making, it has become a useful tool in the
world’s affairs. These days statistical methods are applicable everywhere. There is no field of
study in which statistical methods are not applied. According to A L. Bowley, ‘A knowledge
of statistics is like a knowledge of foreign languages or of Algebra, it may prove of use at any
time under any circumstances”. The importance of the statistical science is increasing in
almost all spheres of knowledge, e g., astronomy, biology, meteorology, demography,
economics and mathematics.
Economic planning without statistics is bound to be baseless. Statistics serve in
administration, and facilitate the work of formulation of new policies. Financial institutions
and investors utilize statistical data to summaries the past experience. Statistics are also
helpful to an auditor, when he uses sampling techniques or test checking to audit the accounts
of his client. In short, its application is found in all disciplines of human endeavour where
reasonable decision are important.
Now statistics holds a central position in almost every field like Industry, Commerce, Trade,
Physics, Chemistry, Economics, Mathematics, Biology, Botany, Psychology, Astronomy,
Information Technology etc…, so application of statistics is very wide. Specialties have
evolved to apply statistical theory and methods to various disciplines. So there are different
fields of application of statistics. Some of those are described below.
• Astrostatistics is the discipline that applies statistical analysis to the understanding of
astronomical data.
• Biostatistics is a branch of biology that studies biological phenomena and observations
by means of statistical analysis, and includes medical statistics.
• Econometrics is a branch of economics that applies statistical methods to the
empirical study of economic theories and relationships.
Advantages
Advantages of Primary Data
Advantages Disadvantages of Primary Data
Disadvantages
Disadvantages
Advantages
• The user
• of Secondary
Business
Advantages
Advantages
Advantages
is able to
analytics Data
identify and Disadvantages
is a rapidly
correct
developing
• It is ofDisadvantages
business Secondary
process that applies statistical
Disadvantages
Disadvantages
expensive to collect and cost
• Spontaneous• It gives
answermore canaccurate
be
methods answers
obtained
to data •
setsoftopeopleThe
develop • It involves
respondent
new insights greater
may and notefforts
want to giveof business
asisit understanding
• •TheIt •isinvolves
cheaper
It can be
enumerator than
lesssent
problems arising ispersonal
cost to very
and
able to
during
performance
interview
large
effort number
explain
collectionquestions
& opportunities.

at It is
that • ••
It
confidential
highly
It
mayis Data
Response biased
time
involve
andnot
rate
consuming
a
embarrassinglarge
usually
is only lowused on people
non-response
answer
The • interviewer
• •respondents There
low is
cost no interviewing
can bias
persuade the • It may
possessing be suitable
telephone for large survey
There
It•is He lessisselects
usually
time
might • anot
high
consuming
the mode response
understand
Environmental
rate of collection
and method statistics •is•The
theThe
• •• isRespondent
user
application
respondent
It
Some
is of
time can
of the
unaware
givemay questions
of
statistical
consuming a
the
object
methods
biased tomay
problem be wrongly
answer
to
answer environmental
toon on
• respondents
The
• It

has
• The
enumerator
that
Spontaneous
tofree
It isrisk give
of
is
in to
from
bias
able
willgeographical
wider be science.
answer
might refuse to in other methods
or
suitable
canto
interview,
memory
mistake
be appeal
for the
coverage
Weather, to
who
error,
on
type
obtained
the
respondent
part of the
respondents
of
of climate,
data
analysis

to • Theeasy
It is
encountered
air show
and water
himself
results
interpreted
questions
quality
in a
of the
to refuse
during on by observations
to
the berespondents
collection
the
are included,
better light
interviewed
ground and
that
depend
they
the
as are studies of plant
interviewer
exaggeration is and
absent
prestige effects telephone simply
•theSpontaneous
skill and byanswer
replacing
impartiality the receiver
of the observer
release he
collection
plans to adopt
information when refusing
and animal populations. compilation are processes
too personal can’t be obtained
•• A•TheItHelp
••isenumerator
great He
easily
Time
can
deal
hasbe
is •hasStatistical
accessible
control
more
given on
the
to
information
if the
consult
opportunity
data
respondent
mechanics
other
collection
candocument
ofdoes ••the
beidentifying
process
isnot ItItis•
which ••specific
application
isacostly
not
Very few
Itespecially
is
It questions
most
ofcan
toprobability
researcher’s
expensive
be
where
can
adopted
bethe
theory,
asked
needs
number
only
onwhich
telephone
of
whenincludes
the
collected.
•situation
Itunderstand
has the when
respondent
longer the
thequestion
respondent
mathematical
orientation needperiod
astoworded.
fill
givesthe
tools
aform
wrong
for dealing
response
• people
It •with
hasThe tolarge
low be respondents
interviewed
populations,
respondents
degree are
isgive
of accuracy
may large
to
literate
theand
field
vague spread
andofreckless
mechanics,
••through
• The It• can further
interviewer
Personal
be can enquiry
usedand which
assess
in is concerned
embarrassing
It has a pre-established degree of
conjunction
the respondent with
questions
with the
in otheraremotion
over aof
more wide
Proneanswer
•theSome
particles
area. or objects when subjected to a force.
to incomplete information
• Personal
termmethods and
of likely
social
eg toto•
embarrassing
Statistical
class,
beknow thequestions
willingly
age time
and physics
and are
isthe
respondent
even more
accurately of•
oneanswered.
likely
will fundamental
to types of theories
people are
of physics,
more difficult
and usestomethods of
validity and reliability • Lack of availability
accuracy
be willingly
be available
of the
anddata accurately
forsupply
probability
personalanswered.
interview
theory in solving
and to physical
locate problems.
and interview than others, e.g.
persuade the respondent
• Actuarial sciencetois thereturn
disciplinetravellers
that applies mathematical and statistical methods to
questionnaire posted to him
assess risk in the insurance and finance industries.

1.3 Roles of Statistics in Scientific Research


• Statistics play a vital role in researches. For example, statistics can be used in data
collection, analysis, interpretation, explanation and presentation. Use of statistics will
guide researchers in research for proper characterization, summarization, presentation
and interpretation of the result of research.
• Statistics provides a platform for research as to; How to go about your research, either
to consider a sample or the whole population, the Techniques to use in data collection
and observation, how to go about the data description (using measure of central
tendency).
• Statistical methods and analyses are often used to communicate research findings and
to support hypotheses and give credibility to research methodology and conclusions.
• It is important for researchers and also consumers of research to understand statistics
so that they can be informed, evaluate the credibility and usefulness of information,
and make appropriate decisions.
• Statistics is very important when it comes to the conclusion of the research. In this
aspect the major purposes of statistics are to help us understand and describe
phenomena in our word and to help us draw reliable conclusions about those
phenomena.

1.4 Basic definitions and Concepts in Statistics


Statistics, like every discipline, has its own language. The language is what helps you know
what a problem is asking for, what results are needed, and how to describe and evaluate the
results in a statistically correct manner. To shield light on statistical terminologies, we begin
with a simple and familiar analogy.
There are millions of passenger automobiles in the United States. What is their average
value? It is obviously impractical to attempt to solve this problem directly by assessing the
value of every single car in the country, adding up all those numbers, and then dividing by
however many numbers there are. Instead, the best we can do would be to estimate the
average. One natural way to do so would be to randomly select some of the cars, say 200 of
them, ascertain the value of each of those cars, and find the average of those 200 numbers.
The set of all those millions of vehicles is called the population of interest (universe), and
the number attached to each one, its value, is a measurement. The average value is a
parameter: a number that describes a characteristic of the population, in this case monetary
worth. The set of 200 cars selected from the population is called a sample, and the 200
numbers, the monetary values of the cars we selected, are the sample data. The average of
the data is called a statistic: a number calculated from the sample data. This example
illustrates the meaning of the following definitions.
A population is any specific collection of objects of interest. A sample is any subset or
subcollection of the population, including the case that the sample consists of the whole
population, in which case it is termed a census
A measurement is a number or attribute computed for each member of a population or of a
sample. The measurements of sample elements are collectively called the sample data.
A parameter is a number that summarizes some aspect of the population as a whole.
A statistic is a number computed from the sample data.
Variable: A characteristic about each individual element of a population or sample
(monetary worth)
Data (singular): The value of the variable associated with one element of a population or
sample. This value may be a number, a word, or a symbol.
Data (plural): The set of values collected for the variable from each of the elements
belonging to the sample.
Experiment: A planned activity whose results yield a set of data. The recording or
measurement of monetary values of the 200 cars in the sample and the computation of mean
is an experiment
Continuing with our example, if the average value of the cars in our sample was $8,357, then
it seems reasonable to conclude that the average value of all cars is about $8,357. In
reasoning this way we have drawn an inference about the population based on information
obtained from the sample. In general, statistics is a study of data: describing properties of the
data, which is called descriptive statistics, and drawing conclusions about a population of
interest from information extracted from a sample, which is called inferential statistics.
Computing the single number $8,357 to summarize the data was an operation of descriptive
statistics; using it to make a statement about the population was an operation of inferential
statistics.

Statistics is a scientific method of collection, organization, summarization, presentation,


analysis and interpretation of data (COSPAID) as well as drawing valid conclusion and make
reasonable decisions on the basis of this analysis.
Descriptive statistics is the branch of statistics that involves organizing, displaying, and
describing data.
Inferential statistics is the branch of statistics that involves drawing conclusions about a
population based on information contained in a sample taken from that population.

The relationship between a population of interest and a sample drawn from that population is
perhaps the most important concept in statistics, since everything else rests on it. This
relationship is illustrated graphically in Figure 1.1 "The Grand Picture of Statistics". The
circles in the large box represent elements of the population. In the figure there was room for
only a small number of them but in actual situations, like our automobile example, they could
very well number in the millions. The solid black circles represent the elements of the
population that are selected at random and that together form the sample. For each element of
the sample there is a measurement of interest, denoted by a lower case (which we have
indexed as to tell them apart); these measurements collectively form the sample data set.
From the data we may calculate various statistics. To anticipate the notation that will be used
later, we might compute the sample mean and the sample proportion , and take them as
approximations to the population mean (this is the lower case Greek letter mu, the traditional
symbol for this parameter) and the population proportion p, respectively. The other symbols
in the figure stand for other parameters and statistics that we will encounter.

Figure 1.1 The Grand Picture of Statistics


The measurement made on each element of a sample need not be numerical. In the case of
automobiles, what is noted about each car could be its color, its make, its body type, and so
on. Such data are categorical or qualitative, as opposed to numerical or quantitative data
such as value or age. This is a general distinction.
Qualitative data/variable are measurements for which there no natural numerical scale is,
but which consist of attributes, labels, or other nonnumerical characteristics. Arithmetic
operations, such as addition and averaging, are not meaningful for data resulting from a
qualitative variable. However, qualitative data can generate numerical sample statistics.
In the automobile example, for instance, we might be interested in the proportion of all cars
that are less than six years old. In our same sample of 200 cars we could note for each car
whether it is less than six years old or not, which is a qualitative measurement. If 172 cars in
the sample are less than six years old, which is 0.86 or 86%, then we would estimate the
parameter of interest, the population proportion, to be about the same as the sample statistic,
the sample proportion, that is, about 0.86. Alternatively, they can be analysed using charts,
frequency counts and percentage.

Quantitative data/variable are numerical measurements that arise from a natural numerical
scale. Arithmetic operations such as addition and averaging, are meaningful for data resulting
from a quantitative variable. In the automobile example, the monetary worth of the cars is a
quantitative variable, so any form of arithmetic operations can be performed on it.
1.5 Measurement scales.
Variables differ in "how well" they can be measured, i.e., in how much measurable
information their measurement scale can provide. There is obviously some measurement
error involved in every measurement, which determines the "amount of information" that we
can obtain. Another factor that determines the amount of information that can be provided by
a variable is its "type of measurement scale." Specifically, variables are classified as (a)
nominal, (b) ordinal, (c) interval or (d) ratio.
a. Nominal variables allow for only qualitative classification. That is, they can be
measured only in terms of whether the individual items belong to some distinctively
different categories, but we cannot quantify or even rank order those categories. For
example, all we can say is that 2 individuals are different in terms of variable A (e.g.,
they are of different race), but we cannot say which one "has more" of the quality
represented by the variable. Typical examples of nominal variables are gender, race,
color, city, etc.

b. Ordinal variables allow us to rank order the items we measure in terms of which has
less and which has more of the quality represented by the variable, but still they do
not allow us to say "how much more." A typical example of an ordinal variable is the
socioeconomic status of families. For example, we know that upper-middle is higher
than middle but we cannot say that it is, for example, 18% higher. Also this very
distinction between nominal, ordinal, and interval scales itself represents a good
example of an ordinal variable. For example, we can say that nominal measurement
provides less information than ordinal measurement, but we cannot say "how much
less" or how this difference compares to the difference between ordinal and interval
scales.

c. Interval variables allow us not only to rank /order the items that are measured, but
also to quantify and compare the sizes of differences between them. For example,
temperature, as measured in degrees Fahrenheit or Celsius, constitutes an interval
scale. We can say that a temperature of 40 degrees is higher than a temperature of 30
degrees, and that an increase from 20 to 40 degrees is twice as much as an increase
from 30 to 40 degrees.

d. Ratio variables are very similar to interval variables; in addition to all the properties
of interval variables, they feature an identifiable absolute zero point, thus they allow
for statements such as is two times more than . Typical examples of ratio scales are
measures of time or space. For example, as the Kelvin temperature scale is a ratio
scale, not only can we say that a temperature of 200 degrees is higher than one of 100
degrees, we can correctly state that it is twice as high. Interval scales do not have the
ratio property. Most statistical data analysis procedures do not distinguish between the
interval and ratio properties of the measurement scales.
1.6 Discrete versus continuous variable
Discrete Variable: A quantitative variable that can assume a countable number of values.
Intuitively, a discrete variable can assume values corresponding to isolated points along a
line interval. That is, there is a gap between any two values.
Continuous Variable: A quantitative variable that can assume an uncountable number of
values. Intuitively, a continuous variable can assume any value along a line interval,
including every possible value between any two values.
In many cases, a discrete and continuous variable may be distinguished by determining
whether the variables are related to a count or a measurement. Discrete variables are usually
associated with counting. If the variable cannot be further subdivided, it is a clue that you
are probably dealing with a discrete variable. Continuous variables are usually associated
with measurements. The values of discrete variables are only limited by your ability to
measure them.
Examples of Discrete variables are
• Number of printing mistakes in a book.
• Number of road accidents in New Delhi.
• Number of siblings of an individual.
• Number of students admitted in a particular session

Examples of Continuous Variable


• Height of a person
• Weight of a person
• Age of a person
• Profit earned by the company.
1.7 Functions of Statistics
i. Statistics presents facts in a definite and precise form so that the facts are readily
available for making valid conclusion.
ii. It simplifies large mass of data and present it in an intelligible manner so that they
convey meaning to the user/reader.
iii. Statistics allow data to be classified which may subsequently be used for
comparison.
iv. Formulating and testing of hypothesis: This is an important function of statistics.
This helps in developing new theories. So statistics examines the truth and helps in
innovating new ideas.
v. Forecasting: The future is uncertain. Statistics helps in forecasting the trend and
tendencies. Statistical techniques are used for predicting the future values of a
variable. For example, a producer forecasts his future production on the basis of the
present demand conditions and his past experiences. Similarly, the planners can
forecast the future population etc. considering the present population trends.
1.8 Limitations of Statistics
Statistics with all its wide application in every sphere of human activity has its own
limitations. Some of them are given below.
1. Statistics is not suitable to the study of qualitative phenomenon: Since statistics is
basically a science and deals with a set of numerical data, it is applicable to the study of only
these subjects of enquiry, which can be expressed in terms of quantitative measurements. As
a matter of fact, qualitative phenomenon like honesty, poverty, beauty, intelligence etc,
cannot be expressed numerically and any statistical analysis cannot be directly applied on
these qualitative phenomena. Nevertheless, statistical techniques may be applied indirectly by
first reducing the qualitative expressions to accurate quantitative terms. For example, the
intelligence of a group of students can be studied on the basis of their marks in a particular
examination.
2. Statistics does not study individuals: Statistics does not give any specific importance to
the individual items; in fact, it deals with an aggregate of objects. Individual items, when they
are taken individually do not constitute any statistical data and do not serve any purpose for
any statistical enquiry.
3. Statistical laws are not exact: It is well known that mathematical and physical sciences
are exact. But statistical laws are not exact and statistical laws are only approximations.
Statistical conclusions are not universally true. They are true only on an average.
4. Statistics table may be misused: Statistics must be used only by experts; otherwise,
statistical methods are the most dangerous tools on the hands of the inexpert. The use of
statistical tools by the inexperienced and untraced persons might lead to wrong conclusions.
Statistics can be easily misused by quoting wrong figures of data. As King says aptly
‘statistics are like clay of which one can make a God or Devil as one pleases’
5. Statistics is only, one of the methods of studying a problem: Statistical method do not
provide complete solution of the problems because problems are to be studied taking the
background of the countries culture, philosophy or religion into consideration. Thus the
statistical study should be supplemented by other evidences.

2. STATISTICAL DATA: TYPES, SOURCES AND METHOD OF COLLECTION


2.1. TYPES OF STATISTICAL DATA
Statistical data are a numerical statement of aggregates. Data, generally, are obtained through
properly organized statistical inquiries conducted by the investigators. There are many ways
of classifying data. A common classification is based upon who collected the data. Data
can be collected by investigator or enumerator(s) employed by him. Alternatively, the
investigator may use already published data for his research. Data freshly generated by the
investigator/researcher is called primary data while data obtained by a separate person,
organization or institution but used by the investigator for his research is called secondary
data. These two data types will be discussed in details in what follows:
2.1.1 Primary Data
Data is said to be primary if it is collected for a particular purpose and used for the original
purpose of collection. They are either collected by or under direct supervision and instruction
of the investigator, such data usually implies considerable knowledge of the condition under
which the data are collected and also of the limitation which must be placed on their use. The
familiarity of the investigator with the background of the data is valuable and can often
prevent errors arising which would otherwise be undetected. The main sources of primary
data are census and survey. In these exercises, information varying from age sex,
educational level, health, economic level, etc. are obtained from the population. They are
used to formulate policies by government and its functionaries.

2.1.2 Secondary Data


Data is said to be secondary if it is collected for a particular purpose and used for another
purpose. Put differently, secondary data are data which have already been collected for
purposes other than the problem at hand. In other words, secondary data are not fresh and
might have been processed a bit. It is type of data where the investigator has lesser degree of
control and the degree of misinterpretation increases as the degree of control decreases.
Secondary data may be available in the published or unpublished form. When it is not
possible to collect the data by primary method, the investigator go for secondary method.
However, the following factors must be considered before using secondary data in
research:
• Specifications: Methodology used to collect the data
• Error: Accuracy of the data
• Currency: When was the data collected? What is time lag between collection and
publication? What is frequency of updates?
• Objective(s): The Purpose for which the data was collected
• Nature: The content of the Data: definition of key variables, unit of measurements,
categories used and relationship examined
• Dependability: Overall, how dependable are the data? What is the expertise,
credibility, trustworthiness and reputation of the source?

2.2 SOURCES OF STATISTICAL DATA

The type of problem at hand dictates where an investigator could obtain statistical data.
Statistical data could therefore be obtained from the public, commercial & industrial sectors,
institutions, laboratory and field experiments. Government establishments such as National
Bureau of Statistics (NBS) and State Government Ministries also provide statistical data for
researchers use. Specifically, Data –which is a necessary tool for statistical analysis- can be
collected from two main sources. These are internal and external sources.
2.2.1. Internal Sources: Internal sources of data are those which are obtained from the
internal reports of an organization. For instance, a factory publishes its annual report on total
production, total profit and loss, total sales, loans, wages to employees, bonus and other
facilities to employees etc.
2.2.2 External Sources: The desire data or information may not be available within one
organization but may be obtained elsewhere. External sources refer to the information
collected outside one organization/agencies. For instance, to study the problem of
transportation in Lagos, if we obtain the information from Lagos National Union of Road and
Transport Workers, it would be known as external sources of data.
External data can be divided into following classes.
• Government Publications- Government sources provide an extremely rich pool of
data for the researchers. In addition, many of these data are available free of cost
on internet websites. There are number of government agencies generating data.
These are:

i. National Bureau of Statistics- One of the mandate of NBS is to collect,


compile, analyse and published statistical information relating to the
commercial, industrial, agriculture, mining, social, education and general
activites as well as conditions of the inhabitants of the federation.

ii. National Population Commission (NPC): The legal instrument which


created the body charged it to undertake complete enumeration of the
population of Nigerian periodically through census and sample survey. The
body publish information on age structure, gender, household welfare,
unemployment rate etc

iii. Central Bank of Nigeria (CBN) - The official publication of Central


Bank of Nigeria like CBN Statistical Bulletin, CBN Annual Reports are
rich source of information on solving economic related problems in
Nigeria. CBN publishes information on prices of commodity, monetary
policy, inflation rates, education, agriculture etc.
• Non-Government Publications- These includes publications of various industrial
and trade associations, such as

i. The Nigerian Labour Congress


ii. African Youths International Development Foundation
iii. Association for Adolescent Reproductive Health and Action
iv. Global Alert for Defense of You and the Less Privilege
v. Health and Sustainable Development Association of Nigeria
vi. National Association of Nigerian Traders
vii. Nigerian Media Women Association
viii. Women Farmers Association of Nigeria
C. International Organization- These includes
• The International Labour Organization (ILO)- It publishes data on the total and
active population, employment, unemployment, wages and consumer prices
• The Organization for Economic Co-operation and development (OECD) - It
publishes data on foreign trade, industry, food, transport, and science and technology.
• The International Monetary Fund (IMA) - It publishes reports on national and
international foreign exchange regulations.
• World Banks: World bank collates and publishes development indicators for more
than 170 countries
• Food and Agriculture Organization (FAO): The Food and Agriculture Organization
of the United Nations leads international efforts to defeat hunger. FAO is a rich source
of information for researcher
• UNICEF: The United Nation for Children Fund (UNICEF) conducts and publishes
Demographic and Health Survey (DHS) in over 90 countries. The survey covers a
wide range of research topics, including fertility, family planning, maternal and child
health, nutrition, and HIV in addition to emerging health issues and changing trends
in gender and youth.
D. Unpublished Sources: e.g hospital records
2.3 METHOD OF DATA COLLECTION

Method of data collection is the term used to describe approach adopted to gather data for a
particular purpose from various sources that has been systematically observed, recorded and
organized. There is no one “best” data collection method. Each method has its pros and cons.
Which one you choose depends on what kind of data you have (i.e. qualitative data or
quantitative data) and which pros/cons are important for your study. However, the objective
of study, type of variables needed, collection point, skill of the enumerator(s), degree of
accuracy and precision desired are factors to be considered in selecting suitable method
of data collection. Common data collection methods with their advantages and
disadvantages are discussed below.

2.3.1 Direct Observation: This has to do with careful watch over a situation or condition or
even over individuals with the intention of eliciting relevant facts or behaviour from them
under different conditions. This method is suitable in traffic census, statistically quality
control and market research. For example, number of vehicles that pass through a busy road
per minute for 30 consecutive minutes.

2.3.2 Direct/Personal Interview: This is a situation in which the researcher/interviewer


locates the respondents and conduct the interview pertaining to the survey in a face-to-face
situation. The accuracy and the quality of the data obtained depend on the skill and the
experience of the interviewer.

2.3.3 Telephone Interview Method: This is very similar to personal interview except that
the respondents are engaged in relevant question and answer conservation via telephone and
not face-to-face as in personal interview.
2.3.4 Mail or Postal Questionnaire: This is a set of questions relating to the aim and
objectives of the investigation sent to respondents by mail or other means to answer and
return to the user. It is used when the respondents to cover are large and are mainly literates

2.3.5 Questionnaire Completed by the Enumerator: This is a method of data collection


whereby the enumerator contact the respondents, get replies to the questions contains in the
schedule and fill them in his own handwriting.

2.4 SCHEDULE/QUESTIONNAIRE
A schedule is a list containing questions on research work which the respondents are
expected to provide answer to. The design of such a list may be such that it does not obey the
format of a questionnaire as responses are to be recorded by the researcher himself rather
than respondents. Whereas a questionnaire is a set of well-prepared list of questions meant
for respondents to complete unaided. The following are features of a questionnaire among
others:
➢ Introduction/Title
➢ Objective of the survey
➢ Clear, simple and unambiguous questions

2.5 TYPE OF QUESTIONNAIRE

(a) Closed-ended Questionnaire: A questionnaire is said to be closed-ended if it is design in


such a way that respondents are limited to state alternatives/options thereby not permitted
further or additional explanation. This is useful when all possible responses of the
respondents can be exhaustively listed and are not too many. It is otherwise called structured
questionnaire.

(b) Open-ended Questionnaire: This is a questionnaire design is such a way that it permits
respondents to express their views in the way they know best. In other words, the respondents
are not in any way restricted to options. It is otherwise called unstructured questionnaire.

2.6 QUALITIES OF QUESTIONNAIRE


A good questionnaire must possess the following qualities:
i. The questions should be presented in a systematic and well-organized form such that it
arouses interest of respondent
ii. Its questions must not be ambiguous, they should have clear interpretation
iii. The questions should cover the subject matter
iv. The questions should have precise answer
v. Leading and biased questions must be avoided

2.7. PROBLEMS OF DATA COLLECTION


• Manipulation of figure
• Accessibility problem
• Lack of dedication on the part of some data collector
• Illiteracy
• Religion and Culture
• Language barrier
• Landscape (Terrain)

Below are some errors that might arise during collection of data
• Non-coverage: This is a failure to locate or visit some unit in the sample
• Not at home: Temporary away e.g working parents
• Unable to answer: The respondent may not know the answer or unwilling to answer
• Hard core: Those who refuse to be interviewed i. e those respondent that do not
corporate with the interviewer

PRESENTATION OF STATISTICAL DATA


The collected data, also known as raw data or ungrouped data are always in an un-organized
form and need to be organized and presented in meaningful and readily comprehensible form
in order to facilitate further statistical analysis. It is, therefore, essential for an investigator to
condense a mass of data into more and more comprehensible and meaningful form. Usually,
data can be presented in the form of table, graph and diagram

3.1 TABULATION OF STATISTICAL DATA


The process of grouping into different classes or sub classes according to some characteristics
is known as classification. This is usually the first step in tabular presentation of data.
Tabulation is the process of summarizing classified or grouped data in the form of a table so
that it is easily understood and an investigator is quickly able to locate the desired
information. A table is a systematic arrangement of classified data in columns and rows.
Thus, a statistical table makes it possible for the investigator to present a huge mass of data in
a detailed and orderly form. It facilitates comparison and often reveals certain patterns in data
which are otherwise not obvious.

3.1.1 Advantages of Tabulation:


Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented are easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like mean, variance, correlation
etc.
4. It presents facts in minimum possible space and unnecessary repetitions and explanations
are avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references and they make it easier to present the information in
the form of graphs and diagrams.

3.1.2 Important Parts of a Table


A typical table should have among other things the following parts:
1. Table number: For ease of identification and reference every table should have be
numbered. The pattern of numbering and the place where the numbers are placed are more of
personal taste than rule.
2. Title of the table: A good table should have a clearly worded, brief but unambiguous title
explaining the nature of data contained in the table. It should also state arrangement of data
and the period covered.
3. Caption: In a table, caption stands for titles of the vertical column. In case there is a sub-
division at any column, there would be sub-caption heading also called “SPANNER”
4. Stubs: The titles of the horizontal rows in a table is called stubs.
5. Body of the table: The body of a table contains the numerical information
6. Footnotes: Footnotes are given at the foot of the table for explanation of any fact or
information included in the table which needs some explanation. Thus, they are meant for
explaining or providing further details about the data, that have not been covered in title,
captions and stubs.
7. Sources of data: This reveals where the information contained in the table has been
obtained.

3. 2 Type of Tables
Tables can be classified according to their purpose, stage of enquiry, nature of data or number
of characteristics used. On the basis of the number of characteristics, tables may be classified
as follows:
3.2.1 Simple or One-way table: A simple or one-way table is the simplest table which
contains data of one characteristic only. A simple table is easy to construct and simple to
follow. For example, the blank table given below may be used to show age distribution of
student in STA211 class.

Table 1: Age distribution of students in STA 211 class


Age-group Number of Student
15-16
17-18
19-20
Source: Class governor’s record sheet

3.2.2 Two-way table: A table, which contains data on two characteristics, is called a two-way
table. In such case, therefore, either stub or caption is divided into two co-ordinate parts. For
example, the caption may be further divided by another distinguishing characteristic of the
student in the STA 211. If the characteristic of interest is gender, the table will look like
Table 2: Age and gender distribution of students in STA 211 class
Age-group Number of Students Total
Male Female
15-16
17-18
19-20
Total
Source: ICT unit

3.2.3 Higher order/ Manifold table: When more than two characteristics of data is
considered in a single table such table is called a higher order or manifold table. For example,
the sub-captions in table 2 may be further classified with respect to the department of the
student offering STA 211. Thus, the table will be as follows:

Table 3: Age, gender and departmental distribution of students in STA 211 class
Age- Number of students G.T
group
Male Female
Physics Chemistry Mathematics Total Physics Chemistry Mathematics Total
15-16
17-18
19-20
Total
Source: Faculty office

FREQUENCY DISTRIBUTION
A frequency distribution is any device that displays the various categories or values that a
variable can assume as well as their frequencies of occurrence. For example, if the value of a
variable, e.g., height, weight, etc. (continuous), number of students in a class, readings of a
taxi-meter (discrete) etc., occurs twice or more in a given series of observations, then the
number of occurrences of the value is termed as the “frequency” of that value and the way of
tabulating a pool of data/measurement on any of these variables and their respective
frequencies side by side is called a ‘frequency distribution’ of those data. Croxton and
Cowden defined frequency distribution as “a statistical table which shows the sets of all
distinct values of the variable arranged in order of magnitude, either individually or in
groups, with their corresponding frequencies side by side.

For instance, if the mark obtained by 100 students in STA 211 examination is as shown below
Table 4: Marks of 100 students in STA 211 examination
72 61 63 65 62 68 69 64 65 67 69 56 60 66 62 57 72 67 65 70
64 66 71 73 67 65 64 63 61 58 64 62 69 66 65 63 63 59 61 64
65 57 66 71 68 70 67 66 60 62 65 58 63 68 64 61 62 65 66 59
62 65 65 60 64 61 64 69 62 64 62 63 68 67 65 62 65 68 61 63
62 72 62 66 66 65 63 67 66 63 63 66 65 63 62 62 66 64 62 62
Source: Hypothetical figure

If the raw data of table 4 are arranged in either ascending, or,


descending order of magnitude, we get a better way of
presentation, usually called an “array”.
Table 5: Array of marks of 100 students in STA 211 examination
56 57 57 58 58 59 59 60 60 60 61 61 61 61 61 61 62 62 62 62
62 62 62 62 62 62 62 62 62 62 62 63 63 63 63 63 63 63 63 63
63 63 64 64 64 64 64 64 64 64 64 64 65 65 65 65 65 65 65 65
65 65 65 65 65 65 66 66 66 66 66 66 66 66 66 66 66 67 67 67
67 67 67 68 68 68 68 68 69 69 69 69 70 70 71 71 72 72 72 73
Source: Hypothetical figure
The construction of a frequency distribution involves two
things which are: the listing of various category and the
listing of corresponding frequency. Generally, data can be
categorized singly as 56, 62, 63, 65…73 or in group as 56-57,
58-59, 60-61…72-73. When the data are categorized singly,
we refer to them as ungrouped data. However, if the data are
categorized in groups we refer them as grouped data.

4.1 FREQUENCY DISTRIBUTION FOR UNGROUPED


DATA
The data table 4 can be presented in a simple (or ungrouped) frequency distribution data with
the aid of tally marks. A tally mark is an upward slanted stroke (/) which is put against a
value each time it occurs in the raw data. The fifth occurrence of the value is represented by a
cross tally mark (\) as shown across the first four tally marks. Finally, the tally marks are
counted and the total of the tally marks against each value is its frequency.

Table 6: Ungrouped frequency distribution of marks of 100 students


Marks Tally marks Frequency Relative Frequency
56 / 1 1/100= 0.01
57 // 2 2/100= 0.02
58 // 2 2/100= 0.02
59 // 2 2/100= 0.02
60 /// 3 3/100= 0.03
61 //// / 6 6/100= 0.03
62 //// //// //// 15 15/100= 0.15
63 //// //// / 11 11/100= 0.11
64 //// //// 10 10/100= 0.10
65 //// //// //// 14 14/100= 0.14
66 //// //// / 11 11/100= 0.11
67 //// / 6 6/100= 0.06
68 //// 5 5/100= 0.05
69 //// 4 4/100= 0.04
70 // 2 2/100= 0.02
71 // 2 2/100= 0.02
72 /// 3 3/100= 0.03
73 / 1 1/00= 0.01
Total 100
Source: Hypothetical figures

4.2 FREQUENCY DISTRIBUTION FOR GROUPED


DATA
The data in Table 4 can be further condensed by putting them into smaller groups, or, classes
called “class-Intervals”. The number of items which fall in a class-interval is called its “class
frequency”. The tabulation of raw data by dividing the whole range of observations into a
number of classes and indicating the corresponding class-frequencies against the class-
intervals, is called “grouped frequency distribution”. The data in Table 4 can be presented
in a grouped frequency distribution as follows.
• Find the range of the data set, Range= Maximum observation- Minimum observation.
In this case the range is, R= 73-56= 17
• If the desired number of class interval is 9, then the class interval size or class width
will be 17/9= 1.89
Note: Thus, the number of class intervals can be fixed arbitrarily keeping in view the nature
of problem under study or it can be decided with the help of Sturges’ Rule. According to him,
the number of classes can be determined by the formula,
Where Total number of observations
Logarithm of the number
= Number of class intervals.
Thus if the number of observation is 10, then the number of class intervals is

Table 7: Grouped frequency distribution of marks of 100 students


Marks Tally marks Frequency Relative Frequency Cumulative
frequency
56-57 /// 3 3/100= 0.03
58-59 //// 4 4/100= 0.04
60-61 //// //// 9 9/100= 0.09
62-63 //// //// //// //// //// / 26 26/100= 0.26
64-65 //// //// //// //// //// 24 24/100= 0.24
66-67 //// //// //// // 17 17/100= 0.17
68-69 //// //// 9 9/100= 0.09
70-71 //// 4 4/100= 0.04
72-73 //// 4 4/100= 0.04
Total 100
Source: Hypothetical figures

4.3 DEFINITIONS OF TERMS ASSOCIATED WITH


GROUPED FREQUENCY TABLE
(a) Class limits and Class interval: The maximum and minimum values of a class-interval
are called upper class limit and lower class-limit respectively. In Table 7, the lower class-
limits of nine classes are 56, 58, 60, 62, 64, 66, 68, 70, 72 and the upper class-limits are 57,
The symbols that define a class such as
59, 61, 63, 65, 67, 69, 71,73.
56-57, 58-59, etc are called class interval. Different type of
class-intervals with their class-limits are given below:

Class-intervals (A) like 10–15, 15–20, 50–100, 100–150; 30–, 40–; are upper limit exclusive
type, i.e., an item exactly equal to 15, 100 and 40 are put in the intervals 15–20, 100–150 and
40–, respectively and not in intervals 0–15, 50–100 and 30–, respectively. Similarly, 15 is
included and 10 excluded (lower limit) in “above 10 but not more than 15” class-interval. In
the exclusive type the class-limits are continuous, i.e., the upper-limit of one class-interval is
the lower limit of the next class-interval and class limits of a class-interval coincide with the
class boundaries of that class-interval. It is suitable for continuous variable data and
facilitates mathematical computations.
Again class-intervals (B) like 60–69, 70–79, 80–89, etc., are inclusive type. Here both the
upper and lower class-limits are included in the class-intervals, e.g., 60 and 69 both are
included in the class-interval 60–69. This is suitable for discrete variable data. There is no
ambiguity to which an item belongs but the idea of continuity is lost. To make it continuous,
it can be written as 59.5–69.5, 69.5–79.5, 79.5–89.5, etc.
In ‘open-end’ class-interval (C) either the lower limit of the first class- interval, or, upper
limit of the last class-interval, or, both are missing. It is difficult to determine the mid-values
of the first and last class-intervals without an assumption. If the other closed class-intervals
have equal width, then we can assume that the open-end class-intervals also have the same
common width of the closed class-intervals. Grouped frequency distributions are kept open
ended when there are limited number of items scattered over a long interval. Unequal class-
intervals (D) are preferred only when there is a great fluctuation in the data.
(b) Class-mark, or, Mid-value: The class-mark, or, mid-value of the class-interval lies
exactly at the middle of the class-interval and is given by:
(c) Class boundaries: Class boundaries are the true-limits of
a class interval. It is associated with grouped frequency
distribution, where there is a gap between the upper class-
limit and the lower class-limit of the next class. This can be
determined by using the formula:

, where = common difference between the upper-class limit of a class interval and the lower-
class limit of the next higher-class interval. The class-boundaries of the class-intervals of
Table 7 will be 55.5 – 57.5; 57.5 – 59.5; 59.5 – 61.5; etc., since d = 58 – 57 = 60 – 59 = ...= 1.
The class boundaries convert a grouped frequency distribution (inclusive type) into a
continuous frequency distribution.
(d) Width or Length (or size) of a Class-interval:
Width of a class-interval = Upper class boundaryLower class boundary
Common width of a class-interval = difference between two successive upper Class-limits
(or, two successive lower class-limits) (when the class-intervals have equal widths) =
difference between two successive upper class-boundaries (or, two successive lower-class
boundaries) = difference between two successive class marks, or, mid values.
MEASURES OF LOCATION OR CENTRAL TENDENCY
A measure of central tendency or location is a single value that attempt to describe a set of
data by identifying the central position within the that set of data. As such, measures of
central tendency are sometimes called measures of location. The most commonly used
measures of location are arithmetic mean or simply mean, Median and Mode. Geometric
mean and Harmonic Mean are used under special cases.

ARITHMETIC MEAN
The mean is commonly called the average. From a mathematical standpoint there are several
types of means which are legitimate in various scenarios:
Arithmetic It is calculated by adding all the values and dividing the sum by the number of
Mean values. Let xi represent the values of a variable X, , the mean is for ungrouped
data is defined as:
For grouped frequency data,
Alternatively,
Where A= The assumed mean
D is the deviation of each observation from the mean i.e

Note that when the word "mean" is used without a modifier, it usually refers to
the arithmetic mean as defined by the formula above.
Harmonic In the case of calculations where reciprocal values are important (e.g. involving
Mean proportions, or velocities at constant distances) the calculation of the mean
should be based on the harmonic mean, which is defined by the reciprocal of
the mean of the reciprocals.
H.M= for ungrouped
H.M=
The harmonic mean never becomes greater than the arithmetic mean.
Geometric The geometric mean has to be used for averaging multiplicative factors (e.g.
Mean the average increase of stock prices). It is calculated as the n-th root of the
product of all factors:
G.M= for ungrouped data
G.M= for grouped data
The geometric mean is closely related to the log-normal distribution
The empirical relationship between Arithmetic mean (), Geometric mean (G.M) and
Harmonic mean (H.M) is given as:
Please note that there are different notations for the mean: the mean of a population is
denoted by , whereas the mean of the scores of a sample is denoted either by m, or .
The mean is a good approximation of the central tendency for unimodal symmetric
distributions, but can be misleading in skewed or multimodal distributions. Therefore, it can
be useful to specify other additional measures of location for skewed distributions (i.e. the
median is more robust in case of skewed distributions or in case of outliers). Another way to
deal with outliers is to use a trimmed mean, which is calculated after the lower and upper
fraction (typically 5%) of the values have been discarded.
MEDIAN
When considering the distribution curve (or the histogram) of a sample, the median is the
location which divides the area under the curve (or the area of the histogram) into two equal
halves. The relative position of the mode, the median, and the mean provides an indication of
the skewness of a distribution:
The median is calculated as follows:
• Sort all values in ascending or descending order.
• If the number of values is odd, take the middle number.
• If the number of values is even, take the average of the middle two numbers.
The sum of absolute deviations of sample scores from their median is lower than the absolute
deviations from any other value. Under certain circumstances the median may be a more
stable measure of location than the mean. The median in particular is less prone to outliers
(extreme values) than the mean. Median statistics is therefore often used with robust
statistics.

Example: Calculate the median of the following values:


4.4, 5.1, 4.1, 6.2, 5.7, 5.6, 7.0
1. Sort the seven values: 4.1, 4.4, 5.1, 5.6, 5.7, 6.2, 7.0
2. pick the middle value (since the number of values is odd) as the
median: 5.6

Outliers
Outliers are extreme values that stand out from the other values of a sample. Outliers
normally have a considerable influence on the calculation of statistics (see e.g. the leverage
effect with linear regression) and should be removed in most cases. You should also note that
outliers may result simply from the fact that you assume a distribution which does not fit the
real distribution of the data.
Typical examples of outliers are errors in measurement, errors in acquisition (human
influences...), or (rare) outstanding values. An important question concerning outliers is
whether it is legitimate to remove a particular value after it has been recognized as an outlier.
Of course, statistical tests cannot decide, if it is appropriate to remove such values. They can
only give you a hint if a significant deviation exists (basically, outlier tests are based on the
probabilities to belong to the assumed distribution).

Leverage Effect
The term "leverage" is commonly used for an undesirable effect which is experienced with
regression analysis (as well as with other methods). It basically means that a single data point
which is located well outside the bulk of the data (an "outlier") has an over-proportional
effect on the resulting regression curve.
The origin of this effect can be found in the method of least squares. As the regression line is
determined by minimizing the sum of squared residuals, a value far off the trend line of the
data has much more influence on the results as the "correct" data points. This effect may
become so strong that the regression line completely "tilts".

Mode of a Distribution
The mode is the value which occurs most frequently in a data set. The mode is only of
interest for large data sets, as in small samples the mode strongly depends on random
variations of the data. The term "mode" is contained in the expressions "unimodal",
"bimodal", and "multimodal". Distributions showing only a single peak are called unimodal,
bimodal distributions show two peaks in their frequency diagrams.
The relative position of the mode, the median, and the mean provides an indication of the
skewness of a distribution. The calculation of the mode D of classed data with equal bin
widths (as in histograms) can be done according to the following equation:

with
k the index of the bin containing the greatest number of objects
xuk lower border of the bin containing the greatest number of objects
b bin width
fk frequency of the k-th class
fk-1, fk+1 frequencies of the neighboring bins

MEASURES OF DISPERSION OR VARIATION


Variance
In addition to the measures of location for describing the position of the distribution of a
variable, one has to know the spread of the distribution (and, of course, about its form).
Maybe you want to have a look at the following interactive example in order to see some
examples of common means but different spreads.

The spread of a distribution may be described using various parameters, of which variance is
the most common one. Mathematically speaking, the variance v is the sum of the squared
deviations from the mean divided by the number of samples less 1:

Examination of this formula should lead to at least three questions:


• Why take the sum of squares and not, for example, the sum of absolute deviations from
the mean? The answer to this is quite simple: the mathematical analysis is simpler, if
the sum of squares is used.
• Why is the sum divided by n-1; wouldn't it be more logical to take just n? Here again,
the answer is simple, but requires the introduction of the concept of the degrees of
freedom.
• What about the s² in the formula? The parameter s which is apparently the square root of
the variance is called the standard deviation.
Please note the notation concerning the variance and the standard deviation: it is depicted as
s² (or s, respectively) if it has been calculated from a sample. If it is computed from a
population the standard deviation is depicted by the Greek letter σ (sigma)
The variance of some data is closely related to the precision of a measuring process, as can
be seen in the
Standard Deviation
The standard deviation is the positive
square root of the variance, and is
depicted by s for samples, or by σ for
populations. The standard deviation is
a useful measure of variability because
of its mathematical tractability:

There is often some confusion about


the standard deviation and its
interpretation. One should carefully
distinguish between the formal definition of the standard deviation and the interpretation of
it. The standard deviation as a numerical value can always be calculated provided that there
are enough samples available. In contrast to this, the interpretation of the standard deviation
as a measure of spread can be fully utilized only if the type of the distribution is known.
However, the theorem of Chebyshev gives some guidelines for any (!) distribution.

In the case of a normal distribution the following rules of thumb can be applied:
(µ σ) contains about 70% of the observations
(µ 2σ) contains about 95% of the observations
(µ 3σ) contains more than 99% of the observations
Coefficient of Variation

Specifying the standard deviation is more or less useless without the additional specification
of the mean (and of course the type of distribution). It makes a big difference if s = 5 with a
mean of = 100, with a mean of = 3. Relating the standard deviation to the mean resolves
this problem. The coefficient of variation is therefore defined by

,
which is a relative measure of the variation. However you should be aware of the fact that the
coefficient of variation becomes more or less useless when the mean approaches zero. So
don't use the coefficient of variation for comparing detection limits (where the signal to noise
ratio falls below 3).

MEASURES OF PARTITION
I. Deciles
II. Percentiles
III. Quartiles
MEASURES OF DISTRIBUTION
Moments of a Distribution
Moments can be used to describe several properties of a distribution. There are two different
methods to define moments; one kind of moment is called "moments about zero" and is
calculated according the following equation:

Another kind of moment is called "moments about the mean"; it is calculated according to

th
The exponent r defines the r moment. As you can easily see, the first moment about zero is
equal to the mean, the second moment about the mean is the variance. The third moment is
related to the skewness, and the forth moment is related to the kurtosis of a distribution.
SKEWNESS
A distribution is said to be skewed to the right (left) if it shows a tailing at the right (left) end.
The amount of skewing can be determined by the third moment of the distribution, which is
usually called skewness:

This definition of the skewness of a sample is a biased estimator of the skewness of the
population. In order to estimate the skewness of a population the following adjusted formula
has to be used:

In order to test, whether the calculated sample skewness originates from a skewed
distribution, the following test statistic may be calculated:

If the test statistic Zn exceeds the (1- /2) quantile of the standard normal distribution, the null
hypothesis of a symmetric distribution has to be rejected at a level of significance of . The
number of data points n should be greater than 100 for this test.
Hint: In some software packages the test statistic Zn is called "standardized
skewness". As a rule of thumb you can assume a skewed distribution with
95% confidence if the standardized skewness is less than -2 or greater than
+2.
Please note that the skewness is occasionally defined by a somewhat different formula,
leading to different values.
Below you find two examples of skewed distributions. You may also start the
following interactive example in order to see the effect of skewed distributions on the mean
and the median.

Kurtosis

The kurtosis (or excess) measures the relative flatness of a distribution (as compared to the
normal distribution, which shows a kurtosis of zero). A positive kurtosis indicates a tapering
distribution (also called leptokurtic distribution), whereas a negative kurtosis indicates a flat
distribution (platykurtic distribution). Distributions resembling a normal distribution are
sometimes called mesocurtic distributions.
(1)
The kurtosis is defined by the following formula:

This equation of the kurtosis is valid for a sample and is a biased estimator of the kurtosis of
the population. In order to estimate the kurtosis of the population you have to use the
following formula:

Below you find two examples of distributions with different kurtosis.


(1) Note that the kurtosis is sometimes defined by another formula, omitting the term "-3" in
the formula above. In this case a normal distribution would yield a kurtosis of 3.

SKEWNESS
Skewness is the degree of asymmetry or departure from symmetry of the distribution of a real
valued random variable.

Positive Skewed
If the frequency curve of a distribution has a longer tail to the right of the central maximum
than to the left, the distribution is said to be skewed to the right or to have positive skewed.
In a positive skewed distribution, the mean is greater than the media and median is
greater than the mode i.e. Mean > Median > Mode

Negative Skewed
If the frequency curve has a longer tail to the left of the central maximum than to the right,
the distribution is said to be skewed to the left or to have negative skewed. In a negatively
skewed distribution, mode is greater than median and median is greater than mean i.e.
Mode > Median > Mean.

In a symmetrical distribution the mean, median and mode coincide. In skewed distribution
these values are pulled apart. Pearson's Coefficient of Skewness Karl Pearson, (1857-1936)
introduced a coefficient of skewness to measure the degree of skewness of a distribution or
curve, which is denote by and denoted by

Usually this coefficient varies between -3 (for negative skewness) to +3 (for positive
skewness) and the sign indicates the direction of skewness.
Bowley's Coefficient of Skewness or Quartile Coefficient of Skewness
Arthur Lyon Bowley (1869-1957) proposed a measure of skewness based on the median and
the two quartiles.

Its values lie between 0 and.

Moment Coefficient of Skewness


This measure of skewness is the third moment expressed in standard units (or the moment
ratio) thus given by
For Population
For Sample

Its values lie between -2 and +2.

If is greater than zero, the distribution or curve is said to be positive skewed. If is less than
zero the distribution or curve is said to be negative skewed. If is zero the distribution or curve
is said to be symmetrical.
The skewness of the distribution of a real valued random variable can easily be seen by
drawing histogram or frequency curve.
The skewness may be very extreme and in such a case these are called J-shaped distributions.

KURTOSIS
Karl Pearson introduced the term Kurtosis for the degree of peakedness or flatness of a
unimodal frequency curve. Measure of kurtosis denote the shape of top of a frequency curve.
Measure of kurtosis tell us the extent to which a distribution is more peaked or more flat
topped than the normal curve, which is symmetrical and bell-shaped.

➢ When the peak of a curve


becomes relatively high
then that curve is called
Leptokurtic.

➢ When the curve is flat-


topped, then it is called
Platykurtic.

➢ Since normal curve is


neither very peaked nor
very flat topped, so it is
taken as a basis for
comparison.

➢ The normal curve is called Mesokurtic.

Kurtosis is usually measured by the moment ratio ().


For Population
For Sample

For a normal distribution, kurtosis is equal to 3.


When is greater than 3, the curve is more sharply peaked and has narrower tails than the
normal curve and is said to be leptokurtic.
When it is less than 3, the curve has a flatter top and relatively wider tails than the normal
curve and is said to be platykurtic.
Another measure of Kurtosis, known as Percentile coefficient of kurtosis is:
Q.D
Kurt=
P90 − P10
Where,
Q.D is semi-interquartile range
th
P90=90 percentile
th
P10=10 percentile

30

You might also like