Introduction to Statistics and Data
___________________________________________________________________
Consider the following examples mentioned below:
To travel from Jorhat to Golaghat, the average one-way travel time is 55.3 minutes.
Twittered (Users on twitter) have risen by 176 million in the last year.
2 million new whatsapp users are added every day. This comes to 6 each second.
In 2016, XYZ corp. predicts that 37.5 million people in the India. (19% of Smartphone users) will
perform transactions with their phones at sale terminals, which is an approximate 61% increase
from the last year.
In all the points mentioned above, there are some numerical figures or facts like 55.3 minutes, 176
million, 6 users per second, 37.5 million. These numerical facts are called Statistics and these numbers,
percent, figures allows us to understand business and economic situations.
From the above examples, it becomes clear that numbers, figures are the heart of statistics without
which this discipline cannot survive but, it is also imperative that statistics is just not limited to numbers
and graphical representation of data. It is about extracting valuable information and conclusions from
such data.
“Statistics is a way to get information from data”. Data are facts, especially numerical facts, collected
together for reference or information. Information is knowledge communicated concerning some
particular fact. Thus statistics is a tool for creating new understanding from a set of numbers.
Statistics are the methods that allow working with data effectively. It focuses on interpreting the results
of applying those methods and help to make better decisions. It provides a formal basis to summarize
and visualize data, reach conclusions about that data, make reliable predictions about activities and
improve processes.
Definitions of statistics
Merriam-Webster dictionary defines statistics as "classified facts representing the conditions of a people
in a state – especially the facts that can be stated in numbers or any other tabular or classified
arrangement”. Renowned Statistician Arthur Lyon Bowley defines statistics as "Numerical statements of
facts in any department of inquiry placed in relation to each other”. He also defined statistics as,
“Statistics may be called the science of counting”. At another place he defines, “Statistics may be called
the science of averages”.
According to Horace Secrist “By statistics we mean aggregates of facts affected to a marked extent by
a multiplicity of causes numerically expressed, enumerated or estimated according to reasonable
standards of accuracy collected in a systematic manner for a pre-determined purpose and placed in
relation to each other”.
1
Thus, based on the definitions above, we can say that Statistics possess the following features:
o Aggregation of facts
o Affected by several causes.
o Expressed numerically.
o Has an accuracy level to some extent.
o Represented in some form.
Statistics may be defined as “a collection of procedures and principles useful as an aid in gathering and
analyzing numerical information for the purpose of drawing conclusions and making decisions.”
Statistics is the science of collecting, analyzing, presenting, and interpreting data, as well as of making
decisions based on such analyses. The following are some important definition of statistics.
Statistics is the branch of science which deals with the collection, classification and tabulation of
numerical facts as the basis for explanations, description and comparison of phenomenon.
– Lovitt
The science which deals with the collection, analysis and interpretation of numerical data.
– Corxton & Cowde
The science of statistics is the method of judging collective, natural or social phenomenon from
the results obtained from the analysis or enumeration or collection of estimates - King
Thus, statistics helps in designing the systems, create a description for them and deduce inferences from
them. Therefore, learning of statistics helps the decision - maker to understand how to present
o Present and describe the information(data) to improve decisions
o To draw conclusions about large population based on the information from obtained from
samples.
o Seek out relationship between pair of variables to improve processes.
Division of Statistics
Statistics is concerned with the collection of data, its subsequent description, and its analysis, which
often leads to the drawing of conclusions. Statistical methods are broadly dived into two categories:
a) Data collection and descriptive statistics
b) Inferential statistics and probability models
Descriptive statistics are related to organizing data and bringing into focus their essential characteristics
for drawing conclusions. The descriptive statistics includes the different statistical methods relating to
collection, classification, analyzing, tabulation and interpretation. These methods include diagrams, and
graphs, numerical techniques such as averages, dispersion, correlation and regression etc.
2
Inferential Statistics consists of the methods that use data collected from a small group (called sample)
to reach conclusions about a larger group (called population). Iinferential Statistics is the process of
making an estimate, prediction, or decision about a population based on a sample. The estimation
theory and testing of hypothesis are included in the inferential statistics. The estimation of population
values is imperative in decision – making.
An understanding of statistical inference requires some knowledge of probability theory as the basis of
statistical inference.
Example of Inferential Statistics: Mr X sold 2, 1, and 0 mobile phones respectively on last three Sundays.
An example of inferential statistics is in the statements: "Mr X never sells more than 2 mobile phones on
a Sunday."
Comparison between Descriptive and Inferential Statistics
Basis Descriptive Inferential
Meaning Descriptive Statistics is that Inferential Statistics is a type of
branch of statistics which is statistics that focuses on drawing
concerned with describing the conclusions about some
population under study. population, on the basis of
sample analysis and observation.
What it does? Organize, analyze and present Compares, test and predicts
data in a meaningful way. data.
Form of final Result Charts, Graphs and Tables Probability
Usage To describe a situation. To explain the chances of
occurrence of an event.
Function It explains the data, which is It attempts to reach the
already known, to summarize conclusion to learn about the
sample. population that extends beyond
the data available.
Importance of Statistics: Statistics has wide applications in almost all sciences namely biology,
psychology, education, economics, business, management, commerce etc. in all stages. It has become a
vital in all stages of human foray. The use of statistical data and statistical techniques are so wide that
today in almost all ministries and departments in the government has separate statistical section. In ,
business , statistics have already made fundamental changes in maintaining and improving output
quality, in selecting and promoting personnel , in efficient use of materials , in projecting long term
capital requirements and forecasting sales, in estimating consumer’s preferences , and in various other
phases of business research and management. It has become increasingly evident that statistics.
3
What is Data?
Data are facts about the world that one seeks to study and explore. It is a collection of observations of
one or more variables of interest. Data may be – numeric or non – numeric. The data which can be
presented with numbers is called numeric. For example, when a nurse weighs a patient or takes a
patient’s temperature, a measurement, consisting of a number such as 150 pounds or 100 degrees
Fahrenheit, is obtained. When a hospital administrator counts the number of patients—perhaps 20—
discharged from the hospital on a given day. The data which cannot be presented in numbers is called
non – numeric. For example: Patient name, seat location, mode of payment at counter
What is Big Data? A collection of data that cannot be easily analysed using traditional methods is called
a big data. Big data implies data that
o are being collected in huge volumes,
o at very fast rates or velocities, and
o in a variety of forms other than traditional structured forms such as data processing records,
files, tables and worksheets.
Structured and Unstructured Data:
Structured data is data that is highly organized and neatly formatted. It is the type of data that can be
put into tables and spreadsheets. Structured data is also often referred to as quantitative data.
Unstructured data is unorganized and requires more work to properly investigate. Collecting,
processing, and analyzing unstructured data represents a significant challenge. Unstructured data are
also called qualitative data, which basically covers everything that structured data does not.
Structured data include credit card numbers, dates, financial amounts, phone numbers, addresses,
product names, and more. These are all data points that aren’t open for interpretation, making it easy
for big data applications to collect and analyze. Patient’s data, for example, would include facts like the
customer’s name and the transactions he or she engaged in. Unstructured data include reports, audio
files, images, video files, text files, social media comments and opinions, emails, and more.
Quantitative and Qualitative data
Quantitative data refers to the data which computes the values and counts and can be expressed in
numerical terms. It can be quantified in definite units of measurement.It is concerned with
measurements like height, weight, volume, length, size, humidity, speed, age etc.
In statistics, most of the analyses are conducted using quantitative data. The tabular and diagrammatic
presentation of data is also possible, in the form of charts, graphs, tables, etc. Further, the quantitative
data can be classified as discrete or continuous data. The methods used for the collection of data are:
Surveys, Experiments, Observations and Interviews.
4
Qualitative data is non-statistical and is typically unstructured or semi-structured in nature. This data is
not measured using hard numbers used to develop graphs and charts. Instead, it is categorized based on
properties, attributes, labels, and other identifiers.
Qualitative data investigative and is often open-ended until further research is conducted. Generating
this data from qualitative research is used for theorizations, interpretations, developing hypotheses, and
initial understandings. Qualitative data can be generated through:
o Texts and documents
o Audio and video recordings
o Images and symbols
o Interview transcripts and focus groups
o Observations and notes
Sources of Data
The performance of statistical activities is motivated by the need to answer a question. For example,
clinicians may want answers to questions regarding the relative advantages of competing treatment
procedures. Administrators may want answers to questions regarding such areas of concern as
employee morale or facility utilization. The appropriate approach to get an answer to a question
requires the use of statistics, which begin to search for suitable data which serve as the raw material for
investigation.
Available Sources to generate data are
1. Routinely kept records such as Hospital medical records, for example, contain immense
amounts of information on patients, while hospital accounting records contain a wealth of data
on the facility’s business activities.
2. Surveys: If the data needed to answer a question are not available from routinely kept records,
the logical source may be a survey. For example, that the administrator of a clinic wishes to
obtain information regarding the mode of transportation used by patients to visit the clinic
3. Experiments: A nurse may wish to know which of several strategies is best for maximizing
patient compliance. The nurse might conduct an experiment in which the different strategies of
motivating compliance are tried with different patients. Subsequent evaluation of the responses
to the different strategies might enable the nurse to decide which is most effective. and
4. External sources: The data needed to answer a question may already exist in the form of
published reports, commercially available data banks, or the research literature. In other words,
we may find that someone else has already asked the same question, and the answer obtained
may be applicable to our present situation.
_____________________________________________________________________________________
Prepared by: Dr Anil K Bhatia, Assistant Professor (Statistics), Kaziranga University, Jorhat