Introduction to Statistics
OR
Descriptive Statistics
What is Statistics?
Statistics is a collection of procedures and principles
for gathering data and analyzing information in order
to help people make decisions when faced with
uncertainty.
2
What is Statistics?
“Statistics is a way to get information from data”
Statistics
Data Information
Statistics is a tool for creating new understanding from a
set of numbers.
3
Getting started
In simple words, statistics is a way to get information
from data. More precisely, it is a science of collecting
and analyzing data to draw conclusions and make
decisions.
Every application of statistics involves one or more of
the following three tasks:
• Collecting data,
• Summarizing and exploring data,
• Drawing conclusions and making decisions based on
data
4
Statistics Defined
“ Statistics is the science and practice of developing
human knowledge through empirical data expressed in
quantitative form. It is based on statistical theory, a
branch of applied mathematics. Within the statistical
theory, randomness and uncertainty are modelled by
probability theory ”
5
Why study statistics?
Areas that rely on statistical information and
techniques include:
• Quality control
• Product planning
• Yearly reports
• Forecasting
• Market research
• Medical research
6
Applications of Statistics in Computer Science
A statistical background is essential for understanding
algorithms and statistical properties that form the
backbone of computer science.
Statistics is used for
• Data mining and analytics
• Machine learning
• Vision and image analysis
• Data compression
• Network and traffic modeling
7
Types of statistics
There are two types of statistics, namely, descriptive
and inferential statistics.
Descriptive Statistics:
Descriptive statistics are the methods that help collect,
summarize, present, and analyze data.
Inferential Statistics:
Inferential statistics are the methods that use the data
collected from a small group to draw conclusions about a
larger group.
8
Population
• The entire collection of
individuals or objects about which
information is desired
• What do you
A census call it when
is performed you
to gather
collect data
about the about
entire the entire
population
population?
9
Suppose we wanted to know the
average GPA of high school
graduates in the nation this year.
We could collect data from all
high schools in the nation.
10
Sample
• A subset of the population, selected
for study in some prescribed manner
What would a sample of all high school
graduates across the nation look like?
High school graduates from each state
(region), ethnicity, gender, etc.
11
Descriptive statistics
• the methods of organizing &
summarizing data
If the sample of high school GPAs contained
1,000 numbers, how could the data be organized
or summarized?
• Create a graph
• State the range of GPAs
• Calculate the average GPA
12
Variable
• any characteristic whose value may
change from one individual to
another
• Suppose we wanted to know the
average GPA of high school
graduates in the nation this year.
Define the variable of interest.
The variable of interest is the
GPA of high school graduates 13
Two types of variables
categorical numerical
discrete continuous
14
Categorical variables
• Qualitative
• Categorical variables (also known as qualitative
variables) have values that can only be placed
into categories such as yes and no.
• Identifies basic differentiating
characteristics of the population
• “Do you currently own bonds?” (yes or no) and
the level of risk of a bond fund (below average,
average, or above average) are examples of
categorical variables.
15
Numerical variables
• Quantitative
• observations or measurements take on
numerical values
• Numerical variables are further
identified as being either discrete or
continuous variables.
• makes sense to average these values
16
Discrete (numerical)
• Isolated points along a number line
• usually counts of items
• EXAMPLE: The number of rooms in a
house, or the number of hammers
sold at the local Home Depot
(1,2,3,…, etc).
17
Continuous (numerical)
• Variable that can be any value in a given
interval
• usually measurements of something
• The time you wait for teller service at a bank
is an example of a continuous numerical
variable
• EXAMPLE: The pressure in a tire, the weight
of a pork chop, or the height of students in a
class. 18
Identify the following variables:
1. the color of cars in the teacher’s lot
Categorical
2. the number of calculators owned by
students at your school
Discrete numerical
3. the zip code of an individual
Categorical
4. the amount of time it takes students to
drive to school Continuous numerical
5. the surveyed value of homes in your city
discrete numerical 19
Classifying variables by the number of variables
in a data set
Suppose that the PE coach records the height of each
student in his class.
This is an example of a
univariate data
Univariate - data that describes a single characteristic
of the population
20
Classifying variables by the number of variables
in a data set
Suppose that the PE coach records the height and
weight of each student in his class.
This is an example of a
bivariate data
Bivariate - data that describes two characteristics of
the population
21
Classifying variables by the
number of variables in a data set
Suppose that the PE coach records the height,
weight, number of sit-ups, and number of push-ups
for each student in his class.
This is an example of a
multivariate data
Multivariate - data that describes more than
two
22
How many variables have you
measured?
• Univariate data: One variable is
measured on a single experimental unit.
• Bivariate data: Two variables are
measured on a single experimental unit.
• Multivariate data: More than two
variables are measured on a single
experimental unit.
23
Variables and Data
• A variable is a characteristic that
changes or varies over time and/or
for different individuals or objects
under consideration.
• Examples: Hair color, white blood
cell count, time to failure of a
computer component.
24
Examples
• For each orange tree in a grove, the number of
oranges is measured.
• Quantitative discrete
• For a particular day, the number of cars entering a
college campus is measured.
• Quantitative discrete
• Time until a light bulb burns out
• Quantitative continuous
25
Examples
• The number of the number of hammers sold at
the local Home Depot is measured.
• Quantitative discrete
• The number of rooms in a house is measured.
• Quantitative discrete
• The pressure in a tyer, the height of students
in a class
• Quantitative continuous
26
Types of Variables
27
Raw Data
• When measurements are taken from a subset
of individuals in a population, they represent
sample data.
• When all individuals in a population are
measured, the measurements represent
population data.
• Descriptive statistics are summaries of the
raw data for all the individuals in a population
or a sample.
28
Levels of measurement
MUTUALLY EXCLUSIVE:
A property of a set of categories such that an
individual or object is included in only one category.
EXHAUSTIVE:
A property of a set of categories such that each
individual or object must appear in a category.
29
Data
• In general, there are many types of data that can be used to
measure the properties of an entity.
• A good understanding of data scales (also called scales of
measurement) is important.
• Depending on the scales of measurement, different
techniques are followed to derive so far unknown knowledge in
the form of
– patterns, associations, anomalies or similarities from a
volume of data.
30
NOIR
Classification of scales of
Measurement
31
NOIR classification
• The mostly recommended scales of
measurement are
N: Nominal
O: Ordinal
I: Interval
R: Ratio
The NOIR scale is the fundamental building
block for the extended data types.
32
Levels of Measurement: Nominal, Ordinal, Interval, Ratio
33
Levels of measurement
Nominal= data that is classified into categories and
cannot be arranged in any particular order.
EXAMPLES: eye color, gender, religious affiliation.
To summarize, nominal-level data have the following
properties:
1. Data categories are mutually exclusive and
exhaustive.
2. Data categories have no logical order.
34
Levels of measurement
• A nominal scale classifies data into distinct categories in
which no ranking is implied.
35
Levels of measurement
Ordinal= involves data arranged in some order, but the
differences between data values cannot be
determined or are meaningless.
EXAMPLE: During a taste test of 4 soft drinks,
Mellow Yellow was ranked number 1, Sprite number 2,
Seven-up number 3, and Orange Crush number 4.
The properties of ordinal-level data are:
1. The data classifications are mutually exclusive and
exhaustive.
2. Data classifications are ranked or ordered according
to the particular trait they possess.
36
Levels of measurement
• An ordinal scale classifies data into distinct
categories in which ranking is implied
37
Levels of measurement
Interval= similar to the ordinal level, with the additional
property that meaningful amounts of differences
between data values can be determined. There is no
natural zero point.
EXAMPLE: Temperature on the Fahrenheit scale.
The properties of interval-level data are:
1. Data classifications are mutually exclusive and
exhaustive.
2. Data classifications are ordered according to the
amount of the characteristics they possess.
3. Equal differences in the characteristic are
represented by equal differences in the measurements.
38
Levels of measurement
Ratio = similar to interval level with an inherent zero
starting point. Differences and ratios are meaningful
for this level of measurement.
EXAMPLES: Monthly income of surgeons, or distance
traveled by manufacturer’s representatives per month.
39
The properties of the ratio-level data are:
1. Data classifications are mutually exclusive and
exhaustive.
2. Data classifications are ordered according to the
amount of the characteristics they possess.
3. Equal differences in the characteristic are
represented by equal differences in the numbers
assigned to the classifications.
4. The zero point is the absence of the characteristic.
40
Levels of measurement
The difference between interval and ratio
measurements can be confusing.
The fundamental difference involves the definition of a
true zero and the ratio between two values.
An interval scale is an ordered scale in which the
difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.
A ratio scale is an ordered scale in which the
difference between the measurements is a
meaningful quantity and the measurements have a
true zero point. 41
Interval and Ratio Scales
42
Example 1: Levels of measurement
Identify the type of data.
Taos, Acoma, Zuni, and Cochiti are the names of four Native
American villages from the population of all Native American
villages in Arizona and New Mexico.
Solution:
These data are at the nominal level. Notice that these data values
are simply names. By looking at the name alone, we cannot
determine if one name is “greater than or less than” another. Any
ordering of the names would be numerically meaningless.
43
Example 2: Levels of measurement
In a high school graduating class of 319 students, Jim
ranked 25th, Kim ranked 19th, Walter ranked 10th, and
Julia ranked 4th, where 1 is the highest.
Solution:
These data are at the ordinal level. Ordering the data
makes sense. Walter ranked higher than Kim. Jim had
the lowest rank, and Julia the highest.
However, numerical differences in ranks do not have
meaning.
44
Example 2: Levels of measurement
The difference between Kim’s and Jim’s ranks is 6,
which is the same difference between Walter’s and
Julia’s ranks. However, this difference doesn’t mean
anything significant.
For instance, if you looked at the grade point average,
Walter and Julia may have had a large gap between
their grades, whereas Kim and Jim may have had closer
grades.
In any ranking system, only relative standing matters.
Differences between ranks are meaningless.
45
Levels of Measurement
An interval scale is an ordered scale in which the
difference between measurements is a meaningful
quantity but the measurements do not have a true
zero point.
A ratio scale is an ordered scale in which the
difference between the measurements is a
meaningful quantity and the measurements have a
true zero point.
46
Levels of measurement
Another example is a sales representative who travels
250 miles on Monday and 500 miles on Tuesday. The
ratio of the distances travelled on the two days is 2/1;
converting these distances to kilometres, or even
inches, will not change the ratio. It is still 2/1.
Suppose the sales representative works at home on
Wednesday and does not travel. The distance travelled
on this date is zero, and this is a meaningful value.
Hence, the variable distance has a true zero point.
47
Example 3: Levels of measurement
Body temperatures (in degrees Celsius) of fish in the Yellowstone
River.
Solution:
These data are at the interval level. We can certainly order the
data, and we can compute meaningful differences. However, for
Celsius-scale temperatures, there is not an inherent starting point.
The value 0 C may seem to be a starting point, but this value does
not indicate the state of “no heat.”
Furthermore, it is incorrect to say that 20 C is twice as hot as 10
C.
48
Example 4: Levels of measurement
Length of fish swimming in the Yellowstone River.
Solution:
These data are at the ratio level. An 18-inch fish is
three times as long as a 6-inch fish. Observe that we
can divide 6 into 18 to determine a meaningful ratio of
fish lengths.
49
Example: Levels of measurement
What is the level of measurement for each of the following
variables?
a. Distance students travel to class.
Ratio
b. Student scores on the first statistics test.
c. A classification of students by state of birth.
d. A ranking of students by freshman, sophomore, junior, and
senior.
e. Number of hours students study per week.
50
Example: Levels of measurement
What is the level of measurement for each of the following variables?
a. Distance students travel to class. Ratio
b. Student scores on the first statistics test. Interval
c. A classification of students by state of birth.
d. A ranking of students by freshman, sophomore, junior, and senior.
e. Number of hours students study per week.
51
Example: Levels of measurement
What is the level of measurement for each of the following
variables?
a. Distance students travel to class. Ratio
b. Student scores on the first statistics test.
Interval
c. A classification of students by state of birth. Nominal
d. A ranking of students by freshman, sophomore, junior, and senior.
e. Number of hours students study per week.
52
Example: Levels of measurement
What is the level of measurement for each of the following
variables?
a. Distance students travel to class. Ratio
b. Student scores on the first statistics test.
Interval
c. A classification of students by state of birth. Nominal
d. A ranking of students by freshman, sophomore, junior, and senior.
Ordinal
e. Number of hours students study per week.
53
Example: Levels of measurement
What is the level of measurement for each of the following
variables?
a. Distance students travel to class. Ratio
b. Student scores on the first statistics test.
Interval
c. A classification of students by state of birth. Nominal
d. A ranking of students by freshman, sophomore, junior, and senior.
Ordinal
e. Number of hours students' study per week. Ratio
54