0% found this document useful (0 votes)
15 views107 pages

01 - Intro and Descriptive Statistics

The document provides an overview of a statistics and probability course, covering topics such as descriptive and inferential statistics, probability distributions, and data analysis techniques. It emphasizes the importance of statistics in decision-making across various fields and introduces measurement scales and types of variables. Additionally, it includes practical applications using software like MS Excel, Minitab, and R for statistical computing.

Uploaded by

majidali365015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views107 pages

01 - Intro and Descriptive Statistics

The document provides an overview of a statistics and probability course, covering topics such as descriptive and inferential statistics, probability distributions, and data analysis techniques. It emphasizes the importance of statistics in decision-making across various fields and introduces measurement scales and types of variables. Additionally, it includes practical applications using software like MS Excel, Minitab, and R for statistical computing.

Uploaded by

majidali365015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STAT-402

3(2-1)

Statistics and Probability

1
Theory
Introduction to statistics, Variables, Type of Measurements, Population
and Sample, Descriptive Statistics and decision making with Statistics,
Graphical representation of Data, Bar charts, Pie charts, Stem and leaf
plot, Box plots, Histograms, Frequency curves, Measures of Central
tendency, Measures of dispersion, Moments of frequency distribution,
examples with real life, Use of elementary statistical packages for
explanatory data analysis. Counting techniques, definition of probability
with classical and relative frequency and subjective approaches, sample
space, events, Laws of probability. Conditional probability and Bays
theorem with application to random variable (discrete and continuous)
Binomial, Poisson, Geometric, Negative binomial distributions;
Exponential, Gamma and Normal distributions.
2
Practical

Working with MS Word and Excel. Use of Minitab


biostatistics software packages for data analysis.

3
R language
• R is an open-source language
used for statistical computing
or graphics. This programming
language is often used in
statistical analysis and data
mining. It can be used for
analytics to identify patterns
and build practical models.

4
What is Statistics?
Statistics is the art and science of extracting information from data

Statistics
Data Information

Information:
Data: Raw facts and Communicated
figures, especially concerning some
numerical facts, particular facts.
collected together for
information.

5
Why study statistics?
1. Data are everywhere
2. Statistical techniques are used to make many decisions that affect
our lives
3. No matter what your career, you will make professional decisions
that involve data. An understanding of statistical methods will help
you make these decisions efectively
Statistics is the science of collecting, organizing,
summarizing, and analyzing data and making
predictions or decisions about a population based on a
sample, while accounting for uncertainty. 6
Statistics

• Statistics is the science of collecting, organizing, analyzing,


interpreting, and presenting data to uncover patterns, relationships,
and trends.
• It provides methods for designing studies, conducting surveys, and
making predictions based on data.
• By studying a small group (sample), statistics helps draw conclusions
about a larger group (population) while accounting for uncertainty.
• It is widely used in various fields, including science, business,
economics, and social sciences, to support decision-making and
problem-solving.
7
Variation and Uncertainty
• Statistics is the subject which deals with the variability. No two
objects in a universe are exactly alike. If they were, there would
have been no statistical problem.

• It also deals with uncertainty as every process of getting


observations whether controlled or uncontrolled, involves
deficiencies or chance variation. That is why we have to talk in
terms of probability since the inferences which are made about
the population on the basis of sample evidence cannot be
absolutely certain.
8
Population vs Sample
Statistical
Inference

Population Sample
(have Parameters) (have Statistic)
ഥ , S, r
Statistic: 𝑿
Parameters: µ, σ, ρ

Population: A Population is Sample: A representative


a group of all part/subset of the
object/elements/items population.
under investigation. 9
Why Sampling?

• A process of drawing a sample from population is called


sampling.
• Reduced cost
• Greater speed
• Greater accuracy
• Some times it is the only option (testing the life of bulbs/bullets)

10
Branches of Statistics

Statistics

Descriptive Inferential

Involves in Organization,
Using sample information
Summarization, and Display of ഥ , S, r, p to draw
such as 𝑿
Data into Tables, Graphs and
Inference about Unknown
Summary Numbers such as
ഥ , S, r, p Population Parameters.
𝑿
11
Branches of Statistics

Statistics

Descriptive Inferential

Descriptive statistics are brief descriptive


coefficients that summarize a given data set, Using sample information
which can be either a representation of the ഥ , S, r, p to draw
such as 𝑿
entire or a sample of a population. Descriptive Inference about Unknown
statistics are broken down into measures of Population Parameters.
central tendency and measures of variability
(spread).
12
Variable Any Characteristics that varies from Object to Object, Place to Place
or Over time is known as Variable. e.g., marks, age, height, sex,
temperature, sales, revenue, time etc.

Variable

Qualitative Quantitative

Characteristic which
varies in quality (not Discrete Continuous
numerically) e.g.,
Eye colour, Height
No. of students
Education level, Weight
No. of chairs
Behaviour, Marks
No. of deaths
Quality, Time
No. of births in a hospital
Design, Distance
No. of accidents
Performance Temperature 13
Guess the type of Variable!

ID Sex Age Smoke Vitamin VitaminUse Quetelet Calories Fat Fiber Cholesterol
1 Female 64 No 1 Regular 21.4838 1298.8 57 6.3 170.3
2 Female 76 No 1 Regular 23.8763 1032.5 50.1 15.8 75.8
3 Female 38 No 2 Occasional 20.0108 2372.3 83.6 19.1 257.9
4 Female 40 No 3 No 25.1406 2449.5 97.5 26.5 332.6
5 Female 72 No 1 Regular 20.985 1952.1 82.6 16.2 170.8
6 Female 40 No 3 No 27.5214 1366.9 56 9.6 154.6
7 Female 65 No 2 Occasional 22.0115 2213.9 52 28.7 255.1
8 Female 58 No 1 Regular 28.757 1595.6 63.4 10.9 214.1
9 Female 35 No 3 No 23.0766 1800.5 57.8 20.3 233.6
10 Female 55 No 3 No 34.9699 1263.6 39.6 15.5 171.9
11 Female 66 No 1 Regular 20.9465 1460.8 58 18.2 137.4
12 Female 40 No 2 Occasional 36.4316 1638.2 49.3 14.9 130.7
13 Male 57 No 3 No 31.7304 2072.9 106.7 9.6 420
14 Female 66 No 1 Regular 21.7885 987.5 35.6 10.3 254.9
15 Male 66 No 3 No 27.3192 1574.3 75 7.1 14 361.5
Measurement

• The concept of measurement scales was introduced by psychologist


Stanley Smith Stevens in 1951.
• He developed a framework to classify data based on how numbers or
labels are assigned to represent attributes or characteristics of objects,
people, states, or events.
• His classification identifies four levels of measurement: nominal, ordinal,
interval, and ratio, each reflecting different properties and relationships
among data points. This system allows researchers to determine the most
suitable analytical methods based on the scale of measurement.

15
Measurement Scales

• Measurement scales refer to the system used to categorize and


quantify variables based on the relationship between values. Stanley
Smith Stevens defined four types of measurement scales: Nominal,
Ordinal, Interval, and Ratio. Each scale has specific characteristics
and influences how data is analyzed. Let’s look at each scale in detail
with examples, including general scenarios and those specific to
computer science.

16
Nominal
• The nominal scale is the simplest type of measurement. It categorizes data
into distinct groups or categories without any quantitative value or order.

• Nominal scales are mutually exclusive (non-overlapping) categories where


order of the categories is not important.

• Only distinguishes between categories


• No inherent order among the categories
• Suitable for labeling and counting frequencies
Examples: Sex, Blood Groups, Religion, Marital status, Political affiliation, Eye color
Programming Languages: Categories like Python, Java, C++, and JavaScript.
Operating System: Categories like Windows, macOS, Linux, and Android.
17
Ordinal
• Order of the values is important and significant, but the differences
between each one is not really known.
Poor → Fair → Good → Very Good → Excellent

• But, Is the difference between “Very Good” and “Excellent” the same as
the difference between “Good” and “Very Good?” We can’t say.

Example:
Students’ Grades
Class Positions
Cricket teams standings in ICC ranking
18
Ordinal …continue

• The ordinal scale categorizes data with a meaningful order or rank


among categories but without precise intervals between them.
• Data can be ordered or ranked.
• Differences between ranks are not consistent or measurable.
• Example: Education Level: Categories include High School, Bachelor’s,
Master’s, PhD. (There is an order, but the difference in years of study
or expertise between each level is not uniform)
• Example: User Access Levels: Categories like Guest, User, Moderator,
Administrator. (These represent a hierarchy of permissions, though
not with equal intervals between levels).
19
Interval
• Interval scales are numeric scales in which we know not only the order, but
also the exact differences between the values. i.e., Constant interval size
• No “true zero” point i.e., Zero does not mean absence

• With interval data, we can add and subtract, but cannot multiply or divide.
Example:
Temperature
Shoe size
IQ scores

20
Ratio

• Ratio scales tell us about the order, they tell us the exact
value between units, AND they also have a “true zero” point

Example:
Height, Weight, Speed, Length, Age
Storage Capacity: Measured in bytes, a storage capacity of 0 GB means
no storage. A 500 GB hard drive has twice the capacity of a 250 GB
hard drive.
Memory Usage: Memory measured in MB or GB where 0 means no
memory usage. 8 GB of RAM is twice as much as 4 GB
21
Just look at some of the Graphs …

22
Line chart
with
respect to
time

23
Bar chart

24
Multiple bar
chart

Cluster bar chart

25
Types of bar charts

26
A histogram is a type of chart that represents
the distribution of a numeric (continuous)
variable by grouping the data into bins
(intervals) and displaying the frequency or count
of data points within each bin. Unlike a bar
chart, which typically represents categorical
data, a histogram is used for continuous or
quantitative data. 27
Qualitative data
Example 1: Consider the data about Sex of 10 students

Sex M F M M F M F M M M

• Make a frequency distribution, relative frequency and % frequency of the


above data and interpret your results? Make an appropriate graph?
Example 2: Suppose we have also collected data of Sections of these 10
students as
Sex M F M M F M F M M M
Section A A A B B B A B A B
• Construct the Cross tabulation of the above data and interpret your results?
Also make an appropriate graph?
28
Solution
Example 1 Sex f Relative % freq Example 2
Sex Sec A Sec B Total
freq
Male 7 0.7 70 Male 3 4 7

Female 3 0.3 30 Female 2 1 3

Total 10 1.0 100 Total 5 5 10

Bar Chart Multiple Bar chart


8 7 5
7
4

Frequency
Frequency

6
Sec A
5 3
4 3 2 Sec B
3
2 1
1
0 0
Male Female Male Female
Sex Sex 29
Simple Bar Chart
• A bar chart is a type of chart which shows the values of different
categories of data as rectangular bars with different lengths.
Example: Draw a Simple Bar Chart to represent the Population of 5
cities of the province Punjab.
Bar diagram showing Population of 5 cities
of Punjab
Cities Population (000)
12,000
10,355
Lahore 10,355 10,000

Population in ‘000’
Rawalpindi 4,765 8,000

Faisalabad 3,675 6,000 4,765


3,675
4,000 3,100
Sargodha 1,550 1,550
2,000
Multan 3,100 0
Lahore Rawalpindi Faisalabad Sargodha Multan
Cities 30
Multiple Bar Chart

Population Multiple Bar Chart showing Population of


Cities (000) Male Female
Males and Females
Lahore 10,355 5385 4,970 6000
5385
4,970
Rawalpindi 4,765 2478 2,287 5000 Males Females
Faisalabad 3,675 1911 1,764 4000

POPULATION
Sargodha 1,550 806 744 3000 2478
2,287
1911
1,764
2000

1000 806 744

0
Lahore Rawalpindi Faisalabad Sargodha
CITIES 31
Component Bar Chart

Component Bar Chart showing population of


both Males and Females and Total
Cities Pop (000) Male Female
12000
Lahore 10,355 5385 4,970
10000 Males
Rawalpindi 4,765 2478 2,287
8000 Females
4,970

Population
Faisalabad 3,675 1911 1,764
6000

Sargodha 1,550 806 744 4000


2,287
5385 1,764
2000
2478 1911 744
0 806
Lahore Rawalpindi Faisalabad Sargodha
Cities
32
Discrete data – Frequency Distribution

Example:
• Following data represents the number of infected plants from a
sample of twenty experimental plots. Your task is to present it in
tabular form.

1 2 4 3 0 1 2 3 1 1 0
2 1 0 2 3 0 0 1 3

33
Discrete Frequency Distribution

No. of infected Tally Frequency Relative


items frequency
f
X
0 |||| 5 5/20 = 0.25
1 |||| | 6 0.30
2 |||| 4 0.20
3 |||| 4 0.20
4 | 1 0.05
Total 20 1.00

34
Graphical Representation of Discrete Data
Bar Chart representing the infected items
7

6
6
5
5
Frequency

4
4 4
3

1
1
0
0 1 2 3 4
No. of infected items
35
Pie Chart
• A pie chart is a type of graph in which a circle is divided into sectors
that each represent a proportion of the whole.
Example: The blood group of 70 students were tested and the following
results were obtained.

Blood No. of Blood Groups of Students


Groups Students (f)

17% 11%
A 8
A
B 30 29% 43%
B
O
O 20 AB

AB 12
36
Pie Chart
Blood No. of Relative Percent Angle
Groups Students frequency frequency rf x 360
(f)
A 8 8/70 = 0.11 0.11*100 = 11 39.6

B 30 0.43 43 154.8
Divide the total
O 20 0.29 29 104.4
angle of the Circle
AB 12 0.17 17 61.2 360 into four
segments as
Total 70 1.00 100 360 calculated

37
Simple Bar Chart

• Consider the Same example of the blood group of 70 students

Blood Groups
Blood No. of
35
Groups Students (f) 30
30
A 8 25
20
B 30 20
15 12
O 20
10 8
AB 12
5
0
A B O AB
38
Simple Bar Chart

Example: Draw a Simple Bar Chart to represent the turnover of a


company for 6 years.
Bar diagram showing the Turnover of a company for
6 years
Years Turnover (Rs) 70,000
2002 25,000 60,000
2003 29,000 50,000

Turnover in Rs.
2004 44,000 40,000

2005 30,000
49,000
20,000
2006 60,000
10,000
2007 64,000
0
2002 2003 2004 2005 2006 2007
Years 39
Obtaining Data
Published source
book, journal, newspaper, Published reports
Designed experiment
researcher exerts strict control over units
Survey
a group of people are surveyed and their responses are recorded
Administrative Records

40
we will be dealing with various techniques for summarizing and describing
qualitative data.

Qualitative

Univariate Bivariate
Frequency Frequency
Table Table

Percentages
Component Multiple
Pie Chart Bar Chart Bar Chart

Bar Chart

We will begin with the univariate situation, and will proceed to the
bivariate situation.
41
Frequency Distribution &
Histogram

42
Following data represents Classes Frequency (f) c.f. r.f. % freq
the plant height (cm) of a
sample of 30 plants. 86–90 6 6 0.200 20.0
87 91 89 91–95 4 10 0.133 13.3
88 89 91 96–100 10 20 0.333 33.3
87 92 90 101–105 6 26 0.200 20.0
98 95 97
96 100 101 106–110 3 29 0.100 10.0
96 98 99 111–115 1 30 0.033 3.3
98 100 102 Total 30 1.000 100.0
99 101 105
103 107 105 Histogram
106 107 112 12
10
10
8

Frequency
Frequency 6
6 6

distribution 4
4
3
& 2 1
Histogram 0
85.5–90.5 90.5–95.5 95.5–100.5 100.5–105.5 105.5–110.5 110.5–115.5
Class Boundries 43
Frequency Distribution

• Tabular arrangement of data in which various items are


arranged into classes or groups and the number of items
falling in each class is stated.
• The number of observations falling in a particular class is
referred to as class frequency "f".
• Data presented in the form of a frequency distribution is also
called grouped data.

44
Some definitions
Class Limits
• The class limits are defined as the number or the values of the variables which are
used to separate two classes. Sometimes classes are taken as 20--25, 25--30 etc In
such a case, these class limits means " 20 but less than 25", "25 but less than 30" etc
Class marks or midpoints
• The class mark or the midpoint is that value which divides a class into two equal parts.
It is obtained by dividing the sum of lower and upper class limits or class boundaries
of a class by 2.
Class interval
• The difference between either two successive lower class limits or two successive
upper class limits OR
• The difference between two successive midpoints.
• denoted by "h". 45
Example

• The following data represents the height of 30 wheat plants taken from the
experimental area. Construct a frequency distribution and appropriate
graphs to explain the distribution of data:

87 91 89 88 89 91 87 92 90 98 95
97 96 100 101 96 98 99 98 100 102 99
101 105 103 107 105 106 107 112

46
Construction of a frequency distribution

• Decide the number of classes:


K=1+3.3 log(n)=5.87 or 𝑛=5.47 → 6 Classes
• Determine the range of variation of the data i.e,
R= Max – Min = 112 – 87 = 25
• Determine the approximate size of class interval
𝑹
𝒉= = 25/6 = 4.17 → 5 Class Interval
𝑲
• Decide where to locate the class limits → 86-90, 91-95, …
• Distribute the data into appropriate classes
47
Frequency Distribution
Classes Class Boundaries Tally Freq (f) c.f. r.f. % freq Cumulative %
freq

86–90 85.5–90.5 6 6 0.200 20.0 20.00


91–95 90.5–95.5 4 10 0.133 13.3 33.3
96–100 95.5–100.5 10 20 0.333 33.3 66.6
101–105 100.5–105.5 6 26 0.200 20.0 86.6
106–110 105.5–110.5 3 29 0.100 10.0 96.6
111–115 110.5–115.5 1 30 0.033 3.3 100.0
Total 30 1.000 100.0

48
Class Boundaries

• Class Boundaries
• Subtract any Upper Class Limit from its Subsequent Lower Class limit and
divide the difference with 2, you will get the Continuity correction factor
• Subtract this factor from all Lower Class Limits and add it to all Upper Class
limits.

For example, 91-90 = ½ =0.05 or 96-95 = ½ =0.05

49
Histogram
Histogram of Height of 30 Students
12
10
10

8
Frequency

6 6
6
4
4 3

2 1

0
85.5–90.5 90.5–95.5 95.5–100.5 100.5–105.5 105.5–110.5 110.5–115.5
Class Boundries
50
Frequency Polygon

• Frequency polygons are a graphical device for understanding the shapes


of distributions. They serve the same purpose as histograms, but are
especially helpful for comparing sets of data.
• Mid Points vs Frequency Frequency Polygon
12

10

Frequency
6

0
88 93 98 103 108 113
51
Mid Points
Cumulative Frequency Polygon / Ogive

• A cumulative frequency polygon is a plot of the cumulative


frequency against the upper class boundary with the points joined by a line
segment.
• Upper Class Boundaries vs Cumulative Frequency
Cumulative Frequency Polygon / Ogive
35
30

Cumulative Frequency
25
20
15
10
5
0
90.5 95.5 100.5 105.5 110.5 115.5
Upper Class Boundaries 52
Stem & Leaf Display

• A relatively small data set can be represented by stem and leaf


display.
• In addition to information on the number of observations falling in
the various classes, it displays details of what those observations
actually are.
• Each number in the data set is divided into two parts, a Stem and a
Leaf. A stem is the leading digit(s) of each number and is used in
sorting, while a leaf is the rest of the number or the trailing digit(s)
and shown in display.

53
Example
Use the data below to make a stem- Stem Leaf
and-leaf plot by taking 10 as a unit.
7 0589
85 115 126 92 104 8 4558
85 116 100 121 123 9 022379
79 90 110 129 108
10 0478
107 78 131 114 92
131 88 97 99 116
11 04566
93 84 75 70 132 12 1369
13 112
7 0589
These values are 70, 75, 78 and 79 54
Example

Represent the following data by Stem and Leaf display by


(i) taking 10 unit as the width of the class
(ii) taking 5 unit as the width of the class
32 45 38 41 49 36 52 56 51 62
63 59 68 Steam Leaf *indicate 0—4
Steam Leaf 3* 2 .indicate 5—9
3. 8 6
3 2 8 6 4* 1
4. 5 9 * and . are called placeholder
4 5 1 9
5* 2 1
5 2 6 1 9 5. 6 9
6* 2 3
6 2 3 8 55
6. 8
Example
Minimum value = 8 Maximum value = 112
Stem unit = 10 Width of class = 10 Arranged stem and leaf table

Stem Leaf Stem Leaf


8 98797 8 77899
9 1120857668989 9 0112566788899
10 01021537567 10 00112355677
11 2 11 2

56
Example of Stem & Leaf display – Class Width = 5
Example
Width of class = 5

Stem Leaf
8* -
8. 79897
9* 1120
9. 857668989
10* 010213
10. 57567
11* 2
11. -
* Indicates 0-4 . Indicates 5-9 [(*, .) are called place holder] 57
Back to Back Stem and Leaf display

• Two data sets can be compared using Back-to-Back stem and leaf
display. In this case a single stem is constructed and the values of one
data set are assigned on the left and the value of second data set are
assigned on the right of the stem.

Example: Compared using Back-to-Back stem and leaf display

Data 1) 32, 45, 38, 41, 49, 36, 52, 56, 51, 62, 63, 59, 68
Data 2) 23, 58, 26, 57, 55, 65, 29, 36, 59, 69, 60

58
Data 1 Data 2
32 = [Link] = 23
68 = [Link] = 69
Stem unit = 10 Width of class = 10

Data 1 Data 2
(#13) (# 11)
Leaf Leaf
2 369
682 3 6
915 4
9162 5 8759
832 6 590
59
Measures of
Central Tendency

60
Measures of Central Tendency

• A Measure of Location summarizes a data set by giving a “single


quantitative value” within the range of the data values that describes
its location relative to entire data set.
• Some Common Measures are:
• Arithmetic Mean / Average
• Median
• Mode
• Geometric Mean
• Harmonic Mean

61
Arithmetic mean / mean
• Most common measure of the center
• Obtained by dividing the SUM of all the observations by the total
number of observations
N

X i
X1 + X 2 + + XN
Population Mean = i =1
=
N N

x i
x1 + x2 + + xn
Sample Mean x= i =1
=
n n

62
Properties of Arithmetic mean
1. Mean of the constant is equal to that constant
2. The sum of the deviations of the observations from their mean is
equal to zero. i.e., 𝒏
ഥ =𝟎
෍ 𝑿𝒊 − 𝑿
𝒊=𝟏

3. The sum of squared deviations of the observations from their mean is


minimum 𝒏 𝒏

෍ 𝑿𝒊 − 𝑿 𝟐 < ෍ 𝑿𝒊 − 𝒂 𝟐

𝒊=𝟏 𝒊=𝟏

Where a is any constant


63
Properties of Arithmetic mean
X (𝑿 − 𝟔𝟖. 𝟓) (𝑿 − 𝟔𝟖. 𝟓)𝟐 (𝑿 − 𝟕𝟎) (𝑿 − 𝟕𝟎)𝟐
65 -3.5 12.25 -5 25
71 2.5 6.25 1 1
67 -1.5 2.25 -3 9
75 6.5 42.25 5 25
63 -5.5 30.25 -7 49
69 0.5 0.25 -1 1
75 6.5 42.25 5 25
63 -5.5 30.25 -7 49
548 0 166 -12 184

σ 𝑿 𝟓𝟒𝟖
ഥ=
𝑿 = = 𝟔𝟖. 𝟓
𝒏 𝟖
64
Properties of Arithmetic mean

4. If X1, X2 , …………, Xn have mean 𝑋ത then the mean after multiplying each
observation by a constant ‘a’ is the mean multiplied by that constant.

σ 𝒏
ഥ ∗ 𝒊=𝟏 𝒂𝑿𝒊 ഥ
𝑿 = =𝒂 ×𝑿
𝒏
5. If a constant ‘a’ is added to each of the observation X1, X2 , …………, Xn having
mean 𝑋ത then mean increases by that constant.

σ 𝒏
𝒊=𝟏 (𝒂+𝑿𝒊 )
ഥ =
𝑿∗ ഥ+𝒂
=𝑿
𝒏

65
Weighted arithmetic mean

• In a data set, if some observations have more importance as


compared to other observations then taking simple average is
misleading.
• In this case the different weights are assigned to different
observations according to their relative importance and then average
is calculated by considering weights as well.
• This average is called weighted Arithmetic Mean or simple weighted
mean, denoted by 𝑋ത𝑤 .

66
Weighted arithmetic mean

Computations:
For ‘n’ observations, 𝑥1 , 𝑥2 , … , 𝑥𝑛 of a data set with corresponding
weights 𝑤1 , 𝑤2 , … , 𝑤𝑛 then weighted arithmetic mean is defined as:

σ𝑛
𝑖=1 𝑊𝑖 𝑋𝑖 σ 𝑊𝑋
𝑋ത𝑤 = σ𝑛
= σ𝑊
𝑖=1 𝑊𝑖

67
Weighted A.M - Calculations
Subjects Marks (Xi) Weights Weights
(Wi) WiXi
Statistics 80 20 1600
Mathematics 75 10 750
Chemistry 50 40 2000
English 60 30 1800
Total 265 100 6150

σ 𝑋 265 σ 𝑊𝑋 6150
𝑋ത𝑤 = = = 66.25 𝑋ത𝑤 = = = 61.5
𝑛 4 σ𝑊 100
68
Combined Mean

• For ‘k’ subgroups of data consisting of ‘n1, n2, …, nk’ observations


(with σ𝑘𝑖=1 𝑛𝑖 = 𝑛), having respective means, 𝑥ҧ1 , 𝑥ҧ2 , …, 𝑥ҧ𝑘 . Then
combined mean (mean of the all ‘k’ means) is given by:

𝑛1 𝑥1ҧ + 𝑛2 𝑥2ҧ + ⋯ + 𝑛𝑘 𝑥𝑘ҧ σ𝑘𝑖=1 𝑛𝑖 𝑥𝑖ҧ σ𝑘𝑖=1 𝑛𝑖 𝑥𝑖ҧ


𝑥ҧ𝑐 = = 𝑘 =
𝑛1 + 𝑛2 + ⋯ + 𝑛𝑘 σ𝑖=1 𝑛𝑖 𝑛

71
Combined Mean
Example: The mean heights and the number of students in three sections of a
statistics class are given below. Calculate overall (or combined) mean height of the
students?
Sections Number of Mean height
students (inches)
A 40 62
B 37 58
C 43 61
Solution:
Note that we have, n1=40, n2=37, n3=43 and ഥ
𝑥1=62, ഥ
𝑥2 =58 and ഥ
𝑥3 =61. So the
Combined mean is :
𝑛1 𝑥ҧ1 + 𝑛2 𝑥ҧ2 + 𝑛3 𝑥ҧ3
𝑥ҧ𝑐 = = 60.4
𝑛1 + 𝑛2 + 𝑛3
72
Tasks
1. The mean weight of 10 students is 50 Kg when two students left the class
the mean weight becomes 48 Kg. Find the mean weight of students who
left the class? Answer = 58
2. There are total 30 students in a class. On thursday,18 students took a
math test and their mean marks was 80. The remaining 12 students took
a math test on Friday and their mean marks was 90. Find the mean marks
of the entire class? Answer = 84
3. Ali took five Math tests during the semester and the mean of his test
score was 85. If his mean after the first three was 83, What was the mean
of his 4th and 5th tests. Answer = 88

73
Geometric mean & harmonic mean
• The Geometric Mean (G.M) of a set of n positive values 𝑥1 , 𝑥2 , … , 𝑥𝑛 is the positive nth root of the
product of the values.
𝒏 𝟏ൗ
𝒏
𝑮. 𝑴 = ෑ 𝑿𝒊
𝒊=𝟏

σ𝒏𝒊=𝟏 𝑳𝒐𝒈 𝑿𝒊
𝑮. 𝑴 = 𝑨𝒏𝒕𝒊𝒍𝒐𝒈
𝒏
• The Harmonic Mean (H) of a set of n values 𝑥1 , 𝑥2 , … , 𝑥𝑛 is defined as the reciprocal of
the arithmetic mean of the reciprocals of the values.
𝒏
𝑯. 𝑴 =
𝟏
σ𝒏𝒊=𝟏
𝑿𝒊 74
Example
Find Geometric Mean and Harmonic Mean from the following data?

X Log (X) 1/X σ𝒏𝒊=𝟏 𝑳𝒐𝒈 𝑿𝒊


𝑮. 𝑴 = 𝑨𝒏𝒕𝒊𝒍𝒐𝒈 = 𝟔. 𝟒𝟑
3 0.477 0.333 𝒏

5 0.699 0.200
6 0.778 0.167
0.778 0.167 𝒏
6 𝑯. 𝑴 = = 𝟓. 𝟖𝟕
𝟏
7 0.845 0.143 σ𝒏𝒊=𝟏
𝑿𝒊
10 1.000 0.100
12 1.079 0.083
49 5.6567 1.1929
75
Mode

• The most frequent value also called nominal average


• can be used for qualitative as well as quantitative data
• may not be unique
• may not exist
• computation of the mode for ungrouped or raw data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
76
Median

• Numerical measures that give the relative position of a data value


relative to the entire data set.
• Divides the observations into two equal parts after arranging the
values in ascending order of magnitude
• If n is odd, the median is the middle number.
• If n is even, the median is the average of the 2 middle numbers.

𝑛+1
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑆𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
2

77
Quartiles
▪ Divide an array into four equal parts, each part having
25% of the distribution of the data values, denoted by Q j
▪ 25th of the observations are below the 1st quartile.
▪ 1st quartile is the 25th percentile; the 2nd quartile is the
50th percentile, also the median and the 3rd quartile is
the 75th percentile.
𝒏+𝟏
𝑸𝒋 = 𝑺𝒊𝒛𝒆 𝒐𝒇 𝒋 𝒕𝒉 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
𝟒

Where j = 1, 2, 3
78
Deciles
▪ Divide an array into ten equal parts, each part having ten
percent of the distribution of the data values, denoted by Dj
▪ 10 percent of the total observations fall below D1 and the
rest 90% are above it.
▪ 5th Decile is equal to the Q2 and Median

𝒏+𝟏
𝑫𝒋 = 𝑺𝒊𝒛𝒆 𝒐𝒇 𝒋 𝒕𝒉 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
𝟏𝟎

Where j = 1, 2, 3, …,9
79
Percentiles
▪ Divide an array (raw data arranged in increasing or
decreasing order of magnitude) into 100 equal parts.
▪ The jth percentile, denoted as Pj, is the data value in the data
set that separates the bottom j% of the data from the top
(100-j)%.

𝒏+𝟏
𝑷𝒋 = 𝑺𝒊𝒛𝒆 𝒐𝒇 𝒋 𝒕𝒉 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
𝟏𝟎𝟎

Where j = 1, 2, 3, …,99
80
Example
▪ Suppose ALI was told that relative to the other scores on a NTS
test, his score was the 95th percentile i.e., his percentile score
is 95. How do we interpret it?

➔ This means that 95% of those who took the test had scores
less than or equal to Ali’s score, while 5% had scores higher than
Ali’s.

81
Exercise

• Find Median, Q1, Q2, Q3 of the following data of marks obtained by


20 students? Also show that Median = Q2? Also interpret the
results?
53 74 82 42 39 28 20 18 68 58 54 93 70
30 61 55 36 37 29 94

• First of all arrange the data in ascending order of magnitude

Sr. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
18 20 28 29 30 36 37 39 42 53 54 55 58 61 68 70 74 82 93 94

82
Median & Quartiles
𝒏+𝟏
• 𝑴𝒆𝒅𝒊𝒂𝒏 = 𝑺𝒊𝒛𝒆 𝒐𝒇 𝒕𝒉 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
𝟐
= Size of 10.5th Observation
= 10th Observation + 0.5 (11th Observation – 10th Observation)
= 53 + 0.5 (54 – 53)
= 53.5

𝒏+𝟏
• Q3= 𝑺𝒊𝒛𝒆 𝒐𝒇 𝟑 𝒕𝒉 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏
𝟒
= Size of 15.75th observation
= 15th Observation + 0.75 (16th Observation – 15th Observation)
= 68 + 0.75 (70 – 68)
=69.5
83
Example

Consider the following data of marks of 20 students:-


53 74 82 42 39 28 20 81 68 58
54 93 70 30 61 55 36 37 29 94
Construct Boxplot of the data and interpret it.

Minimum = 20
Q1 = 36.25
Median = 54.5
Q3 = 73
Maximum = 94

84
Measures of Variation

85
Measures of Variation/ Dispersion
• In Statistics, Dispersion (also called variability, scatter, or spread) denotes
how stretched or squeezed a distribution is
• Variability is the extant to which data points in a Statistical Distribution or
data set diverge from the average, or mean, value as well as the extent to
which these data points differ from each other.
• Following are the commonly used measures of variability
• Variance
• Standard Deviation
• Range
• Inter Quartile Range
• Semi Inter Quartile Range
• Mean Deviation

86
Variance

• Variance is the measure of the spread between observations in a


dataset.
• The variance measures the distance of all the observations from their
mean.
σ𝑁 2
Population Variance 2 𝑖=1 𝑋𝑖 − 𝜇
𝜎 =
𝑁

𝑛 ത 2
Sample Variance σ𝑖=1 𝑋𝑖 − 𝑋
𝑆2 =
𝑛−1
87
Standard Deviation

• It is the positive square root of the Variance

Population Standard σ𝑁
𝑖=1 𝑋𝑖 − 𝜇
2

Deviation 𝜎=
𝑁

σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
Sample Standard 𝑆=
Deviation 𝑛−1

88
X ഥ)
(𝑿 − 𝑿 ഥ )𝟐
(𝑿 − 𝑿
Example 1 2 -4 16
4 -2 4
6 0 0
• Consider the following data of height 8 2 4
(cm) of 5 plants. 10 4 16
2, 4, 6, 8, 10 30 0 40
• Find the average, variance and the σ 𝑿 𝟑𝟎
standard deviation of the yield. ഥ=
𝑿 = =𝟔
𝒏 𝟓
σ 𝒏 ഥ 𝟐
𝟐 𝒊=𝟏 𝑿𝒊 − 𝑿 𝟒𝟎
𝑺 = = = 𝟏𝟎
𝒏−𝟏 𝟓−𝟏

𝑺 = 𝟏𝟎 = 𝟑. 𝟏𝟔

89
X (𝑿 − 𝟔𝟖. 𝟓) (𝑿 − 𝟔𝟖. 𝟓)𝟐
Example 2 65 -3.5 12.25
71 2.5 6.25
67 -1.5 2.25
• Consider the following data of yield of 75 6.5 42.25
wheat (in kgs) from 8 experimental 63 -5.5 30.25
plots. 69 0.5 0.25
75 6.5 42.25
65, 71, 67, 75, 63, 69, 75, 63 63 -5.5 30.25
• Find the average, variance and the 548 0 166
standard deviation of the yield.
σ 𝑿 𝟓𝟒𝟖
ഥ=
𝑿 = = 𝟔𝟖. 𝟓
𝒏 𝟖
σ𝒏 ഥ 𝟐
𝒊=𝟏 𝑿 𝒊 − 𝑿 𝟏𝟔𝟔
𝑺𝟐 = = = 𝟐𝟑. 𝟕𝟏
𝒏−𝟏 𝟖−𝟏

𝑺 = 𝟐𝟑. 𝟕𝟏 = 𝟒. 𝟖𝟕
90
The Range & Coefficient of Range

• The Range R is defined as the difference between the largest and the
smallest observations in a dataset. i.e,

𝑅 = 𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛

• The Coefficient of Dispersion or Coefficient of Range is defined as

𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑅𝑎𝑛𝑔𝑒 =
𝑋𝑚𝑎𝑥 + 𝑋𝑚𝑖𝑛
91
Example

The marks obtained by 9 students are given below:-


45, 32, 37, 46, 39, 36, 41, 48, 36
Find the range and the Coefficient of Range.
Maximum Obs is 48 and Minimum 32, therefore
Range = 16 marks
Co-efficient of Range = 0.2

92
Semi Inter Quartile Range / Quartile Deviation
• The inter quartile range (IQR) is a measure of dispersion, defined as
𝐼𝑄𝑅 = 𝑄3 − 𝑄1
• The Semi Inter Quartile Range or Quartile Deviation (QD) is defined as
𝑄3 − 𝑄1
𝑄𝐷 =
2
• The Co-efficient of Quartile Deviation (QD) is defined as
𝑄3 − 𝑄1
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑄𝐷 =
𝑄3 + 𝑄1

93
The Mean Deviation OR Average Deviation

• The mean deviation (M.D.) of a set of data is defined as the arithmetic


mean of the deviations measured either from the mean or from the
median,
σ𝑛 ത
𝑖=1 𝑋𝑖 −𝑋 σ𝑛
𝑖=1 𝑋𝑖 −𝑚𝑒𝑑𝑖𝑎𝑛
𝑀. 𝐷 = OR 𝑀. 𝐷 =
𝑛 𝑛
• Co-efficient of Mean Deviation is given as
𝑀.𝐷 𝑀.𝐷
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑀. 𝐷 = OR
𝑀𝑒𝑎𝑛 𝑀𝑒𝑑𝑖𝑎𝑛

94
Co-efficient of Variation (CV)

• The coefficient of variation is a measure of spread that describes the


amount of variability relative to the mean. Because the coefficient of
variation is unitless, you can use it instead of the standard deviation
to compare the spread of data sets that have different units or
different means.
𝑆
𝐶𝑜𝑒𝑓𝑓. 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 (𝐶𝑉) = × 100
𝑋ത

95
Example
Following data represents the prices Following data represents the
in Rs. of a certain commodity life of car battery in hours
8, 13, 18, 23, 30 130, 150, 180, 250, 345
Sol:
Sol:
𝑋ത = 18.4 𝑅𝑠. 𝑌ത = 211 𝐻𝑟𝑠.
𝑆𝑥 = 8.56 𝑅𝑠. 𝑆𝑦 = 87.63 𝐻𝑟𝑠.
𝑪. 𝑽 = 𝟒𝟔. 𝟓
𝑪. 𝑽 = 𝟒𝟏. 𝟓
96
Types of Distribution

97
Measures of Skewness

• Skewness is a measure of symmetry, or more precisely, the lack of


symmetry. A distribution, or data set, is symmetric if it looks the same
to the left and right of the center point. The Coefficient of Skewness is
given as
σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 3 (Q3 − 2Q2 + Q1 )
Sk =
𝑆𝑘 =
𝑛𝑆 3 (Q3 − Q1 )
• If Sk = 0 the distribution is Symmetrical
• If Sk  0 the distribution is +vely skewed
• If Sk  0 the distribution is -vely skewed

98
Measures of Kurtosis
• Describes the extent of
peakedness or flatness of
the distribution of the
data.
• Measured by Coefficient
of Kurtosis (K) computed
as,

σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 4
𝐾= −3
𝑛𝑆 4 99
Interpretation

K=0
mesokurtic

K>0 K<0
leptokurtic platykurtic
100
Example
Consider the following data:- Mean 32
Standard Error 1.73
25, 27, 36, 31, 33, 35, 37
Median 33
Find Mean, Variance, Coefficient of Standard Deviation 4.58
Skewness and Coefficient of Kurtosis and Sample Variance 21
interpret the results. Kurtosis -1.65
Skewness -0.39
Range 12
Minimum 25
Maximum 37
Sum 224
Count 7

101
How to do it…
𝑿 ഥ
𝑿−𝑿 ഥ
𝑿−𝑿 𝟐 ഥ
𝑿−𝑿 𝟑 ഥ
𝑿−𝑿 𝟒

25 -7 49 -343 2401
27 -5 25 -125 625
36 4 16 64 256
31 -1 1 -1 1
33 1 1 1 1
35 3 9 27 81
37 5 25 125 625
224 0 126 -252 3990
102
Five Number Summary

• For a set of data, the minimum, first quartile, median, third


quartile, and maximum.
Minimum, Q1, Median, Q3, Maximum

103
Boxplot / Box & Whisker plot

1. Line within box( median) indicates average size of the data


2. Length of graph / box indicates variation in the data
3. Position of line within box indicates the shape of the data
✓ Line at the center of the box indicates data is symmetrical
✓ Line above the center of the box indicates data is -vely skewed
✓ Line below the center of the box indicates data is +vely skewed
104
Example
Consider the following data of marks of 20 students:-

53 74 82 42 39 28 20 81 68 58
54 93 70 30 61 55 36 37 29 94
Construct Boxplot of the data and interpret it.

Minimum = 20
Q1 = 36.25
Median = 54.5
Q3 = 73
Maximum = 94
105
Question
The breaking strength of 20 test pieces of a Mean 90.15
certain alloy is given as:-
Variance S2 = 269.08
95, 97, 96, 73, 78, 95, 89, 68, 82, 79, 69, 67, 83,
94, 87, 93, 103, 108, 117, 130 SD S = 16.4
Calculate the average breaking strength of
the alloy and the standard deviation.
Calculate the percentage of observations
lying within the limits:-
(i) Mean ± S
(ii) Mean ± 2S
(iii) Mean ± 3S
106
Characteristics of Normal Curve: About 68 percent of the observations fall
between plus and minus one SD from the mean; about 95 percent fall
between plus and minus two SD from the mean; and about 99 percent fall
between plus and minus three SD from the mean.
107
Standardized Variable

• A standardized variable (sometimes called a z-score or a standard


score) is a variable that has been rescaled to have a mean of zero and
a standard deviation of one.
𝑋𝑖 − 𝑋ത
𝑍𝑖 =
𝑆
Example Consider the following data:- 25, 26, 23, 25, 45, 45, 58, 58,
50, 25. Calculate its mean and variance and make a standardized
variable 𝑍𝑖 . Verify that the Mean and Variance of the 𝑍𝑖 is zero and 1
respectively.

108
Solution

𝑋𝑖 − 𝑋ത
𝑍𝑖 =
𝑋𝑖 𝑋𝑖 − 𝑋ത 𝑋𝑖 − 𝑋ത 2
𝑆 𝑋ത = 38
25 -13 -0.8905 -0.8905
26 -12 -0.8220 -0.8220 𝑆𝑋 = 14.598
23
25
-15
-13
-1.0275
-0.8905
-1.0275
-0.8905
𝑍ҧ = 0
45 7 0.4795 0.4795 𝑉𝑎𝑟 𝑍 = 1
45 7 0.4795 0.4795
58 20 1.3700 1.3700 𝑆𝑧 = 1
58 20 1.3700 1.3700
50 12 0.8220 0.8220
25 -13 -0.8905 -0.8905
380 0 1918 0.00

109

You might also like