0% found this document useful (0 votes)
74 views82 pages

Chapter 2 - Descriptive Statistics

Uploaded by

RAMEAR KING2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views82 pages

Chapter 2 - Descriptive Statistics

Uploaded by

RAMEAR KING2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Basic of Statistics

Statistics
Statistics is a tool for converting data into information.

Remark
Where then does data come from? How is it gathered?
How do we ensure its accurate? Is the data reliable?
Is it representative of the population from which it was drawn?

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 33 / 195
Basic of Statistics

Methods of Collecting Data


There are many methods used to collect or obtain data for statistical
analysis. Three of the most popular methods are:
1 Direct observation
2 Experiments
3 Surveys.

Remark
A survey solicits information from people; e.g. Gallup polls; pre-election
polls; marketing surveys. Surveys may be administered in a variety of ways
1 Personal Interview,
2 Telephone Interview,
3 Self-Administered Questionnaire.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 34 / 195
Basic of Statistics

Sampling
Sampling is selecting a sub-set of a whole population.

Remark
Sampling is often done for reasons of cost (sample 1,000 television viewers
from 100 million TV) and practicality (performing a crash test on every
automobile).

Remark
In any case, the sampled population and the target population should be
similar to one another.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 35 / 195
Sampling

Sampling Methods
A sampling plan is just a method or procedure for specifying how a
sample will be taken from a population. Three methods are used as
1 Simple Random Sampling,
2 Stratified Random Sampling,
3 Cluster Sampling.

Simple Random Sampling


A simple random sample is a sample selected in such a way that every
possible sample of the same size is equally likely to be chosen.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 36 / 195
Sampling

Example
Drawing three names from a hat containing all the names of the
students in the class is an example of a simple random sample:
any group of three names is as equally likely as picking any other group of
three names.

Stratified Random Sampling


A stratified random sample is obtained by separating the population into
mutually exclusive sets, or strata, and then drawing simple random
samples from each stratum

Remark
Age, socioeconomic divisions, nationality, religion, educational
achievements and other such classifications fall under stratified random
sampling.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 37 / 195
Sampling

Cluster Sampling
A cluster sample is a simple random sample of groups or clusters of
elements (vs. a simple random sample of individual objects).

Remark
This method is useful when it is difficult or costly to develop a complete
list of the population members or when the population elements are widely
dispersed geographically.

Remark
Cluster sampling may increase sampling error due to similarities among
cluster members.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 38 / 195
Variable and Data

Variable
A characteristic that varies from one person or thing to another is called a
variable.

Example
Examples of variables for humans are height, weight, number of siblings,
sex, marital status, and eye color.

Kind of Variables
Variables can be classified as quantitative or qualitative.
1 Quantitative variables are numerical and can be ordered or ranked.
2 Qualitative( categorical) variables are variables that can be placed
into distinct categories, according to some characteristic or attribute.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 39 / 195
Variables and Data

Example
Height, weight, number of siblings are quantitative and sex, marital status,
eye color are qualitative variables.

Remark
Quantitative variables can be classified as either discrete or continuous.
1 Discrete variables assume values that can be counted.
2 Continuous variables can assume an infinite number of values
between any two specific values. They are obtained by measuring.
They often include fractions and decimals.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 40 / 195
Variable and Data

Example
Examples of Quantitative Variables:
1 High school Grade Point Average (e.g. 4.0, 3.2, 2.1).
2 Number of pets owned (e.g. 1, 2, 4).
3 Bank account balance (e.g. $100, $987, $-42).
4 Number of stars in a galaxy (e.g. 100, 2301, 1 trillion) .

Example
Examples of Categorical Variables:
1 Class in college (e.g. freshman, sophomore, junior, senior).
2 Party affiliation (e.g. Republican, Democrat, Independent).
3 Type of pet owned (e.g. dog, cat, rodent, fish).

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 41 / 195
Frequency Distributions
Frequency Distribution of Qualitative Data
A frequency distribution of qualitative data is a listing of the distinct
values and their frequencies

Remark
A frequency distribution provides a table of the values of the observations
and how often they occur.

To Construct a Frequency Distribution of Qualitative Data


1 List the distinct values of the observations in the data set in the first
column of a table.
2 For each observation, place a tally mark in the second column of the
table in the row of the appropriate distinct value.
3 Count the tallies for each distinct value and record the totals in the
third column of the table.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 42 / 195
Frequency Distribution

Example
Professor Weiss asked his introductory statistics students to state their
political party affiliations as Democratic (D), Republican (R), or Other
(O). The responses of the 40 students in the class are given in following
Table. Determine a frequency distribution of these data.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 43 / 195
Frequency Distribution

Relative Frequency

Frequency
Relative Frequency =
Number of observation

To Construct a Relative-Frequency Distribution of Qualitative Data


1 Obtain a frequency distribution of the data.
2 Divide each frequency by the total number of observations.

Remark
Relative-frequency distributions are better than frequency distributions for
comparing two data sets. Because relative frequencies always fall between
0 and 1, they provide a standard for comparison.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 44 / 195
Organizing Data

Example
For the previous example, Construct a relative frequency distribution.

Pie Chart
A pie chart is a disk divided into wedge-shaped pieces proportional to the
relative frequencies of the qualitative data.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 45 / 195
Graphs

To Construct a Pie Chart


1 Obtain a relative-frequency distribution of the data
2 Divide a disk into wedge-shaped pieces proportional to the relative
frequencies.
3 Label the slices with the distinct values and their relative frequencies.

Bar Chart
A bar chart displays the distinct values of the qualitative data on a
horizontal axis and the relative frequencies (or frequencies or percents) of
those values on a vertical axis.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 46 / 195
Graphs

Example
Construct a pie chart of the political party affiliations of the students in
Professor Weiss’s introductory statistics class.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 47 / 195
Organizing Data

To Construct a Bar Chart


1 Obtain a relative-frequency distribution of the data.
2 Draw a horizontal axis on which to place the bars and a vertical axis
on which to display the relative frequencies.
3 For each distinct value, construct a vertical bar whose height equals
the relative frequency of that value.
4 Label the bars with the distinct values.

Example
Construct a bar chart of the political party affiliations of the students in
Professor Weiss’s introductory statistics class.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 48 / 195
Graphs

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 49 / 195
Organizing Quantitative Data

Remark
To organize quantitative data, we first group the observations into classes
(also known as categories or bins) and then treat the classes as the
distinct values of qualitative data.

Single-Value Grouping
In some cases, the most appropriate way to group quantitative data is to
use classes in which each class represents a single possible value. Such
classes are called single value classes.
This method of grouping quantitative data is called single-value
grouping.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 50 / 195
Organizing Quantitative Data

Example
The Television Bureau of Advertising publishes information on television
ownership in Trends in Television. Table 4 gives the number of TV sets per
household for 50 randomly selected households. Use single-value grouping
to organize these data into frequency and relative-frequency distributions.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 51 / 195
Limit Grouping

Class Limit
A second way to group quantitative data is to use class limits. With this
method, each class consists of a range of values.
This method of grouping quantitative data is called limit grouping.

Terms Used in Limit Grouping


1 Lower class limit: The smallest value that could go in a class.
2 Upper class limit: The largest value that could go in a class.
3 Class width: The difference between the lower limit of a class and the
lower limit of the next-higher class.
4 Class mark: The average of the two class limits of a class.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 52 / 195
Limit Grouping

Example
Table 6 displays the number of days to maturity for 40 short-term
investments. The data are from BARRON’S magazine. Use limit
grouping, with grouping by 10s, to organize these data into frequency and
relative-frequency distributions.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 53 / 195
Limit Grouping

Solution
Because we are grouping by 10s and the shortest maturity period is 36
days, our first class is 30-39.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 54 / 195
Cutpoint Grouping

Cutpoint Grouping
A third way to group quantitative data is to use class cutpoints. As with
limit grouping, each class consists of a range of values.
The method of grouping quantitative data by using cutpoints is called
cutpoint grouping.

Terms Used in Cutpoint Grouping


1 Lower class cutpoint: The smallest value that could go in a class.
2 Upper class cutpoint: The smallest value that could go in the
next-higher class.
3 Class width: The difference between the cutpoints of a class.
4 Class midpoint: The average of the two cutpoints of a class.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 55 / 195
Cutpoint Grouping

Example
The U.S. National Center for Health Statistics publishes data on weights
and heights by age and sex in the document Vital and Health Statistics.
The weights shown in Table 8, given to the nearest tenth of a pound, were
obtained from a sample of 18- to 24-year-old males. Use cutpoint
grouping to organize these data into frequency and relative-frequency
distributions. Use a class width of 20 and a first cutpoint of 120.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 56 / 195
Cutpoint Grouping

Solution
Tallying the data in Table 8 gives us the frequencies in the second column
of Table 9. Dividing each such frequency by the total number of
observations, 37, we get the relative frequencies.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 57 / 195
Organizing Data

Remark
Sometimes the class width is unknown, we can use the following formula
1 Range=Max Value-Min Value, The Range is shown by R,
2 The number of classes is calculated by K = 1 + 3.3logn.
Range R
3 Width= = .
Number of classes K

Remark
If it needed, round the answer up to the nearest whole number if there is a
remainder

Remark
Rounding up is different from rounding off. A number is rounded up if
there is any decimal remainder when dividing.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 58 / 195
Organizing Data

Example
These data represent the record high temperatures in degrees Fahrenheit
0
(F ) for each of the 50 states. Construct a grouped frequency distribution
for the data.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 59 / 195
Histogram

Histogram
A histogram displays the classes of the quantitative data on a horizontal
axis and the frequencies (relative frequencies, percents) of those classes on
a vertical axis.

Remark
The frequency (relative frequency, percent) of each class is represented by
a vertical bar whose height is equal to the frequency (relative frequency,
percent) of that class.

Remark
For limit grouping or cutpoint grouping, we use the lower class limits (or,
equivalently, lower class cutpoints) to label the bars.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 60 / 195
Histogram
Example
Construct frequency histograms and relative-frequency histograms for the
data on number of televisions per household (Page 51), days to maturity
for short-term investments (Page 54), and weights of 18- to 24-year-old
males (Page 57).

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 61 / 195
Histogram

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 62 / 195
Histogram

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 63 / 195
Measures of central tendency

Measures of center
Descriptive measures that indicate where the center or most typical value
of a data set lies are called measures of central tendency or, more
simply, measures of center.
Measures of center are often called averages.

Mean of a Data Set


The mean of a data set is the sum of the observations divided by the
number of observation.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 64 / 195
Measures of central tendency

Mean
The mean is the sum of the values, divided by the total number of values.
The symbol X̄ represents the sample mean
n
∑ Xi
X1 + X2 + X3 + ... + Xn i=1
X̄ = n = n
where n represents the total number of values in the sample.
For a population, the Greek letter µ (mu) is used for the mean
N
∑ Xi
X1 + X2 + X3 + ... + XN i=1
µ= =
N N
where N represents the total number of values in the population.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 65 / 195
Measures of centeral tendency

Example
The data represent the number of days off per year for a sample of
individuals selected from nine different countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30

Median of a Data Set


Arrange the data in increasing order.
1 If the number of observations is odd, then the median is the
observation exactly in the middle of the ordered list.
2 If the number of observations is even, then the median is the mean of
the two middle observations in the ordered list.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 66 / 195
Measures of centeral tendency

Example
1 The number of rooms in the seven hotels in downtown Pittsburgh is
713, 300, 618, 595, 311, 401, and 292. Find the median.
2 The number of tornadoes that have occurred in the United States
over an 8-year period follows. Find the median.
684, 764, 656, 702, 856, 1133, 1132, 1303.

Mode of a Data Set


Find the frequency of each value in the data set.
If no value occurs more than once, then the data set has no mode.
Otherwise, any value that occurs with the greatest frequency is a mode of
the data set.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 67 / 195
Example
1 Find the mode of the signing bonuses of eight NFL players for a
specific year. The bonuses in millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10.
2 Find the mode for the number of branches that six banks have.
401, 344, 209, 201, 227, 353

Finding the Mean for Grouped Data


The formula for the mean for grouped data is

∑ f .Xm
X̄ = n
where f is the frequency and Xm is the midpoint for each class.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 68 / 195
Measures of centeral tendency

Example
Using the frequency distribution for the given Table, find the mean. The
data represent the number of miles run during one week for a sample of 20
runners.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 69 / 195
Measures of centeral tendency

Remark
1 A data set that has only one value that occurs with the greatest
frequency is said to be unimodal.
2 If a data set has two values that occur with the same greatest
frequency, both values are considered to be the mode and the data
set is said to be bimodal.
3 If a data set has more than two values that occur with the same
greatest frequency, each value is used as the mode, and the data set
is said to be multimodal.

Remark
The mode for grouped data is the modal class. The modal class is the
class with the largest frequency.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 70 / 195
Measures of centeral tendency

Example
Find the modal class for the frequency distribution of miles that 20
runners ran in one week, used Table below.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 71 / 195
Measures of Variation

Median for grouped data


The median value corresponds to a cumulative percentage of 50%. The
position of the median is n+1
2
th value, where n is the number of values in a
set of data.

Median Class
The first class which its cumulative frequency is greater than and equal to
the half of data is called median class.

Median formula
n
2
− Fc
Median = LM + .w
fM

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 72 / 195
Measures of Variation

Median formula
which LM is lower bound of the median class, Fc is cumulative frequency
of the class before the median class, fM is the frequency of the median
class and w is the class width.

Example
Find the median for the following grouped data.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 73 / 195
Measures of Variation

Measures of variation
To describe the difference quantitatively between two sets of data have the
same measures of center, we use a descriptive measure that indicates the
amount of variation, or spread, in a data set. Such descriptive measures
are referred to as measures of variation or measures of spread.

The Range
The range of a data set is the difference between the maximum (largest)
and minimum (smallest) observations

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 74 / 195
Measures of Variations

Sample Standard Deviation


For a variable x, the standard deviation of the observations for a sample is
called a sample standard deviation. It is denoted sx or, when no confusion
will arise, simply s. We have

∑(xi − x̄)2
s=
n−1

where n is the sample size and x̄ is the sample mean.

Example
Determine the sample standard deviation of the heights of the starting
′ ′ ′′ ′ ′′ ′ ′′ ′ ′′
players on a team as shown by 6 , 6 1 , 6 4 , 6 4 , 6 6 (Feet and Inches).

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 75 / 195
Measures of Variations

Computing Formula for a Sample Standard Deviation


A sample standard deviation can be computed using the formula

(∑ x )2
∑ xi2 − n i
s=
n−1
where n is the sample size.

Example
Find the sample standard deviation of the heights for the five starting
players on a team as given by 67, 72, 76, 76, 84 (Inches) by using the
computing formula.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 76 / 195
Measures of Variations
Variance
In computing a sample standard deviation is to take an average of the
squared deviations. We do so by dividing the sum of squared deviations by
n1,( or 1 less than the sample size). The resulting quantity is called a
2 2
sample variance and is denoted sx or, when no confusion can arise, s . In
symbols,
2
2 ∑(xi − x̄)
s =
n−1

Remark
2
The symbol for population variance is σ . The formula for the population
variance is
2
2 ∑(Xi − µ)
σ =
N
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 77 / 195
Measures of Variations

Remark
The corresponding formula for the population standard deviation is

√ ∑(Xi − µ)2
σ = σ2 =
N

Example
A testing lab wishes to test an experimental brand (a small population) of
outdoor paint to see how long will last before fading. The testing lab
makes 6 gallons to test . The results (in months) are 35, 45, 30, 35, 40,
25. Find the variance and standard deviation for the brand.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 78 / 195
Measures of variations

Sample variance and standard deviation for grouped data


2
1 Find ∑ f .Xm and ∑ f .Xm .
2

2 2
2 n(∑ f .Xm ) − (∑ f .Xm )
s =
n(n − 1)
3 Take the square root to get the standard deviation.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 79 / 195
Measures of Variations

Example
Find the variance and the standard deviation for the frequency distribution
of the data as Table below. The data represent the number of miles that
20 runners ran during one week.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 80 / 195
Measures of Variations

Remark
A statistic that allows you to compare standard deviations when the units
are different is called the coefficient of variation.

Coefficient of variation
The coefficient of variation, denoted by CVar, is the standard deviation
divided by the mean. The result is expressed as a percentage as
s
1 For sample CVar= .100

σ
2 For population CVar=
µ .100

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 81 / 195
Measures of Variations

Example
The mean of the number of sales of cars over a 3-month period is 87, and
the standard deviation is 5. The mean of the commissions is $5225, and
the standard deviation is $773. Compare the variations of the two.

Example
The mean for the number of pages of a sample of women’s fitness
magazines is 132, with a variance of 23; the mean for the number of
advertisements of a sample of women’s fitness magazines is 182, with a
variance of 62. Compare the variations.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 82 / 195
Measures of Position

Measures of position
These measures include standard scores, percentiles, deciles, and quartiles.
They are used to locate the relative position of a data value in the data set.
For example, if a value is located at the 80th percentile, it means that
80% of the values fall below it in the distribution and 20% of the values
fall above it.

Remark
The median is the value that corresponds to the 50th percentile, since
one-half of the values fall below it and onehalf of the values fall above it.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 83 / 195
Measures of Position

Standard Score
A z score or standard score for a value is obtained by subtracting the mean
from the value and dividing the result by the standard deviation. The
symbol for a standard score is z. The formula is
1 for sample

X − X̄
z= s
2 for population
X −µ
z= σ

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 84 / 195
Measures of Positions

Example
A student scored 65 on a calculus test that had a mean of 50 and a
standard deviation of 10; she scored 30 on a history test with a mean of 25
and a standard deviation of 5. Compare her relative positions on the two
tests.

Remark
Note that if the z score is positive, the score is above the mean. If the z
score is 0, the score is the same as the mean. And if the z score is
negative, the score is below the mean.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 85 / 195
Measures of Positions

Percentiles
Percentiles divide the data set into hundredths or 100 equal parts.

Remark
A data set has 99 percentiles, denoted P1 , P2 , ..., P99 . The first percentile,
P1 , is the number that divides the bottom 1% of the data from the top
99%; the second percentile, P2 , is the number that divides the bottom 2%
of the data from the top 98%; and so on.

Remark
Note that the median is also the 50th percentile.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 86 / 195
Measures of Positions

Percentiles
The percentile corresponding to a given value X is computed by using the
following formula:

(number of values below X ) + 0.5


Percentiles = .100
total number of values

Example
A teacher gives a 20-point test to 10 students. The scores are shown here.
Find the percentile rank of a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10

Example
Using the given data 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the percentile
rank for a score of 6.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 87 / 195
Measures of Positions

Finding a Data Value Corresponding to a Given Percentile


1 Arrange the data in order from lowest to highest.
2 Substitute into the formula
n.p
c= , n = total number, p = percentile
100
3 If c is not a whole number, round up to the next whole number. The
value is the number that corresponds to the rounded-up value.
4 If c is a whole number, use the value halfway between the cth and
(c + 1)th values.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 88 / 195
Measures of Positions
Example
Using the given scores 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the value
corresponding to the 25th percentile.

Example
Using the given data 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the value that
corresponds to the 60th percentile.

Quartile
The quartiles divide a data set into quarters (4 equal parts).

Remark
A data set has three quartiles, which we denote Q1 , Q2 , and Q3 .
The first quartile, Q1, is the number that divides the bottom 25% of the
data from the top 75% and so on.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 89 / 195
Measures of Positions

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 90 / 195
Measures of Positions

Quartiles
Arrange the data in increasing order and determine the median.
1 The first quartile is the median of the part of the entire data set that
lies at or below the median of the entire data set.
2 The second quartile is the median of the entire data set.
3 The third quartile is the median of the part of the entire data set that
lies at or above the median of the entire data set.

Remark
Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to P25 , P50 , P75 .
The median is the same as P50 or Q2 .

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 91 / 195
Measures of Positions

Example
Find Q1 , Q2 , and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.

Example
The A. C. Nielsen Company publishes information on the TV-viewing
habits of Americans in Nielsen Report on Television. A sample of 20
people yielded the weekly viewing times, in hours, displayed in given Table.
Determine and interpret the quartiles for these data.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 92 / 195
Measures of Skewness

Detemining Normality
A normally shaped or bell-shaped distribution is only one of many shapes
that a distribution can assume. There are several ways statisticians check
for normality. The easiest way is to draw a histogram for the data and
check its shape.

Concept of Skewness
The skewness of a distribution is defined as the lack of symmetry. In a
symmetrical distribution, the Mean, Median and Mode are equal to each
other.
In normal distribution, the ordinate at mean divides the distribution into
two equal parts.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 93 / 195
Measures of Skewness

Measures of Skewness
Skewness can be checked by using the Pearson coefficient of skewness
(PC) also called Pearson’s index of skewness. The formula is

3(X̄ − median)
PC = s
where X̄ is mean and s is the standard deviation.

Remark
If the index is greater than or equal to +1 or less than or equal to −1, it
can be concluded that the data are significantly skewed.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 94 / 195
Measures of Skewness

Example
A survey of 18 high-technology firms showed the number of days’
inventory they had on hand. Determine if the data are approximately
normally distributed.
5 29 34 44 45 63 68 74 74 81 88 91 97 98 113 118 151 158
M79.50Md77.50S40.454

Example
The data shown consist of the number of games played each year in the
career of Baseball Hall of Famer Bill Mazeroski. Determine if the data are
approximately normally distributed.
81 148 152 135 151 152 159 142 34 162 130 162 163 143 67 112 70
M127.24Md143S39.866

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 95 / 195
Regression and Correlation

Introduction
The purpose of this section is to answer these questions statistically:
1 Are two or more variables linearly related?
2 If so, what is the strength of the relationship?
3 What type of relationship exists?
4 What kind of predictions can be made from the relationship?

Correlation
Correlation is a statistical method used to determine whether a linear
relationship between variables exists.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 96 / 195
Correlation

Remark
To answer the first two questions, statisticians use a numerical measure to
determine whether two or more variables are linearly related and to
determine the strength of the relationship between or among the variables.
This measure is called a correlation coefficient.

Remark
To answer the third question, you must ascertain what type of relationship
exists. There are two types of relationships: simple and multiple.

Remark
The correlation coefficient computed from the sample data measures the
strength and direction of a linear relationship between two quantitative
variables.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 97 / 195
Correlation

Remark
The symbol for the sample correlation coefficient is r . The symbol for the
population correlation coefficient is ρ.

Remark
The range of the correlation coefficient is from −1 to +1.
1 If there is a strong positive linear relationship between the variables,
the value of r will be close to +1.
2 If there is a strong negative linear relationship between the variables,
the value of r will be close to −1.
3 If there is no linear relationship between the variables or only a weak
relationship, the value of r will be close to 0.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 98 / 195
Correlation

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 99 / 195
Correlation

Formula for the Correlation Coefficient r


There are several ways to compute the value of the correlation coefficient.
One method is to use the formula shown here.
n(∑ xy ) − (∑ x)(∑ y )
r=√
[n(∑ x 2 ) − (∑ x)2 ][n(∑ y 2 ) − (∑ y )2 ]

where n is the number of data pairs.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 100 / 195
Correlation Coefficient

Example
Compute the correlation coefficient for the given data.

Company Cars x Revenue y


A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Table: Car Rental Companies

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 101 / 195
Correlation Coefficient
Example
Compute the value of the correlation coefficient for the data obtained in
the study of the number of absences and the final grade of the seven
students in the statistics class.

Student Number of absences x Final grade y %


A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
Table: Absences and Final Grades

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 102 / 195
Correlation Coefficient

Linear Correlation Coefficient


For a set of n data points, the linear correlation coeffcient, r, is defned by
1
n−1
∑(xi − x̄)(yi − ȳ )
r= sx sy

where sx and sy denote the sample standard deviations of the x-values and
y -values, respectively.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 103 / 195
Regression

Remark
If the value of the correlation coefficient is significant, the next step is to
determine the equation of the regression line, which is the data’s line of
best fit.

Remark
Determining the regression line when r is not significant and then making
predictions using the regression line are meaningless.

Line of Best Fit


Given a scatter plot, you must be able to draw the line of best fit. Best fit
means that the sum of the squares of the vertical distances from each
point to the line is at a minimum.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 104 / 195
Regression

Regression

Figure: Line of best fit

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 105 / 195
Determination of the Regression Line Equation

Regression Line Equation


In algebra, the equation of a line is usually given as y = a + bx, where m is
the slope of the line and b is the y intercept.

Remark
There are several methods for finding the equation of the regression line.
These formulas use the same values that are used in computing the value
of the correlation coefficient.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 106 / 195
Determination of the Regression Line Equation


Formulas for the Regression Line y = a + bx

2
(∑ y )(∑ x ) − (∑ x)(∑ xy )
a=
n(∑ x 2 ) − (∑ x)2
n(∑ xy ) − (∑ x)(∑ y )
b=
n(∑ x 2 ) − (∑ x)2

where a is the y intercept and b is the slope of the line.
The aforementioned method for finding the best fit line is called least
square method.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 107 / 195
Regression

Example
Find the equation of the regression line for the data in following table, and
graph the line on the scatter plot of the data.

Company Cars x Revenue y


A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Table: Car Rental Companies

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 108 / 195
Regression

Example
Find the equation of the regression line for the data in given table, and
graph the line on the scatter plot.

Student Number of absences x Final grade y %


A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
Table: Absences and Final Grades

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 109 / 195
Sums of Squares in Regression

Total sum of squares, SST


The total variation in the observed values of the response variable:
2
SST = ∑(yi − ȳ )

Regression sum of squares, SSR


The variation in the observed values of the response variable explained by
the regression:
2
SSR = ∑(ŷi − ȳ )

Remark
ŷi is the predicted value of the response variable which is obtained from
the regression line.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 110 / 195
Sums of Squares in Regression

Error sum of squares, SSE


The variation in the observed values of the response variable not explained
by the regression:
2
SSE = ∑(yi − ŷi )

Coefficient of Determination
2
The coeffcient of determination, r , is the proportion of variation in the
observed values of the response variable explained by the regression. Thus,

2 SSR
r = .
SST

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 111 / 195
Sums of Squares in Regression

Figure: Decomposing the deviation of an observed y-value from the mean into the
deviations explained and not explained by the regression

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 112 / 195
Sums of Squares in Regression

The Regression Identity


The total sum of squares equals the regression sum of squares plus the
error sum of squares:

SST = SSR + SSE

Remark
Because of the regression identity, we can also express the coefficient of
determination in terms of the total sum of squares and the error sum of
squares:

2 SSR SST − SSE SSE


r = = =1−
SST SST SST

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 113 / 195
Sums of Squares in Regression

Example
For the given tables and regression equations
a) compute the three sums of squares, SST, SSR, and SSE.
b) verify the regression identity, SST = SSR + SSE.
c) compute the coefficient of determination.

x 3 4 1 2
y 4 5 0 -1

x 0 2 2 5 6
y 4 2 0 -2 1

The regression equations are ŷ = −3 + 2x, ŷ = 2.875 − 0.625x,


respectively.

Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 114 / 195

You might also like