Chapter 2 - Descriptive Statistics
Chapter 2 - Descriptive Statistics
Statistics
Statistics is a tool for converting data into information.
Remark
Where then does data come from? How is it gathered?
How do we ensure its accurate? Is the data reliable?
Is it representative of the population from which it was drawn?
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 33 / 195
Basic of Statistics
Remark
A survey solicits information from people; e.g. Gallup polls; pre-election
polls; marketing surveys. Surveys may be administered in a variety of ways
1 Personal Interview,
2 Telephone Interview,
3 Self-Administered Questionnaire.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 34 / 195
Basic of Statistics
Sampling
Sampling is selecting a sub-set of a whole population.
Remark
Sampling is often done for reasons of cost (sample 1,000 television viewers
from 100 million TV) and practicality (performing a crash test on every
automobile).
Remark
In any case, the sampled population and the target population should be
similar to one another.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 35 / 195
Sampling
Sampling Methods
A sampling plan is just a method or procedure for specifying how a
sample will be taken from a population. Three methods are used as
1 Simple Random Sampling,
2 Stratified Random Sampling,
3 Cluster Sampling.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 36 / 195
Sampling
Example
Drawing three names from a hat containing all the names of the
students in the class is an example of a simple random sample:
any group of three names is as equally likely as picking any other group of
three names.
Remark
Age, socioeconomic divisions, nationality, religion, educational
achievements and other such classifications fall under stratified random
sampling.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 37 / 195
Sampling
Cluster Sampling
A cluster sample is a simple random sample of groups or clusters of
elements (vs. a simple random sample of individual objects).
Remark
This method is useful when it is difficult or costly to develop a complete
list of the population members or when the population elements are widely
dispersed geographically.
Remark
Cluster sampling may increase sampling error due to similarities among
cluster members.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 38 / 195
Variable and Data
Variable
A characteristic that varies from one person or thing to another is called a
variable.
Example
Examples of variables for humans are height, weight, number of siblings,
sex, marital status, and eye color.
Kind of Variables
Variables can be classified as quantitative or qualitative.
1 Quantitative variables are numerical and can be ordered or ranked.
2 Qualitative( categorical) variables are variables that can be placed
into distinct categories, according to some characteristic or attribute.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 39 / 195
Variables and Data
Example
Height, weight, number of siblings are quantitative and sex, marital status,
eye color are qualitative variables.
Remark
Quantitative variables can be classified as either discrete or continuous.
1 Discrete variables assume values that can be counted.
2 Continuous variables can assume an infinite number of values
between any two specific values. They are obtained by measuring.
They often include fractions and decimals.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 40 / 195
Variable and Data
Example
Examples of Quantitative Variables:
1 High school Grade Point Average (e.g. 4.0, 3.2, 2.1).
2 Number of pets owned (e.g. 1, 2, 4).
3 Bank account balance (e.g. $100, $987, $-42).
4 Number of stars in a galaxy (e.g. 100, 2301, 1 trillion) .
Example
Examples of Categorical Variables:
1 Class in college (e.g. freshman, sophomore, junior, senior).
2 Party affiliation (e.g. Republican, Democrat, Independent).
3 Type of pet owned (e.g. dog, cat, rodent, fish).
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 41 / 195
Frequency Distributions
Frequency Distribution of Qualitative Data
A frequency distribution of qualitative data is a listing of the distinct
values and their frequencies
Remark
A frequency distribution provides a table of the values of the observations
and how often they occur.
Example
Professor Weiss asked his introductory statistics students to state their
political party affiliations as Democratic (D), Republican (R), or Other
(O). The responses of the 40 students in the class are given in following
Table. Determine a frequency distribution of these data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 43 / 195
Frequency Distribution
Relative Frequency
Frequency
Relative Frequency =
Number of observation
Remark
Relative-frequency distributions are better than frequency distributions for
comparing two data sets. Because relative frequencies always fall between
0 and 1, they provide a standard for comparison.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 44 / 195
Organizing Data
Example
For the previous example, Construct a relative frequency distribution.
Pie Chart
A pie chart is a disk divided into wedge-shaped pieces proportional to the
relative frequencies of the qualitative data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 45 / 195
Graphs
Bar Chart
A bar chart displays the distinct values of the qualitative data on a
horizontal axis and the relative frequencies (or frequencies or percents) of
those values on a vertical axis.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 46 / 195
Graphs
Example
Construct a pie chart of the political party affiliations of the students in
Professor Weiss’s introductory statistics class.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 47 / 195
Organizing Data
Example
Construct a bar chart of the political party affiliations of the students in
Professor Weiss’s introductory statistics class.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 48 / 195
Graphs
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 49 / 195
Organizing Quantitative Data
Remark
To organize quantitative data, we first group the observations into classes
(also known as categories or bins) and then treat the classes as the
distinct values of qualitative data.
Single-Value Grouping
In some cases, the most appropriate way to group quantitative data is to
use classes in which each class represents a single possible value. Such
classes are called single value classes.
This method of grouping quantitative data is called single-value
grouping.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 50 / 195
Organizing Quantitative Data
Example
The Television Bureau of Advertising publishes information on television
ownership in Trends in Television. Table 4 gives the number of TV sets per
household for 50 randomly selected households. Use single-value grouping
to organize these data into frequency and relative-frequency distributions.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 51 / 195
Limit Grouping
Class Limit
A second way to group quantitative data is to use class limits. With this
method, each class consists of a range of values.
This method of grouping quantitative data is called limit grouping.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 52 / 195
Limit Grouping
Example
Table 6 displays the number of days to maturity for 40 short-term
investments. The data are from BARRON’S magazine. Use limit
grouping, with grouping by 10s, to organize these data into frequency and
relative-frequency distributions.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 53 / 195
Limit Grouping
Solution
Because we are grouping by 10s and the shortest maturity period is 36
days, our first class is 30-39.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 54 / 195
Cutpoint Grouping
Cutpoint Grouping
A third way to group quantitative data is to use class cutpoints. As with
limit grouping, each class consists of a range of values.
The method of grouping quantitative data by using cutpoints is called
cutpoint grouping.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 55 / 195
Cutpoint Grouping
Example
The U.S. National Center for Health Statistics publishes data on weights
and heights by age and sex in the document Vital and Health Statistics.
The weights shown in Table 8, given to the nearest tenth of a pound, were
obtained from a sample of 18- to 24-year-old males. Use cutpoint
grouping to organize these data into frequency and relative-frequency
distributions. Use a class width of 20 and a first cutpoint of 120.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 56 / 195
Cutpoint Grouping
Solution
Tallying the data in Table 8 gives us the frequencies in the second column
of Table 9. Dividing each such frequency by the total number of
observations, 37, we get the relative frequencies.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 57 / 195
Organizing Data
Remark
Sometimes the class width is unknown, we can use the following formula
1 Range=Max Value-Min Value, The Range is shown by R,
2 The number of classes is calculated by K = 1 + 3.3logn.
Range R
3 Width= = .
Number of classes K
Remark
If it needed, round the answer up to the nearest whole number if there is a
remainder
Remark
Rounding up is different from rounding off. A number is rounded up if
there is any decimal remainder when dividing.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 58 / 195
Organizing Data
Example
These data represent the record high temperatures in degrees Fahrenheit
0
(F ) for each of the 50 states. Construct a grouped frequency distribution
for the data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 59 / 195
Histogram
Histogram
A histogram displays the classes of the quantitative data on a horizontal
axis and the frequencies (relative frequencies, percents) of those classes on
a vertical axis.
Remark
The frequency (relative frequency, percent) of each class is represented by
a vertical bar whose height is equal to the frequency (relative frequency,
percent) of that class.
Remark
For limit grouping or cutpoint grouping, we use the lower class limits (or,
equivalently, lower class cutpoints) to label the bars.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 60 / 195
Histogram
Example
Construct frequency histograms and relative-frequency histograms for the
data on number of televisions per household (Page 51), days to maturity
for short-term investments (Page 54), and weights of 18- to 24-year-old
males (Page 57).
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 61 / 195
Histogram
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 62 / 195
Histogram
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 63 / 195
Measures of central tendency
Measures of center
Descriptive measures that indicate where the center or most typical value
of a data set lies are called measures of central tendency or, more
simply, measures of center.
Measures of center are often called averages.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 64 / 195
Measures of central tendency
Mean
The mean is the sum of the values, divided by the total number of values.
The symbol X̄ represents the sample mean
n
∑ Xi
X1 + X2 + X3 + ... + Xn i=1
X̄ = n = n
where n represents the total number of values in the sample.
For a population, the Greek letter µ (mu) is used for the mean
N
∑ Xi
X1 + X2 + X3 + ... + XN i=1
µ= =
N N
where N represents the total number of values in the population.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 65 / 195
Measures of centeral tendency
Example
The data represent the number of days off per year for a sample of
individuals selected from nine different countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 66 / 195
Measures of centeral tendency
Example
1 The number of rooms in the seven hotels in downtown Pittsburgh is
713, 300, 618, 595, 311, 401, and 292. Find the median.
2 The number of tornadoes that have occurred in the United States
over an 8-year period follows. Find the median.
684, 764, 656, 702, 856, 1133, 1132, 1303.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 67 / 195
Example
1 Find the mode of the signing bonuses of eight NFL players for a
specific year. The bonuses in millions of dollars are
18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10.
2 Find the mode for the number of branches that six banks have.
401, 344, 209, 201, 227, 353
∑ f .Xm
X̄ = n
where f is the frequency and Xm is the midpoint for each class.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 68 / 195
Measures of centeral tendency
Example
Using the frequency distribution for the given Table, find the mean. The
data represent the number of miles run during one week for a sample of 20
runners.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 69 / 195
Measures of centeral tendency
Remark
1 A data set that has only one value that occurs with the greatest
frequency is said to be unimodal.
2 If a data set has two values that occur with the same greatest
frequency, both values are considered to be the mode and the data
set is said to be bimodal.
3 If a data set has more than two values that occur with the same
greatest frequency, each value is used as the mode, and the data set
is said to be multimodal.
Remark
The mode for grouped data is the modal class. The modal class is the
class with the largest frequency.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 70 / 195
Measures of centeral tendency
Example
Find the modal class for the frequency distribution of miles that 20
runners ran in one week, used Table below.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 71 / 195
Measures of Variation
Median Class
The first class which its cumulative frequency is greater than and equal to
the half of data is called median class.
Median formula
n
2
− Fc
Median = LM + .w
fM
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 72 / 195
Measures of Variation
Median formula
which LM is lower bound of the median class, Fc is cumulative frequency
of the class before the median class, fM is the frequency of the median
class and w is the class width.
Example
Find the median for the following grouped data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 73 / 195
Measures of Variation
Measures of variation
To describe the difference quantitatively between two sets of data have the
same measures of center, we use a descriptive measure that indicates the
amount of variation, or spread, in a data set. Such descriptive measures
are referred to as measures of variation or measures of spread.
The Range
The range of a data set is the difference between the maximum (largest)
and minimum (smallest) observations
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 74 / 195
Measures of Variations
Example
Determine the sample standard deviation of the heights of the starting
′ ′ ′′ ′ ′′ ′ ′′ ′ ′′
players on a team as shown by 6 , 6 1 , 6 4 , 6 4 , 6 6 (Feet and Inches).
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 75 / 195
Measures of Variations
Example
Find the sample standard deviation of the heights for the five starting
players on a team as given by 67, 72, 76, 76, 84 (Inches) by using the
computing formula.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 76 / 195
Measures of Variations
Variance
In computing a sample standard deviation is to take an average of the
squared deviations. We do so by dividing the sum of squared deviations by
n1,( or 1 less than the sample size). The resulting quantity is called a
2 2
sample variance and is denoted sx or, when no confusion can arise, s . In
symbols,
2
2 ∑(xi − x̄)
s =
n−1
Remark
2
The symbol for population variance is σ . The formula for the population
variance is
2
2 ∑(Xi − µ)
σ =
N
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 77 / 195
Measures of Variations
Remark
The corresponding formula for the population standard deviation is
√
√ ∑(Xi − µ)2
σ = σ2 =
N
Example
A testing lab wishes to test an experimental brand (a small population) of
outdoor paint to see how long will last before fading. The testing lab
makes 6 gallons to test . The results (in months) are 35, 45, 30, 35, 40,
25. Find the variance and standard deviation for the brand.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 78 / 195
Measures of variations
2 2
2 n(∑ f .Xm ) − (∑ f .Xm )
s =
n(n − 1)
3 Take the square root to get the standard deviation.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 79 / 195
Measures of Variations
Example
Find the variance and the standard deviation for the frequency distribution
of the data as Table below. The data represent the number of miles that
20 runners ran during one week.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 80 / 195
Measures of Variations
Remark
A statistic that allows you to compare standard deviations when the units
are different is called the coefficient of variation.
Coefficient of variation
The coefficient of variation, denoted by CVar, is the standard deviation
divided by the mean. The result is expressed as a percentage as
s
1 For sample CVar= .100
X̄
σ
2 For population CVar=
µ .100
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 81 / 195
Measures of Variations
Example
The mean of the number of sales of cars over a 3-month period is 87, and
the standard deviation is 5. The mean of the commissions is $5225, and
the standard deviation is $773. Compare the variations of the two.
Example
The mean for the number of pages of a sample of women’s fitness
magazines is 132, with a variance of 23; the mean for the number of
advertisements of a sample of women’s fitness magazines is 182, with a
variance of 62. Compare the variations.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 82 / 195
Measures of Position
Measures of position
These measures include standard scores, percentiles, deciles, and quartiles.
They are used to locate the relative position of a data value in the data set.
For example, if a value is located at the 80th percentile, it means that
80% of the values fall below it in the distribution and 20% of the values
fall above it.
Remark
The median is the value that corresponds to the 50th percentile, since
one-half of the values fall below it and onehalf of the values fall above it.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 83 / 195
Measures of Position
Standard Score
A z score or standard score for a value is obtained by subtracting the mean
from the value and dividing the result by the standard deviation. The
symbol for a standard score is z. The formula is
1 for sample
X − X̄
z= s
2 for population
X −µ
z= σ
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 84 / 195
Measures of Positions
Example
A student scored 65 on a calculus test that had a mean of 50 and a
standard deviation of 10; she scored 30 on a history test with a mean of 25
and a standard deviation of 5. Compare her relative positions on the two
tests.
Remark
Note that if the z score is positive, the score is above the mean. If the z
score is 0, the score is the same as the mean. And if the z score is
negative, the score is below the mean.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 85 / 195
Measures of Positions
Percentiles
Percentiles divide the data set into hundredths or 100 equal parts.
Remark
A data set has 99 percentiles, denoted P1 , P2 , ..., P99 . The first percentile,
P1 , is the number that divides the bottom 1% of the data from the top
99%; the second percentile, P2 , is the number that divides the bottom 2%
of the data from the top 98%; and so on.
Remark
Note that the median is also the 50th percentile.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 86 / 195
Measures of Positions
Percentiles
The percentile corresponding to a given value X is computed by using the
following formula:
Example
A teacher gives a 20-point test to 10 students. The scores are shown here.
Find the percentile rank of a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
Example
Using the given data 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the percentile
rank for a score of 6.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 87 / 195
Measures of Positions
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 88 / 195
Measures of Positions
Example
Using the given scores 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the value
corresponding to the 25th percentile.
Example
Using the given data 18, 15, 12, 6, 8, 2, 3, 5, 20, 10, find the value that
corresponds to the 60th percentile.
Quartile
The quartiles divide a data set into quarters (4 equal parts).
Remark
A data set has three quartiles, which we denote Q1 , Q2 , and Q3 .
The first quartile, Q1, is the number that divides the bottom 25% of the
data from the top 75% and so on.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 89 / 195
Measures of Positions
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 90 / 195
Measures of Positions
Quartiles
Arrange the data in increasing order and determine the median.
1 The first quartile is the median of the part of the entire data set that
lies at or below the median of the entire data set.
2 The second quartile is the median of the entire data set.
3 The third quartile is the median of the part of the entire data set that
lies at or above the median of the entire data set.
Remark
Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to P25 , P50 , P75 .
The median is the same as P50 or Q2 .
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 91 / 195
Measures of Positions
Example
Find Q1 , Q2 , and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18.
Example
The A. C. Nielsen Company publishes information on the TV-viewing
habits of Americans in Nielsen Report on Television. A sample of 20
people yielded the weekly viewing times, in hours, displayed in given Table.
Determine and interpret the quartiles for these data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 92 / 195
Measures of Skewness
Detemining Normality
A normally shaped or bell-shaped distribution is only one of many shapes
that a distribution can assume. There are several ways statisticians check
for normality. The easiest way is to draw a histogram for the data and
check its shape.
Concept of Skewness
The skewness of a distribution is defined as the lack of symmetry. In a
symmetrical distribution, the Mean, Median and Mode are equal to each
other.
In normal distribution, the ordinate at mean divides the distribution into
two equal parts.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 93 / 195
Measures of Skewness
Measures of Skewness
Skewness can be checked by using the Pearson coefficient of skewness
(PC) also called Pearson’s index of skewness. The formula is
3(X̄ − median)
PC = s
where X̄ is mean and s is the standard deviation.
Remark
If the index is greater than or equal to +1 or less than or equal to −1, it
can be concluded that the data are significantly skewed.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 94 / 195
Measures of Skewness
Example
A survey of 18 high-technology firms showed the number of days’
inventory they had on hand. Determine if the data are approximately
normally distributed.
5 29 34 44 45 63 68 74 74 81 88 91 97 98 113 118 151 158
M79.50Md77.50S40.454
Example
The data shown consist of the number of games played each year in the
career of Baseball Hall of Famer Bill Mazeroski. Determine if the data are
approximately normally distributed.
81 148 152 135 151 152 159 142 34 162 130 162 163 143 67 112 70
M127.24Md143S39.866
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 95 / 195
Regression and Correlation
Introduction
The purpose of this section is to answer these questions statistically:
1 Are two or more variables linearly related?
2 If so, what is the strength of the relationship?
3 What type of relationship exists?
4 What kind of predictions can be made from the relationship?
Correlation
Correlation is a statistical method used to determine whether a linear
relationship between variables exists.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 96 / 195
Correlation
Remark
To answer the first two questions, statisticians use a numerical measure to
determine whether two or more variables are linearly related and to
determine the strength of the relationship between or among the variables.
This measure is called a correlation coefficient.
Remark
To answer the third question, you must ascertain what type of relationship
exists. There are two types of relationships: simple and multiple.
Remark
The correlation coefficient computed from the sample data measures the
strength and direction of a linear relationship between two quantitative
variables.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 97 / 195
Correlation
Remark
The symbol for the sample correlation coefficient is r . The symbol for the
population correlation coefficient is ρ.
Remark
The range of the correlation coefficient is from −1 to +1.
1 If there is a strong positive linear relationship between the variables,
the value of r will be close to +1.
2 If there is a strong negative linear relationship between the variables,
the value of r will be close to −1.
3 If there is no linear relationship between the variables or only a weak
relationship, the value of r will be close to 0.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 98 / 195
Correlation
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 99 / 195
Correlation
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 100 / 195
Correlation Coefficient
Example
Compute the correlation coefficient for the given data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 101 / 195
Correlation Coefficient
Example
Compute the value of the correlation coefficient for the data obtained in
the study of the number of absences and the final grade of the seven
students in the statistics class.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 102 / 195
Correlation Coefficient
where sx and sy denote the sample standard deviations of the x-values and
y -values, respectively.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 103 / 195
Regression
Remark
If the value of the correlation coefficient is significant, the next step is to
determine the equation of the regression line, which is the data’s line of
best fit.
Remark
Determining the regression line when r is not significant and then making
predictions using the regression line are meaningless.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 104 / 195
Regression
Regression
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 105 / 195
Determination of the Regression Line Equation
Remark
There are several methods for finding the equation of the regression line.
These formulas use the same values that are used in computing the value
of the correlation coefficient.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 106 / 195
Determination of the Regression Line Equation
′
Formulas for the Regression Line y = a + bx
2
(∑ y )(∑ x ) − (∑ x)(∑ xy )
a=
n(∑ x 2 ) − (∑ x)2
n(∑ xy ) − (∑ x)(∑ y )
b=
n(∑ x 2 ) − (∑ x)2
′
where a is the y intercept and b is the slope of the line.
The aforementioned method for finding the best fit line is called least
square method.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 107 / 195
Regression
Example
Find the equation of the regression line for the data in following table, and
graph the line on the scatter plot of the data.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 108 / 195
Regression
Example
Find the equation of the regression line for the data in given table, and
graph the line on the scatter plot.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 109 / 195
Sums of Squares in Regression
Remark
ŷi is the predicted value of the response variable which is obtained from
the regression line.
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 110 / 195
Sums of Squares in Regression
Coefficient of Determination
2
The coeffcient of determination, r , is the proportion of variation in the
observed values of the response variable explained by the regression. Thus,
2 SSR
r = .
SST
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 111 / 195
Sums of Squares in Regression
Figure: Decomposing the deviation of an observed y-value from the mean into the
deviations explained and not explained by the regression
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 112 / 195
Sums of Squares in Regression
Remark
Because of the regression identity, we can also express the coefficient of
determination in terms of the total sum of squares and the error sum of
squares:
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 113 / 195
Sums of Squares in Regression
Example
For the given tables and regression equations
a) compute the three sums of squares, SST, SSR, and SSE.
b) verify the regression identity, SST = SSR + SSE.
c) compute the coefficient of determination.
x 3 4 1 2
y 4 5 0 -1
x 0 2 2 5 6
y 4 2 0 -2 1
Author: Dr. Kamal Mamehrashi (UKH) Probability and Statistics Semester II 2024-25 114 / 195