0% found this document useful (0 votes)
36 views63 pages

Chapter 3 Numerical Descriptive Measures

Chapter 3 covers numerical descriptive measures, focusing on central tendency (mean, median, mode), variation (range, variance, standard deviation), and the shape of data distributions. It also discusses the construction and interpretation of boxplots, covariance, and the coefficient of correlation, along with ethical considerations in data reporting. Key concepts include the importance of selecting appropriate measures based on data characteristics and the implications of outliers.

Uploaded by

Khaoula hn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views63 pages

Chapter 3 Numerical Descriptive Measures

Chapter 3 covers numerical descriptive measures, focusing on central tendency (mean, median, mode), variation (range, variance, standard deviation), and the shape of data distributions. It also discusses the construction and interpretation of boxplots, covariance, and the coefficient of correlation, along with ethical considerations in data reporting. Key concepts include the importance of selecting appropriate measures based on data characteristics and the implications of outliers.

Uploaded by

Khaoula hn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Chapter 3

Numerical Descriptive Measures


Learning Objectives
• In this chapter, you learn:
- To describe the properties of central tendency,
variation, and shape in numerical data
- To calculate descriptive summary measures
for a population
- To construct and interpret a boxplot
- To calculate the covariance and the coefficient
of correlation
Summary Definitions
• The central tendency is the extent to which all
the data values group around a typical or
central value.
• The variation is the amount of dispersion, or
scattering, of values
• The shape is the pattern of the distribution of
values from the lowest value to the highest
value.
Measures of Central Tendency: The
Mean
• The arithmetic mean (often just called
“mean”) is the most common measure of
central tendency
Measures of Central Tendency: The
Mean
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
Measures of Central Tendency: The
Median
• In an ordered array, the median is the
“middle” number (50% above, 50% below)

• Not affected by extreme values


Measures of Central Tendency:
Locating the Median
• The location of the median when the values are in numerical
order (smallest to largest):

• If the number of values is odd, the median is the middle


number
• If the number of values is even, the median is the average of
the two middle numbers
Measures of Central Tendency: The
Mode
• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical data
• There may be no mode
• There may be several modes
Measures of Central Tendency: Review
Example
• Mean: ($3,000,000/5) = $600,000
• Median: middle value of ranked data =
$300,000
• Mode: most frequent value = $100,000
Measures of Central Tendency: Which
Measure to Choose?
• The mean is generally used, unless extreme
values (outliers) exist.
• The median is often used, since the median is
not sensitive to extreme values. For example,
median home prices may be reported for a
region; it is less sensitive to outliers.
• In some situations it makes sense to report
both the mean and the median.
Measures of Central Tendency:
Summary
Measures of Variation

• Measures of variation give information on the spread or variability or dispersion of


the data values.
Measures of Variation: The Range
• Simplest measure of variation
• Difference between the largest and the
smallest values: Range = Xlargest – Xsmallest
Measures of Variation: Why The Range
Can Be Misleading
• Ignores the way in which data are distributed

• Sensitive to outliers
Measures of Variation: The Variance
• Average (approximately) of squared deviations
of values from the mean
Measures of Variation: The Standard
Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data
Measures of Variation: The Standard
Deviation
• Steps for Computing Standard Deviation
1. Compute the difference between each value and
the mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample
variance.
5. Take the square root of the sample variance to
get the sample standard deviation.
Measures of Variation: Sample Standard
Deviation: Calculation Example
Measures of Variation: Comparing
Standard Deviations
Measures of Variation: Comparing
Standard Deviations
Measures of Variation: Summary
Characteristics
• The more the data are spread out, the greater
the range, variance, and standard deviation.
• The more the data are concentrated, the
smaller the range, variance, and standard
deviation.
• If the values are all the same (no variation),
all these measures will be zero.
• None of these measures are ever negative
Measures of Variation: The Coefficient
of Variation
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare the variability of two
or more sets of data measured in different
units
Measures of Variation: Comparing
Coefficients of Variation
• Stock A:
- Average price last year = $50
- Standard deviation = $5

• Stock B:
- Average price last year = $100
- Standard deviation = $5
Locating Extreme Outliers: Z-Score
• To compute the Z-score of a data value, subtract
the mean and divide by the standard deviation.
• The Z-score is the number of standard deviations
a data value is from the mean.
• A data value is considered an extreme outlier if
its Zscore is less than -3.0 or greater than +3.0.
• The larger the absolute value of the Z-score, the
farther the data value is from the mean.
Locating Extreme Outliers: Z-Score

• where X represents the data value X is the


sample mean S is the sample standard
deviation
Locating Extreme Outliers: Z-Score
• Suppose the mean math SAT score is 490, with a
standard deviation of 100.
• Compute the Z-score for a test score of 620.

• A score of 620 is 1.3 standard deviations above


the mean and would not be considered an
outlier.
Shape of a Distribution
• Describes how data are distributed
• Measures of shape
• Symmetric or skewed
General Descriptive Stats Using
Microsoft Excel
General Descriptive Stats Using
Microsoft Excel
Excel output
Numerical Descriptive Measures for a
Population
• Descriptive statistics discussed previously
described a sample, not the population.
• Summary measures describing a population,
called parameters, are denoted with Greek
letters.
• Important population parameters are the
population mean, variance, and standard
deviation.
Numerical Descriptive Measures for a
Population: The mean µ
• The population mean is the sum of the values in the
population divided by the population size, N
Numerical Descriptive Measures For A
Population: The Variance σ2
• Average of squared deviations of values from
the mean
Numerical Descriptive Measures For A
Population: The Standard Deviation σ
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the population variance
• Has the same units as the original data
Sample statistics versus population
parameters
The Empirical Rule
• The empirical rule approximates the variation of data
in a bell-shaped distribution
• Approximately 68% of the data in a bell shaped
distribution is within 1 standard deviation of the
mean or The Empirical Rule μ ± 1σ
The Empirical Rule
• Approximately 95% of the data in a bell-shaped distribution
lies within two standard deviations of the mean, or µ ± 2σ
• Approximately 99.7% of the data in a bell-shaped distribution
lies within three standard deviations of the mean, or µ ± 3σ
Using the Empirical Rule
• Suppose that the variable Math SAT scores is
bellshaped with a mean of 500 and a standard
deviation of 90. Then,
• 68% of all test takers scored between 410 and
590 (500 ± 90).
• 95% of all test takers scored between 320 and
680 (500 ± 180).
• 99.7% of all test takers scored between 230
and 770 (500 ± 270).
Quartile Measures
• Quartiles split the ranked data into 4 segments with an equal
number of values per segment

• The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
• Q2 is the same as the median (50% of the observations are smaller
and 50% are larger)
• Only 25% of the observations are greater than the third quartile
Quartile Measures: Locating Quartiles
• Find a quartile by determining the value in the
appropriate position in the ranked data, where
• First quartile position: Q1 = (n+1)/4 ranked value
• Second quartile position: Q2 = (n+1)/2 ranked value
• Third quartile position: Q3 = 3(n+1)/4 ranked value
• where n is the number of observed value
Quartile Measures: Calculation Rules
• When calculating the ranked position use the
following rules
- If the result is a whole number then it is the
ranked position to use
- If the result is a fractional half (e.g. 2.5, 7.5, 8.5,
etc.) then average the two corresponding data
values.
- If the result is not a whole number or a fractional
half then round the result to the nearest integer
to find the ranked position.
Quartile Measures: Locating Quartiles
Quartile Measures Calculating The
Quartiles: Example

• (n = 9)
• Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so Q1 =
(12+13)/2 = 12.5
• Q2 is in the (9+1)/2 = 5th position of the ranked data, so Q2 =
median = 16
• Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data, so Q3 =
(18+21)/2 = 19.5
Quartile Measures: The Interquartile
Range (IQR)
• The IQR is Q3 – Q1 and measures the spread
in the middle 50% of the data
• The IQR is also called the midspread because
it covers the middle 50% of the data
• The IQR is a measure of variability that is not
influenced by outliers or extreme values
• Measures like Q1, Q3, and IQR that are not
influenced by outliers are called resistant
measures
Calculating The Interquartile Range
The Five Number Summary
• The five numbers that help describe the
center, spread and shape of data are:
- Xsmallest
- First Quartile (Q1)
- Median (Q2)
- Third Quartile (Q3)
- Xlargest
Relationships among the five-number
summary and distribution shape
Five Number Summary and The
Boxplot
• The Boxplot: A Graphical display of the data
based on the five-number summary:
Five Number Summary: Shape of
Boxplots
• If data are symmetric around the median then the box and
central line are centered between the endpoints

• A Boxplot can be shown in either a vertical or horizontal


orientation
Distribution Shape and The Boxplot
Boxplot Example
• Below is a Boxplot for the following data:

• The data are right skewed, as the plot depicts


Boxplot example showing an outlier
• The boxplot below of the same data shows the
outlier value of 27 plotted separately
• A value is considered an outlier if it is more than 1.5
times the interquartile range below Q1 or above Q3
The Covariance
• The covariance measures the strength of the
linear relationship between two numerical
variables (X & Y)
• The sample covariance:

• Only concerned with the strength of the


relationship
• No causal effect is implied
Interpreting Covariance
• Covariance between two variables:
-cov(X,Y) > 0 X and Y tend to move in the same
direction
-cov(X,Y) < 0 X and Y tend to move in opposite
directions
-cov(X,Y) = 0 X and Y are independent
• The covariance has a major flaw:
• It is not possible to determine the relative
strength of the relationship from the size of the
covariance
Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two numerical variables
• Sample coefficient of correlation:
Features of the Coefficient of
Correlation
• The population coefficient of correlation is referred as
ρ.
• The sample coefficient of correlation is referred to as r.
• Either ρ or r have the following features:
- Unit free Ranges between –1 and 1
- The closer to –1, the stronger the negative linear
relationship
- The closer to 1, the stronger the positive linear
relationship
- The closer to 0, the weaker the linear relationship
Scatter Plots of Sample Data with
Various Coefficients of Correlation
The Coefficient of Correlation Using
Microsoft Excel
The Coefficient of Correlation Using
Microsoft Excel
Interpreting the Coefficient of
Correlation Using Microsoft Excel
Pitfalls in Numerical Descriptive
Measures
• Data analysis is objective
- Should report the summary measures that best
describe and communicate the important
aspects of the data set
• Data interpretation is subjective
- Should be done in fair, neutral and clear
manner
Ethical Considerations
• Numerical descriptive measures:
- Should document both good and bad results
- Should be presented in a fair, objective and
neutral manner
- Should not use inappropriate summary
measures to distort facts
Chapter Summary
• Described measures of central tendency
- Mean, median, mode
• Described measures of variation
- Range, interquartile range, variance and standard deviation,
coefficient of variation, Z-scores
• Illustrated shape of distribution
- Symmetric, skewed
• Described data using the 5-number summary
- Boxplots
• Discussed covariance and correlation coefficient
• Addressed pitfalls in numerical descriptive measures and ethical
considerations

You might also like