Class Test 1 Revision Note
Chapter 1 Descriptive Statics
Types of Variables
- Categorical variables
- Ordinal variables
- Quantitative variables
Charts
- Histogram
- Histogram vs. Bar Chart
One implication of this distinction: it is always appropriate to talk about the
skewness of a histogram; that is, the tendency of observations to fall more
on the low end or the high end of the x-axis
With bar charts, however, the x-axis does not have a low end or a high end;
because the labels on the x-axis are categorical – not quantitative. As a
result, it is less appropriate to comment on the skewness of a bar chart.
- Pros and Cons of the Four Visual Displays for Quantitative Variables
Box plots, stem-and-leaf plots, dot plots, and histograms organize
quantitative data in ways that let us begin to find the information in a data
set.
As to the question of which type of display is the best, there is no unique
answer.
The answer depends on what feature of the data may be of interest and, to a
certain degree, on the sample size.
Box plot
Strength:
Give a direct look at central location and spread as it summarizes
the five-number summary.
Can identify outliers.
Side-by-side box plot is an excellent tool for comparing two or
more groups
Weakness:
Not entirely useful for judging shape.
Cannot distinguish between bell-shaped or bimodal.
Stem-and-Leaf plot
Strength:
Excellent for sorting data.
With a sufficient sample size, it can be used to judge shape.
Weakness:
With a large sample size, a stem-and-leaf plot may be too
cluttered because the display shows all individual data values.
More restricted in the choices for “intervals” when compared to
histograms.
Dot plot
Strength:
Can present all individual data values.
Easy to create.
Weakness:
With a large sample size, a dot plot may be too cluttered.
Histogram
Strength:
Excellent for judging the shape of a data set with moderate or
large sample sizes.
Flexible in choosing number as well as the width of the intervals
for the display.
Between 6 and 15 intervals usually gives a good picture of the
shape.
Weakness:
With a small sample size, a histogram may not “fill in”
sufficiently well to show the shape of the data.
With either too few intervals or too many, we may not see the true
shape of the data.
- Misleading Graphs
Statistics can be misleading if not presented appropriately.
Same data can appear very differently when graphed.
E.g. break in the vertical axis.
Frequency on the vertical axis should be continuous from zero.
When we put a break in the axis, we lose proportional relationship
among class interval frequencies.
- Shape of Frequency Distributions
J-shaped
Positively skewed
Negatively skewed
Rectangular
Bimodal
Bell-shaped
Numerical Summaries
- Measures of Central Location: Mean, Mode, Median
Mean as the Balance Point of a Distribution:
Unlike the median and the mode, the mean is responsive to the exact
position of each score in the distribution. It is the balance point of a
distribution.
Median in the Case with Outliers:
The median is less sensitive than the mean to the presence of a
few extreme scores (outliers)
Is it permissible to calculate the mean for tests in the behavioral
sciences? First of all, we have to ask ourselves a question: “Is the
measurement on this scale interval or ordinal?” Sometimes it may not
be interval nor ordinal.
Measures of Variability: Standard Deviation, Range, Interquartile Range
The standard deviation, like the mean, is responsive to the exact
position of every score in the distribution, because it is calculated by
taking deviations from the mean, if a score is shifted to a position more
deviant from the mean, the standard deviation will increase. If the shift
is to a position closer to the mean, the standard deviation decreases.
Measures of Shape: Skewness, Kurtosis
Skewness is a measure of a data set’s deviation from symmetry
Skewness
m3
, m2
(x x ) 2
, m3
(x x ) 3
3
m2 2 n n
The value of this measure generally lies between -3 and +3. The
closer the value lies to -3, the more the distribution is skewed left,
vice versa. A value close to 0 indicates a symmetric distribution. A
normal distribution is symmetric and has skewness of 0.
There are other measures of skewness:
1. Pearson mode skewness or fist skewness coefficient
mean mode
skewness
s.d .
Mean < (>) mode distribution is -ve-ly (+ve-ly) skewned
2. Pearson median skewness or second skewness coefficient
3(mean median)
skewness
s.d .
Mean < (>) median distribution is -ve-ly (+ve-ly) skewed
3. Bowley skewness or quartile skewness coefficient
(Q Q2 ) (Q2 Q1 ) Q3 2Q2 Q1
skewness 3
Q3 Q1 Q3 Q1
Distribution Coefficient of Skewness Measures of Central
Location
Symmetrical 0 Mean = Median = Mode
Skewed to the right >0 Mean > Median > Mode
Skewed to the left <0 Mean < Median < Mode
Kurtosis is a measure of peakedness of a distribution.
m
kurtosis 42
m2
Excess kurtosis is defined as the kurtosis minus 3, i.e.
excess kurtosis = kurtosis – 3
Normal distribution has an excess kurtosis of 3.
Generally, if a distribution has a greater excess kurtosis, it has a
higher peak and thicker tails, compared to another distribution of
the same kind.
Outlier is a data point that is not consistent with the bulk of the data.
If an observation is outside the range [Q1 – 1.5IQR , Q3+1.5IQR],
then it is regarded as outlier.
Possible reasons for outliers and what to do about them:
Outlier is legitimate data value and represents natural
variability for the group and variable(s) measured. Values
may not be discarded. They provide important information
about location and spread.
Mistake made while taking measurement or entering it into
computer. If verified, should be discarded or corrected.
Individual in question belongs to a different group other than
bulk of individuals measured. Values may be discarded if
summary is desired and reported for the majority group only.
Coefficient of Variation
The standard deviation measures the variation in a set of data. For
decision makers, the standard deviation indicates how spread out a
distribution is.
For distributions having the same mean, the distribution with the
largest standard deviation has the greatest relative spread.
When two or more distributions have different means, the relative
spread cannot be determined by merely comparing the standard
deviations.
Coefficient of variation (CV), is used to measure the relative variation
for distributions with different means.
s
Sample coefficient of variation = (100%)
x
When the coefficients of variation for two or more distributions are
compared, the distribution with the largest CV is said to have the
greatest relative spread.
Normal Distribution
Percentile
- k-th percentile is a number that has k% of the data values at or below it and
(100-k)% of the data values at or above it. Lower quartile, median, upper quartile
are special cases of percentile. Lower quartile = 25th percentile, median = 50th
percentile, upper quartile = 75th percentile.
Value-at-Risk (VaR)
- One important application of percentile in risk management is VaR.
- VaR is defined as the worst loss over a target horizon that will not be exceeded
with a certain confidence level. For instance, the VaR at the 95% confidence level
gives a loss value that will not be exceeded with no less than 95% of probability.
Z-score
- 1 contains about 68% of the scores
- 2 contains about 95% of the scores
- 3 contains about 99.7% of the scores
Chapter 2 Correlation and Regression
Scatterplot
- Positive/negative association, linear relationship/nonlinear (curvilinear)
relationship
Correlation Coefficient r
- Strength
It is determined by the closeness of the points to a straight line.
- Direction
It is determined by whether one variable generally increases or generally
decreases when the other variable increases
- Linear
When the pattern is nonlinear, the correlation coefficient is not an
appropriate way to measure the strength of the relationship.
- The measure is also called Pearson product-moment correlation coefficient.
r
S xy
( x x )( y y )
( S xx )( S yy ) (x x ) ( y y)
2 2
where
( x) 2
S xx ( x x ) 2 x 2 nx 2 x 2
n
( y ) 2
S yy ( y y ) y ny y
2 2 2 2
S xy ( x x )( y y ) xy nx y xy
x y
n
- r is always -1 and +1.
- Magnitude indicates the strength of the linear relationship.
- Sign indicates the direction of the association.
Rank Correlation Coefficient rs
- Since rankings are qualitative data but not quantitative data even though they are
numerical, sample correlation coefficient r cannot be used.
- Instead, we will use the nonparametric counterparts of r, the rank correlation
coefficient rs, to perform correlation analysis to a form of qualitative data:
bivariate rankings.
- If we wish to assess the strength of the relation between the two sets of ranks, we
can compute the sample rank correlation coefficient rs.
- The Spearman correlation coefficient rs is defined as the Pearson correlation
coefficient between the ranks of the data.
rs
( R R )( R R )
x x y y
, where Rx and R y are the ranks of the two
(R R ) (R R )
x x
2
y y
2
variables of interest.
If there are no tied ranks in the data, then the following formula also works
6 i 1 di2
n
Shortcut formula: rs 1 ,
n(n 2 1)
di
Rank ( xi ) Rank ( yi )
where ,
Rxi Ryi (difference between a pair of ranks)
n = the number of pairs of ranks
- When to use rs instead of r?
Situation 1: Data are given in the form of ranks.
Situation 2: Data are given in the form of scores, but what matters is that
one score is higher than another and how much higher is not really
important. Then, translating scores to ranks will be suitable.
- Cautions in the use of correlation
Bear in mind the following five cautions in the use of correlation.
Correlation does not prove causation
If variation in X causes variation in Y, that causal connection will
appear in some degree of correlation between X and Y.
However, we cannot reason backward from a correlation to a
causal relationship.
We must always remember “correlation does not imply
causation”.
There are at least four possibilities of an observed correlation.
Denote X as the explanatory variable, Y as the response variable.
(a) Causation – X is a cause of Y.
(b) Reverse of causation – Y is a cause of X.
(c) A third variable influences both X and Y.
(d) A complex of interrelated variables influences X and Y.
Note: Two or more of these situations may occur simultaneously.
For example, X and Y may influence each other. (a+b)
r and rs are only for linear relationship
When data for one or both variables are not linear, other measures
of association are better.
effect of variability
The correlation coefficient is sensitive to the variability
characterizing the measurements of the two variables.
For example, suppose a university had only minimal entrance
requirements, the relationship between total SAT scores, and the
other university is a more selective private university which
admits students only with SAT scores of 1200 or higher. The
correlation will be weaker in the latter case.
Therefore, restricting the range, whether in X, in Y, or in both,
results in lower correlation coefficient (in magnitude).
effect of discontinuity
The correlation tends to be an overestimate in discontinuous
distributions.
Usually, discontinuity, whether in X, in Y, or in both, results in a
higher correlation coefficient.
correlation for combined data
correlation coefficient may increase or decrease, depends.
- Examples of deceiving relationship
Outliers can substantially inflate or deflate correlations.
An outlier that is consistent with the trend of the rest of the data will
inflate the correlation.
An outlier that is not consistent with the rest of the data can
substantially decrease the correlation.
Groups combined inappropriately may mask relationships.
The missing link is a third variable.
Simpson’s Paradox
Two or more groups
Variables for each group may be strongly correlated
When groups combined into one, very little correlation between
the two variables.
Simple Linear Regression