DATA
MANAGEMENT Day 4 – P M
D e c e mbe r 1 3 , 2 0 1 7
PREPARED BY JOSEPH G. TABAN
1
CONTENTS
Review: Descriptive Statistics
Normal Distribution
Hypothesis Testing
Regression and Correlation
Use of available technology related to
statistics
2
REVIEW ON BASIC CONCEPTS
IN STATISTICS
Preliminaries
Population and Sample
Types of Variable
Scales of Measurement
3
POPULATION AND SAMPLE
4
VARIABLE
This refers to some specific characteristic of a
subject that assumes one or more different
values.
VARIABLE
Quantitative Qualitative
Variable Variable
5
SCALES OF MEASUREMENT
Nominal
Scales
Ordinal
Scales
Interval
Scales
Ratio
Scales
6
EXERCISE:
Identify the type and scale of
measurement of the given
variables:
7
REPRESENTATION OF DATA
What is the appropriate graphical
representation of a given data?
Types of Graphs
Histogram
Bar Graphs
Line graphs
Pie charts
Area graphs
X-Y plots 8
REPRESENTATION OF DATA
Use of Excel for Graphical
Representation of Data
9
What can you say about
the following graphs?
10
WHAT’S WRONG?
11
WHAT’S WRONG?
12
WHAT’S WRONG?
13
WHAT’S WRONG?
14
MEASURE OF CENTRAL TENDENCY
The purpose of central tendency is to
determine the single value that best represents the
entire distribution of scores. The three
standard measures of central tendency are the
Mean
Median
Mode
Show how to compute for each measure.
15
MEASURE OF CENTRAL TENDENCY
Which measure of central tendency is
best used in answering each question"
What would a student sharing a house with
friends expect to pay in rent each month?
What are the eating preferences( based on a
list of foods) of freshman students?
At what age do people usually get married
for the first time?
16
MEASURE OF CENTRAL TENDENCY
Choosing a measure of central tendency
the level of measurement of the variable
concerned (nominal, ordinal, interval or ratio);
what is to be done with the figure obtained.
The mean is suitable only for ratio and interval data.
For ordinal variables, where the data can be ranked but
one cannot validly talk of `equal differences' between
values, the median, which is based on ranking, may be
used. Where it is not even possible to rank the data, as in
the case of a nominal variable, the mode may be the
only measure available.
17
EXERCISE
Imagine that you received the following data on the
vocabulary test of a group of students:
22 23 23 23
23 23 24 25
29 30 30 30
30 30 31 32
33 33 34 35
36 36 37 37
•Compute the mean, mode, and median of the data and
decide which of the three you believe to be best for the central
tendency of the data.
•Use Excel to verify the computed values. 18
MEASURES OF LOCATION
These are values which divide the
distribution into a given number of equal
parts.
Types:
Quartiles
Deciles
Percentiles
19
A percentile is a point in the
distribution below which a given
percent of cases lie.
If P 70 of a 100-item test
is 80, what does it mean?
20
EXERCISE
Locate 𝑄1 , 𝐷3 , 𝑃40 from the following data
18, 10, 12, 27, 25, 35, 12, 26,
24, 18, 15, 30, 34, 26, 21, 14
21
MIND WORK
Imagine that you conducted an in-service course for
teachers. To receive university credit for the course, the
teachers must take examinations--in this case, a
midterm and a final. The midterm was a multiple-choice
test of 50 items and the final exam presented teachers
with 10 problem situations to solve. Sue, like most
teachers, was a whiz at taking multiple-choice exams,
but bombed out on the problem-solving final exam. She
received a 48 on the midterm and a 1 on the final.
Becky didn't do so well on the midterm. She kept
thinking of exceptions to answers on the multiple-
choice exam. Her score was 39. However, she really did
shine on the final, scoring a 10.
22
Since you expect students to do well on both
exams, you reason that Becky has done a
creditable job on each and Sue has not. Becky
gets the higher grade. Yet, if you add the points
together, Sue has 49 and Becky has 49. The
question is whether the points are really equal.
Should Sue also do this bit of arithmetic, she
might come to your office to complain of the
injustice of it all. How will you show her that the
value of each point on the two tests is different? 23
MEASURE OF VARIABILITY
This provides a quantitative
measure of the degree to which scores in
a distribution are spread out or clustered
together.
Range
Variance
Standard Deviation
24
MEASURE OF VARIABILITY
• Consider the following data as scores of Students in 8
quizzes in Math .
Group A Group B
11 20
8 10
10 1
9 8
8 0
12 30
10 13
11 6
• Compute Range, Variance and Standard Deviation.
• What can you say about the two groups of students 25
CHARACTERISTICS OF THE MEASURES
OF VARIABILITY
The larger the standard deviation figure,
the wider the range of distribution away
from the measure of central tendency
Adding a constant to each score does not
change the standard deviation.
Multiplying each score by a constant
causes the standard deviation to be
multiplied by the same constant.
26
THE STANDARD NORMAL DISTRIBUTION
1. Properties of the normal curve
2. Mean and standard deviation of the
normal curve
3. Calculating z-scores
4. Area under the curve
5. Probability
27
NORMAL CURVE
The curve is symmetric about the mean.
Mean=Median= Mode
The tail or ends are asymptotic relative
to the horizontal axis
Each half represents 50% of
the total area.
The total area under the normal
curve is 1 or 100%
Areas can be thought of as
probabilities. Areas could be written as percents. Areas can not
be negative.
The normal curve area may be subdivide d into standa rd
deviations, at least 3 units to the left and 3 units to the right of
the vertical line 28
THE HISTOGRAM AND THE
NORMAL CURVE
29
NORMAL DISTRIBUTION
The 68-95-99.7 Rule
Normal Density Plot
In the normal distribution with
0.08
mean µ and standard deviation σ:
68% of the observations fall
0.06
within σ of the mean µ.
95% of the observations fall
0.04
f(x)
within 2σ of the mean µ.
0.02
99.7% of the observations fall
within 3σ of the mean µ. 0.00
-20 -10 0 10 20
3σ 2σ σ x σ
2σ 3σ
30
Z-SCORES
Are a way of determining the position of
a single score under the normal curve.
Measured in standard deviations relative
to the mean of the curve.
The Z-score can be used to determine an
area under the curve known as a
probability.
Formula: z = (X – 𝑥 )
s 31
USING THE NORMAL CURVE: Z SCORES
Steps:
To find areas, first compute Z scores.
Substitute score of interest for x.
Use sample mean for 𝑥 and sample
standard deviation for s.
The formula changes a “raw” score (x) to
a standardized score (z).
32
FINDING PROBABILITIES
Areas under the curve can also be
expressed as probabilities.
Probabilities are proportions and range
from 0.00 to 1.00.
The higher the value, the greater the
probability (the more likely the event).
For instance, a .95 probability of rain is
higher than a .05 probability that it will
rain!
33
THREE DIFFERENT AREA
CALCULATIONS:
1. FIND THE AREA TO THE LEFT OF Z
2. FIND THE AREA TO THE RIGHT OF Z
3. FIND THE AREA BETWEEN 𝒛 𝟏 AND 𝒛 𝟐
34
Obtaining Area under Standard Normal Curve
Approach Graphically Solution
Shade the area to the left Use Table to find the row and
Find the area to of za column that correspond to za.
the left of za The area is the value where the
row and column intersect.
P(Z < a)
a
Shade the area to the right Use Table to find the area to
Find the area to of za
the right of za the left of za. The area to the
right of za is 1 – area to the left
of za.
P(Z > a) or
1 – P(Z < a)
a
Shade the area between za Use Table to find the area to
Find the area and zb the left of za and to the left of
between za and zb zb. The area between is areazb
– areaza.
P(a < Z < b)
a 35
b
EXAMPLE 1
Determine the area under the standard
normal curve that lies to the left of
A.Z = -3.49
= 0.0002
B.Z = -1.99
= 0.0233
C.Z = 0.92
= 0.8212
D.Z = 2.90 a
= 0.9981
36
EXAMPLE 2
Determine the area under the standard
normal curve that lies to the right of
a) Z = -3.49
= 0.9998
b) Z = -0.55
= 0.7088
c) Z = 2.23
a
= 0.0129
d) Z = 3.45
= 0.0003 37
EXAMPLE 3
Find the indicated probability of the
standard normal random variable Z
a) P(-2.55 < Z < 2.55)
= 0.9892
b) P(-0.55 < Z < 0)
= 0.2088 a b
c) P(-1.04 < Z < 2.76)
= 0.8479
38
EXERCISE
Worksheet
39
SIMPLE TEST OF
HYPOTHESIS
Objectives:
1. Define a hypothesis
2. Differentiate between Null and
Alternative Hypothesis
3. State hypothesis for a particular
study/problem
4. Differentiate the types of hypothesis
testing
5. Follow the steps in hypothesis testing
6. Compare means by hypothesis
testing using different test statistic
40
WHAT IS A HYPOTHESIS?
It is an educated guess
It is a tentative generalization.
Statistical Hypothesis---- a guess
or prediction made by a researcher
regarding the possible outcome of
the study.
41
TWO TYPES OF STATISTICAL
HYPOTHESIS:
A. Null Hypothesis (H o )
It is the hypothesis to be tested which
one hopes to reject.
It shows the equality or no significant
difference or relationship between the variables
B. Alternative Hypothesis (H a )
It generally represents the idea which
the researcher wants to prove.
Exercise: Stating the (H o ) and (H a ).
42
2 TYPES OF HYPOTHESIS TESTING
1. One- tailed test. It is a directional test with
the region of rejection lying on either left or right of the
normal curve.
a. Right-Directional Test
(H a uses comparatives such as greater than, more than, higher
than, better than, lower than, superior to, exceeds, etc..)
b. Left- Directional Test
(H a uses comparatives such as smaller than, less than, lower
than, inferior to, below, etc..)
2. Two- tailed Test. It is a non - directional test with the
region of rejection lying on both tails of the normal curve.
(H a uses words such as not equal to, significantly different,
etc) 43
STATISTICAL ERRORS
Type I Error. It is the error committed when the
null hypothesis is rejected when in fact it is true
and the alternative is false
Type II error. It is the error committed when the
null hypothesis is accepted when in fact it is
false and the alternative is true.
Facts Decision
Accept Ho Reject Ho
Ho is True Correct Type I
Ha is False Type II Correct
44
LEVEL OF SIGNIFICANCE
Alpha(α)--- it is used to designate the
probability of committing type I error
Beta(β)--- it is used to designate the probability
of committing type II error
Note: Alpha is the size of rejection, while Beta is
the size of the acceptance region.
What does it mean when you set α= 0.05?
45
STEPS IN HYPOTHESIS TESTING
1. Formulate Ho and Ha.
2. Set the level of signific anc e (α), then determine the type of
hypothe sis testing and the tabular p- value
3. Set the criterion (whe n to reject Ho)
Determine and compute for the test statistic
4. Make your decision
5. Formulate your conclusio n.
46
CRITERIA FOR REJECTING HO
Using tabular value of z
1. One-tailed test (right- directiona l)
Reject Ho if Z compu ted is ≥ Z tabular
2. One-ta iled test (left- directiona l)
Reject Ho if Z compu ted is ≤ Z tabular
3. Two- tailed test ( Zc is positive)
Reject Ho if Z compu ted is ≥ Z tabular
4. Two- tailed test ( Zc is negative)
Reject Ho if Z compu ted is ≤ Z tabular
47
TYPES OF TEST STATISTIC FOR HYPOTHESIS
TEST CONCERNING MEANS
A. Z-test ( used when n is large or n≥ 30.
1. Z- test for comparing hypothesized
and sample mean
2. Z-test for comparing 2 sample means
a. When the population standard
deviation is given
b. When the sample standard
deviation is given.
49
CORRELATION
AND
REGRESSION
50
CORRELATION AND REGRESSION
Scatter plot is used to show a rough estimate of the
relationship between two variables
Correlation
Measures the strength of the association between
two variables ( bivariate data)
Only concerned with strength of the relationship
No causal effect is implied
Bivariate data
Are data sets in which each subject has two
observations associated with it.
51
TYPES
POSITIVE CORRELATION – exists when high scores
in one variable are associated with high scores in the
second variable or low scores in one variable are
associated with low scores in the other
NEGATIVE CORRELATION – exists when high scores
in one variable are associated with low scores in the
second or vice versa.
ZERO CORRELATION– exists when the points on the
scatter diagram are spread in a random manner.
PERFECT CORRELATION– all points lie on a straight
line
52
TH E STRENGTH OR DEGREE OF TH E
RELATIONSH I P IS BA SED ON TH E FOLLOW IN G
RA NGES OF TH E CORRELA TI O N COEFFI CI EN T:
Ranges of r Degree/strength of
relationship
±1.00 perfect relationship
± 0.90 to ± 0.99 very strong/very high
± 0.70 to ± 0.89 strong/high
± 0.40 to ± 0.69 moderate/substantial
± 0.20 to ± 0.39 weak/small
± 0.01 to ± 0.19 almost negligible to
slight
0 no correlation
SCATTER PLOT EXAMPLES
Strong relationships Weak relationships
y y
x x
y y
x x
54
SCATTER PLOT EXAMPLES
No relationship
x
55
CORRELATION COEFFICIENT
A descriptive measure usually
denoted by r, which ranges
from -1 to 1.
It measures the degree of
relationship between two
variables.
56
FEATURES OF R
Unit free
Ranges between -1 and 1
The closer to -1, the stronger the
negative linear relationship
The closer to 1, the stronger the
positive linear relationship
The closer to 0, the weaker the
linear relationship
57
EXAMPLES OF APPROXIMATE
R VALUES
y y y
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1 58
CALCULATING THE
CORRELATION COEFFICIENT
n xy x y
r
[n( x ) ( x) ][n( y ) ( y) ]
2 2 2 2
where:
r = Sample correlation coefficient
n = Sample size
x = Value of the independent
variable
y = Value of the dependent variable
59
CALCULATION EXAMPLE
Tree Trunk
Height Diameter
y x xy y2 x2
35 8 280 1225 64
49 9 441 2401 81
27 7 189 729 49
33 6 198 1089 36
60 13 780 3600 169
21 7 147 441 49
45 11 495 2025 121
51 12 612 2601 144
=321 =73 =3142 =14111 =713
CALCULATION EXAMPLE
)
Tree
n xy x y
Height,
r
[n( x 2 ) ( x)2 ][n( y 2 ) ( y)2 ]
y
70
8(3142) (73)(321)
60
50
40 [8(713) (73)2 ][8(14111) (321)2 ]
30
20
0.886
10
Trunk Diameter, x
r = 0.886 → relatively strong
0
0 2 4 6 8 10 12 14 positive
linear association between x and y
61
EXERCISE
Identify the correlation given a pair of variables
Temperature and air conditioning cost
School attendance achievement
Investment period and interest earned
Weight and IQ
Temperature and ice cream sales
Age and agility
Amount of exercise and body weight
62
Pearson product moment correlation
coefficient
Coefficient of determination = R squared
Indicates the proportion of the variance in
one variable that can be associated within
the variance in the other variable.
63
COEFFICIENT OF
𝟐
DETERMINATION, 𝑹
The coefficient of determination is
the portion of the total variation in
the dependent variable that is
explained by variation in the
independent variable
64
COEFFICIENT OF DETERMINATION, R 2
(
Note: In the single independent variable case, the
coefficient of determination is
where:
R r 2 2
R2 = Coefficient of
determination
r = Simple correlation
coefficient
65
INTRODUCTION TO REGRESSION
ANALYSIS
Regression analysis is used to:
Predict the value of a dependent variable
based on the value of at least one independent
variable
Explain the impact of changes in an
independent variable on the dependent
variable
Dependent variable: the variable we wish
to explain
Independent variable: the variable used to
explain the dependent variable
66
SIMPLE LINEAR REGRESSION MODEL
Only one independent variable, x
Relationship between x and y is
described by a linear function
Changes in y are assumed to be
caused by changes in x
67
TYPES OF REGRESSION MODELS
Positive Linear Relationship Relationship NOT Linear
Negative Linear Relationship No Relationship
68
COEFFICIENT OF DETERMINATION, R 2
(continued)
Coefficient of determinatio n
SSR sum of squares explained by regression
R
2
SST total sum of squares
Note: In the single independent variable case, the coefficient
of determination is
R r2 2
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
69
EXAMPLES OF APPROXIMATE
R 2 VALUES
y
R2 = 1
Perfect linear relationship
between x and y:
x
R2 = 1
y 100% of the variation in y is
explained by variation in x
x
R2 = +1
70
EXAMPLES OF APPROXIMATE
R 2 VALUES
y
0 < R2 < 1
Weaker linear relationship
between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x
x
71
EXAMPLES OF APPROXIMATE
R 2 VALUES
R2 = 0
y
No linear relationship
between x and y:
The value of Y does not
x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
72
EXAMPLE
The relationship between the number of
sale calls and the number of units sold is
given by r = 0.759
The coefficient of determination is r
squared = 0.576
This means that 57.6 % of the variation
in the number of units sold is explained, or
accounted for, by the variation in the number
of sale calls.
73
Correlation is a measure of the linear relationship between two
variables and does not mean there is a causal relationship
between them.
Example. ( explain that there is no causal relationship between
the variables, other factors must have been the causes)
IQ level and starting menstrual period among females
Entrance test result and grades.
74
REGRESSION ANALYSIS
The process of developing an equation,
Preliminaries
Regression equation
How well a regression line fits the data
R2 =1 perfect fit
R2 =0
0< r2< 0.5 not well fit
75
Used Excel
for the Analysis of Data
COMPUTER HANDS-ON
76
Open forum for
clarification and ideas
77
Design a Plan or make a
project proposal
will be due on
December 15, 2017
78