0% found this document useful (0 votes)
33 views139 pages

Ds Inferential Statistics

The document provides an overview of the Crimean War between 1853-1856 that saw heavy casualties on both sides between the Russians and the British Alliance. It then discusses Florence Nightingale's observations of the horrific hospital conditions and her work to implement sanitary reforms that reduced the mortality rate. Finally, it introduces some key concepts in statistics including inferential statistics, the scientific method, qualitative vs quantitative research, stating hypotheses, data analysis and interpretation, and assessing internal and external validity.

Uploaded by

Yashwanth Yashu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views139 pages

Ds Inferential Statistics

The document provides an overview of the Crimean War between 1853-1856 that saw heavy casualties on both sides between the Russians and the British Alliance. It then discusses Florence Nightingale's observations of the horrific hospital conditions and her work to implement sanitary reforms that reduced the mortality rate. Finally, it introduces some key concepts in statistics including inferential statistics, the scientific method, qualitative vs quantitative research, stating hypotheses, data analysis and interpretation, and assessing internal and external validity.

Uploaded by

Yashwanth Yashu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

INFERENTIAL

STATISTICS
(IT258M)
The Crimean War

The Crimean War (1853-1856) was a bloody battle between the Russians and the
British Alliance (Great Brittan, France, Ottoman Empire, Kingdom of Sardinia) that saw
great casualties on both sides.

“Half a league, half a league,


Half a league onward,
All in the valley of Death
Rode the six hundred.
"Forward, the Light Brigade!
"Charge for the guns!" he said:
Into the valley of Death
Rode the six hundred…”

Alfred, Lord Tennyson, “The Charge of the Light Brigade.” Written to memorialize events in the
Balaclava, Oct. 25, 1854.
Florence Nightingale “Lady with the Lamp”
Florence Nightingale observed the horrific
conditions of the wounded and was instrumental
in convincing the British government to make
sweeping changes in the sanitary conditions of the
make-shift “hospitals.” Her work to make
conditions more sanitary caused the mortality rate
to decline from 44 percent to 2 percent within 6
months.

Nightingale wanted to create a visual


Florence Nightingale (1820-1910) representation of her argument on sanitary
Lo! in that hour of misery conditions in her reports to the British
A lady with a lamp I see government. She saw that creating a circle
Pass through the glimmering gloom, denoting 100 percent of an event, and dividing
And flit from room to room. that circle into segments, she could produce a
simple graph that contained a lot of information…
Henry Wadsworth Longfellow’s 1857 poem “Santa
Filomena” thus, Florence Nightingale created the PIE CHART!
Knowledge, Data, Information, and Decisions…

Data

Knowledge Information

New Knowledge/Decisions
The Scientific Method

• Scientific Method
• The way researchers go about using knowledge and evidence to reach
objective conclusions about the real world.

• The analysis and interpretation of empirical evidence (facts from observation


or experimentation) to confirm or disprove prior conceptions
Characteristics of the scientific method
• Scientific Research is Public – Advances in science require freely available
information (replication/peer scrutiny)

• Science is Objective – Science tries to rule out eccentricities of judgment by


researchers and institutions. Wilhelm von Humboldt (1767-1835), founder
University of Berlin (teaching, learning, research) “Lehrfreiheit,” “Lernfreiheit,”
and “Freiheit der Wissenschaft”

• Science is Empirical – Researchers are concerned with a world that is knowable


and potentially measurable. Researchers must be able to perceive and classify
what they study and reject metaphysical and nonsensical explanations of events.
Characteristics of the scientific method, cont.

• Science is Systematic and Cumulative – No single research study stands alone,


nor does it rise or fall by itself. Research also follows a specific method.
• Theory – A set of related propositions that presents a systematic view of
phenomena by specifying relationsships among concepts
• Law – is a statement of fact meant to explain, in concise terms, an action or set of
actions that is generally accepted to be true and universal
• Science is Predictive – Science is concerned with relating the present to the
future (making predictions)
• Science is Self-Correcting – Changes in thoughts, theories, or laws are
appropriate wen errors in previous research are uncovered
Flow chart of the scientific method

Note: Diamond-shaped
boxes indicate stages
in the research process
in which a choice of
one or more
techniques must be
made. The dotted line
indicates an alternative
path that skips
exploratory research.
Two basic types of research

• Qualitative Research (words) - is by definition exploratory, and it is used when


we don’t know what to expect, to define the problem or develop an approach to
the problem. It’s also used to go deeper into issues of interest and explore
nuances related to the problem at hand. Common data collection methods used
in qualitative research are focus groups, in-depth interviews, uninterrupted
observation, bulletin boards, and ethnographic participation/observation.

• Quantitative Research (numbers) - is conclusive in its purpose as it tries to


quantify the problem and understand how prevalent it is by looking for
projectable results to a larger population. Here we collect data through surveys
(online, phone, paper), audits, points of purchase (purchase transactions), and
other trend data.
Stating a hypothesis or research question

• Research Question – A formally stated question intended to provide indications


about some; it is not limited to investigating relationships between variables.
Used when the researcher is unsure about the nature of the problem under
investigation.
• Hypothesis – a formal statement regarding the relationship between variables
and is tested directly. The predicted relationship between the variables is either
true or false.
• Independent Variable (Xi)– the variable that is systematically varied by the
researcher
• Dependent Variable (Yi) – the variable that is observed and whose value is
presumed to depend on independent variables
Hypothesis vs. research question

• Research Question: “Does television content enrich a child’s imaginative


capacities by offering materials and ideas for make-believe play?
• Hypothesis: The amount of time a child spends in make-believe play is directly
related to the amount of time spent viewing make-believe play on television.
• Null Hypothesis: the denial or negation of a research hypothesis; the hypothesis
of no difference
• HO: “There is no significant difference between the amount of time children
engage in make-believe play and the amount of time children watch make-
believe play on television.”
Data analysis and interpretation
• Every research study must be carefully planed and performed according to specific
guidelines.
• When the analysis is completed, the researcher must step back and consider what
has been discovered.
• The researcher must ask two questions:
• Are the results internally and externally valid?
• Are the results valid
Neither Valid Valid but Not Valid Both Valid
nor Reliable not Reliable but Reliable and Reliable
Internal validity

• If y = f(x), control over the research conditions is necessary to eliminate the


possibility of finding that y = f(b), where b is an extraneous variable.

• Artifact – Any variable that creates a possible but incorrect explanation of results.
Also referred to as a confounding variable.

• The presence of an artifact indicates issues of internal validity; that is, the study
has failed to investigate its hypothesis
What affects Internal validity?

• History – various events that occur during a study may affect the subject’s
attitudes, opinions, and behavior.
• Maturation – Subjects’ biological and psychological characteristics change during
the course of a study (mainly longitudinal).
• Testing – The act of testing may cause artifacts depending on the environment,
giving similar pre-tests/post-tests, and/or timing.
• Instrumentation – A situation where equipment malfunctions, observers become
tired/casual, and/or interviewers may make mistakes.
• Statistical regression – Subjects who achieve either very high or very low scores
on a test tend to regress to (move toward) the sample or population mean.
What affects internal validity, cont.
• Experimental Mortality – All research studies face the possibility that subjects
will drop out for one reason or another.
• Sample Selection – When groups are not selected randomly or when they are not
homogeneous
• Demand Characteristics – Subjects’ reactions to experimental situations. Subjects
who recognize the purpose of a study may produce only “good” data for
researchers (Hawthorne Effect).
• Experimenter Bias – Researcher becomes swayed by a client’s (or personal)
wishes for a project’s results (Blind vs. Double Blind).
• Evaluation Apprehension – Subjects are afraid of being measured or tested.
• Causal Time Order – An experiment’s results are due not to the stimulus
(independent) variable but rather to the effect of the dependent variable.
What affects internal validity, cont.

• Diffusion or Imitation of Treatments – Where respondents may have the


opportunity to discuss the experiment/study with another respondent who
hasn’t yet participated.
• Compensation – The researcher treats the control group differently because of
the belief that the group has been “deprived.”
• Compensatory Rivalry – Subjects who know they are in the control group may
work harder to perform differently or outperform the experimental group.
• Demoralization – Control group may feel demoralized or angry that they are not
in the experiential group.
External validity

• How well the results or a study can be generalized across the population.

• Use random samples.

• Us heterogeneous (diverse) samples and replicate the study several times.

• Select a sample that is representative of the group to which the results will be
generalized.
Probability versus Nonprobability Sampling

• Probability Sampling
• A sampling technique in which every member of the population has a known,
nonzero probability of selection.

• Nonprobability Sampling
• A sampling technique in which units of the sample are selected on the basis
of personal judgment or convenience.
• The probability of any particular member of the population being chosen is
unknown.
Variables

• Concepts that are observable and measurable


• Have a dimension that can vary
• Narrow in meaning
• Examples:
• Color classification
• Loudness
• Level of satisfaction/agreement
• Amount of time spent
• Media choice
Types and forms of variables

• Variable Types:
• Independent – those that are systematically varied by the researcher
• Dependent – those that are observed. Their values are resumed to depend on
the effects of the independent variables

• Variable Forms:
• Discrete – only includes a finite set of values (yes/no; republican/democrat;
satisfied….not satisfied, etc.)
• Continuous – takes on any value on a continuous scale (height, weight,
length, time, etc.)
Scales: Concept
• A generalized idea about a class of objects, attributes, occurrences, or
processes

Example: Satisfaction
Scales: Operational Definition
• Specifies what the researcher must do to measure the concept under
investigation

Example: A 1-7 scale measuring the


level of satisfaction; A measure of
number of hours watching TV.
Scales
• Represents a composite measure of a variable
• Series of items arranged according to value for the purpose of
quantification
 Provides a range of values that correspond to different characteristics or amounts of a
characteristic exhibited in observing a concept.
 Scales come in four different levels: Nominal, Ordinal, Interval, and Ratio
Nominal Scale

• Indicates a difference
Ordinal Scale

• Indicates a difference
• Indicates the direction of the distance (e.g. more than or less than)
Interval Scale

• Indicates a difference
• Indicates the direction of the distance (e.g. more than or less than)
• Indicates the amount of the difference (in equal intervals)
Ratio Scale

• Indicates a difference
• Indicates the direction of the distance (e.g. more than or less
than)
• Indicates the amount of the difference (in equal intervals)
• Indicates an absolute zero
Two sets of scores…

Group 1 Group 2
100, 100 91, 85
99, 98 81, 79
88, 77 78, 77
72, 68 73, 75
67, 52 72, 70
43, 42 65, 60

How can we analyze these numbers?


Choosing one of the groups… Descriptive
statistics
Frequency Distribution Frequency Distribution
Distribution of Scores Frequency (N Grouped in Intervals
Responses = 12)
Scores Frequency (N
100, 100 100 2 = 12)
99, 98 99 1
40 - 59 3
98 1
88, 77 88 1
60 - 79 4
72, 68 77 1 80 - 100 5
67, 52 72 1
43, 42 68 1 Pie Chart
67 1
52 1
43 1
42 1

40-59 60-79 80-100


Cumulative Cumulative
Scores Frequency Percentage Frequency Percentage
Frequency Distribution 100 2 8.33% 2 8.33%
with Columns for 99 1 4.17% 3 12.50%
Percentage, Cumulative 98 1 4.17% 4 16.67%
Frequency, and 91 1 4.17% 5 20.83%
Cumulative Percentage 88 1 4.17% 6 25.00%
85 1 4.17% 7 29.17%
81 1 4.17% 8 33.33%
79 1 4.17% 9 37.50%
78 1 4.17% 10 41.67%
77 2 8.33% 12 50.00%
75 1 4.17% 13 54.17%
73 1 4.17% 14 58.33%
72 2 8.33% 16 66.67%
70 1 4.17% 17 70.83%
68 1 4.17% 18 75.00%
67 1 4.17% 19 79.17%
65 1 4.17% 20 83.33%
60 1 4.17% 21 87.50%
52 1 4.17% 22 91.67%
43 1 4.17% 21 87.50%
42 1 4.17% 24 100.00%
N= 24 100.00%
Creating a histogram (bar chart)
Histogram (n=100)
14

12

10

8
Frequency

0
42 43 52 60 65 67 68 70 72 73 75 77 78 79 81 85 88 91 98 99 100
Scores
Creating a Frequency polygon
Frequency Polygon
14

12

10

8
Frequency

0
42 43 52 60 65 67 68 70 72 73 75 77 78 79 81 85 88 91 98 99 100
Scores
Normal Distribution

68%

95%
95%
99% 99%
The Bell Curve

.01 .01

Significant Significant

Mean=70
Central limit theorem

• In probability theory, the central limit theorem says that, under certain
conditions, the sum of many independent identically-distributed random
variables, when scaled appropriately, converges in distribution to a standard
normal distribution.
Central Tendency

• These statistics answer the question: What is a typical score?


• The statistics provide information about the grouping of the numbers
in a distribution by giving a single number that characterizes the
entire distribution.
• Exactly what constitutes a “typical” score depends on the level of
measurement and how the data will be used.
• For every distribution, three characteristic numbers can be identified:
• Mode
• Median
• Mean
Measures of Central Tendency

•Mean - arithmetic average


– µ, Population; x , sample
•Median - midpoint of the distribution
•Mode - the value that occurs most often
Mode Example
Find the score that occurs most frequently
98
88
81
74
72 Mode = 72
72
70
69
65
52
Median Example
Arrange in descending order and find the midpoint

Odd Number (N = 9) Even Number (N = 10)


98 98
88 88
81 81
74 74
72 Midpoint = 72 72 Midpoint =
70 71 (72+71)/2
69 70 = 71.5
65 69
52 65
52
Different means
• Arithmetic Mean - the sum of all of the list divided by the number of
items in the list

a1  a2  a3  a4  ...  an
a
n
Arithmetic Mean Example

98
88
81
74
72
72 741\10 = 74.1
70
69
65
52

741
Normal Distribution

68%

95%
95%
99% 99%
Frequency polygon of test score data
Frequency Polygon
14

12

10

8
Frequency

0
42 43 52 60 65 67 68 70 72 73 75 77 78 79 81 85 88 91 98 99 100
Scores
Skewness

• Refers to the concentration of scores around a particular point on the


x-axis.
• If this concentration lies toward the low end of the scale, with the tail
of the curve trailing off to the right, the curve is called a right skew.
• If the tail of the curve trails off to the left, it is a left skew.
Left-Skewed Distribution
12

10

8
Frequency

0
42 43 52 60 65 67 68 70 72 73 75 77 78 79 81 85 88 91 98 99 100
Scores
Skewness
• Skewness can occur when the frequency of just one
score is clustered away from the mean.

Frequency Polygon
14
12
10

Frequency
8
6
4
2
0
42 43 52 60 65 67 68 70 72 73 75 77 78 79 81 85 88 91 98 99 100
Scores
Normal Distribution

68%

95%
95%
99% 99%

Mode = Median = Mean


When the Distribution may not be normal
Salary Sample Data
Mode
9 = 45K
8 Average = 62K
7

5
Frequency

1 Median = 56K
0
25 27 29 32 35 38 43 45 48 51 54 56 59 60 62 65 68 71 75 78 85 88 91 95 98 99 100150175
Annual Salary in Thousands of Dollars
Measures of Dispersion or Spread

• Range
• Variance
• Standard deviation
The Range as a Measure of Spread

• The range is the distance between the smallest and the


largest value in the set.

• Range = largest value – smallest value


Group 1 Group 2
100, 100 91, 85
99, 98 81, 79
88, 77 78, 77
72, 68 73, 75
67, 52 72, 70
43, 42 65, 60
Range G1: 100 – 42 = 58 Range G2: 91 – 60 = 31
Population Variance

 X i  X ) 2
S2 
N
Sample Variance

 X i  X ) 2
s 
2

n 1
Variance

• A method of describing variation in a set of scores


• The higher the variance, the greater the variability and/or spread of
scores
Variance Example

X X X-X X –X2
98 - 74.1 = 23.90 = 571.21 Population Variance (N)
88 - 74.1 = 13.90 = 193.21
81 - 74.1 = 6.90 = 47.61 1,434.90 \ 10 = 143.49
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41
70 - 74.1 = -4.10 = 16.81 Sample Variance (n-1)
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81 1,434.90 \ 9 = 159.43
52 - 74.1 = -22.10 = 488.41
Mean = 74.1 1,434.90
Uses of the variance

• The variance is used in many higher-order calculations including:


• T-test
• Analysis of Variance (ANOVA)
• Regression
• A variance value of zero indicates that all values within a set of numbers are
identical
• All variances that are non-zero will be positive numbers. A large variance
indicates that numbers in the set are far from the mean and each other, while a
small variance indicates the opposite.
Standard Deviation
• Another method of describing variation in a set of scores
• The higher the standard deviation, the greater the variability and/or
spread of scores
Sample Standard Deviation

s 
 Xi X 2 
n 1
Standard Deviation Example
Population STD
X X X-X X –X2
1,434.90 \ 10 = 143.49
98 - 74.1 = 23.90 = 571.21
88 - 74.1 = 13.90 = 193.21 (SQRT) 143.49 = 11.98
81 - 74.1 = 6.90 = 47.61
74 - 74.1 = -0.10 = 0.01
72 - 74.1 = -2.10 = 4.41
72 - 74.1 = -2.10 = 4.41 Sample STD
70 - 74.1 = -4.10 = 16.81 1,434.90 \ 9 = 159.43
69 - 74.1 = -5.10 = 26.01
65 - 74.1 = -9.10 = 82.81
- 74.1 = (SQRT) 159.43 = 12.63
52 -22.10 = 488.41
Mean = 74.1 1,434.90
• A survey was given to UNA students to find out how many hours per week they
would listen to a student-run radio station. The sample responses were separated
by gender. Determine the mean, range, variance, and standard deviation of each
group.
Group A (Female) Group B (Male)

15 30
25 15
12 21
7 12
3 26
32 20
17 5
16 24
9 18
24 10
Group one (females)

X Mean X-Mean X-Mean2


15 16 -1 1
25 16 9 81
12 16 -4 16
7 16 -9 81
Range = 29 3 16 -13 169 718/9 79.78
32 16 16 256
17 16 1 1 SQRT 8.93
16 16 0 0
9 16 -7 49
24 16 8 64
16 718
Group Two (Males)

X Mean X-Mean X-Mean2


30 18 12 144
15 18 -3 9
21 18 3 9
Range = 22 12 18 -6 36
26 18 8 64
20 18 2 4 535/9 59.44
5 18 -13 169
24 18 6 36 SQRT 7.71
18 18 0 0
10 18 -8 64
18 535
Results

Radio Listening Results


Group Average Range Variance S
Females 16 29 79.78 8.93
Males 18 22 59.44 7.71
Standard Deviation on Bell Curve
.01 .01
What if S = 4?

Significant Significant

58 62 66 Mean=70 74 78 82

-3 -2 -1 0 1 2 3
How Variability and Standard Deviation Work…

Class A Class B

100, 100 91, 85


99, 98 81, 79
88, 77 78, 77
72, 68 73, 75
67, 52 72, 70
43, 42 65, 60

Mean
Mean = 75.5 Mean = 75.5
STD = 21.93 STD = 8.42
How Do We Use This Stuff?
• The type of data determines what kind of measures you can use
• Higher order data can be used with higher order statistics
When scores don’t compare
• A student takes the ACT test (11-36) and scores a 22…
• The same student takes the SAT (590-1,600) and scores a 750…
• The same student takes the TOFFEL (0-120) and scores a 92…
• How can we tell if the student did better/worse on one score in
relation to the other scores?
• ANSWER: Standardize or Normalize the scores
• HOW: Z-Scores!
Z-Scores
• In statistics, the standard score is the (signed) number of standard deviations an
observation or datum is above or below the mean.
• A positive standard score represents a datum above the mean, while a negative
standard score represents a datum below the mean.
• It is a dimensionless quantity obtained by subtracting the population mean from
an individual raw score and then dividing the difference by the population
standard deviation. This conversion process is called standardizing or normalizing.
• Standard scores are also called z-values, z-scores, normal scores, and
standardized variables.
Z-score formula

𝑋−𝑋
𝑧=
𝑆

Z-Scores with positive numbers are above the mean while Z-Scores with
negative numbers are below the mean.
Z-scores, cont.
• It is a little awkward in discussing a score or observation to have to say that it is “2
standard deviations above the mean” or “1.5 standard deviations below the
mean.”
• To make it a little easier to pinpoint the location of a score in any distribution, the
z-score was developed.
• The z-score is simply a way of telling how far a score is from the mean in standard
deviation units.
Calculating the z-score
• If the observed value (individual score) = 9; the mean = 6;
and the standard deviation = 2.68:
Z-Scores, cont.
• A z-score may also be used to find the location of a score
that is a normally distributed variable.
• Using an example of a population of IQ test scores where
the individual score = 80; population mean = 100; and the
population standard deviation = 16…

𝑋 −𝜇 80 −100 − 20
𝑧= = = =− 1.25
𝜕 16 16
Comparing z-scores
• Z-scores allow the researcher to make comparisons
between different distributions.
Mathematics Natural Science English
µ = 75 µ = 103 µ = 52
σ=6 σ = 14 σ=4
X = 78 X = 115 X = 57

𝑋 −𝜇 78 −75 3
Mathematics 𝑧= = = =0.5
𝜎 6 6
115 −103 12
Natural Science 𝑧= = =0.86
14 14
57 −52 5
English 𝑧= = =1.25
4 4
Interpretation
• Interpretation
• The process of drawing inferences from the analysis results.
• Inferences drawn from interpretations lead to managerial implications and
decisions.
• From a management perspective, the qualitative meaning of the data and
their managerial implications are an important aspect of the interpretation.
Inferential Statistics Provide Two
Environments:
• Test for Difference – To test whether a significant difference
exists between groups
• Tests for relationship – To test whether a significant
relationship exist between a dependent (Y) and independent
(X) variable/s
• Relationship may also be predictive
Hypothesis Testing Using Basic Statistics

• Univariate Statistical Analysis


• Tests of hypotheses involving only one variable.
• Bivariate Statistical Analysis
• Tests of hypotheses involving two variables.
• Multivariate Statistical Analysis
• Statistical analysis involving three or more variables or sets of variables.
Hypothesis Testing Procedure
• Process
• The specifically stated hypothesis is derived from the research objectives.
• A sample is obtained and the relevant variable is measured.
• The measured sample value is compared to the value either stated explicitly
or implied in the hypothesis.
• If the value is consistent with the hypothesis, the hypothesis is supported.
• If the value is not consistent with the hypothesis, the hypothesis is not
supported.
Hypothesis Testing Procedure, Cont.
• H0 – Null Hypothesis
• “There is no significant difference/relationship between groups”
• Ha – Alternative Hypothesis
• “There is a significant difference/relationship between groups”
• Always state your Hypothesis/es in the Null form
• The object of the research is to either reject or accept the Null Hypothesis/es
Significance Levels and p-values

• Significance Level
• A critical probability associated with a statistical hypothesis test that indicates
how likely an inference supporting a difference between an observed value
and some statistical expectation is true.
• The acceptable level of Type I error.
• p-value
• Probability value, or the observed or computed significance level.
• p-values are compared to significance levels to test hypotheses.
Experimental Research: What happens?
An hypothesis (educated guess) and then tested. Possible outcomes:

Something Will
Something Not
Not Happen
Will Happen
It Does Not
It Happens
Happen

Something Will
Something Will
Happen
Happen
It Does Not
It Happens
Happen
Type I and Type II Errors
• Type I Error
• An error caused by rejecting the null hypothesis when it should be accepted
(false positive).
• Has a probability of alpha (α).
• Practically, a Type I error occurs when the researcher concludes that a
relationship or difference exists in the population when in reality it does not
exist.
Type I and Type II Errors (cont’d)
• Type II Error
• An error caused by failing to reject the null hypothesis when the hypothesis
should be rejected (false negative).
• Has a probability of beta (β).
• Practically, a Type II error occurs when a researcher concludes that no
relationship or difference exists when in fact one does exist.
Type I and II Errors and Fire Alarms?

FIRE NO FIRE

NO ALARM TYPE I NO ERROR

Alarm NO ERROR TYPE II

H0 is H0 is True
False

ACCEPT H0 TYPE I NO ERROR

REJECT H0 NO ERROR TYPE II


Recapitulation of the Research Process
• Collect Data
• Run Descriptive Statistics
• Develop Null Hypothesis/es
• Determine the Type of Data
• Determine the Type of Test/s (based on type of data)
• If test produces a significant p-value, REJECT the Null Hypothesis. If the test does
not produce a significant p-value, ACCEPT the Null Hypothesis.
• Remember that, due to error, statistical tests only support hypotheses and can
NOT prove a phenomenon
Data Type vs. Statistics Used
Data Type Statistics Used
Nominal Frequency, percentages, modes
Ordinal Frequency, percentages, modes, median,
range, percentile, ranking
Interval Frequency, percentages, modes, median,
range, percentile, ranking average,
variance, SD, t-tests, ANOVAs, Pearson
Rs, regression
Ratio Frequency, percentages, modes, median,
range, percentile, ranking average,
variance, SD, t-tests, ratios, ANOVAs,
Pearson Rs, regression
Pearson R Correlation Coefficient
X Y

1 4
3 6
5 10
5 12
1 13
2 3
4 3
6 8
Pearson R Correlation Coefficient
A measure of how well a linear equation describes the relation between two
variables X and Y measured on the same object

X Y y xy
1 4 -3 -5 15 9 25
3 6 -1 -3 3 1 9
5 10 1 1 1 1 1
5 12 1 3 3 1 9
1 13 2 4 8 4 16
Total 20 45 0 0 30 16 60
Mean 4 9 0 0 6
Calculation of Pearson R

𝑟=
∑ 𝑥𝑦
√∑ 𝑥 ∑ 𝑦
2 2
Alternative Formula

∑ 𝑥∑ 𝑦
∑ 𝑥𝑦 −
𝑁
𝑟=

√ (∑ 𝑥 )
√ (∑ 𝑌 )
2 2

∑𝑥
2

𝑁
∑𝑌 2

𝑁
How Can R’s Be Used?
Y Y Y

R = 1.00 R = .18 R = .85

X X X

Y
R’s of 1.00 or -1.00 are perfect correlations

The closer R comes to 1, the more related the X and Y scores are
to each other

R = -.92 R-Squared is an important statistic that indicates the variance of Y


X that is attributed to by the variance of X (.04, .73)
Degrees of Freedom
• The number of values in a study that are free to vary.
• A data set contains a number of observations, say, n. They constitute n individual pieces
of information. These pieces of information can be used either to estimate parameters or
variability. In general, each item being estimated costs one degree of freedom. The
remaining degrees of freedom are used to estimate variability. All we have to do is count
properly.
• A single sample: There are n observations. There's one parameter (the mean) that needs
to be estimated. That leaves n-1 degrees of freedom for estimating variability.
• Two samples: There are n1+n2 observations. There are two means to be estimated. That
leaves n1+n2-2 degrees of freedom for estimating variability.
Testing for Significant Difference

• Testing for significant difference is a type of inferential statistic


• One may test difference based on any type of data
• Determining what type of test to use is based on what type of data are to be
tested.
Testing Difference

• Testing difference of gender to • Testing difference of gender to


favorite form of media answers on a Likert scale
• Gender: M or F • Gender: M or F
• Media: Newspaper, Radio, TV, Internet • Likert Scale: 1, 2, 3, 4, 5
• Data: Nominal • Data: Interval
• Test: Chi Square • Test: t-test
What is a Null Hypothesis?

• A type of hypothesis used in statistics that proposes that no statistical significance


exists in a set of given observations.
• The null hypothesis attempts to show that no variation exists between variables,
or that a single variable is no different than zero.
• It is presumed to be true until statistical evidence nullifies it for an alternative
hypothesis.
Examples
• Example 1: Three unrelated groups of people choose what they believe to be the
best color scheme for a given website.
• The null hypothesis is: There is no difference between color scheme choice and
type of group
• Example 2: Males and Females rate their level of satisfaction to a magazine using
a 1-5 scale
• The null hypothesis is: There is no difference between satisfaction level and
gender
Chi Square

A chi square (X2) statistic is used to investigate whether distributions of


categorical (i.e. nominal/ordinal) variables differ from one another.
General Notation for a chi square 2x2 Contingency Table

Variable 1
Variable 2 Data Type 1 Data Type 2 Totals
Category 1 a b a+b
Category 2 c d c+d
Total a+c b+d a+b+c+d

2
( 𝑎𝑑 −𝑏𝑐 ) ( 𝑎+ 𝑏+𝑐 + 𝑑 )
𝑥 2=
( 𝑎+ 𝑏 ) ( 𝑐+ 𝑑 )( 𝑏 +𝑑 ) ( 𝑎+ 𝑐 )
Chi square Steps

• Collect observed frequency data


• Calculate expected frequency data
• Determine Degrees of Freedom
• Calculate the chi square
• If the chi square statistic exceeds the probability or table value (based upon a p-
value of x and n degrees of freedom) the null hypothesis should be rejected.
Two questions from a questionnaire…
• Do you like the television program? (Yes or No)
• What is your gender? (Male or Female)
Gender and Choice Preference
H0: There is no difference between gender and choice

Actual Data
Male Female Total
Like 36 14 50 Row
Column
Total
Total Dislike 30 25 55
Total 66 39 105
Grand
To find the expected frequencies, assume independence of the Total
rows and columns. Multiply the row total to the column total and
divide by grand total

rt * ct 50 * 66
ef  OR  31.43
gt 105
Chi square
Expected Frequencies
Male Female Total
Like 31.43 18.58 50.01
Dislike 34.58 20.43 55.01
Total 66.01 39.01 105.02

The number of degrees of freedom is calculated for an x-by-y


table as (x-1) (y-1), so in this case (2-1) (2-1) = 1*1 = 1. The
degrees of freedom is 1.
Chi square Calculations

O E O-E (O-E)2/E
36 31.43 4.57 .67
14 18.58 -4.58 1.13
30 34.58 -4.58 .61
25 20.43 4.57 1.03

Chi square observed statistic = 3.44


Chi square

Probability Level (alpha)

Df 0.5 0.10 0.05 0.02 0.01 0.001


1 0.455 2.706 3.841 5.412 6.635 10.827
2 1.386 4.605 5.991 7.824 9.210 13.815
3 2.366 6.251 7.815 9.837 11.345 16.268
4 3.357 7.779 9.488 11.668 13.277 18.465
5 4.351 9.236 11.070 13.388 15.086 20.51

Chi Square (Observed statistic) = 3.44


Probability Level (df=1 and .05) = 3.841 (Table Value)
So, Chi Square statistic < Probability Level (Table Value)
Accept Null Hypothesis
Results of Chi square Test

There is no significant difference between product choice and gender.


Chi square Test for Independence

• Involves observations greater than 2x2


• Same process for the Chi square test
• Indicates independence or dependence of three or more variables.
Two Questions…
• What is your favorite color scheme for the website? (Blue, Red, or
Green)
• There are three groups (Rock music, Country music, jazz music)
Chi Square
H0: Group is independent of color choice
Actual Data
Row
Blue Red Green Total Total
Rock 11 6 4 21
Jazz 12 7 7 26
Column Country 7 7 14 28 Grand
Total Total
Total 30 20 25 75

To find the expected frequencies, assume independence of the


rows and columns. Multiply the row total to the column total
and divide by grand total

rt * ct 21* 30
ef  OR  8. 4
gt 75
Chi Square

Expected Frequencies

Blue Red Green Total


Rock 8.4 5.6 7.0 21
Jazz 10.4 6.9 8.7 26
Country 11.2 7.5 9.3 28
Total 30 20 25 75

The number of degrees of freedom is calculated for an x-by-y


table as (x-1) (y-1), so in this case (3-1) (3-1) = 2*2 = 4. The
degrees of freedom is 4.
Chi Square Calculations

O E O-E (O-E)2/E
11 8.4 2.6 .805
6 5.6 .4 .029
4 7 3 1.286
12 10.4 1.6 .246
7 6.9 .1 .001
7 8.7 1.7 .332
7 11.2 4.2 1.575
7 7.5 .5 .033
14 9.3 4.7 2.375

Chi Square observed statistic = 6.682


Chi Square Calculations, cont.
Probability Level (alpha)

Df 0.5 0.10 0.05 0.02 0.01 0.001


1 0.455 2.706 3.841 5.412 6.635 10.827
2 1.386 4.605 5.991 7.824 9.210 13.815
3 2.366 6.251 7.815 9.837 11.345 16.268
4 3.357 7.779 9.488 11.668 13.277 18.465
5 4.351 9.236 11.070 13.388 15.086 20.51

Chi Square (Observed statistic) = 6.682


Probability Level (df=4 and .05) = 9.488 (Table Value)
So, Chi Square observed statistic < Probability level (table value)
Accept Null Hypothesis
Chi square Test Results

There is no significant difference between group and choice, therefore,


group and choice are independent of each other.
The t test
x1  x2
t
S x1  x2
x1  Mean for group 1

x2  Mean for group 2

S x1  x2  Pooled, or combined, standard error of difference


between means

The pooled estimate of the standard error is a better


estimate of the standard error than one based of
independent samples.
Uses of the t test
• Assesses whether the mean of a group of scores is statistically
different from the population (One sample t test)
• Assesses whether the means of two groups of scores are statistically
different from each other (Two sample t test)
• Cannot be used with more than two samples (ANOVA)
Sample Data
Group 1 Group 2
x1 16.5 x2 12.2
S1  2.1 S 2  2.6
n1  21 n2  14

Null Hypothesis
x1  x2
H 0 : 1   2 t
S x1  x2
Step 1: Pooled Estimate of the Standard Error
(n1  1) S  (n2  1) S
2
1 1 2
S x1  x2  ( 1
)(  ) 2
n1  n2  2 n1 n2

S12  Variance of group 1 Group 1 Group 2


x1  16.5 x2  12.2
S 22  Variance of group 2
S1 2.1 S 2 2.6
n1  Sample size of group 1 n1 21 n2  14

n2  Sample size of group 2


Step 1: Calculating the Pooled Estimate of
the Standard Error
(n1  1) S  (n2  1) S
2 2
1 1
S x1  x2  ( 1 2
)(  )
n1  n2  2 n1 n2

(20)(2.1)  (13)(2.6)
2 2
1 1
S x1  x2  ( )(  )
33 21 14

=0.797
Step 2: Calculate the t-statistic
x1  x2
t
S x1  x2

16.5  12.2 4 .3
t   5.395
0.797 0.797
Step 3: Calculate Degrees of Freedom
• In a test of two means, the degrees of freedom are
calculated: d.f. =n-k
• n = total for both groups 1 and 2 (35)
• k = number of groups
• Therefore, d.f. = 33 (21+14-2)
• Go to the tabled values of the t-distribution on website.
See if the observed statistic of 5.395 surpasses the
table value on the chart given 33 d.f. and a .05
significance level
Step 3: Compare Critical Value to Observed
Value
Observed statistic= 5.39

Df 0.10 0.05 0.02 0.01


30 1.697 2.042 2.457 2.750
31 1.659 2.040 2.453 2.744
32 1.694 2.037 2.449 2.738
33 1.692 2.035 2.445 2.733
34 1.691 2.032 2.441 2.728

If Observed statistic exceeds Table Value:


Reject H0
So What Does Rejecting the Null Tell Us?
Group 1 Group 2
x1 16.5 x2  12.2
S1  2.1 S 2 2.6
n1  21 n2  14

Based on the .05 level of statistical significance, Group 1


scored significantly higher than Group 2
ANOVA Definition

• In statistics, analysis of variance (ANOVA) is a collection of statistical models, and


their associated procedures, in which the observed variance in a particular
variable is partitioned into components attributable to different sources of
variation.
• In its simplest form ANOVA provides a statistical test of whether or not the means
of several groups are all equal, and therefore generalizes t-test to more than two
groups.
• Doing multiple two-sample t-tests would result in an increased chance of
committing a type I error. For this reason, ANOVAs are useful in comparing two,
three or more means.
Variability is the Key to ANOVA

• Between group variability and within group variability are both components of
the total variability in the combined distributions
• When we compute between and within group variability we partition the total
variability into the two components.
• Therefore: Between variability + Within variability = Total variability
Visual of Between and Within Group Variability

Between Group

Group A Group B Group C


a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
Within
Group . . .
. . .
. . .
ax bx cx
ANOVA Hypothesis Testing
• Tests hypotheses that involve comparisons of two or more populations
• The overall ANOVA test will indicate if a difference exists between any of the groups
• However, the test will not specify which groups are different
• Therefore, the research hypothesis will state that there are no significant difference between any
of the groups

𝐻 0 : 𝜇 1=𝜇 2=𝜇 3
ANOVA Assumptions

• Random sampling of the source population (cannot test)


• Independent measures within each sample, yielding uncorrelated
response residuals (cannot test)
• Homogeneous variance across all the sampled populations (can test)
• Ratio of the largest to smallest variance (F-ratio)
• Compare F-ratio to the F-Max table
• If F-ratio exceeds table value, variance are not equal
• Response residuals do not deviate from a normal distribution (can test)
• Run a normal test of data by group
ANOVA Computations Table

SS df MF F

Between SS(B) k-1 SS(B) MS(B)


(Model) k-1 MS(W)
Within SS(W) N-k SS(W)
(Error) N-k
Total SS(W)+SS(B) N-1
ANOVA Data

Group 1 Group 2 Group 3


5 3 1
2 3 0
5 0 1
4 2 2
2 2 1
Σx1=18 Σx2=10 Σx3=5
Σx21=74 Σx22=26 Σx23=7
Calculating Total Sum of Squares

𝑆𝑆 𝑇 =∑ 𝑥 T −
2
( ∑ 𝑥𝑇
𝑁𝑇 ) 2

( 33 ) 2
𝑆𝑆 𝑇 =107 −
15

1089
𝑆𝑆 𝑇 =107 − =107 − 72.6=𝟑𝟒 . 𝟒
15
Calculating Sum of Squares Within
++

(
𝑆𝑆 𝑤 = 74 −
324
5 )(
+ 26 −
100
5
+ 7− )(
25
5 )
𝑆𝑆 𝑤 =( 74 − 64.8 ) + ( 26 − 20 ) + (7 − 5 )

𝑆𝑆 𝑊 =9.2+6+ 2=𝟏𝟕 . 𝟐
Calculating Sum of Squares Between
(∑ 𝑥 1 ) 2 ( ∑ 𝑥 2 ) 2 (∑ 𝑥 3 ) 2 (∑ 𝑋 𝑇 ) 2
𝑆𝑆 𝐵 = + + −
𝑛1 𝑛2 𝑛3 𝑁𝑇

( 18 ) 2 ( 10 ) 2 ( 5 ) 2 ( 33 ) 2
𝑆𝑆 𝐵 = + + −
5 5 5 15

324 100 25 1089


𝑆𝑆 𝐵 = + + −
5 5 5 15

𝑆𝑆 𝐵 =64.8 +20+5 − 72.6=𝟏𝟕 .𝟐


Complete the ANOVA Table
SS df MF F

Between SS(B) 17.2 k-1 SS(B) MS(B) 6


(Model) 2 k-1 MS(W)
8.6
Within SS(W) 17.2 N-k SS(W)
(Error) 12 N-k
1.43
Total SS(W)+SS(B) 34.4 N-1
14

If the F statistic is higher than the F probability table, reject the null
hypothesis
You Are Not Done Yet!!!

• If the ANOVA test determines a difference


exists, it will not indicate where the difference
is located
• You must run a follow-up test to determine
where the differences may be

G1 compared to G2
G1 compared to G3
G2 compared to G3
Running the Tukey Test

• The "Honestly Significantly Different" (HSD) test proposed by the


statistician John Tukey is based on what is called the "studentized
range distribution.“
• To test all pairwise comparisons among means using the Tukey HSD,
compute t for each pair of means using the formula:
𝑀𝑖− 𝑀 𝑗
𝑡 𝑠=

√ 𝑀𝑆𝐸
𝑛h

Where Mi – Mj is the difference ith and jth means, MSE


is the Mean Square Error, and nh is the harmonic mean
of the sample sizes of groups i and j.
Results of the ANOVA and Follow-Up Tests
• If the F-statistic is significant, then the ANOVA indicates a significant difference
• The follow-up test will indicate where the differences are
• You may now state that you reject the null hypothesis and indicate which groups
were significantly different from each other
Regression Analysis
• The description of the nature of the relationship between two or
more variables
• It is concerned with the problem of describing or estimating the value
of the dependent variable on the basis of one or more independent
variables.
Predictive Versus Explanatory Regression Analysis

• Prediction – to develop a model to predict future values of a response variable (Y)


based on its relationships with predictor variables (X’s)
• Explanatory Analysis – to develop an understanding of the relationships between
response variable and predictor variables
Problem Statement
• A regression model will be used to try to explain the relationship between
departmental budget allocations and those variables that could contribute to the
variance in these allocations.

Bud . Alloc. x1 , x2 , x3  xi 


Simple Regression Model

( 𝑦 )=𝑎 +𝑏𝑥
𝑺𝒍𝒐𝒑𝒆 ( 𝒃 )=(𝑁 Σ 𝑋𝑌 − ( Σ 𝑋 ) ( Σ 𝑌 ) )¿/( 𝑁 Σ 𝑋 2− ( Σ 𝑋 ) 2)

𝑰𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕 ( 𝒂 )=(Σ 𝑌 −𝑏 ( Σ 𝑋 ))/ 𝑁

Where:

y = Dependent Variable Y = Second Score


x = Independent Variable ΣXY = Sum of the product of 1st & 2nd scores
b = Slope of Regression Line ΣX = Sum of First Scores
a = Intercept point of line ΣY = Sum of Second Scores
N = Number of values ΣX2 = Sum of squared First Scores
X = First Score
Simple regression model
y

Predicted Values Residuals

r  Y  Yˆ
Slope (b) i i i
Actual Values

Intercept (a) x
Simple vs. Multiple Regression

Simple: Y = a + bx

Multiple: Y = a + b1X1 + b2 X2 + b3X3…+biXi


Multiple regression model

X1

X2

You might also like