Dsbda Unit 2
Dsbda Unit 2
Statistical Inference
-Ashwini Jarali
Computer Engineering
- The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
- The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-sized
consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles.
•
A plot of the data distribution for some attribute X. The quantiles
plotted are quartiles. The three quartiles divide the distribution
into four equal-size consecutive subsets. The second
quartile corresponds to the median.
• The distance between the first and third quartiles is a simple measure
of spread that gives the range covered by the middle half of the data.
This distance is called the interquartile range (IQR) and is defined as
IQR = Q3 - Q1
• 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
• The data in above Example contain 12 observations, already sorted in
increasing order.
• Thus, the quartiles for this data are the third, sixth, and ninth values,
respectively, in the sorted list.
• Therefore, Q1= $47,000 and Q3 is $63,000.
• Thus, the interquartile range is IQR= 63 – 47= $16,000.
• The prices (in dollars) for a sample of round-trip flights from Chicago,
Illinois to Cancun, Mexico are listed.What is the mean price of the flights?
• 872 432 397 427 388 782 397
• The sum of the flight prices is To find the mean price, divide the sum of
the prices by the number of prices in the sample.
Finding a Weighted Mean
You are taking a class in which your grade is determined from five sources:
50% from your test mean, 15% from your midterm, 20% from your final exam,
10% from your computer lab work, and 5% from your homework.Your scores
are 86 (test mean), 96 (midterm), 82 (final exam), 98 (computer lab), and 100
(homework). What is the weighted mean of your scores? If the minimum
average for an A is 90, did you get an A?
• A frequency distribution is symmetric when a vertical line can be drawn
• through the middle of a graph of the distribution and the resulting halves
are approximately mirror images.
• A frequency distribution is uniform (or rectangular) when all entries, or
classes, in the distribution have equal or approximately equal frequencies.
• A uniform distribution is also symmetric.
• A frequency distribution is skewed if the “tail” of the graph elongates more
to one side than to the other. A distribution is skewed left (negatively
skewed) if its tail extends to the left. A distribution is skewed right
(positively skewed) if its tail extends to the right.
Finding the Range of a Data Set
Two corporations each hired 10 graduates. The starting salaries for each
graduate are shown. Find the range of the starting salaries for Corporation A.
• Variance and Standard Deviation
- Variance and standard deviation are measures of data
dispersion.
-They indicate how spread out a data distribution is.
- A low standard deviation means that the data observations tend
to be very close to the mean,
-while a high standard deviation indicates that the data are
spread out over a large range of values
- The variance of N observations, x1,x2,:::,xN , for a numeric
attribute X is
• You have to figure out what your “tests” and “events” are first.
For two events, A and B, Bayes’ theorem allows you to figure out
p(A|B) (the probability that event A happened, given that test B
was positive) from p(B|A) (the probability that test B happened,
given that event A happened).
• Bayes’ Theorem Example #1
• You might be interested in finding out a patient’s probability of
having liver disease if they are an alcoholic. “Being an alcoholic”
is the test (kind of like a litmus test) for liver disease.
• A could mean the event “Patient has liver disease.” Past data tells
you that 10% of patients entering your clinic have liver disease.
P(A) = 0.10.
• B could mean the litmus test that “Patient is an alcoholic.” Five
percent of the clinic’s patients are alcoholics. P(B) = 0.05.
• You might also know that among those patients diagnosed with
liver disease, 7% are alcoholics. This is your B|A: the probability
that a patient is alcoholic, given that they have liver disease, is
7%.
• Bayes’ theorem tells you:
P(A|B) = (0.07 * 0.1)/0.05 = 0.14
• In other words, if the patient is an alcoholic, their chances of
having liver disease is 0.14 (14%). This is a large increase from
the 10% suggested by past data. But it’s still unlikely that any
particular patient has liver disease.
• Bayesian Spam Filtering
• Although Bayes’ Theorem is used extensively in the medical
sciences, there are other applications. For example, it’s used
to filter spam. The event in this case is that the message is spam.
The test for spam is that the message contains some flagged
words (like “viagra” or “you have won”). Here’s the equation set
up (from Wikipedia), read as “The probability a message is spam
given that it contains certain flagged words”:
– Descriptive Analysis
“What is happening now based on incoming data.” It is
a method for quantitatively describing the main features of a collection
of data. Here are a few key points about descriptive analysis:
• Typically, it is the first kind of data analysis performed on a dataset.
• Usually it is applied to large volumes of data, such as census data.
• Description and interpretation processes are different steps.
– Diagnostic Analytics
Diagnostic analytics are used for discovery, or to determine why
something happened.
Sometimes this type of analytics when done hands-on with a small
dataset is also known as causal analysis, since it involves at least one
cause (usually more than one) and one effect.
• For example, for a social media marketing campaign, you can
use descriptive analytics to assess the number of posts,
mentions, followers, fans, page views, reviews, or pins, etc.
There can be thousands of online mentions that can be distilled
into a single view to see what worked and what did not work in
your past campaigns.
• There are various types of techniques available for diagnostic or
causal analytics. Among them, one of the most frequently used
is correlation.
• Predictive Analytics
– predictive analytics has its roots in our ability to predict what
might happen. These analytics are about understanding the future using
the data and the trends we have seen in the past, as well as emerging new
contexts and processes.
– An example is trying to predict how people will spend their tax refunds
based on how consumers normally behave around a given time of the
year (past data and trends), and
how a new tax policy (new context) may affect people’s refunds.
– Predictive analytics provides companies with actionable insights based
on data. Such information includes estimates about the likelihood of a
future outcome. It is important to remember that no statistical algorithm
can “predict” the future with 100% certainty because the foundation of
predictive analytics is based on probabilities.
– Companies use these statistics to forecast what might happen.
– Some of the software most commonly used by data science
professionals for predictive analytics are SAS predictive analytics, IBM
predictive analytics, RapidMiner, and others.
• Let us assume that Sales force kept campaign data for the last
eight quarters. This data comprises total sales generated by
newspaper, TV, and online ad campaigns and associated
expenditures, as provided in Table
With this data, we can predict the sales based on the expenditures of ad
campaigns in different media for Salesforce.
• Predictive analytics has a number of common applications.
• For example, many people turn to predictive analytics to
produce their credit scores.
Financial services use such numbers to determine the
probability that a customer will make their credit payments on
time.
Customer relationship management (CRM) classifies another
common area for predictive analytics. Here, the process
contributes to objectives such as marketing campaigns, sales,
and customer service.
• Predictive analytics applications are also used in the healthcare
field. They can determine which patients are at risk for
developing certain conditions such as diabetes, asthma, and
other chronic or serious illnesses.
• Prescriptive Analytics
– Prescriptive analytics is the area of business analytics dedicated to finding
the best course of action for a given situation. This may start by first
analyzing the situation (using descriptive analysis), but then moves
toward finding connections among various parameters/variables, and their
relation to each other to address a specific problem .
– A process-intensive task, the prescriptive approach analyzes potential
decisions, the interactions between decisions, the influences that bear
upon these decisions, and the bearing all of this has on an outcome to
ultimately prescribe an optimal course of action in real time.
– Prescriptive analytics can also suggest options for taking advantage of a
future opportunity or mitigate a future risk and illustrate the implications
of each
– Specific techniques used in prescriptive analytics include optimization,
simulation, game theory, and decision-analysis methods.
Exploratory Analysis
-Exploratory analysis is an approach to analyzing datasets to find previously
unknown relationships. Often such analysis involves using various data
visualization approaches.
-exploratory analysis consists of a range of techniques; its application is varied
as well.
However, the most common application is looking for patterns in the data,
such as finding groups of similar genes from a collection of samples.
-Let us consider the US census data available from the US census website.
This data has dozens of variables; If you are looking for something specific
(e.g., which State has the highest population), you could go with descriptive
analysis. If you are trying to predict something (e.g., which city will have
the lowest influx of immigrant population), you could use prescriptive or
predictive analysis. But, if someone gave you this data and asks you to find
interesting insights, then what do you do? You could still do descriptive or
prescriptive analysis, but given that there are lots of variables with massive
amounts of data, it may be futile to do all possible combinations of those
variables. So, you need to go exploring.
Mechanistic Analysis
-Mechanistic analysis involves understanding the exact changes in
variables that lead to changes in other variables for individual objects.
-For instance, we may want to know how the number of free doughnuts
per employee per day affects employee productivity. Perhaps by giving
them one extra doughnut we gain a 5% productivity boost, but two extra
doughnuts could end up making them lazy (and diabetic)
-More seriously, though, think about studying the effects of carbon
emissions on bringing about the Earth’s climate change. Here, we are
interested in seeing how the increased amount of CO2 in the atmosphere
is causing the overall temperature to change.
• Basics and need of hypothesis & hypothesis testing
– Hypothesis testing is a common statistical tool used in research and data
science to support the certainty of findings. The aim of testing is to
answer how probable an apparent effect is detected by chance given a
random data sample.
• What is a hypothesis?
A hypothesis is often described as an “educated guess” about a specific
parameter or population. Once it is defined, one can collect data to determine
whether it provides enough evidence that the hypothesis is true.
Parameters and statistics
In statistics, a parameter is a description of a population,
while a statistic describes a small portion of a population (sample).
For example, if you ask everyone in your class (population) about their average
height, you receive a parameter, a true description about the population since
everyone was asked.
If you now want to guess the average height of people in your grade
(population) using the information you have from your class (sample), this
information turns into a statistic.
• A hypothesis is a calculated prediction or assumption about
a population parameter based on limited evidence. The whole
idea behind hypothesis formulation is testing—this means the
researcher subjects his or her calculated assumption to a series of
evaluations to know whether they are true or false.
• Typically, every research starts with a hypothesis—the
investigator makes a claim and experiments to prove that this
claim is true or false. For instance, if you predict that students
who drink milk before class perform better than those who don't,
then this becomes a hypothesis that can be confirmed or refuted
using an experiment.
• Hypothesis testing is an assessment method that allows researchers to
determine the plausibility of a hypothesis. It involves testing an
assumption about a specific population parameter to know whether it's true
or false. These population parameters include variance, standard deviation,
and median.
• Typically, hypothesis testing starts with developing a null hypothesis and
then performing several tests that support or reject the null hypothesis. The
researcher uses test statistics to compare the association or relationship
between two or more variables.
• How Hypothesis Testing Works
• The basis of hypothesis testing is to examine and analyze the null hypothesis
and alternative hypothesis to know which one is the most plausible
assumption. Since both assumptions are mutually exclusive, only one can be
true. In other words, the occurrence of a null hypothesis destroys the chances
of the alternative coming to life, and vice-versa.
What are the Types of Hypotheses?
1. Simple Hypothesis
2. Complex Hypothesis
3. Null Hypothesis
4. Alternative Hypothesis
5. Logical Hypothesis
6. Empirical Hypothesis
7. Statistical Hypothesis
• Five-Step Procedure for Testing a Hypothesis
Step 1: State the Null Hypothesis (H0) and the Alternate
Hypothesis (H1):
• The first step is to state the hypothesis being tested. It is called
the null hypothesis, designated H0, and read “H sub zero.”
The capital letter H stands for hypothesis,and the subscript
zero implies “no difference.” There is usually a “not” or a “no”
term in the null hypothesis, meaning that there is “no
change.”
• For example, the null hypothesis is that the mean number of
miles driven on the steel-belted tire is not different from
60,000. The null hypothesis would be written H0: 60,000.
• Generally speaking, the null hypothesis is developed for the
purpose of testing. We either reject or fail to reject the null
hypothesis. The null hypothesis is a statement that is not
rejected unless our sample data provide convincing evidence
that it is false.
• The alternate hypothesis describes what you will conclude if you reject the
• null hypothesis. It is written H1 and is read “H sub one.” It is also referred
to as the research hypothesis. The alternate hypothesis is accepted if the
sample data provide us with enough statistical evidence that the null
hypothesis is false.
• The actual test begins by considering two hypotheses. They are
called the null hypothesis and the alternative hypothesis. These
hypotheses contain opposing viewpoints.
• H0: The null hypothesis: It is a statement of no difference
between sample means or proportions or no difference between
a sample mean or proportion and a population mean or
proportion. In other words, the difference equals 0.
• Ha: The alternative hypothesis: It is a claim about the
population that is contradictory to H0 and what we conclude
when we reject H0.
• The following example will help clarify what is meant by the null
hypothesis and the alternate hypothesis. A recent article indicated the
mean age of U.S. commercial aircraft is 15 years. To conduct a statistical
test regarding this statement, the first step is to determine the null and
the alternate hypotheses.
• The null hypothesis represents the current or reported condition. It is
written H0: µ=15.
• The alternate hypothesis is that the statement is not true, that is, H1: µ≠
15.
• It is important to remember that no matter how the problem is stated,
the null hypothesis will always contain the equal sign. The equal sign (=)
will never appear in the alternate hypothesis. Why? Because the null
hypothesis is the statement being tested, and we need a specific value to
include in our calculations. We turn to the alternate hypothesis only if the
data suggests the null hypothesis is untrue.
• Null & Alternative hypothesis
• The null and alternative hypotheses are the two mutually
exclusive statements about a parameter or population.
• The null hypothesis (often abbreviated as H0) claims that there
is no effect or no difference.
• The alternative hypothesis (often abbreviated as H1 or HA) is
what you want to prove. Using one of the examples from above:
• H0: There is no difference in the mean return from A and B, or
the difference between A and B is zero.
• H1: There is a difference in the mean return from A and B or
the difference between A and B > zero.
H0 Ha
equal (=) not equal (≠) or greater than (>) or less
than (<
● A type I error is the rejection of the null hypothesis when the null hypothesis
is TRUE. The probability of the type I error is denoted by the Greek letter α.
● A type II error is the acceptance of a null hypothesis when the null hypothesis
is FALSE. The probability of the type II error is denoted by the Greek letter
β.
Type I and Type II Error
• Step 3: Select the Test Statistic
• There are many test statistics., we use both z and t as the test
statistic., we will use such test statistics as F and X2, called chi-
square.
Left: example of normally distributed data. Right: example of non-normal data distribution.
• Real-world examples
• Hypothesis 1: Average order value has increased since last financial year
-Parameter: Mean order value
-Test type: one-sample, parametric test (assuming the order value follows a
normal distribution)
• Hypothesis 2: Investing in A brings a higher return than investing in B
– Parameter: Difference in mean return
– Test type: two-sample, parametric test, also AB test (assuming the return
follows a normal distribution)
• Hypothesis 3: The new user interface converts more users into customers than
the expected 30%
– Parameter: none
– Test type: one-sample, non-parametric test (assuming number of customers
is not normally distributed)
• One-sample, two-sample, or more-sample test
• When testing hypotheses, it is distinguished between one-
sample, two-sample or more-sample tests.
• In a one-sample test, a sample (average order value this year) is
compared to a known value (average order value of last year).
• In a two-sample test, two samples (investment A and B) are
compared to each other.
• What exactly is a test statistic?
– A test statistic describes how closely the distribution of your data
matches the distribution predicted under the null hypothesis of the
statistical test you are using.
– The distribution of data is how often each observation occurs, and can
be described by its central tendency and variation around that central
tendency. Different statistical tests predict different types of
distributions, so it’s important to choose the right statistical test for your
hypothesis.
– The test statistic summarizes your observed data into a single
number using the central tendency, variation, sample size, and
number of predictor variables in your statistical model.
– Generally, the test statistic is calculated as the pattern in your data
(i.e. the correlation between variables or difference between
groups) divided by the variance in the data (i.e. the standard
deviation).
• What exactly is a test statistic?
For Example You are testing the relationship between temperature
and flowering date for a certain type of apple tree. You use a long-
term data set that tracks temperature and flowering dates from the
past 25 years by randomly sampling 100 trees every year in an
experimental field.
– Null hypothesis: There is no correlation between temperature and
flowering date.
– Alternate hypothesis: There is a correlation between temperature
and flowering date.
To test this hypothesis you perform a regression test, which generates
a t-value as its test statistic. The t-value compares the observed
correlation between these variables to the null hypothesis of zero
correlation.
• Types of test statistics
Test Null and alternative hypotheses Statistical tests that
statistic use it
Our P-value is greater than 0.05 thus we fail to reject the null
hypothesis and don’t have enough evidence to support the
hypothesis that on average, girls score more than 600 in the exam.
• T-test formula
• The formula for the two-sample t-test (a.k.a. the Student’s t-
test) is shown below.
If the sample size is large enough, then the Z test and t-Test will conclude with the same
results. For a large sample size, Sample Variance will be a better estimate of
Population variance so even if population variance is unknown, we can use the Z test
using sample variance.
What is the Z Test?
Z tests are a statistical way of testing a hypothesis when either:
We know the population variance, or
We do not know the population variance but our sample size is large n
≥ 30
If we have a sample size of less than 30 and do not know the
population variance, then we must use a t-test.
One-Sample Z test
We perform the One-Sample Z test when we want to compare a
sample mean with the population mean.
• Here’s an Example to Understand a One Sample Z Test
• Let’s say we need to determine if girls on average score higher than
600 in the exam. We have the information that the standard
deviation for girls’ scores is 100. So, we collect the data of 20 girls
by using random samples and record their marks. Finally, we also
set our ⍺ value (significance level) to be 0.05.
In this example:
Mean Score for Girls is 641
The size of the sample is 20
The population mean is 600
Standard Deviation for Population is 100
Since the P-value is less than 0.05, we can reject the null
hypothesis and conclude based on our result that Girls on average scored
higher than 600.
• Two Sample Z Test
• We perform a Two Sample Z test when we want to compare the
mean of two samples. Here’s an Example to Understand a Two
Sample Z Test
• Here, let’s say we want to know if Girls on average score 10
marks more than the boys. We have the information that the
standard deviation for girls’ Score is 100 and for boys’ score is 90.
Then we collect the data of 20 girls and 20 boys by using random
samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.
In this example:
Mean Score for Girls (Sample Mean) is 641
Mean Score for Boys (Sample Mean) is 613.3
Standard Deviation for the Population of Girls’ is 100
Standard deviation for the Population of Boys’ is 90
Sample Size is 20 for both Girls and Boys
Difference between Mean of Population is 10
• The subscript “c” is the degrees of freedom. “O” is your observed value and E is
your expected value. It’s very rare that you’ll want to actually use this formula to
find a critical chi-square value by hand. The summation symbol means that
you’ll have to perform a calculation for every single data item in your data set.
As you can probably imagine, the calculations can get very, very, lengthy and
tedious. Instead, you’ll probably want to use technology:
• A chi-square statistic is one way to show a relationship between
two categorical variables. In statistics, there are two types of
variables: numerical (countable) variables and non-numerical (categorical)
variables. The chi-squared statistic is a single number that tells you how
much difference exists between your observed counts and the counts you
would expect if there were no relationship at all in the population.
• There are a few variations on the chi-square statistic. Which one you use
depends upon how you collected the data and which hypothesis is being
tested. However, all of the variations use the same idea, which is that you are
comparing your expected values with the values you actually collect. One of
the most common forms can be used for contingency tables:
• Where O is the observed value, E is the expected value and “i” is the “ith”
position in the contingency table.
• Chi Square P-Values.
– A chi square test will give you a p-value. The p-value will tell you if your test results
are significant or not. In order to perform a chi square test and get the p-value, you need
two pieces of information:
– Degrees of freedom. That’s just the number of categories minus 1.
– The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05
(5%), but you could also have other levels like 0.01 or 0.10.
A chi-square test for independence
• Example: a scientist wants to know if education level and marital status are related
for all people in some country. He collects data on a simple random sample of n =
300 people, part of which are shown below.
• Chi-Square Test - Observed Frequencies
• A good first step for these data is inspecting the contingency table of marital status by
education. Such a table -shown below- displays the frequency distribution of marital
status for each education category separately. So let's take a look at it.
•
• Chi-Square Test - Column Percentages
• Although our contingency table is a great starting point, it doesn't really show
us if education level and marital status are related. This question is answered
more easily from a slightly different table as shown below.
• This table shows -for each education level separately- the percentages of
respondents that fall into each marital status category. Before reading on, take
a careful look at this table and tell me is marital status related to education
level and -if so- how?
Marital status is clearly associated with education level.The lower
someone’s education, the smaller the chance he’s married. That is:
education “says something” about marital status (and reversely) in
our sample. So what about the population?
• Chi-Square Test - Null Hypothesis
The null hypothesis for a chi-square independence test is that two categorical
variables are independent in some population.
Chi-Square Test - Statistical Independence
independence means that one variable doesn't “say anything” about another
variable.
A different way of saying the exact same thing is that
independence means that the relative frequencies of one variable are identical
over all levels of some other variable.
Expected Frequencies
• Expected frequencies are the frequencies we expect in our sample
if the null hypothesis holds.
• If education and marital status are independent in our population, then we
expect this in our sample too. This implies the contingency table -holding
expected frequencies- shown below.
These expected frequencies are calculated as
eij=(oi.oj)/N
where
eij is an expected frequency;
oi is a marginal column frequency;
oj is a marginal row frequency;
N is the total sample size
So for our first cell, that'll be eij=39.90/300=11.7
• Test Statistic
• The chi-square test statistic is calculated as
In simpler and general terms, it can be stated that the ANOVA test is used to identify which process, among all the other
processes, is better. The fundamental concept behind the Analysis of Variance is the “Linear Model”.
Example of ANOVA
An example to understand this can be prescribing medicines.
They are being given three different medicines that have the same functionality i.e. to cure fever.
To understand the effectiveness of each medicine and choose the best among them, the ANOVA test is used.
You may wonder that a t-test can also be used instead of using the ANOVA test. You are probably right, but, since t-tests are
used to compare only two things, you will have to run multiple t-tests to come up with an outcome. While that is not the case
with the ANOVA test.
That is why the ANOVA test is also reckoned as an extension of t-test and z-tests.
Terminologies in ANOVA Test
There are few terms that we continuously encounter or better say come across while performing the ANOVA
test. We have listed and explained them below:
As we know, a mean is defined as an arithmetic average of a given range of values. In the ANOVA test, there
are two types of mean that are calculated: Grand and Sample Mean.
A sample mean (μn) represents the average value for a group while the grand mean (μ) represents the average
value of sample means of different groups or mean of all the observations combined.
2. F-Statistic
The statistic which measures the extent of difference between the means of different samples or how
significantly the means differ is called the F-statistic or F-Ratio. It gives us a ratio of the effect we are
measuring (in the numerator) and the variation associated with the effect (in the denominator).
Since we use variances to explain both the measure of the effect and the measure of the error, F is
more of a ratio of variances. The value of F can never be negative.
• When the value of F exceeds 1 it means that the variance due to the effect is larger than the
variance associated with sampling error; we can represent it as:
• When F>1, variation due to the effect > variation due to error
• If F<1, it means variation due to effect < variation due to error
• When F = 1 it means variation due to effect = variation due to error. This situation is not so
favorable
3. Sums of Squares
In statistics, the sum of squares is defined as a statistical technique that is used in regression
analysis to determine the dispersion of data points. In the ANOVA test, it is used while computing
the value of F.
As the sum of squares tells you about the deviation from the mean, it is also known as variation.
Degrees of Freedom refers to the maximum numbers of logically independent values that have the
freedom to vary in a data set.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.
The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
5. Mean Squared Error (MSE)
The Mean Squared Error tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
In the ANOVA test, we use Null Hypothesis (H0) and Alternate Hypothesis (H1). The Null Hypothesis
in ANOVA is valid when the sample means are equal or have no significant difference.
The Alternate Hypothesis is valid when at least one of the sample means is different from the other.
7. Group Variability (Within-group and Between-group)
To understand group variability, we should know about groups first. In the ANOVA test, a group is the
set of samples within the independent variable.
There are variations among the individual groups as well as within the group. This gives rise to the two
terms: Within-group variability and Between-group variability.
• When there is a big variation in the sample distributions of the individual groups, it is called
between-group variability.
• On the other hand, when there are variations in the sample distribution within an individual group,
it is called Within-group variability.
Types of ANOVA Test
The ANOVA test is generally done in three ways depending on the number of Independent Variables
(IVs) included in the test. Sometimes the test includes one IV, sometimes it has two IVs, and sometimes
the test may include multiple IVs.
1. One-Way ANOVA
2. Two-Way ANOVA
3. N-Way ANOVA (MANOVA)
One-Way ANOVA
One-way ANOVA is generally the most used method of performing the ANOVA test. It is also referred
to as one-factor ANOVA, between-subjects ANOVA, and an independent factor ANOVA. It is used to
compare the means of two independent groups using the F-distribution.
Two carry out the one-way ANOVA test, you should necessarily have only one independent variable
with at least two levels. One-way ANOVA does not differ much from t-test.
Example where one-way ANOVA is used: Suppose a teacher wants to know how good he has been in
teaching with the students. So, he can split the students of the class into different groups and assign
different projects related to the topics taught to them.
He can use one-way ANOVA to compare the average score of each group. He can get a rough
understanding of topics to teach again. However, he won’t be able to identify the student who could not
understand the topic.
Two-way ANOVA
Two-way ANOVA is carried out when you have two independent variables. It is an extension of one-
way ANOVA. You can use the two-way ANOVA test when your experiment has a quantitative outcome
and there are two independent variables.
Two-way ANOVA with replication: It is performed when there are two groups and the
members of these groups are doing more than one thing. Our example in the beginning can be a good
example of two-way ANOVA with replication.
Two-way ANOVA without replication: This is used when you have only one group but you are
double-testing that group. For example, a patient is being observed before and after medication.
When we have multiple or more than two independent variables, we use MANOVA. The main purpose
of the MANOVA test is to find out the effect on dependent/response variables against a change in the IV.
• Does the change in the independent variable significantly affect the dependent variable?
• What are interactions among the dependent variables?
• What are interactions between independent variables?
The one way ANOVA test is used to determine whether there is any difference between the means of three or
more groups. A one way ANOVA will have only one independent variable. The hypothesis for a one way
ANOVA test can be set up as follows:
Null Hypothesis, H0
:μ1 = μ2 = μ3 = ... = μk
Alternative Hypothesis, H1
: The means are not equal
Decision Rule: If test statistic > critical value then reject the null hypothesis and conclude that the means of at
least two groups are statistically significant.
The steps to perform the one way ANOVA test are given below:
○ Step 1: Calculate the mean for each group.
○ Step 2: Calculate the total mean. This is done by adding all the means and dividing it by the total number
of means.
○ Step 3: Calculate the SSB.
○ Step 4: Calculate the between groups degrees of freedom.
○ Step 5: Calculate the SSE.
○ Step 6: Calculate the degrees of freedom of errors.
○ Step 7: Determine the MSB and the MSE.
○ Step 8: Find the f test statistic.
○ Step 9: Using the f table for the specified level of significance, α , find the critical value. This is given by
F(α, df1. df2).
○ Step 10: If f > F then reject the null hypothesis.
● Examples on ANOVA Test
● Example 1: Three types of fertilizers are used on three groups of plants for 5 weeks. We want to
check if there is a difference in the mean growth of each group. Using the data given below apply a
one way ANOVA test at 0.05 significance level.
Fertilizer 1 Fertilizer 2 Fertilizer 3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12
Solution:
6 1 8 1 13 9
8 9 12 9 9 1
4 1 9 0 11 1
5 0 11 4 8 4
3 4 6 9 7 9
4 1 8 1 12 4
SSE = 16 + 24 + 28 = 68
N = 18
df2 = N - k = 18 - 3 = 15
MSB = SSB / df1 = 84 / 2 = 42
MSE = SSE / df2 = 68 / 15 = 4.53
ANOVA test statistic, f = MSB / MSE = 42 / 4.53 = 9.33
Using the f table at α= 0.05 the critical value is given as F(0.05, 2, 15) = 3.68
As f > F, thus, the null hypothesis is rejected and it can be concluded that there is a
difference in the mean growth of the plants.
Answer: Reject the null hypothesis
• Pearson Correlation
• Pearson’s correlation coefficient is the test statistics that measures the
statistical relationship, or association, between two continuous variables. It is
known as the best method of measuring the association between variables of
interest because it is based on the method of covariance. It gives information
about the magnitude of the association, or correlation, as well as the direction
of the relationship.
• Questions Answered:
• Do test scores and hours spent studying have a statistically significant
relationship?
• Is there a statistical association between IQ scores and depression?
• Assumptions:
• Independent of case: Cases should be independent to each other.
• Linear relationship: Two variables should be linearly related to each other.
This can be assessed with a scatterplot: plot the value of variables on a scatter
diagram, and check if the plot yields a relatively straight line.
• Homoscedasticity: the residuals scatterplot should be roughly rectangular-
shaped.
• Properties:
• Limit: Coefficient values can range from +1 to -1, where +1 indicates a perfect
positive relationship, -1 indicates a perfect negative relationship, and a 0 indicates no
relationship exists..
• Pure number: It is independent of the unit of measurement. For example, if one
variable’s unit of measurement is in inches and the second variable is in quintals, even
then, Pearson’s correlation coefficient value does not change.
• Symmetric: Correlation of the coefficient between two variables is symmetric. This
means between X and Y or Y and X, the coefficient value of will remain the same.
• Degree of correlation:
• Perfect: If the value is near ± 1, then it said to be a perfect correlation: as one variable
increases, the other variable tends to also increase (if positive) or decrease (if
negative).
• High degree: If the coefficient value lies between ± 0.50 and ± 1, then it is said to be a
strong correlation.
• Moderate degree: If the value lies between ± 0.30 and ± 0.49, then it is said to be a
medium correlation.
• Low degree: When the value lies below + .29, then it is said to be a small correlation.
• No correlation: When the value is zero.
• Correlation coefficients are used to measure how strong a relationship is
between two variables. There are several types of correlation coefficient, but
the most popular is Pearson’s. Pearson’s correlation (also called
Pearson’s R) is a correlation coefficient commonly used in linear
regression. If you’re starting out in statistics, you’ll probably learn about
Pearson’s R first. In fact, when anyone refers to the correlation coefficient,
they are usually talking about Pearson’s.
• Correlation Coefficient Formula: Definition
• Correlation coefficient formulas are used to find how strong a relationship is
between data. The formulas return a value between -1 and 1, where:
– 1 indicates a strong positive relationship.
– -1 indicates a strong negative relationship.
– A result of zero indicates no relationship at all.
• Types of correlation coefficient formulas.
• There are several types of correlation coefficient formulas.
– One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re
taking a basic stats class, this is the one you’ll probably use:
– Two other formulas are commonly used: the sample correlation coefficient and the population
correlation coefficient.
– Sx and sy are the sample standard deviations, and sxy is the sample covariance.
– The population correlation coefficient uses σx and σy as the population standard deviations, and
σxy as the population covariance.
• What is Pearson Correlation?
• Correlation between sets of data is a measure of how well they are related.
The most common measure of correlation in stats is the Pearson
Correlation. The full name is the Pearson Product Moment Correlation
(PPMC). It shows the linear relationship between two sets of data. In
simple terms, it answers the question, Can I draw a line graph to represent
the data? Two letters are used to represent the Pearson correlation: Greek
letter rho (ρ) for a population and the letter “r” for a sample.
• Potential problems with Pearson correlation.
• The PPMC is not able to tell the difference between dependent
variables and independent variables. For example, if you are trying to find
the correlation between a high calorie diet and diabetes, you might find a
high correlation of .8. However, you could also get the same result with the
variables switched around. In other words, you could say that diabetes
causes a high calorie diet. That obviously makes no sense. Therefore, as a
researcher you have to be aware of the data you are plugging in. In
addition, the PPMC will not give you any information about the slope of
the line; it only tells you whether there is a relationship.
Example question: Find the value of the correlation coefficient from the
following table:
GLUCOSE
SUBJECT AGE X XY X2 Y2
LEVEL Y
• The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means
the variables have a moderate positive correlation.
More_problems
Type I error occurs when the null hypothesis is incorrectly rejected when it is true, often controlled by the significance level (α), while a type II error happens when the null hypothesis is not rejected when it is false, linked to the test's power (1-β). Researchers must balance these errors by selecting an appropriate significance level and ensuring sufficient sample size. Lowering the α value reduces type I error but increases type II error, whereas increasing sample size can reduce both types of errors, thereby improving the test's power and reliability .
Hypothesis testing involves a five-step procedure: 1. **State the Null and Alternative Hypotheses:** The null hypothesis (H0) represents a statement of no effect or no difference, usually containing an equality sign, while the alternative hypothesis (H1) represents what you aim to prove. This step is crucial as it defines the claim being tested and forms the foundation for the testing process . 2. **Select the Level of Significance (α):** This is the probability of making a Type I error, which occurs when the null hypothesis is rejected when it is true. The choice of α (commonly 0.05) influences the hypothesis test's sensitivity . 3. **Determine the Test Statistic and Sampling Distribution:** Depending on the type of data and the hypothesis, an appropriate test statistic (like t, z, or chi-square) is calculated. This step is important to ascertain the correct statistical methodology to apply . 4. **Calculate and Compare with Critical Values or Compute the p-value:** Using the collected data, calculate the test statistic and compare it to critical values to determine if it falls within the rejection region. Alternatively, the p-value approach can be used where the null hypothesis is rejected if the p-value is less than α. This step is crucial for making a decision about the hypotheses . 5. **Make a Decision:** Based on the comparison, the null hypothesis is either rejected or not rejected. This decision helps in concluding whether there is enough evidence to support the alternative hypothesis . Each step is vital as it ensures that the test is conducted rigorously, minimizing errors and ensuring valid and reliable conclusions.
The main purpose of hypothesis testing is to assess the plausibility of a hypothesis by determining whether there is sufficient statistical evidence to support or refute an assumption about a population parameter . Researchers conduct this process by formulating two mutually exclusive hypotheses: the null hypothesis (H0), indicating no effect or difference, and the alternative hypothesis (H1), which posits the presence of an effect or difference . They then perform a series of tests, using test statistics to analyze the data collected from samples. This involves selecting a level of significance (α) to control the risk of type I errors, computing test statistics, and comparing these statistics against critical values or calculating p-values to decide whether to reject or fail to reject the null hypothesis . This systematic approach enables researchers to make informed conclusions about the validity of their hypotheses ."}
The choice of significance level (alpha) in a hypothesis test determines the threshold for rejecting the null hypothesis and affects the likelihood of making a Type I error, which is rejecting the null hypothesis when it is true. A lower alpha level (e.g., 0.01) reduces the probability of a Type I error but increases the chance of a Type II error, where the null hypothesis is erroneously accepted when it is false. Conversely, a higher alpha level (e.g., 0.10) increases the risk of a Type I error but reduces the risk of a Type II error . The significance level also defines the critical value for the test, which separates the rejection and non-rejection regions – if the test statistic falls in the rejection region, the null hypothesis is rejected . Researchers typically choose alpha levels like 0.05, balancing the risks of these errors and impacting the robustness of their conclusions .
A one-tailed test evaluates a hypothesis where the effect is expected to be in one direction, either greater than or less than a certain value. It is used when the research question specifies a direction, such as testing if one population mean is greater than another . For instance, if you want to test whether the mean production from a new method is more than 200 units, a one-tailed test is appropriate . Conversely, a two-tailed test checks for any significant difference in either direction, meaning it tests if one value is either greater or less than another without specifying the direction. This test is suitable when you want to know if there is any difference at all, without specifying direction, such as testing if two population means are not equal .
The critical value in hypothesis testing is a threshold that separates the rejection region from the nonrejection region of a test statistic's probability distribution . It is determined based on the level of significance chosen by the researcher, which reflects the probability of making a Type I error, or rejecting the null hypothesis when it is actually true . If the computed test statistic exceeds the critical value, the null hypothesis is rejected, indicating that the observed data is unlikely under the null hypothesis . Conversely, if the test statistic is within the nonrejection region, the null hypothesis is not rejected, suggesting that any observed effect could be due to sampling error or chance . Thus, critical values play a central role in decision-making in hypothesis testing by providing a benchmark for determining statistical significance .
A paired t-test is preferred when comparing two related samples, such as measurements taken on the same subjects before and after a treatment. It is used when the samples come from a single population, like measuring changes before and after an experimental treatment . The assumptions for a paired t-test are: the differences between pairs are approximately normally distributed, the pairs are related (not independent), and each pair consists of one observation from each condition .
The t-test and z-test differ primarily in the assumptions they make about sample size and population variance. A t-test is used when comparing the means of two groups, typically with smaller sample sizes (n < 30) or when the population variance is unknown, relying instead on the sample standard deviation . Conversely, a z-test is applicable when the sample size is large (n ≥ 30) and the population variance is known, as it assumes the distribution of the test statistics can be approximated by a normal distribution . Additionally, the t-test is appropriate for data that meet assumptions of independence, normality, and homogeneity of variance , whereas the z-test is used when the sample variances can better estimate population variance in large samples . The choice of test depends on these considerations of sample size, variance knowledge, and adherence to test assumptions.
The p-value approach in hypothesis testing involves calculating the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. The p-value acts as an observed level of significance. Researchers compare the p-value with the predetermined significance level α to decide whether to reject the null hypothesis. If the p-value is less than α, it indicates that the observed data is unlikely under the null hypothesis, leading to its rejection. This approach provides a measure of the strength of evidence against the null hypothesis, aiding in decision-making .
The significance level, denoted as alpha (α), is crucial in hypothesis testing as it represents the maximum probability of making a type I error, which is rejecting a true null hypothesis. It sets a threshold for determining whether the observed data are sufficient to reject the null hypothesis. The choice of α affects the size of the rejection region and the power of the test. Common values for α are 0.01, 0.05, and 0.10, with 0.05 being the most traditional choice, balancing the risk of type I error with test sensitivity .