Inferential Statistics and Data Analytics Guide
Inferential Statistics and Data Analytics Guide
INFERENTIAL STATISTICS
Populations – samples – random sampling – Sampling distribution- standard error of the mean - Hypothesis testing –
z-test – z-test procedure –decision rule – calculations – decisions – interpretations - one-tailed and two-tailed tests –
Estimation – point estimate – confidence interval – level of confidence – effect of sample size.
Data Analytics:
● Analytics is defined as “the scientific process of transforming data into insights for making better
decisions”
● Analytics, is the use of data, information technology, statistical analysis, quantitative methods, and
mathematical or computer-based models to help managers gain improved insight about their business
operations and make better, fact-based decisions – James Evans
Opportunity abounds for the use of analytics and big data such as:
Based on the phase of workflow and the kind of analysis required, there are four major types of data analytics.
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
1. Descriptive Analytics:
● Descriptive Analytics, is the conventional form of Business Intelligence and data analysis.
● It seeks to provide a depiction or “summary view” of facts and figures in an understandable format.
● This either inform or prepare data for further analysis.
● Descriptive analysis or statistics can summarize raw data and convert it into a form that can be easily
understood by humans.
● They can describe in detail about an event that has occurred in the past.
2. Diagnostic analytics:
● Diagnostic Analytics is a form of advanced analytics which examines data or content to answer the
question “Why did it happen?”.
● Diagnostic analytical tools aid an analyst to dig deeper into an issue so that they can arrive at the source of
a problem.
● In a structured business environment, tools for both descriptive and diagnostic analytics go parallel.
3. Predictive analytics:
● Predictive analytics helps to forecast trends based on the current events.
● Predicting the probability of an event happening in future or estimating the accurate time it will happen can
all be determined with the help of predictive analytical models.
● Many different but co-dependent variables are analyzed to predict a trend in this type of analysis.
4. Prescriptive analytics:
● Set of techniques to indicate the best course of action
● It tells what decision to make to optimize the outcome
● The goal of prescriptive analytics is to enable:
1. Quality improvements
2. Service enhancements
3. Cost reductions
4. Increasing productivity
Descriptive Statistics:
Descriptive statistics can be used to summarize and describe a single variable.
• Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no, Drug A vs. Drug B
• Means & Standard Deviations
– Use with continuous (interval/ratio) data
• Height, weight, cholesterol, scores on a test
Inferential Statistics:
Inferential statistics can be used to prove or disprove theories, determine associations between variables,
and determine if findings are significant and whether or not we can generalize from our sample to the entire
population. Otherwise Inferential statistics are used to draw conclusions about a population by examining the
sample. Accuracy of inference depends on representativeness of sample from population.
Random selection - equal chance for anyone to be selected makes sample more representative.
1.Populations:
● Any complete set of observations (or potential observations) may be characterized as a population. A
population can also be defined as including all people or items with the characteristic one wishes to
understand.
Real Population:
● A real population is one in which all potential observations are accessible at the time of sampling.
Hypothetical Population:
● A hypothetical population is one in which all potential observations are not accessible at the time of
sampling.
Examples: All likely voters in the next election
All parts produced today
● All sales receipts for November
2.Sample:
Any subset of observations from a population may be characterized as a sample. A sample is “a smaller
(but hopefully representative) collection of units from a population used to determine truths about that population”.
Examples: 1000 voters selected at random for interview
A few parts selected for destructive testing
Random receipts selected for audit
3.Random Sampling:
Random sampling is the selection process that guarantees all potential observations in the population have
an equal chance of being included in the sample.
It’s important to note that randomness describes the selection process—that is, the conditions under which
the sample is taken—and not the particular pattern of observations in the sample.
Types and examples of random sampling techniques.
Four main types of random sampling techniques –
● Simple Random Sampling technique – In this technique, a sample is chosen randomly using randomly
generated numbers. A sampling frame with the list of members of a population is required, which is
denoted by ‘n’. Using Excel, one can randomly generate a number for each element that is required.
● Systematic Random Sampling technique -This technique is very common and easy to use in statistics. In
this technique, every k’th element is sampled. For instance, one element is taken from the sample and
then the next while skipping the pre-defined amount or ‘n’. In a sampling frame, divide the size of the
frame N by the sample size (n) to get ‘k’, the index number. Then pick every k’th element to create your
sample.
● Cluster Random Sampling technique -In this technique, the population is divided into clusters or groups
in such a way that each cluster represents the population. After that, you can randomly select clusters to
sample.
● Stratified Random Sampling technique – In this technique, the population is divided into groups that
have similar characteristics. Then a random sample can be taken from each group to ensure that
different segments are represented equally within a population.
Conditional Probability
To obtain the probability that two dependent events occur together, the probability of the second event must
be adjusted to reflect its dependency on the prior occurrence of the first event. This new probability is the
conditional probability of the second event, given the first event.
Also define as, the probability of one event, given the occurrence of another event.
4.SAMPLING DISTRIBUTION:
● Sampling distribution is a statistic that determines the probability of an event based on data from a small
group within a large population.
● Its primary purpose is to establish representative results of small samples of a comparatively larger
population. Since the population is too large to analyze, the smaller group is selected and repeatedly
sampled or analyzed.
● The gathered data, or statistics, is used to calculate the likely occurrence, or probability, of an event.
● Using a sampling distribution simplifies the process of making inferences, or conclusions, about large
amounts of data.
where µ¯X represents the mean of the sampling distribution and μ represents the mean of the population.
Where σ ¯X represents the standard error of the mean; σ represents the standard deviation of the population; and n
represents the sample size.
Problem: Imagine a very simple population consisting of only five observations: 18, 20, 22, 24
(a) List all possible samples of size two.
(b) Construct a relative frequency table showing the sampling distribution of the mean.
Solution:
Relative frequency table:
6.Hypothesis testing:
Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about
a population parameter or a population probability distribution. First, a tentative assumption is made about
the parameter or distribution. This assumption is called the null hypothesis and is denoted by H0. An alternative
hypothesis (denoted Ha), which is the opposite of what is stated in the null hypothesis, is then defined. The
hypothesis-testing procedure involves using sample data to determine whether or not H0 can be rejected. If H0 is
rejected, the statistical conclusion is that the alternative hypothesis Ha is true.
● Null Hypothesis: The null hypothesis is a statement that the value of a population parameter (such as
proportion, mean, or standard deviation) is equal to some claimed value. We either reject or fail to reject the null
hypothesis. Null Hypothesis is denoted by H0.
● Alternate Hypothesis: The alternative hypothesis is the statement that the parameter has a value that is different
from the claimed value. It is denoted by HA.
Level of significance: It means the degree of significance in which we accept or reject the null-hypothesis. Since in
most of the experiments 100% accuracy is not possible for accepting or rejecting a hypothesis, so we,
therefore, select a level of significance. It is denoted by alpha (∝).
For example, assume that a radio station selects the music it plays based on the assumption that the average age of
its listening audience is 30 years.
To determine whether this assumption is valid, a hypothesis test could be conducted with the null
hypothesis given as H0: μ = 30 and the alternative hypothesis given as Ha: μ ≠ 30.
Based on a sample of individuals from the listening audience, the sample mean age, x̄, can be computed
and used to determine whether there is sufficient statistical evidence to reject H0.
Conceptually, a value of the sample mean that is “close” to 30 is consistent with the null hypothesis, while
a value of the sample mean that is “not close” to 30 provides support for the alternative hypothesis. What is
considered “close” and “not close” is determined by using the sampling distribution of x̄.
The null hypothesis that the population mean for the freshman class equals 500 is tentatively assumed to be true. It is
tested by determining whether the one observed sample mean qualifies as a common outcome or a rare outcome in
the hypothesized sampling distribution
Common Outcomes
An observed sample mean qualifies as a common outcome if the difference between its value and that of the
hypothesized population mean is small enough to be viewed as a probable outcome under the null hypothesis.
A common outcome signifies a lack of evidence that, with respect to the null hypothesis, something
special is happening in the underlying population
Rare Outcomes
An observed sample mean qualifies as a rare outcome if the difference between its value and the
hypothesized population mean is too large to be reasonably viewed as a probable outcome under the null
hypothesis.
A rare outcome signifies that,with respect to the null hypothesis, something special probably is happening
in the underlying population.
Boundaries for Common and Rare Outcomes
Superimposed on the hypothesized sampling distribution in Figure 10.2 is one possible set of boundaries for
common and rare outcomes, expressed in values of X.
If the one observed sample mean is located between 478 and 522, it will qualify as a common
outcome (readily attributed to variability) under the null hypothesis, and the null hypothesis will be retained. If,
however, the one observed sample mean is greater than 522 or less than 478, it will qualify as a rare outcome (not
readily attributed to variability) under the null hypothesis, and the null hypothesis will be rejected.
7. Z-test:
Z-test is a statistical method to determine whether the distribution of the test statistics can be approximated
by a normal distribution. It is the method to determine whether two sample means are approximately the same or
different when their variance is known and the sample size is large (should be >= 30).
When to Use Z-test:
o The sample size should be greater than 30. Otherwise, we should use the t-test.
o Samples should be drawn at random from the population.
o The standard deviation of the population should be known.
o Samples that are drawn from the population should be independent of each other.
o The data should be normally distributed, however for large sample size, it is assumed to have a
normal distribution.
8. Z-test procedure:
1. First, identify the null and alternate hypotheses.
2. Determine the level of significance (∝).
3. Find the critical value of z in the z-test using
4. Calculate the z-test statistics. Below is the formula for calculating the z-test statistics.
where,
X¯: mean of the sample.
Mu: mean of the population.
Sd: Standard deviation of the population.
n: sample size.
Now compare with the hypothesis and decide whether to reject or not to reject the null hypothesis.
Example:
Type of Z-test
● Left-tailed Test: In this test, our region of rejection is located to the extreme left of the distribution.
Here our null hypothesis is that the claimed value is less than or equal to the mean population value.
● Right-tailed Test: In this test, our region of rejection is located to the extreme right of the distribution.
Here our null hypothesis is that the claimed value is less than or equal to the mean population value.
● Two-tailed test: In this test, our region of rejection is located to both extremes of the distribution. Here
our null hypothesis is that the claimed value is equal to the mean population value.
Problem: A school claimed that the students’ study that is more intelligent than the average school. On calculating
the IQ scores of 50 students, the average turns out to be 110. The mean of the population IQ is 100 and the standard
deviation is 15. State whether the claim of principal is right or not at a 5% significance level.
1. First, we define the null hypothesis and the alternate hypothesis and our alternate hypothesis. Ho: µ = 50
Ha: µ>50
2. State the level of significance. Here, our level of significance given in this question (∝ =0.05), if not given
then we take ∝=0.05.
3. Now, we look up to the z-table. For the value of ∝=0.05, the z-score for the right-tailed test is 1.645.
4. Now, we perform the Z-test on the problem:
Z = (Sample mean - µ)/(σ/√n)
Where:
X = 110
Mean (mu) = 100
Standard deviation (sigma) = 15
Significance level (alpha) = 0.05
n = 50
Here 4.71 >1.645, so we reject the null hypothesis. If z-test statistics is less than z-score, then we will not reject
the null hypothesis.
9. Decision rule:
A decision rule specifies precisely when the null hypothesis (Ho) should be rejected. Ho should be rejected
if the observed z equals or is more positive than Z critical value or is more negative than z critical value.
Example:
Consider the z critical value with significance level of 0.05 is 1.96. Then the null hypothesis is to be
rejected if the observed z equals or is more positive than 1.96, or is more negative than -1.96. Conversely, Null
hypothesis should be retained if the observed z falls between +1.96 to -1.96.
Critical z-scores:
A z-scores that separates common from rare outcomes and hence dictates whether its should be retained or rejected.
Because of their vital role in the decision about H0, these scores are referred to as critical z scores.
Level of Significance:
Fig indicates the proportion (.025 +.025= .05) of the total area that is identified with rare outcomes. Often referred to
as the level of significance of the statistical test, this proportion is symbolized by the Greek letter α (alpha) . The
degree of rarity required of an observed outcome in order to reject the null hypothesis (H0). For instance, the .05
level of significance indicates that H0 should be rejected if the observed z could have occurred just by chance with a
probability of only .05 (one chance out of twenty) or less.
10. Decisions:
Either retain or reject H0, depending on the location of the observed z value relative to the critical z values specified
in the decision rule. According to the present rule, H0 should be rejected at the .05 level of significance because the
observed z of 3 exceeds the critical z of 1.96 and, therefore, qualifies as a rare outcome, that is, an unlikely outcome
from a population centered about the null hypothesis.
Retain or Reject H0?
If ever confused about whether to retain or reject H0, recall the logic behind the hypothesis test. To reject
H0 only if the observed value of z qualifies as a rare outcome because it deviates too far into the tails of the
sampling distribution. Therefore, to reject H0 only if the observed value of z statistics equals or is more positive
than the upper critical z values or if it equals or is more negative than the lower critical z values.
If you are ever confused about whether to retain or reject H0, recall the logic behind the hypothesis test.
You want to reject H0 only if the observed value of z qualifies as a rare outcome because it deviates too far into the
tails of the sampling distribution. Therefore, you want to reject H0 only if the observed value of z equals or is more
positive than the upper critical z (1.96) or if it equals or is more negative than the lower critical z (–1.96).
Before deciding, you might find it helpful to sketch the hypothesized sampling distribution, along with its
critical z values and shaded rejection regions, and then use some mark, such as an arrow ( ), to designate the location
of the observed value of z (3) along the z scale. If this mark is located in the shaded rejection region—or farther out
than this region, as in Figure —then H0 should be rejected.
11. Interpretation:
Finally, interpret the decision in terms of the original research problem. Although not a strict consequence
of the present test, a more specific conclusion is possible.
Example:
Consider the mean SAT math score for the local freshman class probably differs from the national average
of 500 and the mean of sample is 533.
The solution to it can be concluded that, the null hypothesis was rejected. Since the sample mean of 533 (or
its equivalent z of 3) falls in the upper rejection region of the hypothesized sampling distribution, it can be
concluded that the population mean SAT math score for all local freshmen probably exceeds the national average
of 500.
By the same token, if the observed sample mean or its equivalent z had fallen in the lower rejection region
of the hypothesized sampling distribution, it could have been concluded that the population mean for all local
freshmen probably is below the national average.
If the observed sample mean or its equivalent z had fallen in the retention region of the hypothesized
sampling distribution, it would have been concluded that there is no evidence that the population mean for all
local freshmen differs from the national average of 500.
Two-Tailed Test:
Two-tailed hypothesis tests are also known as non-directional and two-sided tests because you can test for
effects in both directions. When you perform a two-tailed test, you split the significance level percentage between
both tails of the distribution. For example, an alpha of 5% is splited and the distribution has two shaded regions of
2.5% (2 * 2.5% = 5%).
When a test statistic falls in either critical region, then the sample data are sufficiently incompatible with the null
hypothesis that allow to can reject it for the population.
In a two-tailed test, the generic null and alternative hypotheses are the following:
Therefore, the alternative hypothesis, H1, is the complement of the null hypothesis, H0.
Generally, the alternative hypothesis, H1, is the complement of the null hypothesis, H0. Under typical conditions,
the form of H1 resembles that shown for the SAT example, namely,
This alternative hypothesis says that the null hypothesis should be rejected if the mean reading score for the
population of local freshmen differs in either direction from the national average of 500. An observed z will qualify
as a rare outcome if it deviates too far either below or above the national average.
The corresponding decision rule, with its pair of critical z scores of ±1.96, is referred to as a two-tailed or
nondirectional test.
Type 1:
Null: The effect is less than or equal to zero. Alternative: The effect is greater than zero.
Type 2:
Null: The effect is greater than or equal to zero. Alternative: The effect is less than zero.
The disadvantage of one-tailed tests is that they have no statistical power to detect an effect in the other direction.
13.Estimation:
A point estimate for μ uses a single value to represent the unknown population mean. This is the most
straightforward type of estimate. If a random sample of 100 local freshmen reveals a sample mean SAT score of 533,
then 533 will be the point estimate of the unknown population mean for all local freshmen. The best single point
estimate for the unknown population mean is simply the observed value of the sample mean.
Drawbacks: Although straightforward, simple, and precise, point estimates suffer from a basic deficiency. They
tend to be inaccurate. Because of sampling variability, it’s unlikely that a single sample mean, such as 533, will
coincide with the population mean. Since point estimates convey no information about the degree of inaccuracy due
to sampling variability, statisticians supplement point estimates with another, more realistic type of estimate, known
as interval estimates or confidence intervals.
14.Confidence Interval:
A confidence interval for μ uses a range of values that, with a known degree of certainty, includes the unknown
population mean.
For instance, the SAT investigator might use a confidence interval to claim, with 95 percent confidence,
that the interval between 511.44 and 554.56 includes the population mean math score for all local freshmen. To be
95 percent confident signifies that if many of these intervals were constructed for a long series of samples,
approximately 95 percent would include the population mean for all local freshmen. In the long run, 95 percent of
these confidence intervals are true because they include the unknown population mean. The remaining 5 percent are
false because they fail to include the unknown population mean.
How Works:
▪ The mean of the sampling distribution equals the unknown population mean for all local freshmen,
whatever its value, because the mean of this sampling distribution always equals the population mean.
▪ The standard error of the sampling distribution equals the value (11) obtained from dividing the population
standard deviation (110) by the square root of the sample size ( 100 ).
▪ The shape of the sampling distribution approximates a normal distribution because the sample size of 100
satisfies the requirements of the central limit theorem.
Only one sample mean is actually taken from this sampling distribution and used to construct a single 95
percent confidence interval. However, imagine taking not just one but a series of randomly selected sample
means from this sampling distribution. For each sample mean, construct a 95 percent confidence interval by
adding 1.96 standard errors to the sample mean and subtracting 1.96 standard errors from the sample mean; that
is, use the expression 1.96 standard errors to obtain a 95 percent confidence interval for each sample mean.