Statistics With Python 2025
Statistics With Python 2025
I will not go too far into the details of the theory, as I would
risk boring or confusing the reader. Instead, I will show the
basic elements at the beginning of each section and in each
solution, along with the applicability of the approximations
made. This is to allow an easy recall of theoretical notions
without, however, entering a level of detail that would be
outside the practical purposes of this volume. The Python
code, on the other hand, will be complete and explained
step by step.
Let's start with some brief definitions. Probability is a measure of the uncertainty
associated with the occurrence of an event in a random experiment. An empirical method
to estimate the probability of an event is the principle of relative frequency. If an
experiment is repeated n times under the same conditions and the event A occurs k times,
then its probability can be estimated as:
P(A) ~—
II
We can define the complementary event A of an event A as the set of outcomes that do not
belong to A. For example, in the toss of a coin, if A represents heads, its complement
represents tails.
This property is useful for calculating probabilities without having to directly determine
P(A).
Finally, let’s introduce conditional probability. This is the probability of an event A given
that another event B has already occurred. It is defined as:
P(AOB)
P(A|8) =
PW
The quantity P(AnB) represents the probability of the intersection between A and B, that is,
the probability that both events occur. Conditional probability is fundamental in many fields
of statistics and probability calculation, such as in Bayes' Theorem.
Solution
To solve this problem, we are calculating the conversion probability, a fundamental concept
in statistics related to probability analysis. The probability of an event is defined as the
number of favorable cases divided by the total number of possible cases.
In this case, the favorable event is 'a visit that results in a purchase.'
, . , Number of conversions 1*20(1
P(conversion) =---------------------------------------- =----------- = 0.024
Total visits 50000
To express this probability in terms of percentage, we multiply the result by 100:
0.024 • 100 = 2.4
Therefore, we can estimate that the probability of a randomly chosen visitor making a
purchase on this store's website is 2.4%. This information is useful for evaluating the
effectiveness of the site’s marketing strategies and user engagement.
# Convert the probability to percentage conversion percentage « conversion probability ' 100
conversion percentage In this script, we are calculating the conversion probability from visits to pu
1. The conversions and total visits are set as variables ant their names are
number of conversions and number of visits.
2. Divide the number of conversions by the total number of visits to obtain the conversion
probability. This is done using the usual proportion formula: conversion probability =
number of conversions / number of visits.
3. The conversion probability is converted into a percentage by multiplying by 100. This
final step is useful to express the probability in a more comprehensible and usable form
in business contexts.
4. Finally, the variable conversion percentage contains the conversion probability in
percentage, which is the desired output of our calculation.
Solution
The required probability can be calculated using the fundamental concept of classical
probability, which relies on the ratio of favorable cases to the total possible cases. In the
context of this exercise, the favorable cases are the days on which at least one failure
occurred, while the possible cases are all considered working days.
Where n(E) is the number of days with failures (36 days) and n(S) is the total number of
working days (240 days).
240 20
There is thus a probability of 15% that on a randomly chosen working day, at least one
failure will occur in the production process. This exercise uses the statistical concept of
empirical probability based on historical data to estimate the probability of a similar future
event.
# Probability calculation failure probability = days with failures / total working days
print(f'The probability of at least one failure occurring is: (failure probability)') In this Code, the main task is
1. Data assignment:
o The total number of working days and the number of days with failures are
assigned to variables for later use.
2. Probability calculation:
o We calculate the ratio of days with failures to total working days.
3. Solution output:
o We print the calculated result as a fraction, representing the probability that on a
randomly chosen working day, at least one failure will occur in the production line.
Solution
The proposed problem is a classic example of probability, where we want to calculate the
probability of the complementary event (non-defective), and then determine the opposite.
The probability of a defective unit is P(D) = = 0.03. Consequently, the probability
that a unit is NOT defective is P(ND) - 1 - P(D) = 0.97. Using the complementary
probability for 500 units, the probability that ALL 500 units are NOT defective is:
P(500 non-defective) = 0.97500
Now, we calculate the probability of at least one defective gadget, which is the
complement:
P(at least one defective) = 1 - O.97500
This calculation applies the concept of probability for mass production examples, helping
the company estimate the success of their process improvements without having to
completely test a vast quantity of products, thereby reducing the risk of defects in gadgets
distributed in the market.
tt Problem parameters n = 560 * number of units in the batch p defect = 300 I 19000 # probability of a defective unit
# Calculate the probability of at least one defective gadget p non defective all — (1 - p defect) n p at least one defect!
# Print the result prlnt(f‘‘Probablllty that, at least one gadget is defective: (p at least one defective}-) In the present
• First, we define the problem parameters: n (number of units in the batch) and p defect
(probability of a defective unit). The calculated probability is based on the number of
defective units relative to the total produced; here, p defect is 300/10000.
• We use the complementary probability formula:
o We calculate the probability that all units are not defective: d p defect) ** n.
o The event that at least one is defective is complementary to the event that all are
not, so we calculate: 1 - p non defective all.
• Finally, we print the result, showing the probability that at least one gadget is
defective. This type of calculation helps the company understand production risks and
make predictions about the outcomes of changes to production processes.
Solution
To solve this problem, we use the concept of conditional probability. We are looking for the
probability that a customer purchases an item given that they have tried it on in the fitting
room. The probability is calculated as the number of customers who purchased the item
after trying it on, divided by the total number of customers who tried the item.
Calculation:
, u „ Number of customers who purchased
P(Purchase|Tried) = ------------------------------------------------- —--------------
Total number of customers who tried
GOO _ GOO
0.25
GOO 4- 1800 ” 2400
Therefore, the probability that a customer purchases an item after trying it on is 25% or
0.25.
# Calculation of probability conditional purchase probability - number of purchases I (number of purchases ♦ number of no pure
# Result rounded to two decimal places conditional purchase probability - round(conditional purchaseprobability, 2)
conditional purchase probability In the provided code, we are calculating the probability that a customs
1. Definition of data:
o We have defined two variables: number of purchases, representing the number of
customers who purchased after trying; and number of no purchases, indicating those
who did not purchase.
2. Calculation of conditional probability:
o We calculate the probability that a customer purchases after trying it on, by
dividing the number of people who purchased by the total number of people who
tried it on.
3. Approximation
o Finally, we round the result to two decimal places, as required.
This simple code provides us with a clear and direct result of the desired probability, which
is 0.25 or 25%.
Solution
To solve this problem, we must consider the probability of the intersection set. We are
looking for the probability that a user not only converts but also purchases at least 2
products. Let A denote the event 'a user converts' and B the event 'a user who has
converted buys at least 2 products'. We know that:
. P(A) = 0.10,
• P(B\A) = 0.30. (probability of B given A)
The sought probability is P{A n 8), which is calculated from Bayes' formula P(A n 8) = P(A) •
P(8|A), thus
P(A n 8) = P(A) • P(8|A) = 0.10 • 0.30 = 0.03
Thus, the probability that a user who visits the site converts and simultaneously buys at
least 2 products is 3%. This exercise illustrates the concept of the probability of the union
set in the context of customer behavior in an e-commerce, although we actually focused on
the specific events pertinent to the intersection calculation.
# Rate of purchasing at least 2 products among converters p at least 2.products_given conversion -0.30
1. Definition of probabilities:
o p conversion represents the probability that a user converts (10%).
o p at least 2 products given conversion is the probability that a purchaser who has
already converted buys at least 2 products (30%).
2. The probability we are looking for is the intersection of two events: that a user converts
(A) and buys at least 2 products (B). This is calculated as the product of P(A) and
P(B\A), resulting in 0.03 or 3% in percentage form.
3. Finally, we format the final result as a percentage and print it using the print o
function.
In the e-commerce context, this type of analysis helps to better understand user behavior
and to plan personalized marketing strategies.
Solution
To estimate the probability that an employee participates in at least one of the two types of
events, we can use the probability rule for the union of sets. Let A be the event 'attending
a technical workshop’ and B the event 'attending a personal development course.’
The probability that an employee attends a technical event is P(A) = 0.25. The probability
that an employee attends a personal development course is P(8) = 0.15. The probability
that an employee attends both is P(A n 8) = 0.05.
Thus, the probability that an employee participates in at least one of these events is 35%.
Solution with Python
# Define the probabilities P A = 0.25 # Probability of attending technical workshops P B = 0.15 # Probability of attending pc
# Calculate the probability of attending at least one of the events P A union B=PA+PB-PA intersec B
The core of the code is the calculation of probability using the union probability formula:
P(A u 8) = P(A) + P(B) - P(A n 8). This provides the probability that an employee
participates in at least one of the two types of events.
Once the individual probabilities are defined, the rigorous application of this formula
provides the required result.
Solution
To solve this problem, we need to apply the concept of the probability of the union of sets.
That is, we need to determine the probability that a customer is satisfied with the service
or the after-sales support or both.
The formula for the probability of the union of two events A and B is:
P(A u 8) = P(A) + P(B) - P(A n 8)
Where:
• P(A) is the probability that a customer is satisfied with the service, equal to 40% or
0.40
• P(B) is the probability that a customer is satisfied with the after-sales support, equal to
25% or 0.25
• P(A n 8) is the probability that a customer is satisfied with both the service and the
after-sales support, equal to 10% or 0.10
Thus, the probability that a random customer is satisfied with at least one of the two
aspects is 55%.
# Calculating the probability that a customer is satisfied with at least one of the two aspects P union = P service ♦ P suppor
p union In this code, we use the concept of the probability of the union of two events A and B to
1. We define three variables representing the individual probabilities: p service for service
satisfaction, p support for after-sales support satisfaction, and p both for satisfaction
with both.
2. We use the formula to calculate the probability that a customer is satisfied with at least
one of the two aspects. The formula is P(A) + P(B) - P(A n 8), where we subtract the
intersection probability to avoid double-counting customers satisfied with both aspects.
3. The calculation provides the overall satisfaction probability for at least one of the two
aspects for a randomly chosen customer, equal to 0.55 or 55%.
Solution
To solve this problem, we use the concept of the probability of the union of sets:
P(A u 8) = P(A) + P(B) - P(A n 8).
Where:
• P(A) is the probability that a potential customer is reached by the social media
campaign (20% = 0.20),
• P(8) is the probability that a potential customer is reached by the television campaign
(35% = 0.35),
• P(A n 8) is the probability that a potential customer is reached by both campaigns
(10% = 0.10).
Therefore, the probability that a potential customer is reached by at least one of the two
advertising campaigns is 45%.
def probability of either campaign(p a, p b, p ab): # Calculate the probability that a potential customer is # reached by at
# Calculate the probability of being reached by at least one of the campaigns probability = probability of either campaignfp a
print(f"The probability that a potential customer is reached by at least one of the two campaigns is: {probability:.2%}") In t
The executed calculation returns the probability that a potential customer is reached by at
least one of the two advertising campaigns, which is 45%.
Solution
To solve this problem, we use the concept of the probability of the union of sets. The
probability that a customer has purchased products in at least one of the two categories is
given by the sum of the individual probabilities of purchasing in each category, minus the
probability that they have purchased in both categories.
The probability that a customer has purchased in at least one of the two categories is:
P(EuA) = P(E)+P(A)-P(EnA) = 0.45 + 0.55-0.20 = 0.80.
So, 80% of customers have purchased products in at least one of the categories of
electronics or clothing.
P clothing =0.55
# Calculate the probability of purchasing in at least one of the two categories P union = P electronics ♦ P clothing - P e and
p union In this code, we calculate the probability that a customer has made purchases in at least
These probabilities are added together and then the intersection is subtracted.
Solution
To solve this problem, we need to apply the concept of the probability of independent
events. The probability that salesperson A does not conclude a sale is 1 - 0.35 = 0.65.
Similarly, the probability that salesperson B does not conclude a sale is 1 - 0.5 = 0.5. Since
the two events are independent, the probability that neither of the salespeople concludes
the sale is the product of their individual probabilities of failure:
P(no sale) = 0.65 • 0.5 = 0.325
The probability that at least one sale is concluded is complementary to the fact that no
salesperson concludes the sale. Therefore,
P(at least one sale) = 1 - P(no sale) = 1 - 0.325 = 0.675
The probability that at least one sale is concluded during these calls is 0.675,
demonstrating the application of probabilities of independent events to establish the
potential success of the team.
p B = 0.5
# Probability that neither salesperson concludes a sale no sale both ■ no sale. A * nosale B
# Probability that at least one sale is concluded at least one sale = 1 - no sale both
1. Start by defining the probability of success for each salesperson, p a for salesperson A
and p b for salesperson B.
2. Calculate the probability that each salesperson does not conclude a sale with no sate a
and no sale b. This is obtained by subtracting the success probability of each
salesperson from 1.
3. Since these are independent events, the probability that both salespeople do not close
a sale is obtained by multiplying their probabilities of failure.
4. The probability that at least one salesperson concludes a sale is the complement of the
probability that neither concludes a sale. This is calculated by subtracting the
combined probability of failure from 1.
5. Finally, print the result, which is the probability that at least one sale is concluded
during the calls.
Solution
To solve this problem, we apply the statistical concept of the probability of independent
events. Events are considered independent when the outcome of one does not influence
the outcome of the other. The formula to find the probability of both independent events
occurring is given by the product of their individual probabilities.
P(A n B) = P(A) ■ P(B)
Therefore, the probability that both campaigns are successful is 0.42, or 42%
p_b = 0.7
# Calculate the probability that both campaigns are successful p success both = p a * p b
# output the probability p success both In this code, we calculate the probability that both campaigns,
1. We start by defining the variables p a and p b, which represent the probabilities that
Team Alpha's and Team Beta’s campaigns are successful, respectively. These are
provided in the problem statement.
2. We use the multiplication operator * to calculate the product of the probabilities, which
gives us the probability that both events are successful.
3. The result is stored in p success both, which is then simply returned (printed in a broader
context).
Solution
To calculate the probability that a customer passes both assessments, we use the concept
of independent events, where the probability of both events occurring is the product of
their individual probabilities.
If PGA) = 0.8 is the probability of passing the creditworthiness assessment, and P(B) = 0.75
is the probability of passing the claims history assessment, then the probability that both
conditions are satisfied is:
P(A and 8) = P(A) ■ P(B) = 0.8 ■ 0.75 = 0.6
prob A = 0.8
prob B = 0.75
# Probability that both independent events occur prob A and B = prob A * prob B
prob A and B
In the code above, we calculate the probability that a customer passes both risk
assessments based on the premise that the events are independent.
• prob a and prob b represent the probabilities that a customer passes the
creditworthiness assessment and the claims history assessment, respectively.
• Since the events are independent, the probability that both events occur is the product
of the individual probabilities (prob a * prob b). This is a fundamental principle of
probability theory for independent events.
The final result, prob a and b, gives us the probability that a customer passes both
assessments, resulting in a value of 0.6 or 60%.
Solution
To solve this problem, we use the concept of the probability of independent events.
Given the independence of events A and B, the probability that the union of two events
occurs (at least one of them occurs) is given by:
P(A u B) = P(A) + P(B) - P(A) • P(B)
Let's calculate:
P(AuB) = 0.25+0.35-(0.25 0.35) = 0.25+0.35-0.0875 = 0.5125
The probability that a customer neither buys an appliance nor a furniture item is the
complement of P(A u B):
P((A u B)c) = 1 - P(A u B) = 1 - 0.5125 = 0.4875
In this statistical context of independent events, the department store can expect that the
probability of a customer making no purchases in the analyzed categories is 48.75%.
# Probability that a customer neither buys an appliance nor a furniture item complement probability = 1 - union probability
# Output the calculated probability complement probability This Code calculates the probability that a Custom
To calculate the combined probability of two independent events, we used the formula P{A)
+ P(8) - P(A) ■ P(B), which represents the probability that at least one of the two events
occurs.
After obtaining P(A u 8), we calculated the complement (the probability that neither event
occurs) by subtracting P(AuB) from 1.
Finally, the code returns the calculated probability of making no purchases in the specified
categories, which is 0.4875, or 48.75%.
Exercise 14. Analysis of Advertising Sales A digital marketing company analyzed the
response data to its email advertising campaigns in the last quarter. Out of a total of
25,000 emails sent, it recorded that 5,250 of these led recipients to visit the promotional
website. Simultaneously, out of the emails sent, 1,200 resulted in a direct sale on the site.
Based on this data, calculate the estimated probability that a randomly selected email
recipient visits the promotional website and simultaneously makes a purchase. Present the
result as a decimal value rounded to four decimal places.
Solution
The exercise analyzes the probability of combining multiple events: visit and purchase.
Below, we report the various data we have to work with.
= 5250/25000 = 0.2100
Probability that a recipient makes a purchase given that they visited the site:
P(Purchase|Visit) = Number of sales/Number of visits =
= 1200/5250 = 0.2286
Total probability that a recipient visits the site and makes a purchase:
P(Visit and Purchase) = P(Visit) • P(Purchase|Visit) =
In this exercise, we used the concept of conditional probability to estimate the combined
probability of two dependent events: visit and purchase. The estimated probability that a
visitor to the site makes a purchase is rounded to four decimal places and is equal to
0.0480.
visits - 5250
sales = 1200
# Calculation of conditional probability of a sale given that there has been a visit p sale visit = sales / visits
# Calculation of the combined probability of visit and sale p visit and sale = p visit " p sale visit
H Result rounded to four decimal places p visit and sale_approx » roundfp visit_and sale, 4)
p visit and sale approx In this code, we use basic concepts of conditional probability to calculate tl
1. We assign the known values to variables to facilitate their use in the code. We have the
total number of emails sent, the number of site visits, and the number of sales made.
2. The probability that a recipient visits the site is calculated by dividing the number of
site visits by the total number of emails sent. This operation provides P(Visit).
3. The probability of a sale given that there has been a visit is calculated by dividing the
number of sales by the number of visits. This is the conditional probability
P(Purchase|Visit).
4. We use the theorem of conditional probability to calculate the combined probability of
visit and sale, by multiplying P(Visit) by P(Purchase|Visit).
5. Finally, we use the round)) function to approximate the result to four decimal places, as
required.
Solution
To determine the required probability, we consider the number of surveys that express a
positive evaluation and the total number of completed surveys. This probability can be
calculated as:
number of positive surveys
P(positive evaluation) = =
total number of completed surveys
GOO
The concept of probability is applied here to quantify the employees' satisfaction with the
professional development program. A probability of 75% indicates predominantly positive
feedback among employees who participated in the survey.
# Calculate the estimated probability positive probability = number of positive surveys / number of completed surveys
# Print the result print(f“Estimated probability of a positive evaluation: {positive probability)") In this Code, we Ca
1. We define two variables corresponding to the total number of completed surveys and
the number of surveys that reported positive feedback.
2. We calculate the probability of a single positive survey. This is given by the ratio of
positively evaluated surveys to the total completed ones.
3. Finally, we print the results.
Chapter 2
Descriptive Statistics
In this chapter, we will explore a series of exercises in
descriptive statistics, a fundamental branch of statistics that
allows us to effectively summarize and interpret data. We
will start with the basic concepts, such as calculating the
most common observables, including mean, median, mode,
and variance, and then move on to the most widely used
correlation indices, such as Pearson's and Spearman's
correlation coefficients. These tools are essential for
analyzing data in a structured manner, identifying patterns
and trends, and supporting decision-making based on
numerical evidence.
The mean value (or sample mean) is a measure of central tendency that represents the
average of the_data observed in a sample. Given a series of n observations xpx3.... xn, the
sample mean x is defined as:
X = -Iz /=inx /
n
This formula indicates that the mean is obtained by summing all the observations and
dividing by the total number of elements in the sample.
The sample mean is one of the fundamental tools of statistical inference, used to estimate
the population mean and compare groups of data.
Solution
To find the average monthly revenue, all monthly revenue values need to be summed and
then divided by the number of months considered. This is the standard formula for
calculating the arithmetic mean in statistics.
Calculation:
Average = = 82.92
12
The average monthly revenue for the past year is 82.92 thousand euros. This figure
provides a central measure of monthly revenue and helps understand the overall market
performance trend of the company on a monthly basis.
import numpy as np
# Monthly revenue data monthly revenue = 158, 75, 68. 85. 98, 88, 95, 180, 70. 85, 118, 951
# Calculate the average using the numpy library average revenue « np.mean(monthly revenue)
* Print the result print(f“The average monthly revenue is: (average revenue:.2f) thousand euros") In this Code, Wf? USC
Let's look at the various steps:
1. We create a list called monthly revenue containing the monthly revenue data expressed
in thousands of euros for each month of the year.
2. We use the mean function from numpy to calculate the average of the values in the
monthly revenue list. The mean is a statistical indicator that provides a central value of
the data distribution.
3. We use the print function to display the calculated average value, formatting it to two
decimal places for better readability.
This code allows us to quickly obtain an essential measure of the company's average
monthly performance, helping us evaluate last year’s revenue performance.
Exercise 17. Analysis of a Clothing Store's Sales A retail clothing store recorded daily
sales for two consecutive weeks. The collected data are as follows: [1500, 1700, 1600,
1800, 2100, 1900, 2000, 1800, 1600, 2300, 2100, 1900, 2000, 2200] (in euros). The store
manager wants to better understand the average daily sales to optimize inventory and
improve marketing strategy.
Calculate the average daily sales for the analyzed period using the provided data.
Solution
The solution to the problem is based on calculating the average value, also known as the
arithmetic mean. To find the average daily sales, sum all the daily sales and divide the
result by the total number of days considered.
import numpy as np
# Daily sales data daily sales = [1588, 1708, 1688, 1888, 2180, 1988, 2888, I860, 1686, 2388, 2188, 1988, 2868, 22881
# Calculating the average dally sales using numpy average daily sales = np.mean(dally sales)
# Printing the average daily sales print(f“Average daily sales: {average daily sales: .2f) euros’*) In this Code, we use
1. The data provided in the problem is stored in a list called daily sales.
2. We use np. mean (daily sales) to calculate the mean of the values in the list. This function
adds up all the numbers and divides the total by the number of elements.
3. We use the print function to display the result of the average calculation. The
formatted string allows inserting Python variables directly into a string to format the
output. : .2f is used to limit the result to two decimal places for a more precise
representation of the value.
Using the numpy library simplifies our approach compared to manual calculations, helping
reduce the risk of arithmetic errors and optimizing the average calculation process.
The financial director wants to get an overall view of the economic efficiency of the
campaigns to optimize the budget and plan future investments. Calculate the average
monthly cost of each campaign.
Solution
To solve this problem, we need to calculate the average value of the monthly cost for each
advertising campaign.
The average value is a fundamental concept in statistics obtained by summing all the
values of the data in a dataset and then dividing by the total number of values in that
dataset.
import numpy as np
# Calculate the average cost for each campaign average cost campaigns = {name: np.mean(costs) for name, costs in campagnes.ite
print (average cost campaigns) The provided code block allows for determining the average monthly cost o
1. The monthly costs for each campaign are stored in a dictionary campagne, where the
keys represent the campaign names and the values are lists of the monthly costs
expressed in thousands of euros.
2. We use dictionary comprehension to iterate through each campaign and its respective
costs. For each campaign, numpy's mean function is used to calculate the average cost.
3. The resulting dictionary average cost campaigns associates the name of each campaign
with its average monthly cost. This is then printed, providing a clear view of the
average costs for each campaign.
This simple example illustrates how fundamental statistical calculations, like arithmetic
averages, can be easily performed using the scientific libraries available in Python.
2.2 Standard Deviation
The standard deviation is a measure of the dispersion of data relative to their mean. It
indicates how much, on average, the values of a sample deviate from the sample mean.
Given a series of n observations x1,x2,...,xn, the sample standard deviation s is defined as:
X = —z i=inx i
n
Here are some of its properties:
The standard deviation is often used along with the mean to describe the distribution of the
data and to compare variability between different samples.
Exercise 19. Assessment of Product Stability Two companies, Alfa S.p.A. and Beta
S.r.l., sell the same type of product with an average monthly revenue of 50,000 euros.
However, the financial manager wants to identify which of the two products exhibits a more
stable sales trend over time. Here are the data from the past six months: Alfa S.p.A.:
[48000, 52000, 47500, 50500, 49500, 51500]
Determine which product shows greater sales stability and explain why.
Solution
To solve this problem, it is necessary to calculate the sample standard deviation for each
set of provided data. The standard deviation is a measure of data dispersion relative to the
mean, where a lower standard deviation indicates greater stability in the data.
For Alfa S.p.A., we first calculate the sample variance (divide by the number of values
minus 1 to have an unbiased estimate):
Comparing sample standard deviations, we see that Alfa S.p.A. has a standard deviation of
1834.84, while Beta S.r.l. has a standard deviation of 2607.68. Consequently, the sales of
Alfa S.p.A. are more stable compared to those of Beta S.r.l., owing to the lower standard
deviation.
import numpy as np
# Sales data for the two companies vendlte alfa = [48000, 52000, 47500. 50500, 49508, 515001
# Calculation of sample standard deviation std alfa = np.std(vendite alfa, ddof=l) std beta = np.stdfvendite beta. ddof=l)
std alfa, std beta In this code, we use the numpy library to manage data arrays and calculate the sa
The approach of using numpy simplifies the calculation of sample standard deviations,
avoiding manual steps in computing the mean and variance, and is especially useful with
larger data sets or more complex statistical operations.
Exercise 20. Analysis of Daily Work Performance A consulting firm wants to assess
the variation in daily productivity of its consultants. Management has collected data
regarding the actual working hours of a consultant over a period of two working weeks (10
working days). The data collected are as follows: [7,8,7.5,8.5,7,9,6.5,8,7.5,8]
Use this data to determine how the daily productivity of the consultant varies relative to
the average. What conclusion can you reach regarding the consistency of his daily
productivity?
Solution
To determine the daily variation in the consultant's productivity, we need to calculate the
sample standard deviation of the data. The first step is to calculate the daily average of
working hours.
Daily average, x:
- 7+ .. . 4-«
x =-------------------= 7.7 hours
10
The next step is to determine the deviation of each value from the mean, square each
deviation, and sum them.
Z (x, -x)2 = (7 - 7.7)2 + ... + (8 - 7.7)2
Now, to obtain the sample standard deviation, divide this sum by n -1 where n is the
number of data points (in our case 10), and then calculate the square root.
s = V 9 = 0.75
The sample standard deviation of approximately 0.75 hours indicates that the consultant's
daily working hours show moderate variation around the daily average of 7.7 hours. This
statistical calculation allows us to understand that, in general, productivity is fairly
consistent, even though there are some daily variations in hours worked.
import numpy as np
# Calculate the sample standard deviation using sclpy s = np.stdfwork hours data, ddof=l)
(daily average, si In this code, we use the numpy library, often used for numerical and statistical
In the final result, daily average provides the daily average of work hours, while s provides
the sample standard deviation indicating the consistency of the consultant's productivity
relative to the calculated daily average.
The management team wants to identify which developer shows the most stable code
production over time. Analyze the data to determine who has the least variation in weekly
lines of code production.
Solution
To understand which developer has the most stable production, we need to calculate the
sample standard deviation for each dataset. The standard deviation allows us to measure
the dispersion of a sample around its mean.
1. For Developer 1:
/(340 - 355)2 4-... 4- (350 - 355)2
V 4— 1 = 12.91
3. For Developer 3:
/(310 - 312.5)2 + ... 4- (315 - 312.5)2
V 4—1 = 6.45
4. For Developer 4:
/(450 - 153.25)2 4 ... 4- (448 - 153.25)2
V 4-1 = 5.38
5. For Developer 5:
/(500 - 508.75)2 + ... 4- (505 - 508.75)2
V 4—1 = 8.54
Developer 4 has the lowest standard deviation at 5.38, indicating greater stability in code
production compared to other developers. The mathematical concept used here is the
sample standard deviation, which measures data deviation from the mean.
Solution with Python
import numpy as np
'Developer 1': [340, 360, 378, 358), ’Developer 2': |40Q. 380, 398. 410), ’Developer 3*: (310. 320, 305, 3151, 'Developer 4
>
for developer, lines in code.items(): std dev = np.std(lines, ddof=l) * Sample standard deviation stability[developer) » std
# Find the developer with the most stable production most stable developer = min(stability, key=stability.get)
prlntff'The developer with the most stable code production is (most stable developer) with a standard deviation of (stabilityfmc
1. We import numpy with import numpy as np. This library is essential for numerical
operations and to calculate the standard deviation of data.
2. We use np.stdo, specifying ddof=i to calculate the sample standard deviation, which is
necessary when working with a sample instead of the entire population.
3. We insert the weekly lines of code data for each developer into a dictionary.
4. We iterate through each developer and calculate the standard deviation of their weekly
productions. We use np.stdo with ddof=i to obtain the correct standard deviation.
5. We determine which developer has the lowest standard deviation, indicative of more
stable production overtime.
6. Finally, we print which developer has the most stable code production, showing the
associated standard deviation.
This strategy allows us to identify which developer is the most consistent in weekly
recorded lines of code productivity.
Solution
To solve this problem, we need to calculate the standard deviation of the sum of two
random variables, in this case, the productions of components A and B. The standard
deviation of the sum of two random variables X and Y with standard deviations ox and oy
and covariance cov(X,T ) is given by the formula:
Ja\ + a$ + 2 • cov( A\ Y)
°X+Y= V A X
Applying the values from the problem, we get:
aA+B = V1002 4- 1502 4- 2 • 1200 = v 10000 ~ 22500 +2-100 = v34900
~ 186.81
The standard deviation of the weekly sum of the components produced is approximately
186.81 units. This value provides an indication of the total variability of the assembled
production, taking into account the interaction between the two production lines through
covariance.
import numpy as np
sigma A = 108
muB = 1880
sigma B « 158
cov AB = 1208
sigma total The code implements the calculation of the standard deviation of the total weekly prodi
We use the numerical computation library numpy to perform the necessary mathematical
operations.
1. The library numpy is imported with the alias np. numpy is essential for scientific computing
in Python and provides efficient functionality for performing complex mathematical
operations.
2. We define the variables mu a, sigma a, mu b, sigma b, cov ab, which represent the means,
the standard deviations of the normal distributions of A and B, and the covariance
between A and B, respectively. These are the known values provided by the problem. In
this specific case, it’s not necessary to use the mean values, as the standard deviations
are already provided.
3. We use the formula provided by the theory of combined normal distributions: aA+B =
V+ - ' cov
(-4. B)
4. np.sqrt)) is the NumPy function used to calculate the square root, while ** is the
exponentiation operator in Python.
5. The formula considers the variances of the two components (i.e., the squares of their
standard deviations) and includes an additional term that accounts for the covariance
between A and B.
6. The result, sigma total, represents the overall standard deviation of the sum of
components A and B produced weekly and is printed as the output of the code.
Exercise 23. Corporate Financial Risk Analysis A company in the financial sector
manages two investment funds, Fund X and Fund Y. These funds invest in uncorrelated
assets, so their returns can be considered independent.
The annual volatility of Fund X’s return is ax = 0.08, while that of Fund Y is ay = 0.10. In a
year, an investor wants to know how risky the combined total of returns from both funds is.
Determine the overall standard deviation of the investor's final return.
Solution
To solve this problem, we need to calculate the standard deviation of the sum of returns
from Fund X and Fund Y. Since the problem states that there is no correlation between the
two variables (no covariance), we can use the formula:
°x+y 2 = ox2 + oY 2
Therefore, the overall volatility of the sum of the funds is about 0.128, or 12.8% annually.
This calculation maps the statistical concept of standard deviation of the sum of two
independent variables to a financial context, thus comparing the combined risk of two
different investments.
sigma X « 9.88
sigma Y ■ 9.10
# Calculate the overall standard deviation without covariance sigma combined = sqrttsigma **
2
X ♦ sigma Y’*2)
# Print the result f'The aggregated standard deviation of the two funds is: (sigma combined}**
In this code, we calculate the aggregated standard deviation of the returns of two
investment funds, Funds X and Y, using the math library to calculate the square root.
1. from math import sqrt: This imports the sqrt function from Python's math module,
allowing us to calculate the square root necessary to obtain the overall standard
deviation.
2. We define the volatilities of Fund x and Fund y as sigma x and sigma Y respectively. It is
given that sigma x is 0.08 and sigma Y is 0.10.
3. We use the proposed formula to calculate the standard deviation of the sum of two
independent variables, i.e., sqrt(sigma **
x2 »2). This formula assumes no
+ sigma y*
correlations between the two return series (absence of covariance).
4. Finally, we print the result using an f-string statement, which allows for easy and
readable string formatting in Python by directly integrating variables within the string
using braces {}.
Using the square root is necessary for the final calculation of the standard deviation
because, in statistics, we work with variance (which is the square of the standard
deviation) while we want to obtain a measure of dispersion that is on the same scale as the
mean. This method determines the overall risk, or volatility, of two combined investments,
provided they are not correlated with each other.
Solution
To solve this problem, we need to determine the standard deviation of the sum of the two
random variables, which represent the productions of Components X and Y. The key
concept to apply here is how to calculate the standard deviation of a sum of random
variables.
For two random variables X and Y, the variance of their sum is given by:
Var(X + Y) = Var(X) + Var(Y) + 2 • cov(X,Y )
The standard deviation of the sum is thus the square root of the variance:
Ox+y = V7133600 = 365.5
The standard deviation of the total monthly production of Components X and Y is therefore
about 365.5 units. This value represents the variability in the total monthly productions of
the components, providing an indication of the risk associated with the overall production
capacity. By effectively managing this variability, the company can optimize its supply
chain operations.
import numpy as np
* *2
var_Y « sigma Y
In this code, the numpy library is used to perform numerical calculations, including the
square root. Although we only use basic functions of numpy, it is a fundamental library for
scientific computing in Python.
The result, sigma x plus y, represents the overall variability of the monthly production of
components X and Y.
Based on these calculations, briefly discuss whether the investment can be considered
stable.
Solution
To solve this exercise, we analyze the relationship between the average and the standard
deviation of the monthly returns of the fund.
12 12
2. Calculate the standard deviation:
Variance - E£l(
*
< ~ Mcan)2
12
Standard Deviation = V \ ariailCt = 0.263
Import numpy as np
# Monthly returns data returns » [2.5, 3.2. 3.0, 2.8. 3.4, 2.9, 2.7, 3.1, 2.6. 3.3, 2.9, 3.0]
# Calculate the ratio of mean to standard deviation stability ratio » meanreturns / standard deviation
(mean returns, standard deviation, stability ratio) In this code, we primarily use numpy, a useful library f(
1. We create a list of monthly returns called returns that contains the given percentage
data.
2. We use np.meant) to calculate the mean of the monthly returns. This function calculates
the average value of all elements in the array.
3. We use np.stdo to calculate the standard deviation. The crucial difference is ddof=0,
which indicates using n for variance, as we are dealing with a population set, not a
sample (default in numpy).
4. Finally, we divide the average returns by the standard deviation to obtain the stability
ratio. The higher this ratio, the more stable the returns are considered.
Using these procedures, we can judge the stability of the investment fund based on the
calculated values.
Exercise 26. Sales Variables Analysis An e-commerce company wants to analyze the
sales performance of its products to assess the consistency of monthly sales. The sales
data in euros for the first six months of the year are as follows: [5700, 6200, 6500, 5900,
5800, 6400],
The company believes that a good assessment of sales consistency can be obtained by
calculating an index that relates the average value of sales to their standard deviation. A
higher value of this index indicates greater consistency in sales. Evaluate whether the sales
can be considered consistent based on this index.
Solution
To tackle this issue, the main approach is to analyze the ratio between the mean and the
standard deviation, which is related to the concept of the coefficient of variation.
In conclusion, the company can consider its monthly sales as consistent since the
consistency index shows a relatively low variation compared to the average sales, which is
the main objective of the exercise: the ratio between mean and standard deviation offers a
quantitative assessment of the stability of business variables.
import numpy as np
# Calculate the standard deviation standard deviation = np.std(sales, ddof=8) # ddof«8 for entire population
# Output mean, standard deviation, consistency index The code uses the NumPy library to perform statistical
1. Sales data is defined as a NumPy array. The variable sales contains the sales for the
first six months.
2. Using np.meano, we calculate the average sales value, which represents the monthly
mean.
3. With np.std(), we calculate the standard deviation. The parameter ddof=o indicates that
we are calculating the population standard deviation (using division by M).
4. The consistency index is calculated by dividing the mean by the standard deviation.
This index measures sales consistency: the higher it is, the more stable the sales are.
Finally, the code outputs the mean, standard deviation, and consistency index. This allows
the company to evaluate if their sales are consistent in the analyzed period.
The company wishes to understand how stable the production process is. Managers believe
that a numerical comparison can provide a clear indication: less variation from the mean
suggests greater stability of the production process.
Solution
o= n
The stability index is given by the ratio R of the mean to the standard deviation:
I' 1260.83
R = ~------------- =39.80
a 31.68
In statistical context, the ratio between the mean and the standard deviation (which
represents stability) is an important indicator to evaluate the level of fluctuation relative to
the mean value. A higher ratio signifies less instability. The obtained index (R = 39.80)
suggests good stability of the production process as the variations from the mean are
contained in comparison to the mean value itself. This type of analysis is crucial for a
company aiming for efficiency and consistency in its production operations.
import numpy as np
# Monthly production data production « np.array![1200, 1300. 1270, 1250, 1285, 12601)
# 2. Calculate the monthly standard deviation # Using ddof=8 to get the population standard deviation standard deviation - n[
# 3. Calculate the stability index stability index = monthly mean / standard deviation
# Results print(f"Monthly production mean: {monthly mean:.2f}“) print(f-Monthly standard deviation: (standard deviation:.2f}r'
First, we created a numpy array called production to store the monthly production data. We
used numpy for the calculation of the monthly mean using the statement np.mean(production),
which provides the sum of the values divided by the total number of elements.
For the standard deviation calculation, we used the statement np.stdiproduction, ddot=0).
Here we specify ddof=o to get the population standard deviation, which is appropriate since
we are analyzing all six months of available data.
Finally, we calculated the stability index as the ratio between the monthly mean and the
standard deviation. This index provides a measure of production consistency: a higher
value indicates a more stable production relative to the mean.
2.3 Median Value
The median value (or median) is a measure of central tendency that represents the central
value of an ordered set of data. Unlike the mean, the median is less sensitive to extreme
values (outliers) and provides a more robust measure of the central position of a sample.
Given a series of n observations x1,x2,...,xn ordered in increasing order (X; < x2 <• . ■< x„),
the median M is defined as:
2
Let's look at some main properties:
The median is particularly useful for describing the central position of data with
asymmetric distributions or in the presence of extreme values.
The HR manager wants to understand what the central value that best represents the
salary situation of the department is.
Calculate this value and discuss how it might reflect the salary distribution more accurately
compared to other statistical summaries, such as the simple average.
Solution
To solve the problem, we identify what the median of the listed salaries is. The median is
the value that divides the sample into two equal parts, namely the number that is in the
middle of the ordered series.
Given the list of ordered salaries: 40, 45, 50, 55, 60, 65, 70
We can see that the series has seven values. Therefore, the central value (fourth in the
ordered series) is 55, which represents the median.
The median 55 provides us with an idea of the "middle point" of the salary distribution in
the department, which can be particularly useful if there are outliers or asymmetric
distributions that might skew the average. In this context, comparing the median with the
mean might show if there are extremely low or high salaries that influence the mean but
not the median.
# Salary list salaries = (40, 45, 50, 55, 60, 65, 70]
# Calculate the median using scipy median salary scipy = stats.scoreatpercentile(salaries, 50)
# Output results print(f"Median calculated with numpy: {median salary}") printff"Median calculated with scipy: {median salary
Additionally, we utilized the scipy library, which offers advanced scientific functions. We
used the stats.scoreatpercentile command to calculate the median, passing the 50th
percentile. While providing the same result as the np.median() function, scipy offers more
detailed options for handling statistical distributions and can be useful in more complex
contexts.
Finally, we also calculate the arithmetic mean with np.meant). By comparing the mean and
median, we can infer information about any asymmetries in the salary distribution. A
significant discrepancy between the mean and median can indicate the presence of
outliers. However, in this specific case, since the data is symmetric and uniformly
distributed, the mean and median are very similar. This confirms that there are no large
disparities in the salaries of this department.
The management wants to identify the central value of the delivery times to evaluate the
overall efficiency of their logistics system. Determine this central value and discuss how it
can provide a more accurate picture of the management system compared to other
statistical summaries, considering the presence of outliers or exceptionally long times.
Solution
To identify the central value in the delivery times, we need to calculate the median. The
median is the value that divides an ordered data set into two equal halves.
In this case, the median is preferable to the mean because it is less affected by outlier
values (such as 10, 12, and 15 days). This provides a measure of the "typical" delivery time
that is more useful for understanding the general performance of the system without being
distorted by a few exceptionally slow deliveries.
print('The median delivery time is:', median) In this code, we use the scipy library, a widely used Python
1. We start by importing the stats module from scipy, which offers various tools for
statistical analysis.
2. We create a list called deliveries that contains the ordered delivery times for the month
of September. These represent the data to be analyzed.
3. We use the scoreatpercentite function from the stats module to calculate the median of
the deliveries list.
4. We print the result using print, which communicates to the user what the median
delivery time is. The choice of the median allows for a more robust measure of the
"typical" delivery time compared to the arithmetic mean, as it is not significantly
influenced by extremely high or low values (outliers).
Using scipy makes the process of calculating the median simple and efficient, automating
the handling of ordered data and providing reliable results regardless of the presence of
outliers.
Management wants to understand what the typical wait time is to identify areas for
immediate improvement and is seeking to establish a benchmark that is not influenced by
exceptionally long waits due to rare unforeseen events.
What is the value that best represents this time series, and how can it be used to guide the
strategic decisions of the call center?
Solution
To determine the central value of the wait times, we use the median. The median
represents the value that lies in the middle of the data when they are ordered. Let's order
the wait times: 1, 2, 2, 3, 3, 3, 4, 5, 7, 10, 12, 15, 20
Since the total number of observations is odd (13 data points), the median is the seventh
value in the ordered series. Thus, the median is 4 minutes.
The median is a useful statistical measure in this context because it is not affected by
significantly longer waits (outliers) that might skew the arithmetic mean. This can serve as
a benchmark for the typical wait time and guide the strategic decisions of the call center
towards reducing queues or making operational improvements for longer time frames.
printf'The median of the wait times is:', median) This code uses the scipy library, a Python library for sci
A list times is then defined, which contains the wait times, i.e., the recorded data from the
call center.
To calculate the median, we can use scoreatpercentilef), which takes the list of data as its
first argument and the percentile as the second argument (in this case, 50) and returns the
median as output.
Using the median instead of the arithmetic mean is strategic because it is unaffected by
the presence of exceptionally high or low values (outliers), thus providing a value that
better represents the central tendency of the data distribution. This information can be
useful for business decisions, as it provides a realistic benchmark of the typical wait time in
the call center.
2.4 Percentiles
Percentiles are positional measures that divide an ordered set of data into 100 equal parts.
The percentile of order p indicates the value below which p% of the observations lie.
Given an ordered sample of n observations xyx2,...,xn, the percentile of order p (with 0 < p
< 100) is the value Pp such that:
P
# of values < P„ - ---------n
P 100
Depending on the number of observations, the percentile can be determined in the
following ways:
• ‘ n's an integer, the percentile is the average of the two adjacent values.
• If • n is not an integer, the percentile corresponds to the value of the next
observation.
Some percentiles have specific names and are widely used in statistics:
Thus, percentiles are essential tools for describing the distribution of a dataset and
analyzing its variability.
The objective is to understand up to what value the response times extend within which
95% of calls fall, to better plan personnel and ensure high service standards.
Solution
To solve this problem, we calculate the 95th percentile of the response time sample. This
type of calculation allows us to understand up to what point the 95% of the sample data
lies, indicating a high response time risk above this value during peak hours. We sort the
sample in ascending order: 34, 35, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 40, 40, 41, 42, 42,
43, 44, 45.
The position of the 95th percentile is given by: P = 0.95 • (/V + 1) = 0.95 • 21 = 19.95 . The
95th percentile is a weighted average between the 19th and the 20th observation, where
19.95 is close to the value of the 20th position. In the ordered dataset, the 19th position is
44 and the 20th is 45 . Thus, the indication will be towards the 44 .
Therefore, in the context of personnel and resource planning, the response time that
covers 95% of the calls is 45 seconds.
import numpy as np
response times = (38, 42, 35, 39, 45, 40, 37, 44, 41, 36, 38, 39, 43, 34, 37, 42, 38, 40, 36, 37]
The Python code uses the NumPy library, a fundamental library for scientific computing
with Python. It provides support for multidimensional arrays and high-level matrix objects.
In this case, we use the np.percentile function to calculate the desired percentile of a
dataset. The function np.percentile(array, percentile) will return the value below which the
specified percentage of the dataset falls.
1. response times is an array that contains the sample data of response times in seconds.
2. Using the statement np.percentile(response times, 95) we calculate the 95th percentile.
This function analyzes the ordered vector of data and provides the value below which
95% of the observations fall. This allows us to identify a value that represents a
benchmark against which to plan the contact center staff.
3. Finally, percentile 95 contains the value of the 95th percentile of the response times,
which in the context of the exercise is 44 seconds.
The management intends to determine the number of weekly sales above which only 25%
of the weeks exist, to recognize strengths and weaknesses in the product’s production and
distribution chain.
Solution
To solve this problem, we need to find the 75th percentile of the weekly sales. Percentiles
are a statistical concept that indicates a value below which a given percentage of data
falls.
1. Arrange the data in ascending order: 120, 135, 140, 150, 155, 160, 165, 170, 175, 180,
185, 190, 195, 200.
2. Determine the position of the percentile using the formula: P = (n - 1) • (percentile/100)
+ 1, where n is the number of data points and percentile is the target.
o In this case: P = (14 - 1) • (0.75) + 1 = 10.75
3. Since P is not an integer, take a linear combination of positions 10 and 11 of the sorted
data:
o Sale at position 10: 180, Sale at position 11: 185.
o Final calculation: Sa/e„ = 180 + 0.75 • (185 - 180) = 183.75
Round up or down at the company's discretion based on the units sold, thus obtaining
approximately 184 weekly sales.
The technical jacket will have weekly sales equal to or exceeding 184 units in 25% of the
weeks. This value allows the company to better understand sales behavior and plan
production and distribution more effectively.
import numpy as np
# Weekly sales data sales = np.arrayf[120, 135, 140, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 2001)
print(f"The number of sales at the 75th percentile is: (percentile 75}") In this Code, we are using the numpy lib
This operation allows us to quickly and accurately obtain percentiles of the sales data,
which can be extremely useful for data analysis, helping the company understand sales
behavior and make informed decisions about production and distribution of the product.
The management of the chain wants to identify the score below which 25% of the ratings
fall, to pinpoint the hotels that require immediate improvement in the services offered.
Solution
To solve this problem, the goal is to calculate the 25th percentile of the satisfaction score
data. First, the data sample is ordered in ascending order: 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
8, 8, 9, 9, 9, 9, 9, 9, 10, 10
The satisfaction score below which 25% of the ratings fall is 6. This calculation is crucial for
the hotel chain management to identify the facilities whose satisfaction score is in the
lowest quartile, requiring priority corrective actions.
p25 = stats.scoreatpercentilefsatisfaction scores, 25) print(f"The 25th percentile is {p25}”) In this COde, we USe the
1. The scoreatpercentile function of scipy.stats calculates the value below which a certain
percentage of the underlying data falls.
2. Our dataset satisfaction scores is passed to the function along with the percentile we
are interested in (25 in this case).
3. We print the 25th percentile. This indicates the score below which 25% of the data lies
in the context of satisfaction scores, useful for identifying hotels that need immediate
improvements.
By using scipy, the code is not only readable and concise but also efficient, making it
suitable for more complex statistical analyses.
2.5 Chebyshev's Inequality
Let X be a random variable with mean p and variance a2. Then, for any k > 1,
This inequality states that the probability that X deviates from its mean by at least k times
the standard deviation is at most
Since it is valid for any distribution with a finite variance, it is a fundamental tool in
statistics and probability theory, especially in business contexts where it is necessary to
manage the possibility of extreme events.
The board of directors is interested in knowing how frequently significant fluctuations from
the average can occur since high variations in sales can influence supply decisions and
marketing strategies. In particular, they want to estimate the absolute minimum and
maximum sales threshold that can be expected in 2024, such that 90% of the months fall
within this range.
Knowing that no particular assumptions can be made about the data distribution, what
estimate might the statistics team of ABC S.p.A. provide?
Solution
To address the board of directors' question, we can use a useful principle when the data
distribution is not known: the Chebyshev's inequality.
Chebyshev's inequality states that for any data distribution with mean p and standard
deviation a, the proportion of observations falling within k standard deviations from the
mean is at least 1 -p-.
k2 > 10
k>V10 ~ 3.162
This means that, to ensure at least 90% of the months fall within this range, we must
consider an interval of 3.162 standard deviations from the mean.
Therefore, the statistics team can tell that for 2024, in order to align with the board's
expectations, it can be expected with at least 90% certainty that the monthly sales of this
product will fluctuate approximately between 2,095 and 17,905 units.
import numpy as np
mean « 160G0
it Calculation of the confidence interval lower interval = mean • k • standard deviation upper interval = mean + k • standard
it Result (lower intervai, upper interval) In the Python code above, we are mainly using the library numpy
Details:
1. We define mean and standard deviation, which represent respectively the historical
average sales and standard deviation.
2. The key formula of Chebyshev's inequality is k = V * • — P \ The target percentage
is set to 0.9 because we want to cover at least 90% of the fluctuations.
3. Using the mean and the calculated value of k, the lower and upper bounds of the
interval are calculated as mean ■ k * standard deviation and mean + k * standard deviation.
This represents the range within which we expect the monthly sales to fall with 90%
certainty.
This statistical approach adapts well even when the distribution is not normal, making it
versatile for various practical applications.
They want to identify a monthly inventory safety range such that at least 85% of the time,
the inventory levels remain within this range. Using historical data, what inventory range
should TechWare consider?
Solution
To answer this question, we apply a technique that allows us to estimate how many
observations lie within a certain number of standard deviations from the mean, regardless
of the actual distribution of the data.
For TechWare, the mean p of the monthly inventory is 5000 units and the standard
deviation a is 1200 units. We are interested in a range that covers at least 85% of the
inventory levels.
We require that:
1 -J_= 0.85
A2
1
__ =0.15
A2
k2 = 1 .
0.15
k « 2.58
This means that at least 85% of the data will be within 2.58 standard deviations from the
mean.
Therefore, TechWare should predict that the inventory will be between 1904 and 8096 units
for 85% of the months, based on historical data and without assuming a specific
distribution.
sigma = 120G
probability =9.85
# Calculate the lower and upper limits lower limit = mean - k * sigma upper limit = mean ♦ k • sigma
# Display the results (lower limit, upper limit) Let’s look at the details of this script:
1. We define the mean and standard deviation of inventories provided by the problem,
along with the required probability of 85%.
2. We use the derived formula 1 -jL = 0.85 to calculate k, whose value is k=(i/(i-
probability**
))
0.5. Applying this formula might lead to results with more decimal points
than reported in the solution, resulting in slight discrepancies.
3. With the k value, we calculate the lower and upper bounds of the safety range using
the formulas:
o Lower = p - ko
o Upper = id + ko
4. Finally, the code displays the results, corresponding to the range within which the
inventory will fall 85% of the time.
This approach does not assume that the data is normally distributed, making it more
general and applicable even when the data distribution is unknown.
Exercise 36. Management of Profit Margins The ItalFood company, specialized in the
production and distribution of food products, wants to analyze the variability of monthly
profit margins on its main products. From the analysis of historical data of the last 3 years,
it emerges that the average profit margin is 15000 euros, with a standard deviation of
3000 euros.
The management is concerned about the impacts that overly variable profit margins can
have on investment decisions and wants to establish a safety margin. Specifically, they
want to know what the range of monthly profit margins could be such that it can be stated
with confidence that at least 95% of the months of the next year could have a profit margin
within this interval. Provide an estimate based on historical data.
Solution
To solve this problem, we apply a fundamental statistical principle to estimate the safety
margin: 1 where k is the number of standard deviations from the mean.
We know that the average profit margin is p = 15000 euros and the standard deviation is a
= 3000 euros. We want to estimate an interval such that at least 95% of the profit margins
fall within it.
The variable 1 —= 0.95 indicates how many owe need to consider. Solving the equation
* 1 1
for k, we obtain 1 ~ = 0.95 which implies A = 0.05 and therefore k2 - 20. Consequently,
__ k-
k = v20« 4.47.
Therefore, it can be stated with at least 95% confidence that the monthly profit margin for
ItalFood will be between 1590 euros and 28410 euros.
import numpy as np
# Problem data mu = 15890 # Average profit margin sigma = 3000 # Standard deviation
# Calculate the range of profit margins lower bound » mu • k • sigma upper bound = mu + k * sigma
# Results range interval = (lower bound, upper bound) range Interval III this Code, W© are calculating a safety
The calculated interval thus provides a range of profit margins that, with at least 95%
confidence, will contain most of the future monthly profit data. Again, the use of more
decimal places might lead to slight discrepancies in the result compared to what was
provided in the original solution.
2.6 Identification of Outliers with IQR Method
Outliers are anomalous values that significantly deviate from the majority of the data in a
set. A common method for detecting them is the use of the interquartile range (IQR).
where:
• Oj (first quartile) is the value that separates the first 25% of the data.
• Q3 (third quartile) is the value that separates the first 75% of the data.
The IQR measures the central dispersion of the dataset, ignoring the extremes.
The IQR method is therefore an effective tool for detecting anomalous values and
improving the quality of statistical analysis.
01 23 45 12
Q2 26 47 15
Q3 28 49 11
Q4 29 52 14
QI 31 53 37
Q2 34 54 16
Q3 30 55 18
Q4 35 75 17
Table 2.1 : Product Sales.
The team is tasked with identifying which numbers provided can be considered significant
enough to cause concern or require further investigation by management.
Use the provided data to identify any potential anomalies and indicate the calculation
method and the values that are considered out of the ordinary for each product.
Solution
For each product, we calculate the first quartile (QI) and the third quartile (Q3) to
determine what we should consider as an anomaly. We proceed with the calculation of the
interquartile range (IQR = Q3 - QI) and set limits to identify outliers (i.e., QI - 1.5/QR and
Q3 + 1.5/QR). Values outside these limits can indeed be defined as outliers.
Product A:
Product B:
Product C:
For Product B and C, the sales of 75 in the fourth quarter and 37 in the first quarter of the
second year respectively represent statistical anomalies and warrant further analysis to
understand the underlying causes. This procedure is based on the analysis of limits
calculated using the interquartile range method for identifying outliers.
'Product A’: (23, 26, 28, 29, 31, 34, 30, 351, 'Product 8': (45, 47, 49, 52, 53, 54, 55, 751, 'Product C
*
: (12, 15. 11, 14, 3
# Function to calculate and identify outliers def find outliersfdata, product name): sorted data = sorted(data) 01 = np.per-
print(f"(product name):'
*) printff - Ordered data: (sorted data}") print(f” - 01 « (01)") prmt(f" - 03 ■ (03)”) print(f"
for product, data in sales.items!): find outliersfdata, product) The provided Python code identifies outliers j
• The scipy library is used, particularly the stats module for directly calculating the IQR
(interquartile range), numpy is used to calculate the percentiles representing the
necessary quartiles (QI and Q3).
• Quarterly data for each product is represented in a dictionary. Each product is a key
with a list of quarterly sales as values.
• There is a function that takes the data for a product and calculates the quartiles QI
and Q3 using np.percentile. The IQR is calculated using scipy.stats.iqr. The limits for
identifying outliers are defined as [Q1-1.5-/QR,Q3+ 1.5 • IQR]. The outliers are
identified by comparing the data against these limits.
• The data is ordered to facilitate the correct calculation of percentiles, although it is not
strictly necessary. Each intermediate step, from the ordered data to the identified
limits, is printed for transparency and manual verification.
The final result includes informative output on the presence of any outliers for each
product.
EmployeejanFebMarAprMayJun
1 5 8 10 3 2 12
2 7 9 11 6 3 15
3 6 7 13 8 4 2
4 9 11 14 9 6 0
5 10 13 12 7 5 0
6 4 10 12 3 2 25
7 8 8 6 9 4 8
8 10 5 8 12 1 10
9 3 2 1 4 4 3
10 7 9 8 6 5 2
Table 2.2 : Employee Overtime
The task is to identify which employees are working disproportionate amounts of overtime
in any given month, and if any anomalies might require preventive measures to avoid work
overload.
Solution
To solve this problem, we can explore the distributions of monthly overtime hours to
identify outliers. We will use a method involving the calculation of the interquartile range
(IQR).
1. For each month, order the overtime hours and calculate the first quartile (QI), the third
quartile (Q3), and then the interquartile range (IQR) as follows:
IQR = Q3 - QI
2. Any value below QI - 1.5 • IQR or above Q3 + 1.5 ■ IQR is considered an outlier.
3. Take January for example: Overtime hours: (5, 7, 6, 9, 10, 4, 8, 10, 3, 7]
o Ordered: [3, 4, 5, 6, 7, 7, 8, 9, 10, 10]
o QI = 5.25,Q3 = 8.75
O IQR = Q3 - QI = 3.5
o Outlier limits: 3.5 and 14
No employee worked a significant number of hours outside this range, so there are no
significant anomalies for January.
4. Use the same approach for each month.
'Jan
:
* (5, 7, 6. 9, 10, 4, 8, 18, 3, 71, ’Feb': [8, 9, 7, 11, 13, 10, 8, 5, 2. 9], 'Mar
:
* [10, 11, 13, 14, 12. 12, 6, 8, 1, 6
# Function to identify outliers def identify out tiers(monthly hours I: QI - np.percentlle(monthly hours, 25) Q3 = np.percent
outliers • [(index ♦ 1, hours) for index, hours in enumerate(monthly hours) If hours < lower bound or hours > upper bound)
return outliers
# Apply the function to each month outliers.per month » (month: identify outliers(hours) for month, hours in overtimehours.it
# Print the results for month, outliers in outliers per month.iteras(): if outliers: prlnt(f'Outliers In the month of (month
• The function identify outliers takes a list of hours for each month, calculates the
quartiles, the IQR, then determines the lower and upper limits, thus identifying values
that significantly deviate from the norm.
• Outliers are identified if they are below QI - 1.5 • IQR or above Q3 + 1.5 IQR. This
method is effective because it identifies statistically significant deviations from the
'average' of other data.
• For each month, the code prints employees with anomalous overtime or states the
absence of such anomalies.
Exercise 39. Company Energy Efficiency Analysis An energy company is assessing the
energy consumption efficiency of different sections of their production facilities. Corporate
auditors have collected weekly data, expressed in thousands of kWh, related to
consumption for four different sections over the past two months.
The team is tasked with determining which data among these are significantly different
from the others, suggesting a need for more in-depth analysis. Identify the standout values
and describe how to interpret them.
Solution
To identify outliers in the energy consumption of the sections, we apply the interquartile
range (IQR) method.
1. Identify the first quartile QI and the third quartile Q3 for the consumption of each
section.
2. Calculate the IQR: IQR = 03 - QI.
3. A value is considered an outlier if it is less than QI - 1.5 • IQR or greater than Q3 + 1.5 •
IQR.
Any value that falls outside the identified IQR range may indicate unusual energy
consumption and would need investigation to understand the causes, which could be
technical or operational. This approach helps maintain energy efficiency and optimization
within the company.
# Determine bounds for outliers lower bound = 01 - 1-5 * IQR value upper bound = 03 ♦ 1.5 * I0R value
# Identification of outliers section outliers = consumptions df((consumptions df[section] < lower bound) | (consumptions df[se
# Results prmt(outiiers) In this script, we import the necessary libraries: numpy for numerical oper
The DataFrame is constructed from the weekly consumption table. For each section, we
calculate the first quartile Qi and the third quartile Q3 using numpy.percentile, while the
interquartile range iqr is calculated with scipy. stats, iqr.
We create bounds to consider the values as outliers: any value below QI - 1.5 • IQR or
above Q3 + 1.5 • IQR is considered an outlier. We then use pandas conditional indexing to
extract these values and store them in a dictionary called outliers, which is then printed at
the end of the script.
This analysis allows identifying unusual energy consumption in each section for further
investigations.
2.7 Pearson Correlation Coefficient
The Pearson correlation coefficient (or Pearson's linear correlation coefficient) is a measure
of the linear relationship between two numerical variables. It indicates how strongly and in
what direction two variables are correlated.
Given a set of n pairs of data (Xp/j), (x;,y2),..., (x„,y„), the Pearson correlation coefficient,
denoted as r, is defined as:
-
*
) 2-
where:
The Pearson correlation coefficient takes values in the range [-1, 1], which are interpreted
as follows:
It's important to remember that correlation does not imply causation: a high value of r does
not mean that one variable necessarily causes the other.
1 20 150
2 25 160
3 30 220
4 35 240
5 40 280
6 25 170
7 30 200
8 40 260
9 45 300
10 50 320
Calculate the value of the statistical measure that quantifies the strength of the linear
relationship between the advertising expenses and sales. Interpret the result obtained.
Solution
To solve this problem, we use Pearson's correlation coefficient, which measures the
strength and direction of the linear relationship between two quantitative variables.
After performing the calculations, let's assume r = 0.98. This value indicates a strong
positive correlation between advertising expenses and sales, suggesting that as
advertising expenses increase, sales tend to increase proportionally. This implies that the
company's advertising activities have a significant influence on sales.
# Data advertising expenses = np.array([20, 25, 30, 35, 40, 25, 30, 40, 45, 58)} sales = np.array( [150, 160, 220, 240, 280, 1
# Output print( f "Pearson correlation coefficient: (pearson_corr)"') In this code, we use the SClpy library, whiC
1. We import numpy for easily working with vectors and matrices, and pearsonr from the
scipy.stats library to calculate the Pearson correlation coefficient.
2. Initially, we define two numpy arrays: one for the weekly advertising expenses and one
for the sales generated.
3. We use the pearsonr function provided by scipy.stats, which calculates the Pearson
correlation coefficient. This function returns two values: the first is the desired
correlation coefficient, the second is the p-value (which in this case we are not
interested in) of a hypothesis test whose null hypothesis is the absence of a correlation.
4. Finally, we print the calculated value of the Pearson correlation coefficient, which
indicates the strength and direction of the linear relationship between advertising
expenses and sales.
The use of scipy greatly simplifies the calculation thanks to the pearsonr function, which
allows for quickly obtaining the correlation coefficient without having to manually
implement the mathematical formula.
A 10 200
B 15 220
C 22 240
D 25 260
E 31 280
F 35 300
G 40 320
Table 2.5: Productivity and Training Hours.
Determine the statistical measure that quantifies the strength of the relationship between
training hours and productivity. Provide an interpretation of the results.
Solution
In this exercise, we use Pearson's correlation coefficient to examine the linear relationship
between the average number of training hours and average productivity per employee.
Pearson's coefficient, denoted by r, is calculated using the formula:
r e,(.v,-_.v)(r.
-j)
r=vTrv.-A')2E(T-n2
Where X, and Y, are our data on training hours and productivity, respectively, and X and Y
are the means of these data.
Calculating this coefficient for the given data, we obtain r = 0.997. This value, close to 1,
suggests that there is a strong positive correlation between training hours and productivity.
This implies that, generally, a greater amount of training hours is associated with an
increase in productivity in the analyzed departments. However, it's important to note that
this coefficient only measures correlation, not causation, and other factors could influence
the observed relationship.
4 Data training hours - np.array](10, 15, 22, 25, 31. 35. 40]) productivity « np.array](2QQ, 220, 240, 260, 280, 300, 3201 )
# Result r In the provided code, we use the scipy library, particularly the stats module, to calc
1. We import numpy for convenient numerical array operations and pearsonr from
scipy.stats to calculate Pearson’s correlation coefficient.
2. The data fortraining hours and productivity are stored in two numpy arrays. These
arrays represent the average training hours and the average output for each
department, respectively.
3. We use the function pearsonr from scipy.stats, which takes two arrays and returns
Pearson's correlation coefficient and the associated p-value. In this context, we are
primarily interested in the correlation coefficient r, which quantifies the linear
relationship.
4. The value of r is printed on the screen, showing the strength of the correlation.
In this case, Pearson's correlation coefficient is calculated to show how strong the
relationship is between the two analyzed variables. The result is close to 1, indicating a
strong positive correlation between training hours and average productivity. However, it's
important to keep in mind that correlation does not imply causation.
CustomerSatisfactionRenewal (%)
1 4 40
2 5 50
3 7 70
4 8 80
5 6 60
6 3 30
7 9 90
8 10 100
Calculate the measure that allows understanding the relationship between satisfaction and
contract renewal. Interpret the significance of the obtained result.
Solution
To analyze the relationship between customer satisfaction levels and contract renewal
rates, we use the Pearson correlation coefficient. This coefficient quantifies the strength
and direction of the linear relationship between two numerical variables.
We calculate the mean of satisfaction (x) and the mean of renewal (y):
x = 1(4 + 5 + 7 + 8 + 6 + 3 + 9 + 10) = 6.5
8
y = 1(40 + 50 + 70 + 80 + 60 + 30 + 90 + 100) = 65
8
Next, we calculate the sum of the products of the deviations of the individual observations
from their mean:
)(y,-y) = (4-6.5)(40-65)+...+ (10-6.5)(100-65) = 420
2 ,=iB(x r*
n Collected data satisfaction = np.array((4, 5, 7, 8, 6, 3, 9, 10]) renewal = np.arrayf(40, 50, 70, 80, 60, 38, 90, 108] I
r In this code, the libraries numpy and scipy are used to calculate Pearson’s correlation coeffi
• We import numpy and scipy.stats.pearsonr. numpy is used to work with arrays, which are
essential for managing and calculating numerical data. The pearsonr function from the
scipy library allows for directly calculating the Pearson correlation coefficient without
the need for manual implementation.
• The satisfaction data and renewal rate are stored in two numpy arrays: satisfaction and
renewal.
• The pearsonr function is applied to the two arrays to calculate the Pearson correlation
coefficient r and the p-value (ignored in this exercise). The coefficient r quantifies the
strength and direction of the linear relationship between the two arrays.
• By returning r, we obtain the measure of correlation between customer satisfaction and
the renewal rate. In this case, the calculated value is 1, indicating a perfect positive
correlation.
2.8 Spearman's Rank Correlation Coefficient
Spearman's rank correlation coefficient (or Spearman's rho) is a measure of the strength
and direction of the monotonic relationship between two variables. Unlike Pearson's
coefficient, which measures linear relationships, Spearman's coefficient is more suitable for
data that do not follow a linear distribution but may still have a monotonic relationship (i.e.,
always increasing or always decreasing).
Spearman's rank correlation coefficient, denoted by ps (or simply rs), is defined as:
where:
• d, = R(x,) - R(y,) is the difference between the ranks of each data pair (x,,yf). The rank is
a sequential number that follows the order of values (the lowest value will have rank 1,
the next will have rank 2, and so on).
• R(x,) and R(y,) are the ranks of the values x, and y„ respectively.
• n is the number of observations in the dataset.
Spearman's rank correlation coefficient takes values in the range [-1, 1]:
• p., only measures the monotonic relationship, making it less sensitive to non-linear
deviations compared to Pearson’s coefficient.
• It is dimensionless, meaning it does not depend on the units of measurement of the
variables.
• It is symmetric, meaning the correlation between x and y is the same as between y and
x.
• It is less affected by outliers compared to Pearson's coefficient, as it is based on ranks
rather than absolute values.
Exercise 43. Analysis of Sales and Product Reviews A new startup is launching a
range of electronic products and wants to understand the relationship between weekly
sales and customer reviews left on the website. Data has been collected for ten
consecutive weeks on sales (in thousands of units) and the average review score (from 1 to
5). The data is as follows:
1 30 4.3
2 45 3.8
3 18 2.9
4 25 4.0
5 35 4.5
6 50 4.7
7 40 4.1
8 28 3.1
9 33 3.6
10 38 4.2
The startup wants to know if there is a correlation between sales and the review score.
Solution
To solve this problem, we use the Spearman's rank correlation coefficient. This coefficient is
useful when you want to measure the monotonic dependence between two variables (not
necessarily linear).
Rank table:
1 4 8
2 9 4
3 1 1
4 2 5
5 6 9
6 10 10
7 8 6
8 3 2
9 5 3
10 7 7
rs= 1 ------ . —
n(u2 — 1)
Where d, is the difference between the ranks of x and y, and n is the number of
observations.
We calculate the various d„ which are: [4 - 8, 9 - 4,..., 7-7] i.e., [-4, 5, 0,-3,-3, 0, 2, 1, 2, 0].
Thus, Z d,2 = 68.
sales = [38, 45, 18. 25, 35, 50, 40. 28, 33. 381
reviews = [4.3, 3.8, 2.9. 4.8, 4.5, 4.7, 4.1, 3.1, 3.6, 4.21
spearman corr In this code, using the scipy library, we calculate the Spearman's rank correlation c
1. The spearmanr function is part of the scipy.stats module and is used to calculate
Spearman's rank correlation coefficient.
2. The weekly sales and review data are represented as Python lists: sates and reviews.
3. The spearmanr function returns both the Spearman's rank correlation coefficient and the
p-value. In this example, we only use the correlation coefficient (assigned to
spearman corr).
4. A Spearman coefficient value approximately equal to 0.5878 indicates a moderately
strong positive correlation, suggesting that an increase in review scores is moderately
associated with an increase in sales.
In summary, this simple implementation highlights the usefulness of the scipy library for
performing complex statistical analyses in just a few lines of code.
1 120 8
2 100 6
3 150 9
4 80 5
5 200 8
6 50 4
7 170 8
8 130 7
The company's goal is to understand the correlation between the invested budget in the
campaigns and the efficiency score perceived by the experts.
Solution
To determine whether there is a relationship between the advertising campaign budget and
the perceived efficiency score, we use Spearman's rank correlation coefficient. This method
evaluates the correlation between two ordinal ranks when the data is not necessarily
normally distributed or does not have a linear relationship.
1. Sort the data based on their respective budget values (x) and score (y).
2. Calculate the ranks for each value.
3. Determine the difference between the ranks of the two variables for each campaign.
4. Calculate the coefficient using Spearman’s formula:
where d, is the difference between the ranks and n is the number of observations.
After performing the calculation, if the obtained coefficient is close to -1, it indicates a
perfect negative correlation, if it's close to 1 it indicates a perfect positive correlation, and
if it's around 0, it suggests the absence of correlation. Suppose we obtain a p = 0.83, which
would indicate a significant positive correlation between the campaign budgets and the
perceived performance score. This result suggests that an increase in campaign budget is
generally associated with an improvement in the performance score perceived by the
experts.
# Campaign data budgets = np.arrayi [120, 100, 150, 80, 200, 50, 170, 130]) scores = np.array([8, 6, 9, 5, 8, 4, 8, 7])
# Display the result def correction interpretationfrho): if rho > 8.5: return ’The result indicates a significant positive •
ellf rho < -0.5: return “The result indicates a significant negative correlation between budget and score. '
else: return “The result indicates no significant correlation between budget and score.“
# Print the result eval result = correction interpretationrho) print(f"Spearman's correlation coefficient is: (rho:.2f). (e\.
1. The spearmanr function from the scipy.stats library calculates Spearman's correlation
coefficient between two data arrays. It returns the correlation coefficient, p, and the p-
value of the test, which is not used in this specific context.
2. The correction interpretation function takes the value of p and provides a textual
interpretation of the result. This is done through a simple if -elif -else condition that
checks whether the coefficient shows a positive, negative, or no significant correlation.
3. Finally, the code prints the value of p rounded to two decimal places along with the
respective textual interpretation. The coefficient p quantifies the degree of correlation
between budget and score, helping the company understand the relationship between
investment and the perception of efficiency.
1 10 8
2 15 13
3 8 5
4 12 10
5 20 16
6 7 3
7 18 14
The management wants to determine how closely social media interaction is correlated
with web traffic to the company's site.
Solution
To analyze the relationship between social media engagement and web traffic to the
company’s site, we use Spearman's Rank Correlation Coefficient. This method is ideal for
verifying the strength and direction of a monotonic association between two variables,
especially when the data do not necessarily follow a normal distribution.
1. Rank Calculation:
o Assign a rank to each social engagement and web traffic value, ordering them
separately in ascending order.
2. Calculation of Rank Differences (d):
o For each data pair, subtract the web traffic rank from the social engagement rank.
3. Calculation of d2:
o Square the rank differences obtained (d2).
4. Spearman's Formula (p):
o Apply the formula:
# Social media engagement and web traffic data engagement social = [10, 15, 8. 12, 20, 7, 181
web traffic - [8, 13, 5, 10, 16, 3, 141
# Calculation of Spearman's correlation coefficient spearman coefficient, p value = spearmanrfengagement social, web traffic)
print(f“Spearman's Correlation Coefficient: {spearman coefficient: .2f}") printff“P-value: {p value:.4f}“) The code uses t
1. numpy is imported as np for potential numerical operations, while spearmanr from the
scipy.stats library is used to compute Spearman's coefficient directly.
2. The arrays engagement social and web traffic contain the data on social media
engagement and web traffic, respectively, collected over seven consecutive weeks.
3. The spearmanro function takes the two data sets as input and returns the Spearman
correlation coefficient and the p-value, indicating the statistical significance of the
calculated coefficient.
4. The results are formatted and printed. The value of Spearman's correlation coefficient
indicates the strength and direction of the correlation. In this context, a value close to 1
would suggest a strong positive correlation, indicating that social engagement is
strongly correlated with an increase in web traffic. A low p-value (typically < 0.05)
would suggest that the observed correlation is statistically significant.
In summary, this code allows verifying the relationship between social media engagement
and web traffic generated, providing an indication of the effectiveness of the social media
marketing strategy. Spearman's correlation calculation is particularly useful when the data
are not normally distributed, making it ideal for many practical business scenarios.
Chapter 3
Regression Analysis
In this chapter, we will tackle a series of exercises dedicated
to regression problems, with a particular focus on linear and
exponential regression. Regression is an essential statistical
tool for modeling relationships between variables and is
widely used in various fields, such as financial analysis,
marketing, sales forecasting, and scientific research.
where:
The most common method for estimating the coefficients and is the method of least
squares, which minimizes the sum of the squared errors:
SSE = 2 i=in(Y,-Y,Y
This technique is widely used in various fields, including economics, engineering, and social
sciences, for making predictions and analyzing data.
Exercise 46. Demand Analysis and Forecast for a New Product A technology
company has recently launched a new electronic device onto the market. To optimize
production and distribution, the data analysis department wants to examine how the
product’s price affects the weekly demand.
1 250 400
2 220 450
3 210 470
4 240 420
5 260 370
6 230 440
7 225 460
8 235 430
9 245 410
10 255 390
11 200 490
12 215 480
The goal is to determine the relationship between price and demand and to forecast the
demand when the price is set at 230 euros.
Solution
To solve this problem, we will apply the concept of linear regression to determine the
relationship between the product's price and its demand. Linear regression involves finding
an equation in the form:
y - mx + b
where y is the predicted demand, x is the price, m is the regression coefficient (slope), and
b is the intercept.
E, to - *
) 2
b = y - mx
Where x and y are the average of the prices and sales, respectively.
Calculate b:
b = 434 - (-1.96 • 232) = 889
Therefore, the predicted demand when the price is 230 euros is approximately 438 units.
Through linear regression, we have determined that, for a new price, the equation allows
us to predict potential demand, thereby helping the company to adequately plan stocks
and marketing strategies.
# Data prices = np.arrayf(250, 220, 210. 240. 260, 230. 225, 235, 245, 255, 280. 2151) sales = np.arrayf (400, 458. 470, 420.
# Use of scipy's linregress function slope, intercept, r value, p value, std err = stats.linreqress(prices, sales)
regression equation, predicted demand In this code, we use the numpy library to handle the price and sale
1. Import numpy for handling numerical data and stats from scipy to perform linear
regression.
2. Create two arrays prices and sales containing the data from the first 12 weeks.
3. The function stats.linregresstprices, sales) automatically calculates the slope slope (m)
and the intercept intercept (b) of the regression line equation.
4. The function also returns r value, p value, and std err, which are respectively the
correlation coefficient, p-value, and the standard error of the estimate, although in this
exercise we are primarily interested in slope and intercept.
5. We use the line equation, given by y = mx + b, with x equal to 230: predicted demand =
slope • new price + intercept.
At the end, the code provides both the regression equation and the estimated demand for
a price of 230 euros. This solution allows the company to make forecasts to optimize
production and distribution based on adopted pricing strategies.
1 350 150
2 300 180
3 400 130
4 320 170
5 360 140
6 310 175
7 330 160
8 345 155
9 290 185
10 380 135
Management wants to know how working time affects the number of garments delivered
and desires an estimate of the number of garments that can be produced if the working
time is 325 minutes.
Solution
To determine the relationship between working time and the number of garments
delivered, we can apply the method of linear regression.
This approach allows us to understand the linear relationship between working time and
production yield, useful for making informed decisions about production resources. By
using linear regression, management can accurately estimate the expected output with a
given input of time.
# Data working time = np.array( [350, 300, 400, 329, 360, 310, 330, 345, 290, 380]) garments delivered - np.array((159, 180, 1
» Calculate the regression line slope, intercept, r value, p value, std err « stats.llnregress(working time, garments delivere
estimated garments In the described code, we use the scipy library, specifically stats.linregress, t
1. Two numpy arrays contain the working times and delivered garments for each
monitored day.
2. We use stats.linregress, which provides several parameters including:
o slope: the slope of the regression line, indicating how much the delivered garments
vary for each additional minute of working.
o intercept: the intercept, representing the estimated number of garments delivered
when the working time is zero.
o r value: the correlation coefficient, useful for understanding the strength of the
relationship.
o p value and std err: provide additional information about the significance of the
model.
3. The line is given by garments delivered = intercept + slope * working time.
4. By substituting 325 minutes for the array of times, we get an estimate of how many
garments can be delivered.
This method leverages scipy's capabilities for performing advanced statistical calculations
with simple function calls, making the data analysis process highly effective.
1 120
3 330
4 480
6 620
8 790
Use these data to estimate how many new customers the company could acquire in the
next quarter if it plans to launch 5 new initiatives.
Solution
To solve this problem, we apply the concept of linear regression. Linear regression helps us
model the relationship between a dependent variable (new customers) and an independent
variable (initiatives launched) through a linear equation, namely:
y = mx + c
where y is the dependent variable, x is the independent variable, m is the slope of the line,
and c is the intercept.
From the sample data, we can calculate the slope m and the intercept c using the formulas:
_ »(E jv'a) - (EJJ(E.fr)
n(E^)-(E^)2
?)
_(E&)(E
* -(E^XE^yJ
"(E^)-(E
*
.) 2
Inserting the values, we get:
m ~ 95 c ~ 50
Therefore, it is estimated that the company will attract about 525 new customers if it
launches 5 new initiatives in the next quarter.
« Linear regression calculation slope, intercept, r value, p value, std err = linregressiinitiatives, new customers)
# Result predicted customers = round(y pred) predicted customers In the Python code above, we used the SClpy
1. initiatives and new customers represent the data of initiatives launched and new
customers acquired each quarter, respectively.
2. The linregress function is used with two lists, initiatives and new customers, returning
the slope, intercept, R-squared value, p-value, and standard error.
3. stope: represents the slope of the line which indicates the expected change in new
customers for each additional initiative.
4. intercept: is the point where the regression line intersects the y-axis.
5. We use the stope and intercept values to predict how many new customers will be
acquired for 5 new initiatives using the linear equation:
y = slope ■ 5 + intercept
6. Finally, we round the result to the nearest integer since it makes sense to treat the
number of customers as an integer quantity.
7. The final result is assigned to predicted customers, which solves the proposed problem by
providing an estimate of the number of new customers expected.
1 50
3 67
5 78
6 82
8 90
The company wants to predict the performance score of an employee with 7 years of
experience. Use the provided data to estimate this score.
Solution
To tackle this business prediction problem, we can apply the extrapolation technique using
linear regression.
1. The form of the linear regression equation is y = mx + c, where m is the slope and c is
the intercept.
o Calculate the mean of x (experience) and y (performance), x = * 1 " 1 * ■> t-(i _ 4 5y
= jp-67-= 73 4
# Provided data experience = np.array([l, 3, 5, 6, 8]) performance = np.arrayf[56, 67, 78, 82, 901)
# Calculate linear regression slope, intercept, r value, p value, std err = stats.linregressfexperience, performance)
# Function to predict performance based on years of experience predict performance = lambda x: slope * x + intercept
performance predict In this code, we primarily used the numpy library for handling numerical data ant
1. The first part of the code defines two numpy arrays that represent the years of
experience and the performance scores of the employees.
2. We use scipy.stats.Unregress, a convenient method to calculate linear regression
between two data series. This method returns several statistically relevant values such
as the slope (slope), the intercept (intercept), and the correlation coefficient (r value),
among others.
3. slope represents the slope, and the intercept represents the intercept, which are used
to create the prediction function y = mx + c in the form of a lambda function to simplify
the calculation of the prediction.
4. Finally, the code calculates and prints the estimated performance score for an
employee with 7 years of experience using the previously defined prediction function.
3.2 Exponential Regression
where:
where ln(x) is the natural logarithm and it requires a > 0. In this way, linear regression can
be applied to the transformed data, estimating In a and p using the least squares method.
Once these values are obtained, a is calculated as:
a - eln a
Exponential regression is commonly used in fields such as population growth, the spread of
diseases, financial analysis, and radioactive decay.
Exercise 50. Growth of a Tech Startup A technology startup is analyzing its monthly
growth in active users. In the first month, the startup had 100 active users. In the following
five months, the number of active users grew to 150, 225, 337, 506, and 759 respectively.
The CEO wants to establish a model that represents the growth rate of the user base over
time and plans to use this model to project the number of users in the coming year. What is
the most suitable mathematical model to describe the user growth? Calculate the
projections for the seventh month and discuss whether the growth rate will remain
sustainable in the long term.
Solution
The growth of users follows an exponential pattern typical in the context of startups with
viral or rapid initial growth experiences. In this case, we are using exponential regression to
model the data because the number of users seems to be growing at increasing rates over
time.
To establish the model, we seek an equation of the type Users(t) = a-e01, where a
represents the initial number of users and b represents the growth rate. The given data
indicates that there are 100 users in the first month, and we aim to find a good exponential
fit.
We use a logarithmic transformation on the data to linearize the relationship and then
apply linear regression on the logarithmic values to determine b. From the straight line
log(y) = log(a) + b • t, we can then solve for a and b.
Once the parameters a and b are estimated, we can make a projection for the seventh
month. Assuming a ~ 67 and that b has been determined as, for example, b = 0.4, we
have:
Users(7) = 67 • « 1101
This implies very rapid growth. However, as the user base increases, practical limits such
as market saturation or infrastructure costs may arise. It is important to continuously
monitor the growth rate, adapt strategies, and consider geographical expansion or product
innovations to sustain such rates. In summary, exponential regression offers a powerful
tool to model and predict the initial growth of startups, but prudence is necessary when
projecting the future.
import numpy as np from scipy.optimize import curve fit import matplotlib.pyplot as pit
# User data per month months = np.array([l, 2, 3, 4, 5, 6]) # months users = np.arrayf [100, 150, 225, 337, 506, 759]) # activ
# Define the exponential function # Users(t) = a exp(b t) def exponential model(t, a, b): return a np.expfb t)
# Plot the fitted model for the known data extended months - np.arangefl, 13, 1) # including projection up to 12 months predi
plt.scatter(months, users, color='red', label='Real Data') pit.plot{extended months, predicted users, label='Exponential Model
1. We use numpy for numerical operations, scipy for fitting and matplotlib for data
visualization in the form of a graph.
2. We provide two arrays, months and users, representing the months and the number of
active users, respectively.
3. The function exponential model represents the equation we want to fit to the data:
Users(t) = a ■ ebt.
4. We use curve fit to fit the exponential function to our data. We give an initial estimate
of the parameters po. This function optimizes a and b for the best data fit.
5. The values for a and b are obtained from the optimized parameters.
6. We use the refined parameters to calculate the expected number of users for the
seventh month, with possible extended projections up to a year.
7. We use matplotlib to plot the graph representing the actual data and the curve fitted by
the model. This helps visualize the adequacy of the exponential growth model to the
observed data.
This approach allows for simple modeling of user growth and providing projections, aware
of the variables not considered like market limits or other external constraints.
Exercise 51. Forecasting Sales of a New Mobile App A company developing mobile
applications has recently launched a new fitness app. In the eight months following the
launch, the monthly sales in thousands of units were: 200, 290, 421, 612, 889, 1291, 1874,
2716.
The development team, impressed by the growth in sales, wants to create a forecasting
model to estimate expected sales over the next three months and to plan future updates
and marketing strategies.
Solution
To model the sales growth of an application over time, we can use an exponential
regression model, since the growth rate seems to accelerate as months go by. The general
form of an exponential growth model is:
V (t) = V 0 • ekt
Where:
Using the data provided, we can perform an exponential regression to estimate the
parameters V 0 and k. After calculating these parameters, we can use the model to predict
sales in the months following the eight provided, for example to estimate sales at the tenth
month.
For the tenth month, using the newly constructed model: If we find, for example, V 0 = 200
and k = 0.3, the sales for the tenth month would be:
V (10) = 200 • e0-310 ~ 4017 units
A sustainability analysis must consider that exponential growth is typically not sustainable
in the long term due to factors such as market saturation and increased competition. This
implies that, although the model may provide good short-term predictions, long-term
estimates should be treated with caution. It is also important to constantly monitor market
conditions and adapt the model if anomalies are observed in growth data.
# Sales data months = np.array(|l, 2, 3, 4, 5, 6, 7, 8)) sales = np.array([200, 290, 421, 612, 889, 1291, 1874, 2716])
# Perform data fitting params, covariance = curve fit(exponentialgrowth, months, sales, p0=[208, 0.3])
predicted sales = exponential growth(month to predict, V0, k) printff"Predicted sales for month (month to predict}: {predicted
1. The monthly sales data is stored in two numpy arrays, months and sales, which contain
the month numbers and the associated sales in thousands of units, respectively.
2. The function exponential growthft, vo, k) defines the exponential model, where vo and k
are the parameters to be estimated through regression.
3. The function curve fit from scipy.optimize is used to estimate the parameters vo and k
of the exponential equation that best fits the observed data. The po parameter provides
an initial guess for these parameters, improving the convergence of the fitting
algorithm.
4. After fitting, the estimated values for vo and k are printed.
5. Finally, the exponential growth function is used to estimate sales in the tenth month
using the calculated parameters, giving an idea of expected growth.
This model allows for estimating future sales growth but requires monitoring its validity
over the long term to account for potential market saturation phenomena.
201810
201911.5
202013.2
202115.1
202217.25
The management wants to forecast the revenue for the years 2023, 2024, and 2025. Using
the available data, determine a possible growth model using an extrapolation method
based on an appropriate regression model and calculate projections for the coming years.
Solution
For this exercise, we assume that the growth model for TechGrowth Inc. follows an
exponential trend. This type of model is common for companies in technologically
innovative sectors where growth can quickly accelerate due to increased demand or
significant technological advances.
• y is the value of the dependent variable (in this case, the revenue);
• a is the initial revenue,
• b is the base of the exponential growth, indicating what the annual growth rate will be;
• x is the time measured in years.
~ 1.15.
2. Predetermine the revenue growth equation: y = 10 • 1.15
.*
3. Calculate the forecasted revenues for 2023, 2024, and 2025:
o For 2023: y =10 • 1.155 ~ 19.79 million;
o For 2024: y =10 • 1.156 ~ 22.67 million;
o For 2025: y =10 • 1.157 ~ 25.97 million.
This exploration approach using exponential regression provides insights into how the
company might grow if the current expansion rate is maintained. Such projection assists
strategic management in planning resources, investments, and market strategies for the
future.
# Historical revenue data past years = np.array([2818, 2819, 2028, 2821, 2822]) past revenue = np.arrayf[10, 11.5, 13.2, 15.1
# Calculate the parameters a and b using curve fit popt, pcov = curve fit(exponential model, past years • 2818, past revenue!
a, b = popt
* Print results for year, revenue in zip(future years, predicted revenue): print(f‘Predicted revenue for (year): euro (reveni
1. The function exponential model is defined, representing the growth model y = a • b\ The
model is based on two parameters to be estimated: a, representing the initial revenue,
and b. the exponential growth rate.
2. Using curve fit, the best-fit values of a and b are calculated to fit the exponential
model to the historical data. The time x is calculated as the difference from the first
year (2018) to maintain manageable numbers.
3. Using the estimated values of a and b, the predicted revenue is computed using the
exponential model function.
4. The predicted revenue for each future year is printed, showing annual projections
rounded to two decimal places.
Exercise 53. Projection of Solar Energy Demand for a Company The company
SolarWave Energy specializes in providing solar panels for private homes. Over the past 6
years, it has observed a trend in the number of annual installations that provides a solid
basis for planning the future. The installation data (number of units) is as follows:
2016150
2017180
2018220
2019270
2020330
2021400
The management wants to estimate the number of projected installations for the years
2022, 2023, and 2024 to plan the production and procurement of necessary materials. Use
the available data to construct an appropriate projection model and calculate future
estimates.
Solution
To address the problem of projecting the demand for solar panels for SolarWave Energy, an
exponential regression model is used, given the nature of the data, which indicates a
growth rate that increases more than proportionally. This choice allows capturing potential
exponential growth in the number of installations.
Where:
Using the series of historical data, we perform a linear fit to calculate log(a) and log(b),
from which we can derive a and b.
Once the parameter values are obtained, we calculate the projections for the subsequent
years (2022, 2023, 2024) by substituting the respective x values into the exponential
formula.
This process allows SolarWave Energy to estimate future market demand with reasonable
accuracy, thus adopting the necessary operational decisions to align production capacity
and inventory management.
* Historical data years ■ np.array((2016, 2017, 2818, 2019, 2028, 28211) installations - np.array((150, 180, 228, 270, 338. A
# Fit the model to the historical data params, = curvefit(exponential model, x data, installations)
a, b = params
years to predict = np.array([2022, 2023, 2024]) x to predict = years to predict - base year
# Results for year, prediction in zip(years to predict, predictions): print(f"Prediction for the year {year}: {intfpredictior
1. The curve fit function from the scipy.optimize module is central to estimating the
parameters of our exponential model, curve fit searches for the best values for the
model parameters defined in the exponential model function, fitting them to the provided
data.
2. We transform the years into a base-zero format to facilitate model fitting. This means
the first year becomes 'O', the second '1', and so on.
3. We define a function exponential model that represents the general formula of
exponential growth: y = ab.*
4. We use curve fit to determine the parameters a and b that best fit our model to the
historical data.
5. After identifying the parameters, we proceed to calculate the forecasts for the
subsequent years 2022, 2023, and 2024.
6. Finally, we print the calculated forecast results, providing SolarWave Energy with an
estimate of future installations.
P(A\B) =
Pl A n /?), with P(B) > 0.
PW
This formula expresses the fraction of the probability of B that is shared with A. If P(A |B) =
P(A). then events A and B are independent, meaning that the knowledge of B does not
affect the probability of A.
Management wants to know what is the probability that a sale supported by a social media
advertising campaign was correctly forecasted.
Solution
To solve this problem, we use the concept of conditional probability and Bayes' theorem.
Bayes' theorem is defined as:
P(S|(7) • P((7)
P(C|S) =
P(S)
Where:
• P(C|S) is the probability that a forecast is correct given that it is supported by a social
media campaign.
• P(S|C) is the probability that a sale happens through a social media campaign given
that the forecast was correct (20% or 0.2).
• P(C) is the probability that a forecast is correct (70% or 0.7).
• P(S) is the probability that a sale happens through a social media campaign (30% or
0.3).
# Data provided in the problem # Probability that a sale happens via social given the forecast is correct P S given C = 0.2
print(’The probability that a sale supported by social was correctly forecasted is approximately:', P C given S) The code US
• Variables:
ops given c, p c, and p s represent the probabilities as described in the problem.
• Calculation using Bayes' theorem:
o We use Bayes' theorem to calculate p s given c, the probability of a correct forecast
given a social campaign.
In summary, the core of the problem lies in understanding and applying Bayes’ theorem.
Management wants to determine the probability that a visitor who made a purchase saw
the retargeting ads.
Solution
In the given business context, we want to calculate the probability P(V |A), which is the
probability that a visitor who made a purchase saw the retargeting ads.
p(»)
Where:
• P(A|\Z ) = 0.25 (probability that a visitor makes a purchase after seeing an ad).
• P(V ) = 0.60 (probability that a visitor sees the retargeting ads).
• P(A) = 0.20 (overall probability that a visitor makes a purchase).
Substituting the values into the formula: P(V |A) = 11 (11 = 0.75
Therefore, the probability that a visitor who made a purchase saw the retargeting ads is
75%. This means that the retargeting campaign was quite effective, as a large majority of
buyers saw the retargeting ads before making a purchase.
p v given a In this exercise, we use Bayes' theorem to calculate the probability P(V |A), which re|
We then calculate P(V |A) by substituting the values into Bayes' theorem. The calculated
value p v given a allows us to deduce the effectiveness of the retargeting campaign. A high
value, in this case 0.75 or 75%, indicates that a large majority of buyers saw the
retargeting ads before making the purchase, suggesting that the campaign was quite
effective.
Exercise 56. Corporate Risk Analysis in the Insurance Sector An insurance company
wants to analyze the risk of claims for customers who have auto insurance. After analyzing
historical data, the company gathers the following information:
The company wants to know the probability that a customer who has had an accident
belongs to the high-risk category.
Solution
In the context of this problem, we are examining the concept of conditional probability and
using Bayes' theorem to determine the probability of interest.
To calculate the required conditional probability P(H\S}t which is the probability that a
customer is high-risk given that they have had an accident, we use Bayes' theorem:
P(H|S) =
P(S\ II)-P(II)
Where:
• P(S\H) = 0.30 is the probability of having an accident given that the customer is high-
risk,
• P(H) = 0.05 is the probability of being a high-risk customer,
• P(S) = 0.10 is the probability of having an accident.
This result provides useful insight for the company regarding customer segmentation and
profiling, suggesting that, while only a small portion of customers belong to the high-risk
category, a considerable proportion of claims come from this group.
# Result P H given S
In the above code, we use Bayes’ theorem to calculate the conditional probability that a
customer belongs to the high-risk category given that they have had an accident. Bayes’
theorem is a fundamental part of statistics and allows us to update probabilities in light of
new evidence or data.
Details:
The formula is directly implemented to obtain p h given s, which represents the conditional
probability we are seeking.
Exercise 57. Production Analysis of a Manufacturing Plant A factory produces
electronic components, including printed circuits, and wants to analyze the production
process. The management has decided to focus on the product quality associated with a
particular supplier of materials.
The management aims to determine the probability that a component produced with
materials from the supplier is defective.
Solution
To solve this problem, we apply the conditional probability formula to determine the
probability that a component produced with materials from the supplier is defective: P(D\F).
P(D\F) =
P(F\D)P(D)
P(F)
Inserting the values:
„0|f).0.20 0.12. 0.30
0.08
Therefore, the probability that a component produced with materials from the supplier is
defective is 30%.
Using the supplier's materials seems to be associated with a higher probability of defects
compared to the overall plant average (30% versus 12%). This indicates that sourcing from
this supplier might increase the risk of defects in the components, suggesting the need for
a review of the materials or processes used by this supplier to reduce defects.
# Probability that a component Is produced with materials from the supplier P F = 0.08
# Probability that a defective component is produced with materials from the supplier F
P F given 0 = 0.20
# Calculate the conditional probability P(D|F) using Bayes' theorem P D_given F « (PFgiven D • P 0) / PF
print(f"The probability that a component produced with materials from the supplier is defective is: (P u given F:.2f)") In the
1. The known probabilities are set as variables:
o p o is the probability that a component is defective (0.12 or 12%).
o p f is the probability that a component is produced using materials from supplier F
(0.08 or 8%).
o p f given d is the probability that a defective component is produced with materials
from supplier F (0.20 or 20%).
2. We use Bayes' theorem to calculate the conditional probability p o given f, which is the
probability that a component produced with materials from the supplier is defective.
F\F D)P(D)
P(D\F) = —1—1— ------ -—
P{F)
3. Finally, the calculated probability is printed in a format that shows it with two decimal
places.
The management intends to determine the probability that a customer who made a
purchase over 100 euros is enrolled in the loyalty program.
Definition of events:
• 'I' represents the event 'the customer is enrolled in the loyalty program'.
• 'A' represents the event 'the customer makes a purchase over 100 euros'.
Calculate this probability and analyze what this data indicates about the effectiveness of
the program.
Solution
To solve this problem, we use the concept of conditional probability and Bayes’ theorem.
We have:
We want to calculate P(/|A), the probability that a customer who made a purchase over 100
euros is enrolled in the program.
0 15
Therefore, the probability that a customer who made a purchase over 100 euros is enrolled
in the loyalty program is approximately 66.67%.
This analysis shows that the loyalty program is effective in generating high-value
purchases, as a significant proportion of such purchases are made by customers enrolled in
the program.
def calculate probability(): - Definition of initial probabilities # Probability that a customer is enrolled in the loyalty f
# Probability that an enrolled customer makes a purchase over 100 euros P A given I = 0.25
# Calculate the conditional probability using Bayes' theorem PI given A » (PA given I • PI) / PA return PI given A
probability « calculate probabllity() prmtff’The probability that a customer who made a purchase over 100 euros is enrolled i
This code highlights the effectiveness of the loyalty program in generating significant
purchases among enrolled customers, as a good portion of the purchases over 100 euros
are made by loyal customers.
Chapter 5
Probability Distributions
In this chapter, we will present some practical exercises on
the most common probability distributions, with a focus on
their application to concrete scenarios. The goal is to place
these distributions in realistic business contexts, providing
the reader with useful tools for analysis and solving practical
problems.
P(X = k) = \ ^/p
(l
* - p)n-k, k = 0, 1.... n
n\
(k / is the binomial coefficient, which counts the number of
ways to obtain k successes out of n trials.
• Mean: E[X] = np
• Variance: V ar(X) = np(l - p)
Solution
To solve this problem, we can model the number of purchases made
as a discrete random variable, specifically using a binomial
distribution. Given N total visitors and p as the probability of purchase
for each visitor, the probability of having n successful events
(purchasing users) is defined by the binomial distribution:
1. Initially calculate P(X < 200) using the cumulative property of the
binomial distribution: this can be done using dedicated functions
in statistical software or binomial tables (cf. subsequent
solutions).
2. Then, use the complement of the cumulative probability: P(X >
200) = 1 - P(X < 200).
3. With the parameters of the problem, the probability of having
fewer than 200 purchases is very high, so the complementary
probability of having more than 200 purchases will be very low.
# Problem parameters n = 1000 # Number of visitors p = 0.15 # Purchase probability per visitor
# Result p_greater_2O0
In this code, we use the binomial distribution to calculate the
probability that more than 200 out of 1000 visitors make a purchase.
We use the binom.cdf function from the scipy.stats library to calculate
the cumulative probability up to 200 purchases. The cdf method
(which stands for Cumulative Distribution Function) gives us the
probability that the random variable X (number of purchases) takes
values less than or equal to 200.
Solution
By calculating the values, we obtain P(X < 10). Knowing its amount,
we determine P(X > 10), thus solving the problem of fraud analysis in
the reimbursements submitted that month.
p rob_at_least_10
In this code, we use the scipy library, specifically the binom module,
which manages the binomial distribution. This allows us to calculate
the cumulative probability related to discrete events in a series of
trials.
Parameters:
The Poisson distribution models the number of events that occur in a given interval of time
or space, assuming that the events are independent and occur at a constant average rate
A (lambda). The probability function of having k events is:
\^e-A
P(X = k) = ______ _ k = 0, 1, 2,...
A* ’
The main properties are:
• Mean: E[X] = A
• Variance: V ar(X) = A
You want to calculate the probability that more than 30 customers will enter a particular
retail outlet within an hour.
Solution
The described situation can be represented with a Poisson distribution, which is suitable for
modeling the number of events occurring in a fixed interval of time, given a certain
average rate of occurrence and the independence between events.
To calculate the probability that more than 30 customers enter in an hour, we first calculate
the probability that exactly 0, 1, ..., 30 customers enter, and then subtract the sum of
these probabilities from 1.
p(x = k) =:______
k\
where A is the average rate, i.e., 25 customers per hour, and k is the number of customers.
We then calculate:
25fce 25
P(X > 30) = 1 -Z k=o30—----------
A!
Using a calculation software or a scientific calculator, we find:
P(X > 30) = 0.137
threshold = 38
# Calculate cumulative probability P(X 30) probability less than equal 38 = polsson.cdf(threshold, lambda value)
# Print the probability prlnt(f“Probability that more than 30 customers enter in an hour: (probability greater than 38:.3f)“)
In our case, we want to calculate P(X > 30), the probability that more than 30 customers
enter the store in an hour. This is 1 minus the probability that at most 30 customers enter,
orP(X < 30).
1. Define the parameters: lambda value representing the average customer arrival rate
(25) and threshold as the maximum customer threshold (30).
2. Calculate the cumulative probability: We use the SciPy function poisson.cdf (threshold,
lambda value) to find the cumulative probability up to 30 customers.
3. Find the desired probability: The sought probability is P(X > 30) = 1 - P(X < 30).
4. Display the result: We use formatted output to show the calculated probability with
three decimal places, which is about 0.137, i.e., 13.7%
This procedure efficiently expresses the problem’s solution using a programming language
and the use of an advanced statistical library.
Calculate the probability that the number of fragile orders received in one day exceeds the
manageable number of 5 orders.
Solution
To solve this problem, the Poisson distribution can be used, which models the probability of
a given number of events occurring in a fixed interval of time when the events are
independent and occur with a known average rate.
The average arrival rate of fragile orders is A = 3 orders per day. We need to find the
probability that more than 5 orders are placed in one day.
The probability of observing k events in a time interval with a mean of A is given by the
Poisson distribution formula:
Where e is the base of the natural logarithm, approximately equal to 2.71828.
We calculate the probability of receiving at most 5 orders, P(X < 5), and subtract it from 1
to get the complementary probability:
r-33A’
P(X > 5) = 1 -I fc=05____ _
A!
Carrying out the calculations, we find:
P(X < 5) ~ 0.9161
Thus,
P(X > 5) = 1 - 0.9161 = 0.0839
The probability that the number of fragile orders received in one day is more than 5 is
approximately 0.0839, or 8.39%. This value indicates that the risk of daily overload is
relatively low but not negligible, suggesting that warehouse management should anticipate
potential peak periods.
# The probability that more than 5 orders are placed in one day is P(X > 5) prob X gt_5 » i - prob X leq_5
In this code, we use the scipy library, specifically the stats module, which offers a wide
variety of statistical distributions, including the Poisson distribution. The function
poisson.cdf calculates the cumulative distribution function of the Poisson distribution, which
is the probability of obtaining a value less than or equal to a certain number, given a mean
value (lambda). In our case, lambda is equal to 3, representing the average number of
fragile product orders per day.
We calculate the cumulative probability up to 5 orders using poisson.cdf (5, lambd), and then
obtain the complementary probability of having more than 5 orders by subtracting this
value from 1.
The scipy.stats.poisson is particularly useful for calculating the Poisson distribution without
needing to manually write the mathematical formulas, reducing the risk of errors.
This code is a basic example of how to simplify probability calculations for events
distributed over time that are rare and independent.
5.3 Exponential Distribution
The exponential distribution models the time between two successive events in a Poisson
process, where events occur independently at a constant average rate A. The probability
density function (PDF) is:
f(x) = Ae'Ax, x > 0,A > 0.
. Mean: EfX] = A
• Variance: V ar(X) = -A_
• Modeling waiting times in service systems (e.g., time between customer arrivals).
• Reliability analysis (lifetime before failure in electronic devices).
• Finance (time between significant changes in stock prices).
Exercise 63. Managing Customer Flow in a Retail Store An electronics store has
observed that the average waiting time between the arrival of one customer and the next
is 5 minutes. The store manager wants to improve staff efficiency and optimize customer
service. Calculate the probability that, in any given period, two customers will arrive less
than 10 minutes apart.
Solution
To solve this problem, we use the concept of the exponential distribution, which is suitable
for modeling the time between independent events occurring at a constant average rate.
In our case, the average time between the arrival of two customers is 5 minutes. We want
to find the probability that the time X between two arrivals is less than 10 minutes. The
cumulative distribution function (CDF) for an exponentially distributed random variable is
given by:
P(X < t) = 1 - e-Af
In this context, A = 1/5 for the exponential distribution, because the inverse of the average
of 5 minutes determines the parameter of the exponential distribution. We want to
calculate the probability that the time between two customers is less than 10 minutes.
Therefore:
P(X < 10) = 1 - e'17510 = 1 - e'2 « 1 - 0.1353 = 0.8647
Thus, the probability that two customers arrive less than 10 minutes apart is approximately
86.47%
The problem involves the exponential distribution, which models the time between events
in a continuous Poisson process. In this context, we are modeling the time between the
arrival of two customers in the store.
Using scipy, the calculation and manipulation of statistical distributions become simple and
efficient, as the library provides ready-to-use functions for many common statistical
distributions.
Solution
In this context, the response time of the operators follows an exponential distribution. The
main characteristic of this distribution is that it describes the time between events that
occur with a certain frequency (in our case, the average response time of 2 minutes). The
cumulative distribution function (CDF), which expresses the probability that an exponential
random variable is less than a certain value, can be represented as F(tj = 1 - e'Af, where A
is the inverse of the average response time (2 minutes in our exercise). Therefore, the
probability of responding after 3 minutes is the complement of the CDF for t = 3.
The probability sought is P(T > 3) = 1-F(3) = = e-1-5 ~ 0.2231. Thus, the probability that
a customer has to wait more than 3 minutes is approximately 22.31%. The result indicates
that almost one in four customers might hang up due to excessive waiting. The company
might consider increasing staff or optimizing processes to reduce wait times, improving the
customer experience.
import numpy as np
# Average response time lambda_ « 1.0 / 2 # 2 minutes
wait time = 3 # minutes def probability exceeds wait(t, lambda # Calculate the complement of the CDF to get P(T > t) reti
# Calculate the probability that the wait exceeds 3 minutes probability » probability exceeds wait(wait time, lambda J
print(f"The probability that a customer waits more than {wait time} minutes is about (probability:.4f} (i.e., {probability
*
100: .
• numpy library: It is mainly used here to compute the exponential function e15. The
imported function np.expf) calculates the exponential of a given input number, which is
necessary for evaluating the exponential distribution.
• lambda represents the inverse of the average response time, which is 2 minutes in this
case, wait time is set to 3 minutes, the time after which the customer tends to hang up.
• The function probability exceeds wait(t, lambda ) computes the complementary
probability of the CDF (Cumulative Distribution Function) using the formula P(T > t) = e~
tA, which represents the probability that a customer waits more than t minutes.
• Finally, a print statement formats and displays the calculated probability as a well-
readable percentage.
The uniform distribution describes a random variable that has the same probability of
taking any value within an interval [a,b], There are two main types:
• Discrete uniform: each discrete value in a finite set has an equal probability.
• Continuous uniform: the probability density is constant over an interval [a,b|.
For a continuous uniform distribution, the probability density function (PDF) is:
. Mean: E[X] =
• Variance: V ar(X) = (l>
12
Here are some possible uses:
Exercise 65. Sales Analysis in a Clothing Store A clothing store receives daily supplies
of a particular item of clothing. The supply manager, to better plan orders, has determined
that the number of items sold daily ranges from a minimum of 10 to a maximum of 50
items. With this in mind, calculate the average number of items sold per day. Assume that
sales are equally likely in this range.
Solution
In this exercise, we are using the concept of a continuous uniform distribution. When we
have a uniform distribution, the mean (or expected value) of a random variable X that has
minimum a and maximum b is given by the formula: p =
In our problem, a = 10 and b = 50. Applying the formula for the mean of the uniform
distribution, we get:
2 2
Therefore, the average number of items sold daily in the store is 30.
# Calculate the mean of the uniform distribution average items sold = uniform.mean(loc=a, scale=(b-a))
prlnt(f"Tlie average number of Items sold dally is: (average Items sold)-) In this Python code, we calculate the av
• We use scipy.stats, a Python library for statistics that offers useful functions for working
with distributions.
• We define a as the minimum number of items sold and b as the maximum number of
items sold, which are 10 and 50, respectively.
• We use uniform.mean(toc=a, scate=(b-a)) to calculate the mean of the uniform
distribution. Here, toe represents the parameter a, while scale represents the length of
the interval (b - a).
• Finally, we print the result, which should return the value 30, which is the mean of the
specified range, so the average number of items sold daily in the uniform distribution is
30.
Solution
To calculate the probability that the delivery time is less than 4 days, we use a = 2 days
and b = 5 days. Therefore:
P(X < 4) == _
4-2 2
5-2 3
Thus, there is a 66.67% probability that the delivery time is less than 4 days.
For the average delivery time, we use the formula for the expected value of a uniform
distribution:
These calculations suggest that the average delivery time is 3.5 days. With this
information, the logistics manager can decide to maintain sufficient stock to handle
occasional delays by setting the safety stock level to cover at least those rare cases where
the delivery time reaches the maximum limit of 5 days. Using the uniform distribution
allows for estimating these waiting times directly and easily implementing them in
management strategies.
Python Solution
# Calculate probability that delivery time is less than 4 days probability less than 4 = uniform.cdf(4, loc=min days, scale=ma
# Calculate average delivery time average delivery time = uniform.mean(loc=min days, scale=max days - min days)
# Output results print(f"Probability that delivery time Is less than 4 days: {probability less than 4:.4f}“) prlntff“Average
• We define min days and max days as the endpoints of the interval.
• The function uniform.cdf (x, toe, scale) calculates the cumulative distribution function
(CDF) for the uniform distribution. In this case, loc is the minimum value (2 days) and
scale is the difference between the maximum and minimum (5-2 days), cdf returns the
probability that a random variable is less than x, which in this case is 4 days.
• The function uniform.meandoc, scale) returns the mean or expected value of the random
variable, which for a uniform distribution is the average of its endpoints.
Finally, we print the results showing the calculated probability and the average delivery
time. This information helps the logistics manager consider maintaining an adequate level
of safety stock to mitigate the impact of possible delays.
5.5 Triangular Distribution
The triangular distribution is a continuous probability distribution that derives its name
from the triangular shape of its probability density function. It is characterized by three
main parameters: the minimum, the maximum, and the mode (modal value). The
probability density function is defined over an interval [a,b], with c representing the mode,
which is the point where the probability is highest.
The probability density function for a random variable X following a triangular distribution
is defined as:
if a < x < c.
- 6.(
if c < r <
otherwise.
The expected value E[X] and the variance Var(X) for a triangular distribution are given by:
EOT - " + <' + <, var(x). (»-a)2 + (c-g)(i>-c)
3 18
The triangular distribution is useful in situations where the extreme values (minimum and
maximum) are known, while the mode represents a plausible estimate based on
experience or previous data. It is often used in simulation models, risk analysis, and
forecasting, particularly in project management contexts.
Solution
To solve this problem, we use a triangular distribution, suitable for modeling phenomena
where we know a minimum, a maximum, and a most likely value within an interval. In our
case, the minimum (a) is 5 days, the maximum (b) is 15 days, and the most likely value (c)
is 10 days. The cumulative distribution function (CDF) for a triangular distribution can be
used to calculate the probability that the time taken exceeds a given value.
For a right triangular distribution, the CDF up to a certain value x is calculated as follows:
Since we want to calculate the probability that the duration exceeds 10 days, we use:
P(X > 10) = 1 - P(X < 10)
Implementing, we obtain:
(10 — 5)2 25
P{X < 10) =------ '------------ --------- = — =0.5
(15- 5)(10 — 5) 50
Therefore:
P(X > 10) = 1 - 0.5 = 0.5
Thus, the probability that the testing phase takes more than 10 days is 50%. This allows
the marketing team to know that there is a significant possibility that the launch might be
delayed and to prepare accordingly.
# maximum b = 15
# Calculate the scale and position for the triangular distribution # loc is the start of the interval loc « a
# scale is the difference between the maximum and the minimum scale = b- a
c relative = (c • a) / scale
# Create the triangular distribution object triang dist • triangle relative, loc=loc, scale»scale)
# Calculate the probability that X > 19, that is 1 - P(X <= 10) x = 10
# Result probability more than x In this solution, we used the scipy.stats library, which provides man^
The triang function uses these parameters to create a triangular distribution object. Then,
using the cdf method provided by the distribution, we calculate P(X <= x) (the cumulative
probability up to 10 days). The probability that the time exceeds 10 days is i P(x <= 10),
and this calculation gives us the probability that the marketing team requires for their
planning.
Exercise 68. Inventory Management and Delivery Time Estimation An online store
selling technology products is optimizing the inventory management of its warehouses. The
company wants to predict the time needed to receive a new batch of products from an
overseas supplier. Based on previous experiences, the delivery time can vary between a
minimum of 7 days and a maximum of 21 days. However, a delivery time of 14 days is
considered the most probable according to the commercial agreements with the supplier.
The logistics manager would like to know the probability that an order will take less than 12
days to arrive, in order to improve the planning of sales promotions and warehouse
management. Calculate this probability.
Solution
To solve this exercise, we use the concept of continuous triangular distribution. The
triangular distribution is defined by three parameters: the minimum value (a), the
maximum value (b), and the most probable value (c). In this case, we have a = 7, b = 21,
and c = 14.
The cumulative distribution function (CDF) for a triangular distribution is given by two
expressions, one for the increasing interval (from a to c) and one for the decreasing
interval (from c to b). To calculate the probability that the delivery times are less than 12
days, we use the part of the CDF for the increasing branch:
{x — a)2
F(x) =---------------- -------- for a < x < c
(6-a)(c - a)
Therefore, the probability that the delivery time is less than 12 days is approximately
25.51%.
# Parameters of the triangular distribution a = 7 # minimum value b = 21 9 maximum value c = 14 9 most probable value
9 Calculate the loc and scale parameters for scipy loc « a scale = b - a c param = (c - a) / scale
9 Create the triangular distribution object triang distribution = trlangfc param, loc=loc, scale°scale)
# Calculate the probability that the delivery time is less than 12 days prob less than 12 ■ triang distribution.cdf(12)
pnnt(f"The probability that the order takes less than 12 days is approximately (prob less than 12 • I00:.2f)V) In this COC
We create a triang distribution object using these parameters. The cdt function
(cumulative distribution function) of this object allows us to calculate the probability that a
random variable is less than or equal to a certain value. In this case, we calculated
triang distribution.cdf (12), which gives us the probability that the delivery time is less than
12 days. The result is then multiplied by 100 and formatted to be expressed as a
percentage.
5.6 Normal Distribution
The normal distribution, also known as the Gaussian distribution, is one of the most
common and important probability distributions in statistics. It is a continuous distribution
characterized by two main parameters: the mean (p) and the standard deviation (a).
Its probability density function (PDF) has the shape of a symmetric bell curve around the
mean value p, and is defined as:
where:
• p (mu) is the mean of the distribution, indicating the central point of the curve,
• a (sigma) is the standard deviation, which measures the dispersion of the values
around the mean,
• exp denotes the exponential function.
• The normal distribution is symmetric with respect to its mean value p, which implies
that the probability of observing a value greater than p is the same as observing one
that is smaller.
• The normal distribution has the expected value 6 [X] = p and the variance Var(X) = a2.
• The curve of the normal distribution is bell-shaped, with most values concentrated
around the mean p.
• The probability density function tends to zero as it moves away from the mean. This
means that the probability of observing extreme values (far from the mean) is
extremely low.
• Approximately 68% of the values lie within one standard deviation from the mean
(p±a), about 95% lie within two standard deviations (p ± 2a), and about 99.7% lie
within three standard deviations (p ± 3a).
Exercise 69. Delivery Time Analysis A logistics company has recorded the delivery
times of its packages in various cities. Delivery times are influenced by numerous factors
such as traffic, weather conditions, and distance. Management wants to estimate the
average delivery time in a specific city to optimize customer service.
A study is conducted where 45 days are randomly selected, and delivery times for each
package are recorded. The result is a dataset with an average of 50 minutes and a
standard deviation of 10 minutes.
Determine the probability that the average delivery time exceeds 52 minutes based on the
sample taken, assuming the distribution of delivery times on any given day can vary.
Solution
The problem requires determining the probability that the average delivery time, based on
a sample of 45 days, is greater than 52 minutes.
From the exercise, we are given the sample mean x = 50 minutes and the sample standard
deviation s = 10 minutes, with a sample size n = 45.
According to the central limit theorem, if n is sufficiently large (typically n > 30), the
distribution of the sample mean x approaches a normal distribution centered on the
A
population mean p with a standard deviation of This allows us to calculate the
probability using:
Assuming that p = 50 (the unknown population mean, estimated from the sample mean),
we have:
52 - 50 2
w_ = ~Tn~ = « 1.34
6.708 1.49
Using the standard normal distribution table, we find that the probability P(Z > 1.34) is
approximately 0.09.
Thus, there is a 9% probability that the average delivery time exceeds 52 minutes.
# Problem data mean sample = 50 # sample mean std sample = 10 # sample standard deviation n = 45 # sample size mean hypotb
# Calculation of standard error of the sample mean std error = std sample / np.sqrt(n)
probability To solve this problem, we use the scipy library, particularly the stats module that pr<
We start by defining the data: the sample mean, the sample standard deviation, and the
sample size. We then assume a hypothesized mean of 52 minutes to calculate the
probability of exceeding this value.
We calculate the standard error of the sample mean (also known as the standard error)
using std sample / np.sqrt(n). This step leverages the central limit theorem property that
allows us to make the calculation considering the sample sufficiently large.
Next, we calculate the z score, which represents how many standard deviations the value
52 (the hypothesized mean) is beyond the sample mean, using the formula
(mean hypothesized - mean sample) / std error.
In the end, we obtain the desired probability, which in this case returns an approximate
value of 9%. This indicates there is a 0.09 probability that the average delivery time
exceeds 52 minutes.
Exercise 70. Customer Satisfaction Analysis A restaurant chain wants to evaluate the
average customer satisfaction to improve service quality. Management decides to collect
data from satisfaction surveys filled in by customers over 60 different days. The
satisfaction score is a numerical measure ranging from 1 to 100, and the collected data
show an average of 80 with a standard deviation of 12.
The company wants to know the probability that the average satisfaction calculated over
these 60 days is less than 78, considering that satisfaction on any given day can vary
depending on several factors such as waiting time, service quality, and the menu of the
day.
Assume the validity of the normal distribution of satisfaction scores calculated over the
samples.
Solution
To solve this problem, we use a fundamental concept in statistics: the distribution of the
sample mean. In this case, we have a number of samples equal to 60, which is large
enough to apply the central limit theorem, suggesting that the distribution of the sample
means will be approximately normal, regardless of the original data distribution.
The theorem tells us that the mean of the sample means is equal to the population mean
and that the standard deviation of the sample means is the population standard deviation
divided by the square root of the sample size.
Therefore:
Mean of the sample means = 80
12
Standard deviation of the sample means = —■— = 1.549
V60
We calculate the z value for a sample mean score of 78:
78 - 80
z ==-1.29
1.549
Using a table or statistical software, we find that the probability that the sample mean is
less than 78 is approximately 0.0985, or 9.85%
This probability represents the likelihood that the average customer satisfaction is
significantly lower than the observed average and could indicate areas for improvement on
days when satisfaction is particularly low.
sample size = 69
# Calculation of the standard deviation of the sample mean sigma sample mean = population std dev I math.sqrt(sample size)
# Calculation of the z value z = (desired sample mean - population mean) / sigma sample mean
# Calculation of the probability using the standard normal distribution probability ■ norm.cdf(z)
# Final result print(f-The probability that the sample mean is less than (desired sample mean) is approximately {probability:.
• Import norm from scipy.stats to work with the standard normal distribution, and math for
basic mathematical calculations.
• Define the problem variables, such as the population mean and standard deviation, the
number of samples, and the desired sample mean value.
• Use the formula to obtain the standard deviation of the sample means, which is the
population standard deviation divided by the square root of the sample size.
• Calculate the z value that determines the position of the desired sample mean relative
to the population mean on the standard normal distribution scale.
• Using norm.cdf, we calculate the probability that the sample mean is below the desired
value. The function cdt calculates the cumulative distribution function, useful for finding
the accumulated probability up to a certain point, as in this case.
• Finally, print the calculated probability in both decimal and percentage forms.
This approach allows us to statistically verify the possibility that the average satisfaction
over a given period deviates significantly from the historical average, thanks to the power
of the normal distribution.
Chapter 6
Hypothesis Testing
In this chapter, we will analyze various examples of
statistical hypothesis testing, fundamental tools for making
data-based decisions and evaluating the significance of a
result. Each example will start with the formulation of the
null hypothesis and the alternative hypothesis, then proceed
with the calculation of the p-value or the critical value of a
test statistic. Both methods will be shown depending on the
circumstances, allowing the reader to choose the most
suitable strategy depending on the problem at hand.
The one-sample Student's t-test is a statistical test used to determine if the mean of a
sample comes from a population with a specified mean. This test is particularly useful
when the sample size is relatively small (generally less than 30 values) and the population
variance is unknown. It is a test that assumes the data are normally distributed.
The one-sample Student's t-test is used in various business contexts to test hypotheses
regarding the mean of a single group. Some examples include the following cases:
• A company might want to check if the mean weight of a batch of products conforms to
a specified target value (for example, if the average weight of a package is equal to
500 grams).
• A company may want to test if the average annual earnings of a division are equal to a
target figure, such as a 10% growth.
• A company might use this test to determine if the mean of customer satisfaction scores
from a survey equals a predefined mean (for example, 8 out of 10).
To correctly use the one-sample Student's t-test, the following requirements must be met:
• Null hypothesis (Ho): The null hypothesis posits that the sample mean is equal to the
specified population mean. Mathematically:
Ho : M = Mo,
where p is the population mean and p0 is the hypothesized mean value.
• Alternative hypothesis (HJ: The alternative hypothesis posits that the sample mean is
not equal to the specified population mean. It can be two-tailed:
Hi : p
* 0
or one-tailed
H, : p > p0 or H: : p < p0
The test statistic for the one-sample Student's t-test is given by:
T - //(,
f= s
where:
The t statistic follows a Student’s t distribution with n - 1 degrees of freedom. The p-value
of the test determines the rejection of the null hypothesis if it is less than the required
significance level (typically 5%).
Solution
1. Definition of hypotheses:
o Null hypothesis (Ho): The average time to find a product with the new search
engine is still 2.5 seconds (p = 2.5).
o Alternative hypothesis (/-/,): The average time to find a product is different from 2.5
seconds (p*
2.5).
2. Perform the test:
The formula for the t-value is:
X ~ /'fl
s(y/ri
Whe_re:
o x = 2.7 (sample mean)
o p0 = 2.5 (theoretical mean under Ho)
o s = 0.6 (sample standard deviation)
o n = 30 (sample size)
Calculating:
2.7-2.5
t =---------- == 1.8257
0.6/V^O
3. Compare with the critical value With a significance level of 5%, we look for the critical
value for a two-tailed t-test and 29 degrees of freedom. Using a t-table, we find that
the critical t-value for 29 degrees of freedom is approximately ±2.045. Since the
calculated t-value (1.8257) falls within the interval from -2.045 to 2.045, we do not
reject the null hypothesis.
In conclusion, there is not enough evidence to conclude that the new search engine has
significantly altered the average time necessary to find a product at the 5% significance
level.
sample size « 30
t stat » (sample mean ■ theoretical mean) (standard deviation math.sqrt(sample size)) df » sample size - 1
» Critical values for two-tailed test critical value - stats.t.ppff1 - significance level/2, df)
# Solution output {'t stat': t stat, ‘critical value': critical value, 'test significant': test significant)
The code uses the scipy.stats library to perform a one-sample Student's t-test. This library
is powerful for performing statistical tests and associated calculations.
• Import scipy.stats as stats and math, scipy.stats is useful for statistical functions, while
math simplifies the use of square roots.
• Define variables representing the sample mean (2.7), theoretical mean (2.5), standard
deviation (0.6), and sample size (30).
• Calculate the t-value using the formula: t = —y=, where x is the sample mean, p0 is
the mean under the null hypothesis, s is the sample standard deviation, and n is the
sample size.
• Determine the degrees of freedom with df = sample size - i.
• Use stats.t.ppf to obtain the critical value of the two-tailed t-test, considering the
significance level set at 5%.
• Determine if the test is significant by comparing t stat with critical value. If the
absolute value of t stat exceeds critical value, the null hypothesis is rejected.
• Finally, the code returns t stat, critical value, and a boolean test significant indicating
if the test result is significant.
Alternatively, if we had the entire sample data instead of the mean and standard deviation,
we could use the ttest isamp function from scipy to calculate the p-value directly and reject
the null hypothesis if it was smaller than the desired significance level.
Given this scenario and a 5% significance level, the company wants to understand whether
the maintenance intervention has improved the average production time.
Solution
To tackle this exercise, we use a one-sample Student's t-test, specifying that it is a left
tailed test because we are checking if the average production time has decreased.
1. Definition of hypotheses:
o Null hypothesis (Ho): The average production time is greater than or equal to 1.2
minutes (p > 1.2 minutes).
o Alternative hypothesis (/-/,): The average production time is less than 1.2 minutes
(p < 1.2 minutes).
2. Calculation of the t-test statistic:
11-12 -0.1
t =--- :—— =---------- — =______ = -1.67
s x/n 0.3/V25 0.06
3. Conclusion:
With a significance level of 5% and 24 degrees of freedom (n - 1), the critical value of t
for a one-tailed test is approximately -1.711. Since the calculated value (-1.67) is not
less than the critical value (-1.711), we cannot reject the null hypothesis.
In conclusion, with the data available to us, there is not enough evidence to claim that the
maintenance intervention has significantly improved the average production time at the
5% level.
* Problem data historical,mean » 1.2 9 minutes samplemean « 1.1 9 minutes standard deviation » 0.3 9 minutes n » 25 9 sarr
9 Calculation of the t test statistic t stat = (sample mean - historical mean) (standard deviation (n •• 0.5))
9 Settings for the t-test alpha « 0.05 9 significance level qdl « n - 1 9 degrees of freedom
# Test result t stat, t critical, t stat < t critical In this example, we use the scipy.stats library, which
1. We define the data related to the exercise, including the historical mean (p0), the
sample mean (x), the standard deviation, and the sample size.
2. We use the formula for the t-test for one sample:
t=s/VH
This calculates how much the sample mean differs from the historical mean in units of
standard deviation.
3. We establish the significance level (a) at 5% and calculate the degrees of freedom (n -
1) for the test.
4. Using stats.t.ppf, we obtain the critical value for the one-tailed test. This represents
the value that our t must be lower than to reject the null hypothesis.
5. We compare the test statistic with the critical value to determine if there is a
statistically significant improvement in production time. If the calculated t value is less
than the critical value, we reject the null hypothesis in favor of the alternative
hypothesis.
Also, in this case, if we had the entire sample available, the use of the function ttest isamp
would give us the p-value of the test.
6. 2 Student's t-test for the Means of Two Samples
The two-sample Student's t-test is a statistical test used to compare the means of two
independent samples and determine if there are any significant differences between them.
This test is particularly useful when you want to verify if two groups have similar or
different means, and it is used when the variances of the two populations are unknown.
The two-sample Student's t-test is often used in various business contexts to compare two
groups. Some examples include the following cases:
• A company might want to compare the average sales performance between two
divisions. The test can determine if the differences in average sales are statistically
significant.
• A company may want to test if two groups of consumers (e.g., consumers from two
different cities) have a similar average product satisfaction.
• A company could want to compare the average production times between two
production plants or two production lines to determine if there are significant
differences in their efficiency.
To correctly apply the two-sample Student's t-test, the following requirements must be
satisfied:
• The two samples must be independent of each other, meaning the observations in one
sample must not influence those in the other sample.
• The data in the samples should be normally distributed. However, with sufficiently
large sample sizes (generally n > 30), this condition can be approximately met even if
the data do not perfectly follow a normal distribution, thanks to the central limit
theorem.
• Although the two-sample Student's t-test does not require equal variances, a more
common version of the test, called the Student's t-test with equal variance assumption,
assumes that the variances of the two samples are equal. If this condition is not met, a
t-test with different variances can be used.
• The null hypothesis Ho asserts that the means of the two samples are equal, meaning
there is no significant difference between the two populations. Mathematically:
: Pi = P2.
where and p2 are the means of the two populations.
• The alternative hypothesis claims that the means of the two samples are different,
meaning there is a significant difference between the two populations. It can be two-
tailed:
«i :Pi
P
* 2
or one-tailed:
H1:p1>p2 or Hi : P! <
The test statistic for the two-sample Student's t-test is calculated as:
^7-^7
'=V./d7d-
T n-
where:
The t statistic follows a Student's t-distribution with degrees of freedom calculated as:
H1 —1 U2 — 1
Again, the null hypothesis is rejected if the p-value of the test is less than the desired
critical threshold.
Exercise 73. Productivity Analysis in Two Plants A manufacturing company has two
plants, one located in Milan and the other in Rome. Management wants to compare the
average productivity of the two plants to determine if one is significantly more productive.
For this reason, production hours per shift are collected in both plants. The sample data are
as follows:
Suppose the variances of the two populations are different. Verify, using a significance level
of 5%, whether there is a significant difference between the two productivity means.
Solution
To analyze the problem, we use the Welch's test, a two-sample Student's t-test used when
the variances of the two populations are assumed to be different.
The null hypothesis (Ho) is that the Milan plant is as productive as the Rome plant. The
alternative hypothesis is that it is not.
o Rome: s 2 = = 0.1839
IIQ— 1
3. Calculation of the Welch's test statistic:
2
*1 “*
t= ri------ 7"= 5.7
v? + r-
v n1 »2
4. Determination of degrees of freedom:
2
# Data milano data = np.array((8, 7.5, 8, 7, 8.5, 9, 7.8, 8.2, 8, 7.S)> roma data =• np.array([6, 6.5, 7, 6, 6.8, 7.2, 7, 6.5,
# Welch's test t stat, p value = stats.ttest Ind(milano data, roma data, equal var=False)
# Comparison test result » 'Reject HO" if p value < 0.05 else "Do not reject HO"
The code implements a Welch's test to verify if there is a significant difference between the
productivity means of two production plants located in Milan and Rome. The main libraries
used are numpy and scipy, which are essential for performing statistical and mathematical
calculations in Python.
The main part of the code that executes the statistical test is done with the function
stats, ttest indo which calculates the t-test for independent samples without assuming
equal variances (option equal var=Faise). This returns the test statistic value (t stat) and the
associated p-value.
Finally, the code compares the p-value with the 5% threshold to decide whether to accept
or reject the null hypothesis: if the p-value is less than 5%, the null hypothesis is rejected,
indicating that the productivity difference between the two plants is statistically significant.
Alternatively, we could have calculated the critical t-value with the expression
np.abs(stats.t.ppf ((i-0.95)/2,17)), which would indeed give us 2.10. With this critical value
and a t-value of 5.7, we reject the null hypothesis.
• Instagram: [85, 90, 88, 92, 89, 91, 87, 86, 95, 94]
• Facebook: [80, 78, 85, 82, 81, 83, 79, 77, 84, 82]
The management wants to determine if the difference in the average number of purchases
prompted by the two platforms is significant. A significance level of 5% is used.
Solution
To address this question, it is necessary to compare the means of the two samples and
determine if there is a statistically significant difference between them. This is a typical
case for applying the Welch's t-test, used when the variances of the two populations are
presumed to be unequal.
1. Formulation of hypotheses:
o Null hypothesis (Ho): The mean number of purchases on Instagram is equal to the
mean number of purchases on Facebook.
o Alternative hypothesis (Hr): The mean number of purchases on Instagram is
different from the mean number of purchases on Facebook.
2. Calculation of sample statistics:
o Mean for Instagram: xx = — 90—88+92+89 +91 +87+86—95—1)4 _ 89.7
o Mean for Facebook: x, = — rX-85+8'2 +81+83 +79+77-81-82 _ 81.1
( - d’
o Variance for Instagram: s/ = — t-i1 ' = 11.12
n— 1 q
o Variance for Facebook: s,2 = — rU1-1-’- = 6.77
n-1
3. Calculation of the Welch's statistic:
In conclusion, by using Welch's t-test to compare the two means, we can determine if the
advertising campaigns on Instagram and Facebook produce significantly different results in
terms of purchases.
# Collected data instagram purchases = np.arrayl(85, 99, 88, 92, 89, 91, 87, 86, 95, 94]) facebook purchases = np.array(|80,
# Mean and variance for each sample instagram mean = np.mean(instagram purchases) facebook mean = np.mean(facebook purchases)
Instagram variance ■ np.var(lnstagram purchases, ddof=l) facebook variance = np.var(facebook purchases, ddof=l)
# Welch's test t statistic, p value = stats.ttestindfinstagram purchases, facebook purchases, equal var=False)
# Results print(f"Instagram Mean: {instagram mean)") print(f"Facebook Mean: {facebook mean}") printff"Instagram Variance: {
if p value < alpha: printf"Reject the null hypothesis: there is a significant difference between the two platforms.") else:
1. Library Imports
o numpy is used to facilitate the calculation of means and variances.
o scipy.stats provides the ttest ind method to perform the Welch's t-test, allowing
comparison of the sample means.
2. Calculation of Basic Statistics
o Means for Instagram and Facebook are calculated with np.mean.
o Sample variances are calculated with np.var using the ddof=i parameter to obtain
the sample (corrected) variance.
3. Execution of Welch's Test
o stats.ttest ind performs the test, utilizing equal var=Faise to specify unequal
variances between samples. It returns the t-statistic and the p-value.
4. Conclusion of the Test
o Determine whether to reject the null hypothesis by comparing the p-value to the
significance level alpha, set at 5%.
The ultimate goal of the code is to determine whether the observed differences in average
purchases prompted by Instagram and Facebook are statistically significant, meaning they
are not due to chance.
6.3 Z-test on Proportions
The z-test for proportions is a statistical test used to compare a sample proportion with a
theoretical proportion or to compare two sample proportions. It is used when you want to
verify if a proportion of a population or two populations is significantly different from a
reference value or another proportion. This test is based on the normal distribution and is
applied when the sample is large enough.
The z-test for proportions is commonly used in various business contexts to make
inferences about proportions or percentages. Some examples include:
To correctly apply the z-test for proportions, the following requirements must be met:
• The null hypothesis Ho states that the sample proportion is equal to the reference
proportion. In mathematical terms:
Ho: p = Po
where p is the sample proportion and p0 is the theoretical or reference proportion.
• The alternative hypothesis H, states that the sample proportion is different from the
reference proportion. It can be two-tailed:
Hj : p
* 0
or one-tailed
: p > p0 or H, : p < p0
The test statistic for the z-test for proportions is calculated as:
p-po
Z~ /po(l—po)'
V n
where:
• Pis the observed proportion in the sample (i.e., the number of successes divided by the
total number of observations),
• p0 is the reference proportion,
• n is the sample size.
The z statistic follows a standard normal distribution with a mean of 0 and a standard
deviation of 1.
If the p-value is less than the critical threshold (typically 5%), we reject the null hypothesis.
Solution
1. Problem data:
o Historical proportion (p0) = 0.04
o Total number of customers (n) = 1200
o Number of successes (converted customers) = 60
o Proportion of successes in the sample (p) = 60/1200 = 0.05
2. Formulation of hypotheses:
o Null hypothesis (Ho): the proportion of success is equal to the historical proportion.
(P = Po)
o Alternative hypothesis (H,): the proportion of success is different from the historical
p
*
proportion. (p 0)
3. Calculation of the z-test statistic: The formula for the z-test for a single proportion is:
(P ~ Po)
2 / po(l-po)
V «
4. Determination of the p-value: Comparing the calculated z-value with the critical value
of the standard normal distribution at the 0.05 level of significance for two tails, we can
find the associated p-value.
We do not have sufficient statistical evidence to state that the marketing campaign
induced a significant change in the conversion rate compared to the historical rate of 4%.
Solution with Python
# Problem data p9 = 0.04 * historical proportion n = 1200 # total number of customers successes = 60 » number of converted
# Hl: p • = p0
)
print(result) The code implements a z-test for a single proportion using the data provided in the p
First, the problem data is defined with variables such as pO, which represents the historical
proportion, n for the total number of customers, successes for the converted customers, and
p for the observed proportion.
The null and alternative hypotheses are explained in the form of comments in the code,
indicating the statistical context of the test.
The calculation of the z statistic is performed using the formula derived for a single
proportion:
(? ~ Po)
Z~ /poll-Po)
The p-value is calculated using the norm.cdf function, which helps find the cumulative
probability of a standard normal distribution. The use of 2 * (i norm.cdf(abs(z))) reflects
the consideration of a two-tailed test.
Finally, the result is printed in a structured form to show the value of z, the p-value, and an
indication of whether the result is statistically significant relative to the 5% significance
level.
Solution
In this exercise, a two-tailed z-test for proportions is applied. The aim is to determine if the
new strategy has led to a significant difference compared to the historical defect
percentage.
1. Hypotheses:
o Null hypothesis (Ha): The defect percentage is equal to the historical standard: p =
0.08.
o Alternative hypothesis (/-/,): The defect percentage is different from the historical
standard: p
*
0.08.
2. Calculation of the sample proportion:
30
p= ----- - = 0.06
500
3. Calculation of the z-score:
p-p 0.06-0.08
2= /p(i_p) = /0,08 0.92 s=-1-648
V n V 500
4. With a 5% significance level in a two-tailed test, the critical values are approximately
±1.96.
5. With z = -1.648, we do not fall into the critical range of ±1.96. Therefore, there is not
enough evidence to reject the null hypothesis.
There is not sufficient statistical evidence to assert that the new quality control strategy
has led to a significant change in the percentage of defective parts compared to the
historical standard of 8%. The two-tailed z-test for proportions suggests that the observed
difference in the sample could be due to random variability.
# Problem parameters p 0 = G.08 # historical proportion of defective parts n = 580 # sample size x = 30 # number of defecti
*
alpha = 8.85
« Results (reject null, z. z critical) In this code, we perform a two-tailed z-test for proportions to d
The one-way analysis of variance (ANOVA) is a statistical technique used to compare the
means of three or more independent groups in order to determine if at least one of them
significantly differs from the others. One-way ANOVA extends the concept of the t-test to
compare more than two groups.
• A company may use ANOVA to compare the average sales performances among
different departments and determine if the observed differences are statistically
significant.
• ANOVA can be used to compare the average productivity of different production lines
and check for significant differences.
• If a company introduces different models of a product, ANOVA can be used to compare
the average quality ratings among the different models.
• The groups must be independent from each other. Each observation in a group must be
independent of those in other groups.
• The data in each group should follow a normal distribution. If the sample size is large,
the central limit theorem allows us to approximate normality.
• The variances of the groups should be similar, a condition known as homoscedasticity.
This can be verified using Levene's test or other techniques.
• The null hypothesis states that all group means are equal. In other words, there are no
significant differences between the groups:
Ho : Pi = P? = • • •= Pit»
where PpP^...^ are the group means.
• The alternative hypothesis asserts that at least one group's mean differs from the
others:
H, : At least one mean differs from the others.
The test statistic for one-way ANOVA is the F-statistic, which measures the ratio between
the variance between the groups and the variance within the groups:
Variance between the groups
Variance within the groups
where:
• Variance between the groups measures how much the group means differ from the
overall mean,
• Variance within the groups measures the variability of the observations within each
group.
If the variance between the groups is significantly greater than the variance within the
groups, the F value will be high, suggesting that the group means differ significantly.
The p-value indicates the probability of obtaining observed results, or more extreme
results, if the null hypothesis were true. In other words, the p-value measures the evidence
against the null hypothesis.
If the p-value is less than the significance level a (for example, 0.05), we reject the null
hypothesis, suggesting that at least one group's mean is significantly different from the
others. If the p-value is greater than the significance level a, we do not reject the null
hypothesis, suggesting that there is not sufficient evidence to claim that the group means
are different.
Data:
Determine if there are significant differences in the average response times among the
three regions. Assume that the distributions are normal with equal variance.
Solution
To solve this problem, we use statistical analysis to compare the means of multiple groups.
The appropriate test for this scenario is a one-way Analysis of Variance (ANOVA), which
determines if there are significant differences between the means of three or more
independent groups.
This analysis suggests that the company should further investigate and manage customer
service resources to reduce differences in response times among different regions. By
identifying specific causes of the differences, such as potential inefficiencies in regional
operations, the company can improve customer service and the overall experience.
# Response time data for each region north = [10, 12, 15, 14, 13]
# Determine the result alpha = 0.05 # significance level if p value < alpha: printCThere are significant differences in av'
In the provided code, we used the t oneway function from the scipy.stats module, which is
designed to perform one-way Analysis of Variance (ANOVA). This statistical test allows us to
compare the means of three or more groups and determine if at least one of these means
is significantly different from the others.
1. Import numpy for general mathematical operations and f oneway from scipy.stats to
perform the ANOVA test.
2. The response times for the North, Center, and South regions are provided in Python
lists.
3. Use f oneway by passing the three data lists to obtain the F-value (stat) and the p-value
(p value). These values help us determine the significance of the observed differences.
4. Print the F-value and the p-value for an immediate view of the ANOVA results.
5. Compare the p-value with a significance level (alpha), here set at 0.05, to decide
whether to reject the null hypothesis. If the p-value is less than alpha, we conclude that
there are significant differences among the group means.
This approach is efficient for identifying statistically significant differences in the presence
of multiple groups and provides valuable insights to support strategic business decisions.
Collected data:
. Urban: 7, 8, 7, 6, 9
• Suburban: 5, 6, 5, 7, 6
. Rural: 8, 7, 9, 9, 8
Solution
To tackle this problem, we use a statistical method to compare means between more than
two independent groups. In this case, a statistical analysis suitable for comparing the
means of more than two groups is adopted, without presuming which one might have a
higher mean.
1. Formulation of Hypotheses:
o Null hypothesis (Ho): There are no significant differences in the average customer
satisfaction scores between branches.
o Alternative hypothesis (HJ: At least one of the branch satisfaction means is
significantly different from the others.
2. Calculate the necessary statistic using dedicated software. At this stage, we calculate
the overall variability between groups and within the groups themselves.
3. Compare the calculated statistic with the critical value referenced from the statistical
software or statistical table for the specific number of groups and samples.
4. If the calculated statistic exceeds the critical value or if the p-value is less than the
level of significance (usually 0.05), reject the null hypothesis, concluding that there are
significant differences in satisfaction scores. Otherwise, do not reject the null
hypothesis.
After applying this method, we find that the statistic allows us to reject the null hypothesis:
there are significant differences between the average satisfaction scores of the various
branches, suggesting that the area may influence the perceived customer satisfaction.
# Customer satisfaction data for urban, suburban, and rural branches satisfaction urban = (7, 8, 7, 6, 9]
# Perform a one-way ANOVA test f statistic, p value = stats.f onewayfsatisfaction urban, satisfaction suburban, satisfaction r
if p value < alpha: printC'There are significant differences between the average satisfaction scores of the branches.") else:
• Importing stats from scipy: We use scipy.stats to access the f oneway function, which
performs the one-way ANOVA test.
• Customer satisfaction scores for the three branches (urban, suburban, and rural) are
stored in separate lists.
• The f oneway function takes several groups as input and returns two values: the F-
statistic and the p-value. The F-statistic measures the proportion of variance between
groups relative to the variance within the groups. The p-value allows us to evaluate the
evidence against the null hypothesis.
• We compare the p-value with the level of significance (set here at 0.05). If the p-value
is lower than this value, we reject the null hypothesis, concluding that there are
significant differences between the average satisfaction scores of the branches.
Otherwise, we do not reject the null hypothesis.
The use of ANOVA is appropriate in this context because we are analyzing three
independent groups, and the goal is to understand if at least one group significantly differs
from the others in terms of average scores.
6.5 Kruskal-Wallis Test on the Median of Multiple Groups
The Kruskal-Wallis test is a non-parametric test used to compare the medians of more than
two independent groups. It is a non-parametric version of the one-way ANOVA and is used
when the assumptions of normality or homogeneity of variances required by ANOVA cannot
be met. The Kruskal-Wallis test is useful when the data are ordinal or when quantitative
data do not follow a normal distribution.
The Kruskal-Wallis test is applied in various business contexts to compare data across
multiple groups when reliance on parametric tests is not possible. Some examples include:
• If a company wants to compare the performance of different work teams (for example,
based on scores or ordinal ratings), the Kruskal-Wallis test can be used to verify if there
are significant differences between the teams.
• When customers express preferences for different variants of a product on an ordinal
scale, the Kruskal-Wallis test can be used to compare the medians of the ratings among
the different product variants.
• A company that produces goods in different plants might use the Kruskal-Wallis test to
compare the quality of products across different plants, based on ordinal measures or
scores that are not normally distributed.
To correctly apply the Kruskal-Wallis test, the following requirements must be met:
• The groups must be independent of each other. Each observation must belong to only
one group and must be independent from the others.
• The data must be at least ordinal, meaning they must have a natural order.
• The null hypothesis states that all group medians are identical.
• The alternative hypothesis suggests that at least one of the group medians differs from
the others.
The Kruskal-Wallis test statistic is based on the rank of data within each group. The main
steps to calculate the statistic are:
• Order all observations from all groups together and assign them ranks.
• Calculate the sum of ranks for each group.
• Calculate the H statistic as follows:
12 7?2
H = TV7T7----- 7T * i=l — - 3(N + D,
A (A -I-1) n.
where:
o N is the total number of observations (the sum of the sizes of the groups),
o k is the number of groups,
o Rt is the sum of ranks for group /,
o n, is the size of group /.
The company wants to determine if there are significant differences in the sales
performance of the three teams. Assuming that the data might not follow a normal
distribution, how should the company proceed?
Solution
To tackle this problem, the company should use the Kruskal-Wallis test, a non-parametric
test used to determine if there are significant differences between the medians of three or
more independent groups, especially when the data do not follow a normal distribution.
The null hypothesis Ho is that the medians are equal.
To perform the Kruskal-Wallis test, it is advisable to use specific software that calculates the
p-value or the test statistic H. If the p-value is less than 5%, the null hypothesis is rejected.
Similarly, if H is more extreme than the critical value for this dataset, the null hypothesis is
rejected.
If the calculations result in a significant statistic, management could further investigate the
differences between the teams and consider regionally tailored improvement strategies.
# Output results {
The provided code uses the scipy.stats library to perform the Kruskal-Wallis Test, a non
parametric statistical test that does not assume the data follows a normal distribution. This
makes it ideal for comparing the medians of independent groups such as the weekly sales
of the North, Central, and South teams presented in the problem.
Let's look at the various steps in the code:
The company wants to determine if there are significant differences in the customer
satisfaction levels among the three departments. Considering that the feedback scores
may not follow a normal distribution, how should the company proceed to reach a
significant conclusion?
Solution
To address the company's problem and evaluate the significance of the differences in
satisfaction levels among the three departments, the Kruskal-Wallis test can be referred to.
This non-parametric test is suitable for comparing three or more independent groups,
especially when the normal distribution of the data cannot be assumed.
1. Initial Hypothesis:
o Null hypothesis (Ho): There are no significant differences in the satisfaction levels
among the Software, Hardware, and Networks departments.
o Alternative hypothesis There are significant differences in the satisfaction
levels among the departments.
2. Test Calculation:
o Using specific statistical software, the p-value of the test is calculated.
3. Interpretation:
o If the calculated p-value is less than or equal to 5%, the null hypothesis is rejected.
Using the Kruskal-Wallis test, the company can determine if the observed differences in
rank means are large enough to be considered statistically significant.
# Scores for each department software scores = [85, 90, 78, 88, 921
# print the h statistic and p-vaiue h_statistic, p_vaiue The provided code uses the scipy.stats library to p€
The choice to use scipy is motivated by its wide set of tools for statistical analysis, making
it particularly useful for performing tests like the Kruskal-Wallis quickly and efficiently.
6.6 Levene’s Test for Equality of Variances Across
Multiple Groups
Levene’s test is a statistical test used to check the homogeneity of variances across two or
more groups. Homogeneity of variances, meaning the condition that variances among
groups are similar, is a fundamental assumption for the application of various parametric
tests, such as ANOVA.
Levene’s test is used in business settings to verify if data groups in an analysis of variance
(ANOVA) or other statistical tests show similar variances. Some examples of its application
are:
• In an analysis of the performance of different work teams, Levene’s test can be used to
check that the variances among team scores are similar before applying a test like
ANOVA.
• In a company that produces products at different facilities, Levene’s test can be used
to verify whether the quality variations among products are homogeneous across
different facilities.
• When a company tests different variants of a product, Levene’s test can be used to
ensure that the variances in customer evaluations are similar among the variants.
Levene’s test does not have the normality restriction that characterizes other tests like
ANOVA. However, for correct application, the independence of observations among groups
must be assured.
• Null hypothesis (Ho): the variances of the groups are equal. In other words, there are no
significant differences in variances among the groups:
Ho: o,2 = a 22 = • . . = o k2
where a2,a k2 are the variances of the groups.
• Alternative hypothesis (HJ: at least one of the variances differs from the others.
Levene’s test statistic is based on the calculation of the difference between each
observation and the median (or alternatively the mean) of the group. The test is calculated
as follows:
• For each group, calculate the absolute deviation between each observed value and the
group's median (or mean).
• Perform an analysis of variance on the absolute values of the deviations obtained.
• Levene’s test statistic W is given by:
The statistic \N follows an F-distribution with k - 1 and N -k degrees of freedom, if the null
hypothesis is true. This allows us to calculate the p-value of the test. If the p-value is less
than the significance level a (for example, 0.05), we reject the null hypothesis, suggesting
that at least one of the group variances is significantly different from the others.
• Western Europe (in millions of euros): [20, 22, 23, 25, 21, 24, 26, 23, 24, 22, 25, 23, 24,
22, 25, 27, 20, 22, 23, 24]
. Eastern Asia (in millions of euros): [18, 20, 21, 19, 20, 22, 23, 21, 19, 20, 21, 20, 18,
21, 23, 22, 20, 19, 21, 20]
Solution
To perform the test, we use Levene’s test, which is designed to check for the equality of
variances between two or more groups. This test is more robust than other variance
homogeneity tests when data are not necessarily normally distributed.
Levene's test requires specific software to be carried out and, like all tests, returns a p-
value. The null hypothesis Ho is that the variances of the groups are equal. If the p-value is
less than the significance level we have set (e.g., 5%), we reject the null hypothesis.
Should this occur, managers might need to consider this different variability in sales for a
more efficient allocation of resources and to obtain a more accurate forecast of future sales
between the different regions.
# Sales data sales western europe = (26, 22, 23, 25, 21, 24, 26, 23, 24, 22, 25, 23, 24, 22, 25, 27, 20, 22, 23, 24)
sales eastern asia = (18, 20, 21, 19, 20, 22, 23, 21, 19, 20, 21, 20, 18, 21, 23, 22, 20, 19, 21, 20]
# Variance calculation variance western europe = np.var(sales western europe, ddof=l) variance eastern asia = np.varfsales ee
# Levene's Test stat, p value = levene(sales western europe, sales eastern asia)
# Results print(f"Variance of Western Europe: {variance western europe}") print(f"Variance of Eastern Asia: {variance eastern
if p value < alpha: print("There is a significant difference in the variability of quarterly sales between the two regions.")
The np.varo function from NumPy is used to calculate the variance of a data series, and
the parameter ddof=i specifies that we are calculating the sample variance.
SciPy provides functions for advanced computation and statistical testing. In this context,
we use the tevene function available in the scipy. stats module to perform Levene's test,
which is ideal for checking the equality of variances between two groups.
In the code, we initially load the quarterly sales data for the two regions. Then we calculate
the variance for each group. We then use Levene's test to determine if the variances
between the two regions differ significantly. The resulting pvalue variable allows us to make
a decision: if it is lower than the significance level (alpha) of 0.05, we reject the null
hypothesis, indicating that the variances are significantly different.
The code output includes the calculated variances, the results of Levene’s test (statistics
and p-value), and a conclusion sobre the hypothesis tested.
Exercise 82. Analysis of Turnover Variability A major consulting firm has divided its
workforce into three regional teams: North America, Europe, and Asia-Pacific. Each team
has worked on similar projects and now the human resources department wants to check if
there is a significant difference in the quarterly staff turnover variability among these three
regions. The recent data related to turnover rates for the last six quarters are:
Determine if there is a significant difference in turnover variability among the three groups
using an appropriate statistical test.
Solution
To solve this problem, we use Levene's test, which is designed to assess the difference in
variability across multiple groups without assumptions about the probability distributions
involved.
Using specific software, it’s possible to calculate the p-value from the test. If the p-value is
less than a chosen significance level (typically 0.05), we can conclude that there is a
significant difference in turnover variability between at least two of the groups. If the p-
value is higher, we cannot reject the null hypothesis of homogeneity of variances.
# Turnover rate data for the three regions north america = [5.2, 4.9, 5.5, 5.0, 5.3, 5.1]
# Use Levene’s Test to check for variance homogeneity stat, p value « levenefnorth america, europe, asia pacific)
if p value < significance: printf"There is a significant difference in turnover variability among the groups.") else: print
We clearly identified the turnover rates for the regions of North America, Europe, and Asia-
Pacific as separate Python lists. These lists contain the raw data to be examined.
The command levenefnorth america, europe, asia pacific) calculates Levene's test statistic
and returns a statistical value and a p-value. The test statistic measures evidence against
the null hypothesis of equal variances.
We compare the obtained p-value with a predetermined significance level, in this case,
0.05. If the p-value is below the significance level, we can reject the null hypothesis of
equal variances, indicating that there are significant differences in turnover variability
among the regions. If it is higher, we cannot reject the null hypothesis.
Finally, the code provides an interpretation of the result, indicating whether or not there is
evidence of a significant difference in turnover variability among the three regions.
6.7 One-Sample Kolmogorov-Smirnov Test
• A company might use this test to check if monthly sales data follow a normal
distribution or another theoretical distribution, before applying statistical analysis
techniques that assume a specific distribution.
• If a company has a quality specification for a product, the test can be used to verify if
the variability of quality data follows a normal distribution or another expected
distribution.
• A company managing a queue of customers may use the Kolmogorov-Smirnov test to
verify if the waiting times follow an exponential distribution, which is a common
distribution in waiting time problems.
The one-sample Kolmogorov-Smirnov test does not need to assume a specific distribution
for the data, but it requires that the observations are independent. Additionally, it is crucial
that the theoretical distribution against which the data are compared is clearly defined.
• Null hypothesis (Ho): the observed data follow the theoretical distribution. In other
words, there are no significant differences between the empirical distribution of the
data and the theoretical distribution.
• Alternative hypothesis (HJ: the observed data do not follow the theoretical distribution,
meaning there is a significant difference between the empirical distribution of the data
and the theoretical one.
The Kolmogorov-Smirnov test statistic measures the maximum absolute deviation between
the empirical cumulative distribution of the data (F„(x)) and the theoretical cumulative
distribution (F(x)):
D = sup x|Fn(x) - F(x)|,
where:
The statistic D measures the maximum vertical distance between the two curves (empirical
and theoretical). It is a random variable that follows a specific probability distribution
characteristic of this test, allowing us to calculate the p-value. If the p-value is lower than
the significance level a (e.g., 0.05), we reject the null hypothesis, suggesting that the data
do not follow the theoretical distribution. If the p-value is greater than the significance level
a, we do not reject the null hypothesis, suggesting that there is not enough evidence to
claim that the data do not follow the theoretical distribution.
Exercise 83. Analysis of Product Demand An electronics company wants to understand
if the weekly distribution of sales for a new smartphone model follows a certain predictive
normal distribution. This predictive distribution has been built based on weekly sales of
similar models over the past five years.
The company collected a 10-week sales sample for the new smartphone model, with the
following sales data (in units): 110, 112, 107, 115, 108, 111, 116, 109, 113, 114.
The predictive distribution can be described as a normal with mean p = 112 and standard
deviation o = 3.
It is required to verify if the sales sample belongs to the predictive distribution. Use a
significance level of 5%.
Solution
To solve this problem, we use the Kolmogorov-Smirnov (K-S) test, which allows us to
compare an empirical sample with a theoretical distribution.
First of all, we calculate the empirical cumulative distribution function (ECDF) for our
sample. Ordering the data: 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, the ECDF
assumes the following values:
. F(107) = 0.1
• F(108) = 0.2
• F(109) = 0.3
• ...
• F(116) = 1.0
The cumulative distribution function of the predictive normal distribution (CDF) must be
calculated for each value of the sample. For example:
Now we calculate the maximum absolute deviation between ECDF and the theoretical CDF,
denoted as D = max \ECDF(x) - CDF(x)|.
We compare the value of D with the Kolmogorov-Smirnov critical value for the 5%
significance level and a sample size of n = 10.
If D is greater than the critical value, we reject the null hypothesis that the sample belongs
to the predictive distribution. Otherwise, we do not have sufficient evidence to reject the
null hypothesis. Alternatively, we calculate the p-value of the test and reject the null
hypothesis if it is less than 5%.
# Empirical distribution data weekly sales = (110, 112, 107, 115, 108, 111, 116, 109, 113, 114]
sigma ■ 3
# Calculate the D-value using kstest d, pvalue ■ kstest(weekly sales, 'norm', args=(mu, sigma))
# Interpretation of the result if p value < 0.05: print("We reject the null hypothesis: the distribution of sales does not fc
1. We have a list weekly sales that represents smartphone weekly sales. A predictive
normal distribution is defined with mean mu=H2 and standard deviation sigma=3.
2. We use the kstest function to verify the compatibility of the sample with a normal
distribution. The 'norm' parameter specifies the theoretical distribution to be used and
args=(mu, sigma) passes the distribution parameters.
3. We compare the p-value with the 5% significance level.
4. If it is lower, we reject the null hypothesis. If it is higher, there is no sufficient evidence
to reject the hypothesis.
The company decides to test the quality by taking a sample of 15 bottles filled at the new
plant, with the following measured volumes (in ml): 498, 502, 501, 499, 500, 503, 497,
499, 501, 500, 500, 498, 504, 505, 496.
The company wants to statistically determine whether the volume filled by the new
machines can be considered consistent with their expected normal distribution, using a
significance level of 5%.
Solution
We use the Kolmogorov-Smirnov test to compare the sample with the expected normal
distribution.
1. Input Data:
o Sample volumes: 498, 502, 501, 499, 500, 503, 497, 499, 501, 500, 500, 498, 504,
505, 496.
o Expected distribution: Normal with p = 500 ml, a = 5 ml.
o Significance level: 0.05.
2. Calculation of the Test Statistic:
o Sort the sample data in ascending order.
o Calculate the empirical cumulative distribution function (ECDF) for the sample
data.
o Calculate the theoretical cumulative distribution function (CDF) using the expected
normal distribution.
o Determine the maximum difference between the ECDF and the CDF.
3. Calculation of Dmax.
o Dmax = max |ECDF(x) - CDF(x)|.
4. Determination of the Critical Value:
o For n = 15 samples and a significance level of 5%, the critical value of D is
approximately 0.351.
5. Comparison of Dmax with the Critical Value:
° Dmax > 0.351, we reject the null hypothesis that the sample follows the expected
normal distribution.
o If Dmax < 0.351, we do not reject the null hypothesis.
6. Alternatively, we can calculate the p-value of the test and reject the null hypothesis if
the p-value is less than 5%.
Assuming a calculation that leads to Dmax < 0.351, we do not have sufficient evidence to
reject the null hypothesis. Therefore, the sample of filling volumes of the new bottles from
our plant can be considered consistent with the expected normal distribution. Otherwise, if
Dmax were greater, there would be reason to review the bottling process to improve filling
accuracy according to the expected standards.
# Input data volumes = [498, 502, 501, 499, 500, 503, 497, 499, 501, 500, 500, 498, 504, 505, 496]
# Using the Kolmogorov-Smirnov test statistic, p value = kstestfvolumes, 'norm', args=(mu, sigma))
# Determination of the result result = "We reject the null hypothesis" if p value < 0.05 else "We do not reject the null hypot
In the Python code above, we use the scipy.stats library to perform a Kolmogorov-Smirnov
test, which allows us to compare the empirical distribution of the sample data with an
expected theoretical distribution, in this case, a normal distribution.
1. Import Libraries:
o scipy.stats offers the function kstest, useful for performing the Kolmogorov-Smirnov
test.
2. Input Data:
o Sample volumes are defined in a list.
o The expected mean (mu) and standard deviation (sigma) are set as variables.
3. Execution of the Kolmogorov-Smirnov test:
o kstest is used to calculate the test statistic and the p-value. Here, we pass the
sample data, the name of the tested distribution ('norm' for normal), and the
distribution parameters (expected mean and standard deviation).
4. Comparison and Conclusion:
o The variable result is used to determine whether to reject the null hypothesis by
comparing the p-value with the required significance.
o In the final result, the test indicates whether the sample data can be considered
consistent with the expected normal distribution.
6.8 Two-Sample Kolmogorov-Smirnov Test
The two-sample Kolmogorov-Smirnov test does not require making assumptions about the
form of the data distribution, but it requires that the samples are independent of each
other. It is a non-parametric test, so it can be used with data that do not follow a normal
distribution.
• Null hypothesis (Ho): the distributions of the two samples are identical.
• Alternative hypothesis (HJ: the distributions of the two samples are different.
The test statistic of the two-sample Kolmogorov-Smirnov test measures the maximum
absolute deviation between the empirical distributions of the two samples. The statistic D
is defined as:
D = sup JFJx) - P2(x)|,
where:
The statistic D measures the maximum distance between the two empirical distribution
curves. The p-value obtained from calculating D with respect to the specific distribution of
this test allows us to reject the null hypothesis if it is low (for example, less than 5%) or not
reject it if it is high.
Group A made the following purchases (in euros): 89, 95, 102, 110, 74, 93, 87, 105, 99, 91.
Group B, however, made the following purchases (in euros): 77, 85, 112, 120, 108, 92, 95,
85, 100, 94.
Solution
To solve this problem, we can use a non-parametric test such as the Kolmogorov-Smirnov
test, which allows us to compare two samples without assumptions about the distribution.
1. Calculate the empirical cumulative distribution functions for both groups. This involves
ordering the data and calculating the proportion of observations below each value in
the ordered list.
2. Identify the maximum difference between the two calculated CDFs:
D = |F„(x) - f8(x)|
This measures where the cumulative distribution functions of the two groups differ the
most.
3. Determine if this maximum difference is statistically significant by comparing it to the
critical value derived from the Kolmogorov-Smirnov table. For a significance of 0.05
with n = m = 10 (the samples), consult the corresponding critical value.
4. If D is greater than the critical value, we can reject the null hypothesis that the two
distributions are the same, concluding there is a significant difference in purchasing
behavior between the two groups.
5. Alternatively, compare the test p-value with the significance level of 0.05. If it is less
than 0.05, we reject the null hypothesis.
# Purchase data for the two groups group A = (89, 95, 102, 110, 74, 93, 87, 105, 99, 91]
group B = (77, 85, 112, 120, 108, 92, 95, 85, 100, 94]
# Determine the significance of the result if p value < alpha: result » "There is a significant difference in purchasing behc
The code primarily uses the scipy library, specifically the stats module, which provides
useful tools for statistical analysis. In this case, we use the ks 2samp function to perform the
two-sample Kolmogorov-Smirnov test, which allows us to determine the difference between
the cumulative distributions of two independent samples.
Campaign 1 recorded the following daily sales: [12, 15, 14, 10, 13, 17, 20, 15, 16, 14].
Campaign 2 recorded the following daily sales instead: [18, 14, 17, 15, 20, 22, 19, 17, 16,
21].
The management requires a statistical analysis to determine if the sales distributions differ
significantly. A standard significance level of a = 0.05 is assumed.
Solution
1. Definition of hypotheses:
o Null hypothesis (/70): The two distributions (sales from Campaign 1 and Campaign
2) do not differ significantly.
o Alternative hypothesis (HJ: The two distributions differ significantly.
2. Calculate the empirical cumulative distributions for both samples.
3. Compute the maximum distance between the two empirical cumulative distributions.
4. Compare the calculated test value with the critical value for a = 0.05.
5. If the calculated value exceeds the critical value, we reject the null hypothesis.
If the null hypothesis is rejected, this suggests that the two advertising campaigns have
had different effectiveness in terms of daily sales; managers should then delve into which
aspects of the two campaigns might have caused this difference. Otherwise, there is
insufficient evidence to claim a significant difference in sales distribution behavior.
The analysis thus allows for mapping differences in marketing strategies and optimizing
future campaigns.
sales campaign2 - [18, 14, 17, 15, 20, 22, 19, 17, 16, 21]
# Decision if p value < alpha: print('Reject H0: there is a significant difference between the distributions of sales from th
numpy is imported as np to handle numerical data efficiently. However, in this specific case,
we don't need to use numpy, but it is commonly seen when working with numerical arrays.
The Kolmogorov-Smirnov test is computed with the kstest function, which compares two
independent samples. We pass the data for both campaigns to the function, which returns
two values: statistic and p value.
The significance level alpha is set at 0.05, which is the standard value for many statistical
analyses.
The Shapiro-Wilk test is a statistical test used to determine whether a sample comes from a
normal distribution. This test is widely used in inferential statistics to verify the assumption
of normality in data, which is a fundamental premise for many statistical techniques, such
as analysis of variance (ANOVA) and the Student's t-test.
The Shapiro-Wilk test statistic is based on the comparison between an estimate of the
sample variance and a theoretical variance expected under the normality assumption. The
test calculates a statistic W, which is defined as:
E’L. I2
where:
The statistic W measures how closely the observed distribution approaches the normal
distribution. If the value of W is close to 1, the data are consistent with a normal
distribution. As always, the statistic can be used to derive the p-value through the specific
probability distribution of the test. If the p-value is less than the significance level a (e.g.,
0.05), we reject the null hypothesis, suggesting that the data do not follow a normal
distribution. If the p-value is greater than the significance level a. we do not reject the null
hypothesis, suggesting that there is not enough evidence to claim that the data do not
follow a normal distribution.
The manager suspects that variability in delivery times might affect the logistics process.
Analyze the collected data to confirm the hypothesis that they follow a normal distribution
and provide statistically founded conclusions.
Solution
To verify if the delivery times follow a normal distribution, we can use the Shapiro-Wilk Test.
Suppose, through calculation, we obtained a p-value of 0.12. Since 0.12 > 0.05, we do not
have sufficient evidence to reject the null hypothesis.
# Delivery time data delivery times » np.array([2.3, 2.5, 3.1, 2.8, 3.8, 2.7, 2.9, 2.6, 3.4, 3.2, 3.0, 2.5, 3.5, 3.8, 2.8, 2.6
if pvalue > alpha: print("Do not reject the null hypothesis. Delivery times follow a normal distribution.") else: print("R<
1. We use numpy to handle the data as an array and scipy.stats to access the shapiro
function, which performs the Shapiro-Wilk test.
2. The provided data, which are delivery times expressed in days, are organized into a
NumPy array.
3. We use stats.shapirot) passing the data array. This returns two values: the W-statistic
and the p-value.
4. Commonly chosen as 0.05, this value helps us decide the acceptance or rejection of
the null hypothesis.
5. If the p-value is greater than alpha, do not reject the null hypothesis. This means that
the data can be considered normally distributed. If the p-value is less than or equal to
alpha, reject the null hypothesis, indicating that the data does not follow a normal
distribution.
The manager intends to determine if the preparation times can be considered normal to
more accurately predict staff needs. Evaluate the data to confirm whether they follow a
suitable distribution to make reliable predictions and provide a conclusion based on the
analysis results.
Solution
To determine if the preparation times follow a normal distribution, we apply the Shapiro-
Wilk test. This test provides a methodology for verifying the normality of the data available.
A significant result (low p-value) indicates a deviation from normality.
After performing the test with the provided data, let's assume we obtain a p-value of 0.25.
Since the obtained p-value is higher than the common significance level of 0.05, we do not
have sufficient evidence to reject the null hypothesis that the data follow a normal
distribution. Therefore, the operations manager can proceed with the assumption that the
preparation times are normally distributed, thus facilitating better kitchen resource
planning using appropriate predictive models.
preparationtimes = (12.2, 10.8, 11.5, 13.6, 12.0, 11.7, 12.4, 13.1, 10.9, 11.3, 12.6, 13.0, 12.8, 11.4, 12.1, 13.3, 12.9, 11.6,
# Interpret the result if p value > 0.05: print("There is not enough evidence to reject the hypothesis of normality.") else
First, the preparation time data is stored in a list called preparation times. This list represents
the observations collected from the restaurant.
Next, the Shapiro-Wilk test is performed using the stats.shapiroO function, which accepts
the data list as its argument. This function returns an object containing both the test
statistic and the associated p-value. In our case, the p-value is stored in shapiro test.pvatue.
The p-value is then interpreted to determine the normality of the data set. If the p-value is
greater than the commonly used significance level (0.05), there is not enough evidence to
reject the null hypothesis that the data follows a normal distribution. This supports the idea
that the data is normally distributed, which can facilitate the use of statistical predictive
models.
Conversely, a p-value less than 0.05 would indicate that the data probably does not follow
a normal distribution.
The scipy library is highly versatile and widely used in Python to perform various statistical
tests, including the Shapiro-Wilk test. This module provides a convenient interface for
working with many statistical algorithms and data analysis tools, making its use
particularly popular among researchers and data analysts.
6.10 Chi-Square Test on Contingency Tables
The chi-square test on contingency tables is a statistical test used to determine if there is a
significant relationship between two categorical variables. The contingency table is a table
that shows the observed frequencies of combinations of categories of two variables. The
chi-square test compares the observed frequencies with the expected frequencies,
calculating a statistic that measures the discrepancy between the two.
• A company can use the chi-square test to examine if there is a relationship between
gender and product preference, checking if preferences are evenly distributed among
different gender groups.
• A company might want to understand if product sales are independent of the
distribution channel (e.g., online vs. physical stores). The chi-square test helps
determine if there is a significant relationship between these factors.
• The test can be used to examine if there are significant differences between employee
groups based on their geographic area and the level of job satisfaction.
The chi-square test has some fundamental requirements for its correct application:
• Null hypothesis (WQ): the two variables are independent, meaning there is no
relationship between them.
• Alternative hypothesis (H,): the two variables are dependent, meaning there is a
relationship between them.
The chi-square test statistic (%?) measures the discrepancy between the observed and
expected frequencies and is calculated as:
where:
• O:j is the observed frequency in the cell of the /-th row and j-th column,
• E,i is the expected frequency in the cell of the /-th row and /-th column,
• r is the number of rows in the contingency table,
• c is the number of columns in the contingency table.
• —
where:
The p-value of the chi-square test represents the probability of obtaining a statistic
greater than or equal to the one observed, assuming the null hypothesis is true. If the p-
value is less than the significance level a (e.g., 0.05), we reject the null hypothesis,
suggesting that the variables are dependent. If the p-value is greater than the significance
level a, we do not reject the null hypothesis, suggesting that there is not enough evidence
to claim that the variables are dependent.
Age GroupFreshCannedFrozen
18-30 150 60 90
Analyze the data and determine if there is a significant association between the age groups
of customers and their product type preferences. Use a significance level of 5%.
Solution
To solve the problem, we need to test the null hypothesis that there is no association
between age groups and product preferences. We will use the chi-square statistical test for
this analysis.
1. Calculate Totals
o Calculate the row and column totals from the table:
• Totals for age groups: 300, 400, 490.
• Totals for product categories: 600, 260, 330.
• Overall total: 1190.
2. Calculate Expected Frequencies (E,;)
o Using the formula: E,; - (Row total • Column total)/Grand total
o Calculate, for example, for cell (18-30, Fresh): (300 • 600)/1190 ~ 151.26
3. Calculate the Value of/1 23
° / = Z((O „ - E,)2/E,j) for all cells.
o Perform this calculation for all combinations in the table.
4. Determine the Degrees of Freedom (df)
o df = (number of rows - l)(number of columns -1) = (3 - 1)(3 - 1) = 4
5. Compare with the Critical Value
o We use a significance level a = 0.05 and df - 4.
o The critical value is approximately 9.49 (by consulting the chi-square distribution
table or using software tools).
6. Test Conclusion
o If x2 > 9.49, we reject the null hypothesis, indicating that there is a significant
association.
o If %2 < 9.49, we cannot reject the null hypothesis.
In this exercise, the manual calculation process of/2, will indicate whether these age
groups influence their product preferences based on the provided dataset.
1)
# Perform the chi-square test chi2, p, dof, expected = chi2 contingency(supermarket data)
# Test conclusion if p < alpha: printCWe reject the null hypothesis: there is a significant association.") else: print(“h
• The contingency table is a NumPy array, supermarket data, representing the observations
of different age groups and their preferences for product categories.
• The function chi2 contingency returns several values: the chi-square statistic, the
associated p-value, the degrees of freedom, and a matrix of expected frequencies. The
p-value allows us to determine if there is enough statistical evidence to reject the null
hypothesis.
• The chosen significance level is 0.05. If the p-value is lower than this, we reject the null
hypothesis and conclude that there is a significant association between age groups and
product preferences. Otherwise, there isn't enough evidence to state that.
• Finally, the code prints the results of the chi-square statistic, the p-value, the degrees
of freedom, and the calculated expected frequencies. The test conclusion indicates if
there is a significant association between the analyzed variables.
Junior 100150 50
Middle 120130 70
Senior 80 90 110
Analyze the data to determine if there is a significant association between the job position
of employees and their preference for transportation modes. Use a significance level of 5%.
Solution
First, calculate the expected frequencies for each cell of the contingency table using the
formula Row total • Column total/Total. The marginal distributions are: For the Junior
category and Car, the expected frequency is: (300 • 300)7900 = 100. Similarly, calculate all
the expected frequencies:
With 4 degrees of freedom (df - (r - l)(c -1) = 2 • 2 = 4), compare the calculated chi-
square value with the critical chi-square value for a significance level of 5%. The threshold
value is 9.488.
Since x2 = 46 > 9.488, we reject the null hypothesis. Therefore, there is a significant
relationship between job position and the choice of transportation mode among
employees.
# Observed data observed = np.array([[1Q0, 150, 50], [120, 130, 70], [80, 90, 110]])
# Calculation of the chi-square test of independence chi2, p, dof, expected » chi2 contingency(observed)
result = {
'chi2 calculated': chi2 calculated, 'p value': p value, 'expected frequencies': expected frequencies.tolistO }
print(resuit) In this exercise, we use the Python library scipy, which is extremely useful for stat
• Declare a variable observed as a numpy array, which encapsulates the data collected
from the example. This array represents the observed frequencies for each
combination of job position and transportation mode.
• Use chi2 contingency to calculate the chi-square value. The function will return several
values:
o chi2: the calculated chi-square test value.
o p: the p-value, which indicates the probability of obtaining a result at least as
extreme as the one observed, assuming the null hypothesis is true.
o dot: the degrees of freedom.
o expected: an array containing the expected frequencies.
• The results are stored in result as a dictionary. Here, expected frequencies is converted
into a list of lists for easier interpretation.
• Print the results, which include the chi-square value, the p-value, and the expected
frequencies, helping us verify whether there is a significant association between the
variables in our original observations.
In summary, the code analyzes whether there is a significant relationship between job
position and choice of transportation mode, by comparing observed frequencies with
expected ones and providing a chi-square value that allows us to accept or reject the null
hypothesis.
6.11 Fisher's Exact Test on 2x2 Tables
The Fisher's exact test is a statistical test used to assess the association between two
categorical variables in a 2x2 contingency table. This test is particularly useful when the
observed frequencies in table cells are small, as it does not require large sample conditions
like the chi-square test does.
Fisher's exact test is often used in various business contexts, especially when data is
infrequent or sample sizes are small. Some examples of usage include:
Fisher's exact test has no restrictions on sample size and is particularly useful when the
observed frequencies in some cells of the 2x2 contingency table are low. However, there
are certain requirements:
• Null hypothesis (WQ): the two variables are independent, meaning there is no
association between them.
• Alternative hypothesis (H,): the two variables are dependent, meaning there is a
relationship between them.
Fisher's exact test does not rely on a continuous test statistic like the chi-square but
calculates the exact probability of obtaining the observed data distribution under the null
hypothesis of independence. Specifically, the test calculates the probability of obtaining a
frequency distribution as observed, or more extreme, assuming the two variables are
independent.
The exact probability calculation is based on the hypergeometric distribution, and the
formula to calculate the probability P of a specific frequency configuration O in a 2x2 table
is given by:
where:
a,b,c,d are the values in the cells of the 2x2 contingency table,
• is the binomial coefficient and is equal to n
• x! is the factorial, i.e., the product of the first x integers starting from 1
• n is the total number of observations.
If the p-value is less than a (for example, 5%), the null hypothesis is rejected, suggesting
that the variables are dependent. If the p-value is greater than the significance level a, the
null hypothesis is not rejected, suggesting that the variables are independent.
The company collected the following data one month after the campaign:
PurchaseNo Purchase
Active Advertisement 45 55
Inactive Advertisement30 70
Evaluate if there is a significant difference in purchasing behavior between the two groups
(with active advertisement and without advertisement).
Solution
To compare the proportions of two groups on a 2x2 contingency table, Fisher's exact test is
an appropriate tool because it provides exact results without making assumptions about
the samples, making it ideal when we have low expected frequencies in a 2x2 matrix. It is
therefore preferred over the chi-square test in such cases.
Then, we apply Fisher's exact test on the table using appropriate software.
Calculating, we obtain a p-value, let’s assume p < 0.05, which indicates that we can reject
the null hypothesis at the 5% significance level.
The advertising campaign had a significant impact on purchasing behavior. This suggests
that maintaining or further adapting the social media advertising strategy might be
advantageous for increasing sales.
Solution with Python
# Creation of the data table * ([Purchases with Advertisement, No Purchases with Advertisement], # [Purchases without Advert!
# Calculation of p-value using Fisher's exact test oddsratio, p value ■ fisher exact(data)
# Print the result def interpret result(p value, alpha-0.05): if p value < alpha: return "The advertising campaign had a sii
# Output of the results print(f"Odds ratio: {oddsratio)") print(f"P-value: (p value)") prlntdnterpret result) In the pr
1. The data from the two experimental conditions (active advertisement and inactive
advertisement) are stored in a list of lists data. Each inner list represents a row
corresponding to the conditions (purchases and no purchases) for the two situations.
2. The function fisher exact calculates both the odds ratio and the p-value for the
provided table. The latter allows us to determine if the result is statistically significant.
Here, oddsratio is ignored but printed for information.
3. We define a function interpret result that checks if the p-value is lower than a certain
level of significance (alpha), typically 0.05. If so, it suggests that the advertising
campaign has a significant impact.
4. Finally, the odds ratio, p-value, and the interpretation of the result are printed. The user
can see if the advertising campaign was statistically significant in modifying purchasing
behavior.
The company collected the following data on the sales of brand A product:
Evaluate if there is a significant difference in sales of brand A product between stores with
and without promotion.
Solution
The chosen test is Fisher’s exact test, since the data are contained in a 2x2 contingency
table and verify the independence of classifications of two criteria. This test is more
appropriate than the chi-square test for comparing the proportions of two groups and
evaluating if there is a statistically significant difference, but can only be applied with 2x2
contingency tables.
The null hypothesis Ho is that there is no relation between the variables involved.
To calculate the p-value, we can use statistical software or an online calculator that allows
for entering the data of the 2x2 table. The resulting p-value is p = 0.012.
Since the p-value (0.012) is below the common significance level of 0.05, we reject the null
hypothesis Ho. Therefore, we can conclude that there is a significant difference in the sales
of brand A product between stores with and without promotion. This suggests that the
promotional strategy had a positive impact on sales.
# Definition of the contingency table contingency table = [[80, 401, (60, 60]1
if p value < significance level: print("The null hypothesis is rejected: there is a significant difference in sales.") else:
1. The code starts by importing fisher exact from the scipy.stats module. This function is
designed to calculate Fisher's exact test on a 2x2 contingency table.
2. A nested list contingency table is created representing the sales data collected between
stores with and without promotion.
3. The fisher exact function is called passing the contingency table as an argument. This
returns two values: oddsratio and p value. In this context, we are more interested in the
p value, which will help to determine statistical significance.
4. The p-value is printed as output to provide the result of the statistical test.
5. The code compares the p-value with a preset significance level (0.05). If the p-value is
less than 0.05, the null hypothesis is rejected, indicating that there is a significant
difference in sales. Conversely, if the p-value were higher, there would be insufficient
evidence to claim the sales difference is statistically significant.
Using Fisher's exact test is useful when sample sizes are small or the distributed data do
not meet some of the necessary assumptions for other parametric tests.
Chapter 7
Confidence Intervals
In this chapter, we will delve into the use of confidence
intervals, which are fundamental tools in statistics for
estimating an unknown parameter of a population based on
a data sample. Confidence intervals provide a range of
values within which the true parameter of the population is
located with a certain probability, usually expressed as a
confidence level (e.g., 95% or 99%). These intervals are
used to infer, with some confidence, the value of a mean or
a proportion from a sample.
Confidence intervals are a statistical tool that allows us to estimate a range within which it
is likely that a population parameter, such as the mean, is found. In the case of the mean,
the confidence interval provides an estimate of the range in which the true population
mean is found, based on a sample drawn from the population itself.
To calculate a confidence interval for the mean, the following formula is used:
Cl =
where:
In the case where the population standard deviation is not known, Student’s t-distribution
is used and the formula becomes:
where ta/2,n-i is the critical value of the t-distribution with n- 1 degrees of freedom.
Confidence intervals for the mean find various applications in the business environment.
Here are some examples of usage:
• A company can compare the confidence intervals of the mean of two samples (e.g.,
sales of two products in different periods) to determine if there is a significant
difference between the means. If the confidence intervals overlap, no significant
difference between the means can be concluded, whereas if they do not overlap, the
difference is likely significant.
• A company can compare the confidence interval of the mean of a business variable
(e.g., return on investment) with a predefined benchmark value (e.g., an annual
target). If the confidence interval does not include the benchmark, the company may
consider the performance unsatisfactory.
• Use confidence intervals to estimate future trends of business variables like sales,
costs, or customer satisfaction. This aids in planning and risk management, allowing
decisions to be made based on more accurate forecasts.
• In a quality control context, the confidence interval of the mean of a measurement
(e.g., the size of a product) can be used to verify if the production process is under
control. If the confidence interval falls outside of specified tolerances, this could
indicate a problem in the production process.
Exercise 93. Analysis of Monthly Sales in Two Different Regions A company wants
to compare the sales efficiency between two regions: Region A and Region B. Over the past
year, they have collected data on the monthly sales for each region.
The management wants an accurate estimate of the average monthly sales for each region
and wants to know if there is a significant difference between the two regions.
For both regions, the collected data are as follows (in thousands of euros): Region A: [65,
70, 62, 68, 74, 69, 72, 71, 66, 73, 75, 68]
Region B: [60, 62, 58, 66, 63, 65, 67, 64, 61, 66, 62, 67]
Calculate the 95% confidence interval for the average sales of each region and compare
the two intervals to determine if the sales efficiency significantly differs between the two
regions.
Solution
1. Calculate the sample mean (xA) and the sample standard deviation (sA).
2. Calculate the 95% confidence interval:
o Degrees of freedom = nA - 1 = 11
o Use the Student's t-distribution with 11 degrees of freedom to find the critical value
^0.025)-
—
o Confidence interval = xA ± t0025
1. Calculate the sample mean (xe) and the sample standard deviation (s8).
2. Calculate the 95% confidence interval:
o Degrees of freedom = nB - 1 = 11
o Use the Student's t-distribution with 11 degrees of freedom to find the critical value
(£0.025)-
• If the intervals do not overlap, we can conclude that there is a significant difference in
average sales between the two regions.
• If the intervals overlap, we cannot conclude that there is a significant difference.
# Sales data in thousands of euros sales region A = np.array([65, 76, 62, 68, 74, 69, 72, 71, 66, 73, 75, 681) sales region E
# Calculate the mean and standard deviation for Region A mean A = np.meant sales region A) std dev A = np.std(sales region A,
# Number of data points for each Region nA - len(sales region A) n B = lent sales region 8)
degrees freedom A = n A • 1
degrees freedom B = n B - 1
* Calculate the critical t value for 95’4 confidence Intervals tscoreA » stats.t .ppf(1 - 0.025, degrees freedom A) t scoreJ
# Confidence Interval for Region A cl A lower » mean A t score A ♦ (std dev A / np.sqrt(n A)) cl A upper •= mean A ♦ t score
ci 8 lower = mean B - t score B • (std dev B / np.sqrtfn BH ci B upper - mean B ♦ t score B x (std dev B / np.sqrtfn B))
# Results print(f”95% Confidence interval for Region A: [{ci A lower:.2f). {ci A upper:.2f)1") prlnt(f“954 Confidence interve
# Comparison if ci A upper < ci B lower or cl B upper < ci A lower: print!"The confidence intervals do NOT overlap: there is
1. Monthly sales for Regions A and B are given as lists. We convert these lists into nuinpy
arrays to allow vectorized mathematical operations.
2. Using np.mean() and np.std(ddof=i), we calculate the mean and the sample standard
deviation for each region respectively. ddof=i specifies that we are calculating the
sample standard deviation.
3. To calculate the confidence intervals, we use scipy’s stats.t.ppf() function, which
returns the critical t value given a confidence level and degrees of freedom (number of
data points in the sample minus one). These critical values are used to determine the
lower and upper limits of the confidence intervals using the formulas:
O Lower limit: x -100„
o Upper limit: x + t0025 •—
4. After printing the confidence intervals for both datasets, we compare them to check if
they overlap. If they do not overlap, this suggests a significant difference in average
sales between the two regions. We print the conclusion based on this comparison.
For both teams, the collected data are as follows (projects completed monthly):
Calculate the 95% confidence interval for the mean number of projects completed by each
team and compare the two intervals to determine if the performance differs significantly
between the two teams.
Solution
Begin with calculating the confidence intervals for the two teams.
Team X:
Team Y:
. Team X: [8.82,11.18]
. Team Y: [7.03,8.64]
# Data team x = f8, 9, 7, 18, 12, 11, 13, 8, 9, 11. 10, 12]
# Function to calculate confidence interval def calculate confidence interval(data, confidence^.95I: n = len(data) mean = i
# Calculate confidence intervals for both teams interval x = calculate confidence interval(team x) interval y = calculate cor
The code begins by defining the data: the number of projects completed monthly by Team
X and Team Y. It then defines a function calculate confidence interval that takes a team's
data and a confidence level as input, defaulting to 95%.
The confidence intervals for each team are calculated by calling the function with the
team's data and are subsequently displayed on screen using the print function.
Data related to the processing time for a random sample of 50 orders were collected, and
an average of 28 minutes with a standard deviation of 5 minutes was calculated.
1. Calculate and interpret a 95% confidence interval for the average order processing
time after the implementation of the new system.
2. Based on the calculated confidence interval, can the company conclude that the new
system is more efficient than the benchmark of 30 minutes? Justify your answer.
Solution
1. We calculate the 95% confidence interval for the average processing time.
The confidence interval is given by the formula:
S
where:
o x = 28 is the sample mean
o s = 5 is the sample standard deviation
o n = 50 is the sample size
o z = 1.96 is the critical value for a 95% confidence interval using the normal
distribution (an acceptable approximation given the sample size)
Calculate the margin of error:
ME = 1.96 ■ 1.384
\/50
The 95% confidence interval for the average order processing time ranges from 26.616
to 29.384 minutes.
2. To compare the sample mean with the benchmark of 30 minutes, we observe that our
confidence interval [26.616, 29.384] is entirely below 30 minutes. This suggests that,
with a 95% confidence level, the new warehouse management system indeed reduces
the order processing time compared to the company benchmark of 30 minutes.
Therefore, the company can conclude that the new system is more efficient.
samplestddev = 5 n = 50
z = stats.norm.ppf(0.975)
ismoreefficient = confidenceintervalupper < benchmark printds the new system more efficient than the benchmark?:, ismoreeffic
1. We set up the provided sample data, which includes the sample mean, sample
standard deviation, and sample size.
2. We use the stats module of the scipy library to calculate the critical value z for a two-
tailed 95% confidence interval. The parameter 0.975 represents the two extremes of
the normal distribution, considering we are in the 95% range.
3. We compute the margin of error using the margin of error formula for a normal
distribution. The margin of error depends on the sample standard deviation and the
sample size.
4. We calculate the lower and upper limits of the confidence interval using the sample
mean and the margin of error.
5. We determine whether the computed confidence interval is entirely below the
benchmark of 30 minutes. If it is, we can conclude that the new system is more
efficient.
The final printed result confirms that the confidence interval is below 30 minutes, and that,
with a 95% confidence level, the new warehouse management system is more efficient
compared to the previous company standard.
1. Calculate and interpret a 95% confidence interval for the average waiting time in line
with the new payment system.
2. Based on the calculated confidence interval, can the company consider the new
system more efficient than the previous benchmark of 10 minutes? Justify your answer.
Solution
Let's look at the solution to the various questions. To calculate the 95% confidence interval
for the mean, we use the following formula:
/ o \
Cl = x ± zl ——
\ /
Where:
The calculated confidence interval (7.995, 9.005) minutes does not include the 10-minute
benchmark. This means that, at a 95% confidence level, we can state that the average
waiting time in line with the new system is significantly less than 10 minutes. Therefore,
the company can consider the new system to be more efficient compared to the previous
benchmark.
# Sample size n = 60
# Calculate the Z-score for the required confidence level z score = stats.norm.ppf(l - (] - confidence level) I 2)
# Calculate the standard error of the mean (SE) standard error = std dev / np.sqrt(n)
# Calculate the confidence interval margin of error = z score * standard error confidence interval = (mean - margin of error,
prmt(confidence intervai) In this code, we use the scipy.stats library to calculate the Z value for a
Subsequently, we calculate the standard error of the mean (SE) using numpy for the square
root of the sample size. The standard error is the standard deviation divided by the square
root of the number of observations in the sample and represents the quantity used to
calculate the margin of error.
The margin of error is then determined by multiplying the obtained z value by the standard
error. The final confidence interval is calculated by adding and subtracting the margin of
error from the sample mean, thus obtaining the lower and upper limits of the confidence
interval. This information is then presented as (7.995, 9.605), indicating with 95%
confidence that the average waiting time with the new system is within this range, which
does not include the 10-minute benchmark, and therefore we can conclude that the new
system is more efficient.
7.2 Confidence Intervals for Proportions
Confidence intervals for proportions are used to estimate the range within which the
proportion of a certain event falls in a population, based on a drawn sample. These
intervals provide an estimate of the likelihood that the sample proportion of successes
represents the population proportion.
To calculate a confidence interval for the proportion, the following formula is used:
where:
A
• P is the sample proportion (the number of successes divided by the total number of
observations in the sample),
• n is the sample size,
• za„ is the critical value from the standard normal distribution for the desired confidence
level (as with the mean, a 95% confidence level is associated with the value 1.96).
In cases where the sample size is small or the proportion of successes is very close to 0 or
1, a continuity correction or alternative methods like the Wilson method are used to ensure
more accurate confidence intervals.
Confidence intervals for proportions are very useful in various business contexts, such as
market research, quality control, and customer satisfaction analysis. Here are some
examples of their use:
Calculate the 95% confidence interval for the conversion rate of both campaigns. Compare
the two conversion rates and determine if the difference is statistically significant at the
5% significance level.
Assuming that the number of views is sufficient to allow a normal approximation, your task
is to help the company understand which campaign was more effective.
Solution
To solve this problem, we need to calculate the conversion rate for each campaign and
then construct the confidence intervals for these rates. Subsequently, we compare these
intervals to see if there is a significant difference.
Campaign A: pA = = 0.25
Campaign B: pB = = 0.24
Next, we move on to calculating the 95% confidence intervals. We use the formula for the
confidence interval for a proportion:
IpO -p)
ci = P± z\ n
where P is the observed proportion, z is the critical Z value for a 95% confidence interval
(1.96, according to the normal distribution tables), and n is the total number of
observations.
• For Campaign A:
The confidence intervals for Campaign A and Campaign B overlap, which implies that we
cannot conclude that there is a statistically significant difference in the performance of the
two campaigns at the 5% level.
Both confidence intervals indicate that the campaigns might have similar conversion rates.
Therefore, the company might need to consider other factors, like cost per view, to decide
which campaign is more advantageous.
conversions A = 300
views B = 1500
conversions B = 360
# Function to calculate the confidence interval def confidence interval(p, n, z=1.96): se = np.sqrt{(p * (1 - p)) / n) retu
# Calculating the 95% confidence intervals ci A = confidence interval(p A, views A) ci B = confidence intervaUp B, views B)
# Output of the results print("Conversion rate for Campaign A: ", p A) print("Confidence interval for Campaign A: ", ci_A)
# Comparison between confidence intervals if ci A[ll < ci B[0] or ci A[0] > ci_B[l]: print("The difference in the conversion
1. Importing libraries:
o numpy is a widely used library for mathematical calculations in Python. In this
example, it is used for calculating the standard deviation of a proportion.
o scipy.stats is a part of the scipy library, which provides statistical functions,
including z values for the normal distribution.
2. Initial Data:
© We have the views and conversions for two campaigns, A and B.
3. Calculation of conversion rates:
o Conversion rates are calculated by dividing conversions by the total number of
views for each campaign.
4. confidence interval function:
o This function calculates the confidence interval for a given proportion using the
critical value z of the normal distribution (1.96 for 95% confidence).
o The standard deviation of the proportion is calculated and then used to determine
the interval.
5. Calculation and output of confidence intervals:
o Confidence intervals for both campaigns are calculated and printed.
6. Comparison of confidence intervals:
o We check if the confidence intervals overlap. If they do not overlap, the difference
in the conversion rate is considered statistically significant.
• Strategy 1: 800 total customers during the week, 240 purchased Product X.
• Strategy 2: 1000 total customers during the week, 310 purchased Product X.
Calculate the 95% confidence interval for the percentage of customers who purchased
Product X for each of the two strategies. Compare the two percentages and determine
whether the difference is statistically significant.
Assume that the number of customers is large enough to allow for the normal
approximation. Assist the supermarket in deciding which strategy had a greater impact on
Product X sales.
Solution
To calculate the 95% confidence interval of a proportion p, we use the following formula:
/pU ~?)
where P is the proportion of successes (purchases), is the critical value of the standard
normal distribution (approximately 1.96 for a 95% interval), and n is the sample size.
Strategy 1:
. p, =210 = 0.3
1 X(M>
• The confidence interval is: ______________
/o.3(l - 0.3)
0.3 ± 1.96V 800
= 0.3 ± 1.96 • 0.0162
= 0.3 ± 0.0317
= [0.2683, 0.3317]
Strategy 2:
. p2=Jio = o.3i
2 1000
• The confidence interval is: _________________
/0.31(l - 0.31)
0.31 ±1.96 V 1000
= 0.31 ± 1.96 • 0.0144
= 0.31 ± 0.0282
= [0.2813, 0.3386]
Since the intervals overlap, there is no statistically significant difference between the two
strategies at the 5% level of significance. Therefore, both strategies had a similar impact
on the sales of Product X.
# Data nl « 880
xl « 240
n2 = 1800
x2 = 310
p2 hat » x2 / n2
# Function to calculate the confidence interval of a proportion def confidence interval^ hat. n, confidence-0.95): z =■ stat
*
# Calculate confidence intervals for the two strategies dl ° confidence interval(pl_hat, nil ci2 « confidence_interval(p2_hc
1. The proportions of customers who purchased Product X under the two strategies are
calculated as the ratio of the number of customers who purchased to the total number
of customers for each strategy.
2. The confidence interval function uses the normal distribution to calculate the 95%
confidence interval of a proportion. The math library is used to compute the square root
necessary in the calculation of the margin of error.
The overlap between the confidence intervals allows us to say that the two strategies are
statistically similar.
1. Calculate the 95% confidence interval for the proportion of satisfied customers of the
company.
2. Determine, with a 5% level of significance, if the customer satisfaction of the company
is in line with the industry benchmark.
Solution
150
The 95% confidence interval for a proportion is given by:
/p(l-p)
P±z\ n
Where zfor a confidence level of 95% is approximately 1.96. Thus:
/0.88- 0.12
0.88 ± 1.96-V 150
= 0.88 ± 0.05292
0.88 - 0.88
2= / o xx.o r> = 0
V 150
Since z = 0 is less than 1.96 (critical value for a two-tailed test with a = 0.05), we
cannot reject the null hypothesis.
The data does not show a significant difference between the company’s customer
satisfaction proportion and the benchmark. The satisfaction appears to be in line with the
industry benchmark.
# Calculate the *
95 confidence interval /value ■ st.norm.ppf(0.975) # 0.975 because it is twotailed and (1-0.65)72
error margin =* / value ’ math.sqrt((p hat • (1 - p hat)) / total customers) confidence interval « (p hat • error margin, p hat
z test = (p hat pQ) I math. sqrtf (p0 ’ (1 pB)) / total customers) p value = 2 • (1 - st .norm.cdf (abs(z test))) * Two-tailed
# Results print('95^ confidence interval:’, confidence interval) print(’/-value:’, z test) print(’p-value:’, p value) The
1. Start by defining the problem data, which is the number of satisfied customers and the
total number of customers surveyed. This allows us to calculate the sample proportion
(P).
2. The sample proportion P is simply the ratio of satisfied customers to the total number
of customers.
3. Calculation of the confidence interval:
o Calculate the critical z value for a 95% confidence level using st.norm.ppf, which is
similar to the inverse quantile function.
o Calculate the error margin, which is multiplied by the standard deviation of the
sample proportion divided by the sample size.
o Determine the confidence interval by adding and subtracting the error margin from
P.
4. Hypothesis testing against the benchmark:
o Calculate the z value (z-test) to check if the difference between the sample
proportion and the benchmark is sufficiently small to be insignificant.
o The p value is calculated as a two-tailed probability (helpful to check if we can
reject the null hypothesis).
5. The final prints communicate the confidence interval and the result of the comparison
with the benchmark.
The use of math.sqrt is for calculating the square root, while scipy.stats is essential for
obtaining statistical distribution values.
1. Calculate the 95% confidence interval for the proportion of customers satisfied with call
quality.
2. Determine, with a significance level of 5%, if the company's call quality is in line with
the sector benchmark.
Solution
To calculate the 95% confidence interval for the proportion of satisfied customers, we use
the formula for the confidence interval of a proportion:
Ip(L-p)
p ± za/2\ n
Where:
Compare Z with the critical value Zo/2 = ±1.96. Since -0.9129 is within the acceptance
range, we do not reject the null hypothesis.
The 95% confidence interval for the proportion of satisfied customers is: [0.8158, 0.9342],
With a significance level of 5%, there is not enough evidence to assert that call quality
differs from the sector benchmark of 90%. Therefore, we conclude that call quality could be
in line with the benchmark.
alpha » 0.B5
# Calculate the 95% confidence interval for the sample proportion z alpha half » norm.ppffl ■ alpha / 2) SE = math.sqrt(p hat
# Calculate the test statistic SE benchmark = math.sqrtfp benchmark • (1 - p benchmark) / n) Z = (p hat - p benchmark) / SE t
» Results {
'interval', confidenceinterval, 'statistics': Z, 'decision'; "Do not reject the null hypothesis" if abs(Z) < zalpha half el
The code uses the scipy library to obtain the critical Z value for a given confidence level,
facilitating statistical calculation compared to manual use of statistical tables.
1. First, gather the data: sample size, sample proportion, and sector benchmark.
2. To determine the 95% confidence interval, calculate the margin of error using the
critical Z value (obtained using nom.ppt from scipy.stats) and the standard error of the
sample proportion.
3. Calculate the confidence interval by adding and subtracting the margin of error from
the sample proportion.
4. For the hypothesis test, calculate the test statistic Z by comparing the difference
between the sample proportion and the benchmark with the standard error of the
benchmark.
5. Compare Z with the critical value; if it’s less, it implies there is no evidence to reject
the null hypothesis, suggesting call quality might be in line with the benchmark.
The Author
Gianluca Malato was born in 1986 and is an Italian data
scientist, entrepreneur, and author. In 2010, he graduated
with honors in Theoretical Physics of Disordered Systems at
the University of Rome "La Sapienza" (thesis advisors:
Giorgio Parisi and Tommaso Rizzo). He worked for years as a
data architect, project manager, data analyst, and data
scientist for a large Italian company. Currently, he holds the
position of Head of Education at profession.ai academy.
Website: https://www.yourdatateacher.com
Books: https://www.yourdatateacher.com/en/my-books/
2 Descriptive Statistics
2.4 Percentiles
3 Regression Analysis
4 Conditional Probability
5 Probability Distributions
6 Hypothesis Testing
7 Confidence Intervals