0% found this document useful (0 votes)
45 views189 pages

Statistics With Python 2025

The document is a guide titled 'Statistics with Python: 100 Solved Exercises for Data Analysis' by Gianluca Malato, aimed at teaching data analysis through practical exercises using Python. It covers various statistical topics including probability, descriptive statistics, regression analysis, hypothesis testing, and confidence intervals, with each exercise accompanied by theoretical and Python solutions. The book is designed to bridge the gap between statistical theory and real-world business applications, enhancing problem-solving skills for aspiring data analysts.

Uploaded by

gktgurasoglu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views189 pages

Statistics With Python 2025

The document is a guide titled 'Statistics with Python: 100 Solved Exercises for Data Analysis' by Gianluca Malato, aimed at teaching data analysis through practical exercises using Python. It covers various statistical topics including probability, descriptive statistics, regression analysis, hypothesis testing, and confidence intervals, with each exercise accompanied by theoretical and Python solutions. The book is designed to bridge the gap between statistical theory and real-world business applications, enhancing problem-solving skills for aspiring data analysts.

Uploaded by

gktgurasoglu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 189

Statistics with Python

100 solved exercises


r-r-r

for Data Analysis


Table of Contents
Statistics with Python. 100 solved exercises for Data
Analysis
Introduction
Chapter 1 Probability
Chapter 2 Descriptive Statistics
2.1 Mean Value
2.2 Standard Deviation
2.3 Median Value
2.4 Percentiles
2.5 Chebyshev's Inequality
2.6 Identification of Outliers with IQR Method
2.7 Pearson Correlation Coefficient
2.8 Spearman's Rank Correlation Coefficient
Chapter 3Reqression Analysis
3.1 Linear Regression
3.2 Exponential Regression
Chapter 4 Conditional Probability
Chapter 5 Probability Distributions
5.1 Binomial Distribution
5.2 Poisson Distribution
5.3 Exponential Distribution
5.4 Uniform Distribution
5.5 Triangular Distribution
5.6 Normal Distribution
Chapter 6 Hypothesis Testing
6.1 Student's t-test for a Single Sample Mean
6.2 Student's t-test for the Means of Two Samples
6.3 Z-test on Proportions
6.4 One-Way ANOVA on the Mean of Multiple Groups
6.5 Kruskal-Wallis Test on the Median of Multiple Groups
6.6 Levene's Test for Equality of Variances Across Multiple
Groups
6.7 One-Sample Kolmogorov-Smirnov Test
6.8 Two-Sample Kolmogorov-Smirnov Test
6.9 Shapiro-Wilk Normality Test
6.10 Chi-Square Test on Contingency Tables
6.11 Fisher's Exact Test on 2x2 Tables
Chapter 7 Confidence Intervals
7.1 Confidence Intervals for the Mean
7.2 Confidence Intervals for Proportions
The Author
Contents
Statistics with Python. 100
solved exercises for Data
Analysis
Gianluca Malato

Date of the first English edition: April 28, 2025

Publisher: Gianluca Malato Copyright 2025 Gianluca Malato


All rights reserved.

No part of this publication may be reproduced, distributed,


transmitted in any form or by any means, including
photocopying, recording, or other electronic or mechanical
methods, without the prior written permission of the author,
except in the case of brief quotations embodied in reviews
and other non-commercial uses permitted by copyright law.
Introduction
I decided to write this book after years spent teaching data
analysis to many students, both with face-to-face lessons
and recorded video courses. My experience as a data
analyst and data scientist in the banking sector and my
scientific background as a theoretical physicist have always
led me to place particular emphasis on statistics in my
teaching.

I have always believed that although statistics is a complex


and multifaceted science, the elements that a professional
needs in the workplace are reduced to a few specific tools
necessary to analyze data effectively and functionally to
provide valuable insights. I have always concentrated all my
teaching on these tools.

What has always fascinated my students during statistics


lessons have been concrete references to the business
world, real use cases where statistics are applied in the
working reality. This connection with the real world has
always made every concept more vivid, which would
otherwise remain an obscure and difficult-to-apply
mathematical formula.

In the spirit of a pragmatic and practice-based teaching


approach, I have decided to write this book of statistics
exercises with solutions in Python. I chose Python because it
is the most used programming language in data science and
data analysis, having now become a true industrial
standard.

The purpose of this volume is, therefore, to allow aspiring


data analysts to practice with the most common statistical
tools at work, but above all, to stimulate a problem-solving
approach that allows translating a business problem into a
statistical question, identifying the most appropriate tools to
use, and learning to communicate the results correctly.
Based on my experience, this is the most important
component of this profession. This exercise book thus aims
to also train such non-technical and more methodological
skills.

This is achieved through the 100 exercises proposed here,


which are small realistic business use cases that require
statistical tools to be solved. Each exercise is accompanied
by a theoretical solution and a Python language solution,
both explained and commented on to understand both the
theory and practice.

I will not go too far into the details of the theory, as I would
risk boring or confusing the reader. Instead, I will show the
basic elements at the beginning of each section and in each
solution, along with the applicability of the approximations
made. This is to allow an easy recall of theoretical notions
without, however, entering a level of detail that would be
outside the practical purposes of this volume. The Python
code, on the other hand, will be complete and explained
step by step.

The exercises will cover the most basic content of


probability and descriptive statistics, gradually moving
towards more advanced topics such as hypothesis testing
and inferential statistics.

With this book, I hope to succeed in creating a bridge


between the business reality and statistical theory that
allows data analysts to perform their jobs to the best and
with the utmost value.
For any feedback, doubts, or reflections, I leave you my
email address: [email protected]. I will be happy to
receive your suggestions and comments. For those
interested in exploring data science and machine learning
further, I provide the link to my website, where you can find
my articles and information on the other books I have
written on these topics: www.yourdatateacher.com. In particular,
you can find the list of my books on this page:
https://www.yourdatateacher.com/en/my-books/

Finally, forthose who wish to become more competitive in


the labor market in the field of data analysis and artificial
intelligence, I recommend the professional master's
programs of profession.ai, the Italian academy of which I
have the honor of being the educational manager. The
purpose of our master's programs is to train highly qualified
professionals, ready to pursue a career in data science, data
analysis, and data engineering.
Chapter 1
Probability
In this chapter, we will explore some basic probability exercises, which are fundamental to
understanding the essential concepts of statistics and data analysis. These exercises are
designed to be solved without necessarily resorting to the Python language, as they are
based on relatively simple mathematical formulas that can be applied manually or with any
other calculation tool, such as a spreadsheet or a scientific calculator.

The objective of these exercises is to provide a practical review of the fundamental


principles of probability, which every data analyst should know to tackle real-world
problems in their work.

Let's start with some brief definitions. Probability is a measure of the uncertainty
associated with the occurrence of an event in a random experiment. An empirical method
to estimate the probability of an event is the principle of relative frequency. If an
experiment is repeated n times under the same conditions and the event A occurs k times,
then its probability can be estimated as:

P(A) ~—
II

This definition is the basis of the frequentist interpretation of probability, according to


which the probability of an event is the limit of the relative frequency for an increasing
number of trials (law of large numbers).

We can define the complementary event A of an event A as the set of outcomes that do not
belong to A. For example, in the toss of a coin, if A represents heads, its complement
represents tails.

The probability of A and that of its complement satisfy the relationship:


P(A) = 1 - P(A)

This property is useful for calculating probabilities without having to directly determine
P(A).

Finally, let’s introduce conditional probability. This is the probability of an event A given
that another event B has already occurred. It is defined as:
P(AOB)
P(A|8) =
PW

The quantity P(AnB) represents the probability of the intersection between A and B, that is,
the probability that both events occur. Conditional probability is fundamental in many fields
of statistics and probability calculation, such as in Bayes' Theorem.

Exercise 1. Analysis of a Retail Website's Visits An online store recorded a total of


50,000 visits to its website in January. During the same period, 1,200 of these visits
resulted in purchases. Based on these historical data, what is the estimated probability that
a visitor randomly chosen from this store's website in the future completes a purchase?
Provide your answer in percentage.

Solution

To solve this problem, we are calculating the conversion probability, a fundamental concept
in statistics related to probability analysis. The probability of an event is defined as the
number of favorable cases divided by the total number of possible cases.

In this case, the favorable event is 'a visit that results in a purchase.'
, . , Number of conversions 1*20(1
P(conversion) =---------------------------------------- =----------- = 0.024
Total visits 50000
To express this probability in terms of percentage, we multiply the result by 100:
0.024 • 100 = 2.4

Therefore, we can estimate that the probability of a randomly chosen visitor making a
purchase on this store's website is 2.4%. This information is useful for evaluating the
effectiveness of the site’s marketing strategies and user engagement.

Solution with Python

# Data number of conversions • 1260

number of visits = 50600

# Calculation of conversion probability conversionprobability =» number of conversions / number of visits

# Convert the probability to percentage conversion percentage « conversion probability ' 100

conversion percentage In this script, we are calculating the conversion probability from visits to pu

Here are the main steps of the code:

1. The conversions and total visits are set as variables ant their names are
number of conversions and number of visits.
2. Divide the number of conversions by the total number of visits to obtain the conversion
probability. This is done using the usual proportion formula: conversion probability =
number of conversions / number of visits.
3. The conversion probability is converted into a percentage by multiplying by 100. This
final step is useful to express the probability in a more comprehensible and usable form
in business contexts.
4. Finally, the variable conversion percentage contains the conversion probability in
percentage, which is the desired output of our calculation.

Exercise 2. Fault Analysis in a Production Line An electronics component factory


recorded a total of 240 working days over the last 12 months. During this period, there
were 36 days in which at least one failure was found in the production process. Based on
this historical data, what is the probability that on a randomly chosen working day in the
coming months, at least one failure will be recorded in the production line? Present your
answer as a simple fraction.

Solution
The required probability can be calculated using the fundamental concept of classical
probability, which relies on the ratio of favorable cases to the total possible cases. In the
context of this exercise, the favorable cases are the days on which at least one failure
occurred, while the possible cases are all considered working days.

The formula for probability (P) in its simplest form is:


n
P(E) = -
n

Where n(E) is the number of days with failures (36 days) and n(S) is the total number of
working days (240 days).

Substituting the values, we get:

240 20
There is thus a probability of 15% that on a randomly chosen working day, at least one
failure will occur in the production process. This exercise uses the statistical concept of
empirical probability based on historical data to estimate the probability of a similar future
event.

Solution with Python

# Problem data total workmg days ■ 240

days, with failures ■ 36

# Probability calculation failure probability = days with failures / total working days

print(f'The probability of at least one failure occurring is: (failure probability)') In this Code, the main task is

Let's see the steps of the code:

1. Data assignment:
o The total number of working days and the number of days with failures are
assigned to variables for later use.
2. Probability calculation:
o We calculate the ratio of days with failures to total working days.
3. Solution output:
o We print the calculated result as a fraction, representing the probability that on a
randomly chosen working day, at least one failure will occur in the production line.

Exercise 3. Analysis of a Production Process In a manufacturing company that


produces electronic gadgets, it has been observed that out of 10,000 units produced in a
month, 300 units are defective. The company decides to improve its production processes
to reduce the number of defects and wants to evaluate the probability of achieving better
results in the coming month. Assuming the defect rate remains unchanged, what is the
probability that at least one gadget chosen at random from a batch of 500 units is
defective?

Solution

The proposed problem is a classic example of probability, where we want to calculate the
probability of the complementary event (non-defective), and then determine the opposite.
The probability of a defective unit is P(D) = = 0.03. Consequently, the probability
that a unit is NOT defective is P(ND) - 1 - P(D) = 0.97. Using the complementary
probability for 500 units, the probability that ALL 500 units are NOT defective is:
P(500 non-defective) = 0.97500

Now, we calculate the probability of at least one defective gadget, which is the
complement:
P(at least one defective) = 1 - O.97500

This calculation applies the concept of probability for mass production examples, helping
the company estimate the success of their process improvements without having to
completely test a vast quantity of products, thereby reducing the risk of defects in gadgets
distributed in the market.

Solution with Python

tt Problem parameters n = 560 * number of units in the batch p defect = 300 I 19000 # probability of a defective unit

# Calculate the probability of at least one defective gadget p non defective all — (1 - p defect) n p at least one defect!

# Print the result prlnt(f‘‘Probablllty that, at least one gadget is defective: (p at least one defective}-) In the present

Let's examine in detail:

• First, we define the problem parameters: n (number of units in the batch) and p defect
(probability of a defective unit). The calculated probability is based on the number of
defective units relative to the total produced; here, p defect is 300/10000.
• We use the complementary probability formula:
o We calculate the probability that all units are not defective: d p defect) ** n.
o The event that at least one is defective is complementary to the event that all are
not, so we calculate: 1 - p non defective all.
• Finally, we print the result, showing the probability that at least one gadget is
defective. This type of calculation helps the company understand production risks and
make predictions about the outcomes of changes to production processes.

Exercise 4. Customer Preference Analysis in a Clothing Store A clothing company


wants to analyze the behavior of its customers. In the past six months, they noticed that
600 customers purchased a product immediately after trying on a specific item in the
fitting room, while 1800 customers tried it on but did not purchase it. Based on these data,
the company wants to know the probability that a customer, after trying on an item in the
fitting room, decides to purchase it. Present your answer as a decimal value rounded to two
decimal places.

Solution

To solve this problem, we use the concept of conditional probability. We are looking for the
probability that a customer purchases an item given that they have tried it on in the fitting
room. The probability is calculated as the number of customers who purchased the item
after trying it on, divided by the total number of customers who tried the item.

Calculation:
, u „ Number of customers who purchased
P(Purchase|Tried) = ------------------------------------------------- —--------------
Total number of customers who tried
GOO _ GOO
0.25
GOO 4- 1800 ” 2400
Therefore, the probability that a customer purchases an item after trying it on is 25% or
0.25.

Solution with Python

# Problem data number of purchases » 6G0

number of nojjurchases » 1800

# Calculation of probability conditional purchase probability - number of purchases I (number of purchases ♦ number of no pure

# Result rounded to two decimal places conditional purchase probability - round(conditional purchaseprobability, 2)

conditional purchase probability In the provided code, we are calculating the probability that a customs

Let's look at the various details of the code:

1. Definition of data:
o We have defined two variables: number of purchases, representing the number of
customers who purchased after trying; and number of no purchases, indicating those
who did not purchase.
2. Calculation of conditional probability:
o We calculate the probability that a customer purchases after trying it on, by
dividing the number of people who purchased by the total number of people who
tried it on.
3. Approximation
o Finally, we round the result to two decimal places, as required.

This simple code provides us with a clear and direct result of the desired probability, which
is 0.25 or 25%.

Exercise 5. Analysis of Conversions in an Ecommerce In an online store, a company's


analyst is monitoring user behavior on the platform. The conversion rate of a user visiting
the site and completing a purchase is 10%. Additionally, among those who convert, 30% of
these users purchase at least 2 products in the same order. If a user visits the site, what is
the probability that they convert and also buy at least 2 products?

Solution

To solve this problem, we must consider the probability of the intersection set. We are
looking for the probability that a user not only converts but also purchases at least 2
products. Let A denote the event 'a user converts' and B the event 'a user who has
converted buys at least 2 products'. We know that:

. P(A) = 0.10,
• P(B\A) = 0.30. (probability of B given A)

The sought probability is P{A n 8), which is calculated from Bayes' formula P(A n 8) = P(A) •
P(8|A), thus
P(A n 8) = P(A) • P(8|A) = 0.10 • 0.30 = 0.03
Thus, the probability that a user who visits the site converts and simultaneously buys at
least 2 products is 3%. This exercise illustrates the concept of the probability of the union
set in the context of customer behavior in an e-commerce, although we actually focused on
the specific events pertinent to the intersection calculation.

Solution with Python

# Conversion rate p conversion =0.16

# Rate of purchasing at least 2 products among converters p at least 2.products_given conversion -0.30

# Calculation of the probability probability = p conversion * p at least 2 products given conversion

# Output the result .2%}.


print(f"The probability that a user converts and buys at least 2 products is {probability:*
) In thi

Let's see the various steps of the code:

1. Definition of probabilities:
o p conversion represents the probability that a user converts (10%).
o p at least 2 products given conversion is the probability that a purchaser who has
already converted buys at least 2 products (30%).
2. The probability we are looking for is the intersection of two events: that a user converts
(A) and buys at least 2 products (B). This is calculated as the product of P(A) and
P(B\A), resulting in 0.03 or 3% in percentage form.
3. Finally, we format the final result as a percentage and print it using the print o
function.

In the e-commerce context, this type of analysis helps to better understand user behavior
and to plan personalized marketing strategies.

Exercise 6. Evaluation of the Probability of Participation in Company Events A


company is analyzing employee behavior concerning participation in internal training
events. 25% of employees voluntarily attend technical workshops, and 15% attend
personal development courses offered by the company. It has been observed that 5% of
employees participate in both types of events. What is the probability that a randomly
chosen employee participates in at least one of these events?

Solution

To estimate the probability that an employee participates in at least one of the two types of
events, we can use the probability rule for the union of sets. Let A be the event 'attending
a technical workshop’ and B the event 'attending a personal development course.’

The probability that an employee attends a technical event is P(A) = 0.25. The probability
that an employee attends a personal development course is P(8) = 0.15. The probability
that an employee attends both is P(A n 8) = 0.05.

Using the union probability formula:


P(A u 8) = P(A) + P(8) - P(A n 8)

Substituting the values:


P(A u 8) = 0.25 + 0.15 - 0.05 = 0.35

Thus, the probability that an employee participates in at least one of these events is 35%.
Solution with Python

# Define the probabilities P A = 0.25 # Probability of attending technical workshops P B = 0.15 # Probability of attending pc

# Calculate the probability of attending at least one of the events P A union B=PA+PB-PA intersec B

# Output the result PA union B

The core of the code is the calculation of probability using the union probability formula:
P(A u 8) = P(A) + P(B) - P(A n 8). This provides the probability that an employee
participates in at least one of the two types of events.

Once the individual probabilities are defined, the rigorous application of this formula
provides the required result.

Exercise 7. Analysis of Customer Satisfaction in a Service Company A service


company is analyzing the satisfaction of its customers through an annual survey. The data
collected shows that 40% of the customers surveyed are satisfied with the service
received, while 25% are satisfied with the after-sales support. Furthermore, it was noted
that 10% of the customers are satisfied with both the service and the after-sales support.
Calculate the probability that a random customer is satisfied with at least one of the two
aspects (service or after-sales support).

Solution

To solve this problem, we need to apply the concept of the probability of the union of sets.
That is, we need to determine the probability that a customer is satisfied with the service
or the after-sales support or both.

The formula for the probability of the union of two events A and B is:
P(A u 8) = P(A) + P(B) - P(A n 8)

Where:

• P(A) is the probability that a customer is satisfied with the service, equal to 40% or
0.40
• P(B) is the probability that a customer is satisfied with the after-sales support, equal to
25% or 0.25
• P(A n 8) is the probability that a customer is satisfied with both the service and the
after-sales support, equal to 10% or 0.10

Plugging the values into the formula, we obtain:


P(A u 8) = 0.40 + 0.25 - 0.10 = 0.55

Thus, the probability that a random customer is satisfied with at least one of the two
aspects is 55%.

Solution with Python

# Probability of satisfaction with the service P service = 0.40

# Probability of satisfaction with the after-sales support P support = 0.25

# Probability of satisfaction with both P both = 0.10

# Calculating the probability that a customer is satisfied with at least one of the two aspects P union = P service ♦ P suppor
p union In this code, we use the concept of the probability of the union of two events A and B to

Let's see the various steps:

1. We define three variables representing the individual probabilities: p service for service
satisfaction, p support for after-sales support satisfaction, and p both for satisfaction
with both.
2. We use the formula to calculate the probability that a customer is satisfied with at least
one of the two aspects. The formula is P(A) + P(B) - P(A n 8), where we subtract the
intersection probability to avoid double-counting customers satisfied with both aspects.
3. The calculation provides the overall satisfaction probability for at least one of the two
aspects for a randomly chosen customer, equal to 0.55 or 55%.

Exercise 8. Analysis of Marketing Campaigns A company is analyzing the


effectiveness of two different advertising campaigns. The first campaign, conducted on
social media, reaches 20% of potential customers, while the second campaign, on a
television platform, covers 35% of the target. Market analysis shows that 10% of potential
customers are reached by both campaigns. What is the probability that a potential
customer is reached by at least one of the two advertising campaigns?

Solution

To solve this problem, we use the concept of the probability of the union of sets:
P(A u 8) = P(A) + P(B) - P(A n 8).

Where:

• P(A) is the probability that a potential customer is reached by the social media
campaign (20% = 0.20),
• P(8) is the probability that a potential customer is reached by the television campaign
(35% = 0.35),
• P(A n 8) is the probability that a potential customer is reached by both campaigns
(10% = 0.10).

Insert the values:


P(A u 8) = 0.20 + 0.35 - 0.10 = 0.45.

Therefore, the probability that a potential customer is reached by at least one of the two
advertising campaigns is 45%.

Solution with Python

def probability of either campaign(p a, p b, p ab): # Calculate the probability that a potential customer is # reached by at

# Probability of a customer being reached by the social media campaign p a = 0.20

# Probability of a customer being reached by the television campaign p b = 0.35

# Probability of a customer being reached by both campaigns p ab =0.10

# Calculate the probability of being reached by at least one of the campaigns probability = probability of either campaignfp a

print(f"The probability that a potential customer is reached by at least one of the two campaigns is: {probability:.2%}") In t

1. Define a function probability of either campaign that implements the aforementioned


formula to calculate the combined probability.
2. Set the variables p a, p b, and p ab with the probability values specified in the problem.
3. Calculate the resulting probability using our function.
4. Print the result formatted as a percentage.

The executed calculation returns the probability that a potential customer is reached by at
least one of the two advertising campaigns, which is 45%.

Exercise 9. Customer Preferences Analysis of a Company An e-commerce company


wants to analyze the purchasing preferences of its customers for two product categories:
electronics and clothing. From the database, it is observed that 45% of customers have
made at least one purchase in the electronics category, while 55% of customers have
bought at least one product in the clothing category. Furthermore, 20% of customers have
purchased at least one product in both categories. Calculate the probability that a
randomly selected customer has made purchases in at least one of the two categories.

Solution

To solve this problem, we use the concept of the probability of the union of sets. The
probability that a customer has purchased products in at least one of the two categories is
given by the sum of the individual probabilities of purchasing in each category, minus the
probability that they have purchased in both categories.

In mathematical terms, if we define:

• P(E) = probability that a customer purchases electronics = 0.45,


• P(A) = probability that a customer purchases clothing = 0.55,
• P(E n A) = probability that a customer purchases both categories = 0.20,

The probability that a customer has purchased in at least one of the two categories is:
P(EuA) = P(E)+P(A)-P(EnA) = 0.45 + 0.55-0.20 = 0.80.

So, 80% of customers have purchased products in at least one of the categories of
electronics or clothing.

Solution with Python

# Probability of purchasing in each category P electronics = G.45

P clothing =0.55

# Probability of purchasing in both categories P e and a = 0.20

# Calculate the probability of purchasing in at least one of the two categories P union = P electronics ♦ P clothing - P e and

p union In this code, we calculate the probability that a customer has made purchases in at least

• represents the probability of purchasing electronic products,


p electronics
• represents the probability of purchasing clothing products,
p clothing
• p e and a represents the probability of purchasing both electronic and clothing
products.

These probabilities are added together and then the intersection is subtracted.

Exercise 10. Sales Performance Analysis in a Company Commission A company


relies on a team of salespeople to launch a new product on the market. Each salesperson
has a distinctive approach to closing sales. The probability that salesperson A concludes a
sale in a call is 0.35, while for salesperson B it is 0.5. Since both are working on separate
calls, calculate the probability that at least one sale is concluded, if each makes one call.
Consider the strategies of the salespeople to be distinct and independent.

Solution

To solve this problem, we need to apply the concept of the probability of independent
events. The probability that salesperson A does not conclude a sale is 1 - 0.35 = 0.65.
Similarly, the probability that salesperson B does not conclude a sale is 1 - 0.5 = 0.5. Since
the two events are independent, the probability that neither of the salespeople concludes
the sale is the product of their individual probabilities of failure:
P(no sale) = 0.65 • 0.5 = 0.325

The probability that at least one sale is concluded is complementary to the fact that no
salesperson concludes the sale. Therefore,
P(at least one sale) = 1 - P(no sale) = 1 - 0.325 = 0.675

The probability that at least one sale is concluded during these calls is 0.675,
demonstrating the application of probabilities of independent events to establish the
potential success of the team.

Solution with Python

# Probability of success for each salesperson p A = 0.35

p B = 0.5

# Calculation of failure probabilities no sale A = 1 • p A no sale B = 1 - p B

# Probability that neither salesperson concludes a sale no sale both ■ no sale. A * nosale B

# Probability that at least one sale is concluded at least one sale = 1 - no sale both

print(at ieast one sale) Let's see the various steps:

1. Start by defining the probability of success for each salesperson, p a for salesperson A
and p b for salesperson B.
2. Calculate the probability that each salesperson does not conclude a sale with no sate a
and no sale b. This is obtained by subtracting the success probability of each
salesperson from 1.
3. Since these are independent events, the probability that both salespeople do not close
a sale is obtained by multiplying their probabilities of failure.
4. The probability that at least one salesperson concludes a sale is the complement of the
probability that neither concludes a sale. This is calculated by subtracting the
combined probability of failure from 1.
5. Finally, print the result, which is the probability that at least one sale is concluded
during the calls.

Exercise 11. Simulation of Advertising Campaigns' Success In a small marketing


agency, two teams are working separately on advertising campaigns for different clients.
Team Alpha is conducting an email marketing campaign, while Team Beta is focusing on a
social media campaign. According to past analyses, the probability that Team Alpha's
campaign meets its goal is P(A) = 0.6, while the probability for Team Beta’s campaign is
P(B) = 0.7. Since the two campaigns are designed for different clients, they can be
considered independent. Calculate the probability that both campaigns are successful.

Solution

To solve this problem, we apply the statistical concept of the probability of independent
events. Events are considered independent when the outcome of one does not influence
the outcome of the other. The formula to find the probability of both independent events
occurring is given by the product of their individual probabilities.
P(A n B) = P(A) ■ P(B)

In this specific case, substituting the values given in the problem:


P(A n 8) = 0.6 • 0.7 = 0.42

Therefore, the probability that both campaigns are successful is 0.42, or 42%

Solution with Python

# Probability of success for each team pa =0.6

p_b = 0.7

# Calculate the probability that both campaigns are successful p success both = p a * p b

# output the probability p success both In this code, we calculate the probability that both campaigns,

Let's look at the various steps:

1. We start by defining the variables p a and p b, which represent the probabilities that
Team Alpha's and Team Beta’s campaigns are successful, respectively. These are
provided in the problem statement.
2. We use the multiplication operator * to calculate the product of the probabilities, which
gives us the probability that both events are successful.
3. The result is stored in p success both, which is then simply returned (printed in a broader
context).

Exercise 12. Risk Analysis in an Insurance Company In an insurance company, there


are two independent risk assessments conducted to determine the coverage of a policy.
The first assessment is based on the customer’s creditworthiness, with a probability of
passing of 0.8. The second assessment examines the claims history, with a probability of
passing of 0.75. Calculate the probability that a customer passes both assessments and
can therefore obtain the requested insurance coverage.

Solution

To calculate the probability that a customer passes both assessments, we use the concept
of independent events, where the probability of both events occurring is the product of
their individual probabilities.

If PGA) = 0.8 is the probability of passing the creditworthiness assessment, and P(B) = 0.75
is the probability of passing the claims history assessment, then the probability that both
conditions are satisfied is:
P(A and 8) = P(A) ■ P(B) = 0.8 ■ 0.75 = 0.6

Thus, the probability that a customer passes both assessments is 0.6 or 60


Solution with Python

# Probabilities of events A and B

prob A = 0.8

prob B = 0.75

# Probability that both independent events occur prob A and B = prob A * prob B

prob A and B

In the code above, we calculate the probability that a customer passes both risk
assessments based on the premise that the events are independent.

Here are the various steps:

• prob a and prob b represent the probabilities that a customer passes the
creditworthiness assessment and the claims history assessment, respectively.
• Since the events are independent, the probability that both events occur is the product
of the individual probabilities (prob a * prob b). This is a fundamental principle of
probability theory for independent events.

The final result, prob a and b, gives us the probability that a customer passes both
assessments, resulting in a value of 0.6 or 60%.

Exercise 13. Analysis of Cross-Selling Operations in a Department Store In a


department store, a section is analyzing customer purchase habits to improve cross-selling
strategies. It is known that the probability that a customer buys an appliance is P(A) =
0.25, while the probability that they buy a furniture item is P(B) = 0.35. These two events
are considered independent. Calculate the probability that a customer neither buys an
appliance nor a furniture item during their visit.

Solution

To solve this problem, we use the concept of the probability of independent events.

Given the independence of events A and B, the probability that the union of two events
occurs (at least one of them occurs) is given by:
P(A u B) = P(A) + P(B) - P(A) • P(B)

Let's calculate:
P(AuB) = 0.25+0.35-(0.25 0.35) = 0.25+0.35-0.0875 = 0.5125

The probability that a customer neither buys an appliance nor a furniture item is the
complement of P(A u B):
P((A u B)c) = 1 - P(A u B) = 1 - 0.5125 = 0.4875

In this statistical context of independent events, the department store can expect that the
probability of a customer making no purchases in the analyzed categories is 48.75%.

Solution with Python

# Probability that a customer buys an appliance P A = 0.25

# Probability that a customer buys a furniture item P B = 0.35


# Calculate P(A union B) using the formula for independent events union probability ■ P A ♦ P B • (P A ♦ P B)

# Probability that a customer neither buys an appliance nor a furniture item complement probability = 1 - union probability

# Output the calculated probability complement probability This Code calculates the probability that a Custom

To calculate the combined probability of two independent events, we used the formula P{A)
+ P(8) - P(A) ■ P(B), which represents the probability that at least one of the two events
occurs.

After obtaining P(A u 8), we calculated the complement (the probability that neither event
occurs) by subtracting P(AuB) from 1.

Finally, the code returns the calculated probability of making no purchases in the specified
categories, which is 0.4875, or 48.75%.

Exercise 14. Analysis of Advertising Sales A digital marketing company analyzed the
response data to its email advertising campaigns in the last quarter. Out of a total of
25,000 emails sent, it recorded that 5,250 of these led recipients to visit the promotional
website. Simultaneously, out of the emails sent, 1,200 resulted in a direct sale on the site.
Based on this data, calculate the estimated probability that a randomly selected email
recipient visits the promotional website and simultaneously makes a purchase. Present the
result as a decimal value rounded to four decimal places.

Solution

The exercise analyzes the probability of combining multiple events: visit and purchase.
Below, we report the various data we have to work with.

Probability that a recipient visits the site:


P(Visit) = Number of visits/Total emails =

= 5250/25000 = 0.2100

Probability that a recipient makes a purchase given that they visited the site:
P(Purchase|Visit) = Number of sales/Number of visits =

= 1200/5250 = 0.2286

Total probability that a recipient visits the site and makes a purchase:
P(Visit and Purchase) = P(Visit) • P(Purchase|Visit) =

= 0.2100 • 0.2286 = 0.0480

In this exercise, we used the concept of conditional probability to estimate the combined
probability of two dependent events: visit and purchase. The estimated probability that a
visitor to the site makes a purchase is rounded to four decimal places and is equal to
0.0480.

Solution with Python

# Initial data email total = 25G00

visits - 5250
sales = 1200

# Calculation of visit probability p visit = visits I email total

# Calculation of conditional probability of a sale given that there has been a visit p sale visit = sales / visits

# Calculation of the combined probability of visit and sale p visit and sale = p visit " p sale visit

H Result rounded to four decimal places p visit and sale_approx » roundfp visit_and sale, 4)

p visit and sale approx In this code, we use basic concepts of conditional probability to calculate tl

Let’s see the various details:

1. We assign the known values to variables to facilitate their use in the code. We have the
total number of emails sent, the number of site visits, and the number of sales made.
2. The probability that a recipient visits the site is calculated by dividing the number of
site visits by the total number of emails sent. This operation provides P(Visit).
3. The probability of a sale given that there has been a visit is calculated by dividing the
number of sales by the number of visits. This is the conditional probability
P(Purchase|Visit).
4. We use the theorem of conditional probability to calculate the combined probability of
visit and sale, by multiplying P(Visit) by P(Purchase|Visit).
5. Finally, we use the round)) function to approximate the result to four decimal places, as
required.

Exercise 15. Employee Performance Evaluation through Internal Surveys A


consulting firm manages a professional development program for its employees. In order to
improve the program, anonymous satisfaction surveys are conducted among employees at
the end of each quarter. In the last year, out of 800 distributed surveys, 600 were
completed and returned by employees. Furthermore, 450 of these completed surveys
express a positive evaluation of the program. Determine the estimated probability that a
randomly chosen survey among the completed ones expresses a positive evaluation.

Solution

To determine the required probability, we consider the number of surveys that express a
positive evaluation and the total number of completed surveys. This probability can be
calculated as:
number of positive surveys
P(positive evaluation) = =
total number of completed surveys

GOO
The concept of probability is applied here to quantify the employees' satisfaction with the
professional development program. A probability of 75% indicates predominantly positive
feedback among employees who participated in the survey.

Solution with Python

# Define the data number of completed surveys = 609

number of positive surveys = 450

# Calculate the estimated probability positive probability = number of positive surveys / number of completed surveys
# Print the result print(f“Estimated probability of a positive evaluation: {positive probability)") In this Code, we Ca

Let's see the details:

1. We define two variables corresponding to the total number of completed surveys and
the number of surveys that reported positive feedback.
2. We calculate the probability of a single positive survey. This is given by the ratio of
positively evaluated surveys to the total completed ones.
3. Finally, we print the results.
Chapter 2
Descriptive Statistics
In this chapter, we will explore a series of exercises in
descriptive statistics, a fundamental branch of statistics that
allows us to effectively summarize and interpret data. We
will start with the basic concepts, such as calculating the
most common observables, including mean, median, mode,
and variance, and then move on to the most widely used
correlation indices, such as Pearson's and Spearman's
correlation coefficients. These tools are essential for
analyzing data in a structured manner, identifying patterns
and trends, and supporting decision-making based on
numerical evidence.

Additionally, exercises will be proposed concerning


Chebyshev's inequality, a statistical theorem that provides a
bound on the probability that a value lies at a certain
distance from the mean. This concept is particularly useful
in risk management contexts as it allows us to estimate the
distribution of data with a certain degree of confidence even
when the exact distribution is unknown. In business
applications, for example, Chebyshev's inequality can be
used to assess the probability of extreme events, such as
financial losses or significant changes in performance
metrics.

Another important topic covered in this chapter is the


analysis of outliers, which involves identifying anomalous
values within a dataset. The interquartile range {IQR)
method will be explored, a widely used approach that helps
in identifying data that deviates significantly from the mean.
This method is widely employed in various fields, such as
data quality in financial analyses, fraud detection in the
banking sector, and data cleaning in data science projects.
2.1 Mean Value

The mean value (or sample mean) is a measure of central tendency that represents the
average of the_data observed in a sample. Given a series of n observations xpx3.... xn, the
sample mean x is defined as:
X = -Iz /=inx /

n
This formula indicates that the mean is obtained by summing all the observations and
dividing by the total number of elements in the sample.

Let's examine some properties:

• It is an unbiased estimator of the population mean, meaning its expected value


coincides with the population mean.
• If the sample is large, according to the central limit theorem, the distribution of the
sample mean tends to a normal distribution, irrespective of the population distribution.
• It is sensitive to extreme values (outliers), which can significantly influence the result.

The sample mean is one of the fundamental tools of statistical inference, used to estimate
the population mean and compare groups of data.

Exercise 16. Market Performance Analysis A technology company recorded its


monthly revenue for an entire year, reporting the following data in thousands of euros: [50,
75, 60, 85, 90, 80, 95, 100, 70, 85, 110, 95]. The company's manager wants to know the
average monthly revenue to evaluate last year's performance.

Calculate the average monthly revenue using the provided data.

Solution

To find the average monthly revenue, all monthly revenue values need to be summed and
then divided by the number of months considered. This is the standard formula for
calculating the arithmetic mean in statistics.

Calculation:

Average = = 82.92
12
The average monthly revenue for the past year is 82.92 thousand euros. This figure
provides a central measure of monthly revenue and helps understand the overall market
performance trend of the company on a monthly basis.

Solution with Python

import numpy as np

# Monthly revenue data monthly revenue = 158, 75, 68. 85. 98, 88, 95, 180, 70. 85, 118, 951

# Calculate the average using the numpy library average revenue « np.mean(monthly revenue)

* Print the result print(f“The average monthly revenue is: (average revenue:.2f) thousand euros") In this Code, Wf? USC
Let's look at the various steps:

1. We create a list called monthly revenue containing the monthly revenue data expressed
in thousands of euros for each month of the year.
2. We use the mean function from numpy to calculate the average of the values in the
monthly revenue list. The mean is a statistical indicator that provides a central value of
the data distribution.
3. We use the print function to display the calculated average value, formatting it to two
decimal places for better readability.

This code allows us to quickly obtain an essential measure of the company's average
monthly performance, helping us evaluate last year’s revenue performance.

Exercise 17. Analysis of a Clothing Store's Sales A retail clothing store recorded daily
sales for two consecutive weeks. The collected data are as follows: [1500, 1700, 1600,
1800, 2100, 1900, 2000, 1800, 1600, 2300, 2100, 1900, 2000, 2200] (in euros). The store
manager wants to better understand the average daily sales to optimize inventory and
improve marketing strategy.

Calculate the average daily sales for the analyzed period using the provided data.

Solution

The solution to the problem is based on calculating the average value, also known as the
arithmetic mean. To find the average daily sales, sum all the daily sales and divide the
result by the total number of days considered.

Here is the detailed calculation:


1500 4-...+2200
Average daily sales =
14
26500
---------- = 1892.86 euros
14
Thus, the average daily sales during the two-week period is 1892.86 euros. This calculation
allows the store manager to gain a general understanding of sales and plan future
strategies based on identified trends.

Solution with Python

import numpy as np

# Daily sales data daily sales = [1588, 1708, 1688, 1888, 2180, 1988, 2888, I860, 1686, 2388, 2188, 1988, 2868, 22881

# Calculating the average dally sales using numpy average daily sales = np.mean(dally sales)

# Printing the average daily sales print(f“Average daily sales: {average daily sales: .2f) euros’*) In this Code, we use

Let's see the various details of the code:

1. The data provided in the problem is stored in a list called daily sales.
2. We use np. mean (daily sales) to calculate the mean of the values in the list. This function
adds up all the numbers and divides the total by the number of elements.
3. We use the print function to display the result of the average calculation. The
formatted string allows inserting Python variables directly into a string to format the
output. : .2f is used to limit the result to two decimal places for a more precise
representation of the value.

Using the numpy library simplifies our approach compared to manual calculations, helping
reduce the risk of arithmetic errors and optimizing the average calculation process.

Exercise 18. Analysis of the Average Cost of Advertising Campaigns A digital


marketing company launched several advertising campaigns for a range of products over
the past quarter. The monthly costs for each campaign, expressed in thousands of euros,
are recorded as follows:

• Campaign 1: [4.5, 5.0, 4.7]


• Campaign 2: [3.2, 3.8, 4.0]
• Campaign 3: [5.5, 6.0, 5.8]
• Campaign 4: [4.0, 3.9, 4.1]

The financial director wants to get an overall view of the economic efficiency of the
campaigns to optimize the budget and plan future investments. Calculate the average
monthly cost of each campaign.

Solution

To solve this problem, we need to calculate the average value of the monthly cost for each
advertising campaign.

The average value is a fundamental concept in statistics obtained by summing all the
values of the data in a dataset and then dividing by the total number of values in that
dataset.

Here's how we calculate it (in thousands of euros) for each campaign:


A r . n 4.5+ 5.0+ 4.7 14.2 4,,
Average Cost Campaign 1 =--------------------------=-------- = 4.73
3 3
3.2+ 3.8+ 4.0 11.0
Average Cost Campaign 2 = —-------- - ——---- = —-— = 3.67
3 3
A r , 5.5+ 6.0+ 5.8 17.3 R
Average Cost Campaign 3 =_________________ =_____ = 5.77
3 3
4.0 + 3.9+4.1 12.0
Average Cost Campaign 4 =------------------------- =-------- = 4.00
3 3
This result helps the financial director understand which campaigns are more costly and
plan accordingly.

Solution with Python

import numpy as np

# Monthly costs of advertising campaigns campagne = (


'Campaign 1': [4.5, 5.Q, 4.7], 'Campaign 2’: [3.2, 3.8, 4.0], 'Campaign 3': [5.5, 6.0, 5.8], 'Campaign 4’: [4.0, 3.9, 4.1]

# Calculate the average cost for each campaign average cost campaigns = {name: np.mean(costs) for name, costs in campagnes.ite

print (average cost campaigns) The provided code block allows for determining the average monthly cost o

1. The monthly costs for each campaign are stored in a dictionary campagne, where the
keys represent the campaign names and the values are lists of the monthly costs
expressed in thousands of euros.
2. We use dictionary comprehension to iterate through each campaign and its respective
costs. For each campaign, numpy's mean function is used to calculate the average cost.
3. The resulting dictionary average cost campaigns associates the name of each campaign
with its average monthly cost. This is then printed, providing a clear view of the
average costs for each campaign.

This simple example illustrates how fundamental statistical calculations, like arithmetic
averages, can be easily performed using the scientific libraries available in Python.
2.2 Standard Deviation

The standard deviation is a measure of the dispersion of data relative to their mean. It
indicates how much, on average, the values of a sample deviate from the sample mean.

Given a series of n observations x1,x2,...,xn, the sample standard deviation s is defined as:

where x is the sample mean:

X = —z i=inx i
n
Here are some of its properties:

• The standard deviation is always positive or zero (s > 0).


• It measures the variability of the data: large values of s indicate greater dispersion,
while small values indicate that the data are closer to the mean.
• The denominator n-1 (instead of n) is used in the formula to make s an unbiased
estimate of the population standard deviation.

The standard deviation is often used along with the mean to describe the distribution of the
data and to compare variability between different samples.

Exercise 19. Assessment of Product Stability Two companies, Alfa S.p.A. and Beta
S.r.l., sell the same type of product with an average monthly revenue of 50,000 euros.
However, the financial manager wants to identify which of the two products exhibits a more
stable sales trend over time. Here are the data from the past six months: Alfa S.p.A.:
[48000, 52000, 47500, 50500, 49500, 51500]

Beta S.r.l.: [49000, 51000, 46000, 54000, 50000, 50000]

Determine which product shows greater sales stability and explain why.

Solution

To solve this problem, it is necessary to calculate the sample standard deviation for each
set of provided data. The standard deviation is a measure of data dispersion relative to the
mean, where a lower standard deviation indicates greater stability in the data.

For Alfa S.p.A., we first calculate the sample variance (divide by the number of values
minus 1 to have an unbiased estimate):

1. Calculate the mean: (48000 + ... + 51500)/6 = 49833.33


2. Calculate the sample variance: (19S33.J3i-i... _ 3366655.66

3. Calculate the sample standard deviation: V3306666.GG = 1834.84


For Beta S.r.l.:

1. Calculate the mean: (49000 + ... + 50000)/6 = 50000


2. Calculate the sample variance: (49000-300001 _ ggOOOOO

3. Calculate the sample standard deviation: V6800001) = 2607.68

Comparing sample standard deviations, we see that Alfa S.p.A. has a standard deviation of
1834.84, while Beta S.r.l. has a standard deviation of 2607.68. Consequently, the sales of
Alfa S.p.A. are more stable compared to those of Beta S.r.l., owing to the lower standard
deviation.

Solution with Python

import numpy as np

# Sales data for the two companies vendlte alfa = [48000, 52000, 47500. 50500, 49508, 515001

vendite beta = [49GG8, 51080, 46600, 54GG0, 50080, 500001

# Calculation of sample standard deviation std alfa = np.std(vendite alfa, ddof=l) std beta = np.stdfvendite beta. ddof=l)

std alfa, std beta In this code, we use the numpy library to manage data arrays and calculate the sa

Here are the various details:

1. We distribute the sales data of the two companies in lists.


2. We use the std function provided by numpy to calculate the sample standard deviation
for each data set (Alfa and Beta). The argument ddof=i allows us to use, in the variance
denominator, the number of values minus 1 to calculate an unbiased estimate, since
we do not know the distribution mean from which the data are drawn and must
estimate it with the sample mean.
3. We populate the variables std alfa and std beta with the calculated standard
deviations. This allows us to deduce which company has more sales stability based on
the lower standard deviation.

The approach of using numpy simplifies the calculation of sample standard deviations,
avoiding manual steps in computing the mean and variance, and is especially useful with
larger data sets or more complex statistical operations.

Exercise 20. Analysis of Daily Work Performance A consulting firm wants to assess
the variation in daily productivity of its consultants. Management has collected data
regarding the actual working hours of a consultant over a period of two working weeks (10
working days). The data collected are as follows: [7,8,7.5,8.5,7,9,6.5,8,7.5,8]

Use this data to determine how the daily productivity of the consultant varies relative to
the average. What conclusion can you reach regarding the consistency of his daily
productivity?

Solution

To determine the daily variation in the consultant's productivity, we need to calculate the
sample standard deviation of the data. The first step is to calculate the daily average of
working hours.

Daily average, x:
- 7+ .. . 4-«
x =-------------------= 7.7 hours
10
The next step is to determine the deviation of each value from the mean, square each
deviation, and sum them.
Z (x, -x)2 = (7 - 7.7)2 + ... + (8 - 7.7)2

= 0.49 + ... + 0.09 = 5.1

Now, to obtain the sample standard deviation, divide this sum by n -1 where n is the
number of data points (in our case 10), and then calculate the square root.

s = V 9 = 0.75

The sample standard deviation of approximately 0.75 hours indicates that the consultant's
daily working hours show moderate variation around the daily average of 7.7 hours. This
statistical calculation allows us to understand that, in general, productivity is fairly
consistent, even though there are some daily variations in hours worked.

Solution with Python

import numpy as np

work hours data » [7, 8, 7.5, 8.5, 7, 9, 6.5, 8, 7.5, 81

# Calculate the dally average dally average - np.mean(work hours data)

# Calculate the sample standard deviation using sclpy s = np.stdfwork hours data, ddof=l)

(daily average, si In this code, we use the numpy library, often used for numerical and statistical

Let's look at the code details:

• We use numpy.mean to calculate the arithmetic mean of daily work hours.


np.mean(work hours data) simply calculates the sum of all the elements in the list
work hours data and divides it by the total number of elements (10 working days).
• np.std calculates the sample standard deviation, which means it accounts for the
variance of a sample (where, using the argument ddof=i, the variance is calculated by
dividing by n - 1 instead of n, with n equal to the number of observations in the
dataset). This function is particularly useful because it encompasses all the steps we
would otherwise have to perform manually (subtracting the mean from each value,
squaring it, summing them, and dividing by n - 1).

In the final result, daily average provides the daily average of work hours, while s provides
the sample standard deviation indicating the consistency of the consultant's productivity
relative to the calculated daily average.

Exercise 21. Employee Performance Analysis In a tech company, the software


development department is monitoring the number of lines of code produced per
developer on a weekly basis for an entire month. This data has been collected for five
developers: Developer 1: [340, 360, 370, 350]

Developer 2: [400, 380, 390, 410]


Developer 3: [310, 320, 305, 315]

Developer 4: [450, 455, 460, 448]

Developers: [500, 520, 510, 505]

The management team wants to identify which developer shows the most stable code
production over time. Analyze the data to determine who has the least variation in weekly
lines of code production.

Solution

To understand which developer has the most stable production, we need to calculate the
sample standard deviation for each dataset. The standard deviation allows us to measure
the dispersion of a sample around its mean.

Calculate the mean for each developer:

1. Developer 1 Mean:(340 + ... + 350)/4 = 355


2. Developer 2 Mean:(400 + ... + 410)/4 = 395
3. Developer 3 Mean:(310 + ... + 315)/4 = 312.5
4. Developer 4 Mean:(450 + ... + 448)/4 = 453.25
5. Developer 5 Mean:(500 + ... + 505)/4 = 508.75

Calculate the sample standard deviation for each set:

1. For Developer 1:
/(340 - 355)2 4-... 4- (350 - 355)2
V 4— 1 = 12.91

2. For Developer 2: ______________________________________


/(400 - 395)2 4-... 4- (410 - 395)2
V 4—1 = 12.91

3. For Developer 3:
/(310 - 312.5)2 + ... 4- (315 - 312.5)2
V 4—1 = 6.45

4. For Developer 4:
/(450 - 153.25)2 4 ... 4- (448 - 153.25)2
V 4-1 = 5.38

5. For Developer 5:
/(500 - 508.75)2 + ... 4- (505 - 508.75)2
V 4—1 = 8.54

Developer 4 has the lowest standard deviation at 5.38, indicating greater stability in code
production compared to other developers. The mathematical concept used here is the
sample standard deviation, which measures data deviation from the mean.
Solution with Python

import numpy as np

# Lines of code data for each developer code « (

'Developer 1': [340, 360, 378, 358), ’Developer 2': |40Q. 380, 398. 410), ’Developer 3*: (310. 320, 305, 3151, 'Developer 4

>

# Calculate the sample standard deviation for each developer stability - {)

for developer, lines in code.items(): std dev = np.std(lines, ddof=l) * Sample standard deviation stability[developer) » std

# Find the developer with the most stable production most stable developer = min(stability, key=stability.get)

prlntff'The developer with the most stable code production is (most stable developer) with a standard deviation of (stabilityfmc

Let’s see it in detail:

1. We import numpy with import numpy as np. This library is essential for numerical
operations and to calculate the standard deviation of data.
2. We use np.stdo, specifying ddof=i to calculate the sample standard deviation, which is
necessary when working with a sample instead of the entire population.
3. We insert the weekly lines of code data for each developer into a dictionary.
4. We iterate through each developer and calculate the standard deviation of their weekly
productions. We use np.stdo with ddof=i to obtain the correct standard deviation.
5. We determine which developer has the lowest standard deviation, indicative of more
stable production overtime.
6. Finally, we print which developer has the most stable code production, showing the
associated standard deviation.

This strategy allows us to identify which developer is the most consistent in weekly
recorded lines of code productivity.

Exercise 22. Industrial Production Analysis A manufacturing company produces two


types of components, A and B, which are then assembled into final products. The weekly
production of A and B follows normal distributions with known means and standard
deviations: pA = 1500 units, oA = 100 units, pe = 1000 units, and aB = 150 units.
Additionally, it’s known that the covariance between the productions of A and B is cov(A,B)
= 1200.

Calculate the standard deviation of the total components produced weekly.

Solution

To solve this problem, we need to calculate the standard deviation of the sum of two
random variables, in this case, the productions of components A and B. The standard
deviation of the sum of two random variables X and Y with standard deviations ox and oy
and covariance cov(X,T ) is given by the formula:
Ja\ + a$ + 2 • cov( A\ Y)
°X+Y= V A X
Applying the values from the problem, we get:
aA+B = V1002 4- 1502 4- 2 • 1200 = v 10000 ~ 22500 +2-100 = v34900

~ 186.81
The standard deviation of the weekly sum of the components produced is approximately
186.81 units. This value provides an indication of the total variability of the assembled
production, taking into account the interaction between the two production lines through
covariance.

Solution with Python

import numpy as np

# Problem data mu A = 1588

sigma A = 108

muB = 1880

sigma B « 158

cov AB = 1208

# Calculate the standard deviation of the total sigma total = np.sqrtfsigma A-


2
* ♦ sigma B’*2 + 2 ♦ cov AB)

sigma total The code implements the calculation of the standard deviation of the total weekly prodi

We use the numerical computation library numpy to perform the necessary mathematical
operations.

Let’s look at the code in detail:

1. The library numpy is imported with the alias np. numpy is essential for scientific computing
in Python and provides efficient functionality for performing complex mathematical
operations.
2. We define the variables mu a, sigma a, mu b, sigma b, cov ab, which represent the means,
the standard deviations of the normal distributions of A and B, and the covariance
between A and B, respectively. These are the known values provided by the problem. In
this specific case, it’s not necessary to use the mean values, as the standard deviations
are already provided.
3. We use the formula provided by the theory of combined normal distributions: aA+B =
V+ - ' cov
(-4. B)
4. np.sqrt)) is the NumPy function used to calculate the square root, while ** is the
exponentiation operator in Python.
5. The formula considers the variances of the two components (i.e., the squares of their
standard deviations) and includes an additional term that accounts for the covariance
between A and B.
6. The result, sigma total, represents the overall standard deviation of the sum of
components A and B produced weekly and is printed as the output of the code.

Exercise 23. Corporate Financial Risk Analysis A company in the financial sector
manages two investment funds, Fund X and Fund Y. These funds invest in uncorrelated
assets, so their returns can be considered independent.

The annual volatility of Fund X’s return is ax = 0.08, while that of Fund Y is ay = 0.10. In a
year, an investor wants to know how risky the combined total of returns from both funds is.
Determine the overall standard deviation of the investor's final return.

Solution
To solve this problem, we need to calculate the standard deviation of the sum of returns
from Fund X and Fund Y. Since the problem states that there is no correlation between the
two variables (no covariance), we can use the formula:
°x+y 2 = ox2 + oY 2

Applying the provided data:


ax+r2 = (0.08)2 + (0.10)2 = 0.0064 + 0.01 = 0.0164

The standard deviation of the total return will therefore be:


Ox-tY = V 0.0104 « 0.128

Therefore, the overall volatility of the sum of the funds is about 0.128, or 12.8% annually.
This calculation maps the statistical concept of standard deviation of the sum of two
independent variables to a financial context, thus comparing the combined risk of two
different investments.

Solution with Python

from math Import sqrt

# Volatility of Fund X and Fund Y

sigma X « 9.88

sigma Y ■ 9.10

# Calculate the overall standard deviation without covariance sigma combined = sqrttsigma **
2
X ♦ sigma Y’*2)

# Print the result f'The aggregated standard deviation of the two funds is: (sigma combined}**

In this code, we calculate the aggregated standard deviation of the returns of two
investment funds, Funds X and Y, using the math library to calculate the square root.

Details of the code:

1. from math import sqrt: This imports the sqrt function from Python's math module,
allowing us to calculate the square root necessary to obtain the overall standard
deviation.
2. We define the volatilities of Fund x and Fund y as sigma x and sigma Y respectively. It is
given that sigma x is 0.08 and sigma Y is 0.10.
3. We use the proposed formula to calculate the standard deviation of the sum of two
independent variables, i.e., sqrt(sigma **
x2 »2). This formula assumes no
+ sigma y*
correlations between the two return series (absence of covariance).
4. Finally, we print the result using an f-string statement, which allows for easy and
readable string formatting in Python by directly integrating variables within the string
using braces {}.

Using the square root is necessary for the final calculation of the standard deviation
because, in statistics, we work with variance (which is the square of the standard
deviation) while we want to obtain a measure of dispersion that is on the same scale as the
mean. This method determines the overall risk, or volatility, of two combined investments,
provided they are not correlated with each other.

Exercise 24. Analysis of Monthly Production Capacity Fluctuations A tech company


produces two types of electronic components monthly, Components X and Y, which are
then integrated into the devices sold. The monthly production of X and Y may vary
depending on market demand and follows normal distributions with known means and
standard deviations: = 2000 units and ax = 300 units for X, and /jr = 1500 units and oY
= 200 units for Y. Due to various factors such as raw material suppliers and seasonality, the
productions of X and Y are correlated with a covariance cov(X,Y ) = 1800. Calculate the
overall variability of the total monthly production of Components X and Y to maximize the
effectiveness of the supply chain.

Solution

To solve this problem, we need to determine the standard deviation of the sum of the two
random variables, which represent the productions of Components X and Y. The key
concept to apply here is how to calculate the standard deviation of a sum of random
variables.

For two random variables X and Y, the variance of their sum is given by:
Var(X + Y) = Var(X) + Var(Y) + 2 • cov(X,Y )

Where Var(X) = ox2 and Var(Y) = a Y2.

Substituting the given values:


Var(X + Y) = 3002 + 2002 + 2 • 1800

Var(X + Y ) = 90000 + 40000 + 3600 = 133600

The standard deviation of the sum is thus the square root of the variance:
Ox+y = V7133600 = 365.5

The standard deviation of the total monthly production of Components X and Y is therefore
about 365.5 units. This value represents the variability in the total monthly productions of
the components, providing an indication of the risk associated with the overall production
capacity. By effectively managing this variability, the company can optimize its supply
chain operations.

Solution with Python

import numpy as np

# Provided data mu X = 2000 # Mean of X

sigma X = 300 U Standard deviation of X

mu_Y ° 1500 # Mean of Y

sigma Y - 200 # Standard deviation of Y

cov XY = 1800 # Covariance between X and Y

a Calculation of the variance of the sum var X = sigma X


* *2

* *2
var_Y « sigma Y

var X plus Y = var X ♦ var Y ♦ 2 • cov XY

# Calculation of the standard deviation of the sum sigma_X_plus_Y « np.sqrtfvar Xplus Y)

# Result sigma X plus Y

In this code, the numpy library is used to perform numerical calculations, including the
square root. Although we only use basic functions of numpy, it is a fundamental library for
scientific computing in Python.

Let's see the various steps of the code:

1. Dumpy is imported for basic numerical calculations.


2. We use the provided means and standard deviations for the two components X and Y.
These data represent the average monthly production and its variability. While we will
not use the mean values explicitly, in a more complex analysis, it may be useful to
have them readily available in appropriate variables.
3. We use the provided standard deviations to calculate the variances of X and Y using
the formula Var(X) = ox2 and Var(Y ) = aY 2.
4. We use the formula Var(X + Y ) = Var(X) + Var(Y ) + 2 • cov(X,Y ) to find the variance of
the sum of X and Y.
5. The standard deviation of the sum is the square root of the previously calculated
variance.

The result, sigma x plus y, represents the overall variability of the monthly production of
components X and Y.

Exercise 25. Evaluate the Stability of an Investment Portfolio A company decides to


evaluate the performance of a new investment fund. The monthly data from the last two
years show the following monthly percentage returns for twelve consecutive months: [2.5,
3.2, 3.0, 2.8, 3.4, 2.9, 2.7, 3.1, 2.6, 3.3, 2.9, 3.0], The company wants to examine the
average of the monthly returns and the standard deviation to determine the stability of the
returns. The smaller the standard deviation relative to the mean, the greater the stability.

Based on these calculations, briefly discuss whether the investment can be considered
stable.

Solution

To solve this exercise, we analyze the relationship between the average and the standard
deviation of the monthly returns of the fund.

Let's look at the steps:

1. Calculate the mean:


Mean x E,-l = 2-5+--+3-° . 29S

12 12
2. Calculate the standard deviation:
Variance - E£l(
*
< ~ Mcan)2

12
Standard Deviation = V \ ariailCt = 0.263

3. Ratio of mean to standard deviation:


Mean 2.95
Ratio = 11.22
Standard Deviation 0 2G3
The higher this ratio, the more stable the returns. In the example, the value of about 11.22
indicates that compared to the mean, the variation in returns is rather contained,
suggesting relative stability. This examination of the ratio between the mean and standard
deviation is fundamental in statistics for measuring variability relative to the mean, a key
concept for assessing the risk and stability of an investment. In this case, the investment
can be considered relatively stable, considering the expected variability in the context of
corporate investments.

Solution with Python

Import numpy as np

# Monthly returns data returns » [2.5, 3.2. 3.0, 2.8. 3.4, 2.9, 2.7, 3.1, 2.6. 3.3, 2.9, 3.0]

a Calculate the mean mean returns = np.mean(returns)

# Calculate the standard deviation *


standard deviation » np.stdfreturns, ddof
fl)

# Calculate the ratio of mean to standard deviation stability ratio » meanreturns / standard deviation

(mean returns, standard deviation, stability ratio) In this code, we primarily use numpy, a useful library f(

Let's look at the steps:

1. We create a list of monthly returns called returns that contains the given percentage
data.
2. We use np.meant) to calculate the mean of the monthly returns. This function calculates
the average value of all elements in the array.
3. We use np.stdo to calculate the standard deviation. The crucial difference is ddof=0,
which indicates using n for variance, as we are dealing with a population set, not a
sample (default in numpy).
4. Finally, we divide the average returns by the standard deviation to obtain the stability
ratio. The higher this ratio, the more stable the returns are considered.

Using these procedures, we can judge the stability of the investment fund based on the
calculated values.

Exercise 26. Sales Variables Analysis An e-commerce company wants to analyze the
sales performance of its products to assess the consistency of monthly sales. The sales
data in euros for the first six months of the year are as follows: [5700, 6200, 6500, 5900,
5800, 6400],

The company believes that a good assessment of sales consistency can be obtained by
calculating an index that relates the average value of sales to their standard deviation. A
higher value of this index indicates greater consistency in sales. Evaluate whether the sales
can be considered consistent based on this index.

Solution

To tackle this issue, the main approach is to analyze the ratio between the mean and the
standard deviation, which is related to the concept of the coefficient of variation.

Let's see the various steps:

1. Calculate the average monthly sales:


5700 + ...+G400 .36500
Mean =-------------------------- = ---------- = 6083.33
6 6
2. Calculate the standard deviation of monthly sales
w . (5700 - 6083.33)2 -4-... 4- (6100 - 6083.33)2
Variance = 1'________ 1_________________ L_
6
Standard Deviation = V \ a ria lice ~ 302.30
3. Calculate the sales consistency index:
u 6083.33
Consistency Index =------------- « 20.12
302.30
A high index indicates good sales consistency, suggesting that sales are relatively stable
compared to the average. In this example, an index of 20.12 suggests good stability.

In conclusion, the company can consider its monthly sales as consistent since the
consistency index shows a relatively low variation compared to the average sales, which is
the main objective of the exercise: the ratio between mean and standard deviation offers a
quantitative assessment of the stability of business variables.

Solution with Python

import numpy as np

# Sales data sales « np.array((57G0. 6200. 6500, 5980, 5800, 6480])

# Calculate the mean mean » np.mean(sales)

# Calculate the standard deviation standard deviation = np.std(sales, ddof=8) # ddof«8 for entire population

# Calculate the consistency index consistency index » mean / standard deviation

# Output mean, standard deviation, consistency index The code uses the NumPy library to perform statistical

1. Sales data is defined as a NumPy array. The variable sales contains the sales for the
first six months.
2. Using np.meano, we calculate the average sales value, which represents the monthly
mean.
3. With np.std(), we calculate the standard deviation. The parameter ddof=o indicates that
we are calculating the population standard deviation (using division by M).
4. The consistency index is calculated by dividing the mean by the standard deviation.
This index measures sales consistency: the higher it is, the more stable the sales are.

Finally, the code outputs the mean, standard deviation, and consistency index. This allows
the company to evaluate if their sales are consistent in the analyzed period.

Exercise 27. Analysis of Company Production Performance A manufacturing


company wants to analyze the efficiency of its electric motor production line. In the last six
months, the monthly production in units has been: 1200, 1300, 1270, 1250, 1285, 1260.

The company wishes to understand how stable the production process is. Managers believe
that a numerical comparison can provide a clear indication: less variation from the mean
suggests greater stability of the production process.

Evaluate the stability of the production process based on this index.

Solution

The monthly production mean, \i, is calculated as:


n

where x, is the production in month / and n is the number of months.


1200-... + 1260 7565 ACAOO
p =____________________ =______ « 1260.83
6 6
The standard deviation, a, is calculated as:

o= n

/(1200- 1260.83)2+... +(126(1 — 1260.83)2


o=V 6 =31.68

The stability index is given by the ratio R of the mean to the standard deviation:
I' 1260.83
R = ~------------- =39.80
a 31.68
In statistical context, the ratio between the mean and the standard deviation (which
represents stability) is an important indicator to evaluate the level of fluctuation relative to
the mean value. A higher ratio signifies less instability. The obtained index (R = 39.80)
suggests good stability of the production process as the variations from the mean are
contained in comparison to the mean value itself. This type of analysis is crucial for a
company aiming for efficiency and consistency in its production operations.

Solution with Python

import numpy as np

# Monthly production data production « np.array![1200, 1300. 1270, 1250, 1285, 12601)

# 1. Calculate the monthly mean monthly mean = np.mean(production)

# 2. Calculate the monthly standard deviation # Using ddof=8 to get the population standard deviation standard deviation - n[

# 3. Calculate the stability index stability index = monthly mean / standard deviation

# Results print(f"Monthly production mean: {monthly mean:.2f}“) print(f-Monthly standard deviation: (standard deviation:.2f}r'

First, we created a numpy array called production to store the monthly production data. We
used numpy for the calculation of the monthly mean using the statement np.mean(production),
which provides the sum of the values divided by the total number of elements.

For the standard deviation calculation, we used the statement np.stdiproduction, ddot=0).
Here we specify ddof=o to get the population standard deviation, which is appropriate since
we are analyzing all six months of available data.

Finally, we calculated the stability index as the ratio between the monthly mean and the
standard deviation. This index provides a measure of production consistency: a higher
value indicates a more stable production relative to the mean.
2.3 Median Value

The median value (or median) is a measure of central tendency that represents the central
value of an ordered set of data. Unlike the mean, the median is less sensitive to extreme
values (outliers) and provides a more robust measure of the central position of a sample.

Given a series of n observations x1,x2,...,xn ordered in increasing order (X; < x2 <• . ■< x„),
the median M is defined as:

• If n is odd, the median is the central value:


M= Xxli
2
• If n is even, the median is the mean of the two central values:

2
Let's look at some main properties:

• The median is insensitive to outliers, unlike the arithmetic mean.


• It is a suitable measure of central tendency for asymmetric data.
• It divides the sample into two groups of equal size: 50% of the data is less than or
equal to the median and 50% is greater than or equal to the median.

The median is particularly useful for describing the central position of data with
asymmetric distributions or in the presence of extreme values.

Exercise 28. Analysis of Company Compensation In a major consulting firm, the


human resources team is evaluating employee salaries to determine if there are significant
disparities between various departments. The following data on the annual salary (in
thousands of euros) of employees in a particular department has been collected: [40, 45,
50, 55, 60, 65, 70]

The HR manager wants to understand what the central value that best represents the
salary situation of the department is.

Calculate this value and discuss how it might reflect the salary distribution more accurately
compared to other statistical summaries, such as the simple average.

Solution

To solve the problem, we identify what the median of the listed salaries is. The median is
the value that divides the sample into two equal parts, namely the number that is in the
middle of the ordered series.

Given the list of ordered salaries: 40, 45, 50, 55, 60, 65, 70

We can see that the series has seven values. Therefore, the central value (fourth in the
ordered series) is 55, which represents the median.

The median 55 provides us with an idea of the "middle point" of the salary distribution in
the department, which can be particularly useful if there are outliers or asymmetric
distributions that might skew the average. In this context, comparing the median with the
mean might show if there are extremely low or high salaries that influence the mean but
not the median.

Solution with Python

import numpy as np from scipy import stats

# Salary list salaries = (40, 45, 50, 55, 60, 65, 70]

# Calculate the median median salary = np.median(salaries)

# Calculate the mean mean salary = np.mean(salaries)

# Calculate the median using scipy median salary scipy = stats.scoreatpercentile(salaries, 50)

# Output results print(f"Median calculated with numpy: {median salary}") printff"Median calculated with scipy: {median salary

Additionally, we utilized the scipy library, which offers advanced scientific functions. We
used the stats.scoreatpercentile command to calculate the median, passing the 50th
percentile. While providing the same result as the np.median() function, scipy offers more
detailed options for handling statistical distributions and can be useful in more complex
contexts.

Finally, we also calculate the arithmetic mean with np.meant). By comparing the mean and
median, we can infer information about any asymmetries in the salary distribution. A
significant discrepancy between the mean and median can indicate the presence of
outliers. However, in this specific case, since the data is symmetric and uniformly
distributed, the mean and median are very similar. This confirms that there are no large
disparities in the salaries of this department.

Exercise 29. Analysis of Delivery Times in an Online Stationery Store An online


stationery store has recorded the delivery times (in days) of its orders in one state for the
month of September. Here is the collected data: [2, 3, 3, 4, 4, 5, 7, 8, 10, 12, 15]

The management wants to identify the central value of the delivery times to evaluate the
overall efficiency of their logistics system. Determine this central value and discuss how it
can provide a more accurate picture of the management system compared to other
statistical summaries, considering the presence of outliers or exceptionally long times.

Solution

To identify the central value in the delivery times, we need to calculate the median. The
median is the value that divides an ordered data set into two equal halves.

1. Sort the data (already sorted in this example): 2,3,3,4,4,5,7,8,10,12,15


2. Count the number of observations: there are 11 values.
3. Identify the central position: for 11 observations, the median is the 6th value.
4. Observe the 6th value of the ordered sequence: 5

The median delivery time is thus 5 days.

In this case, the median is preferable to the mean because it is less affected by outlier
values (such as 10, 12, and 15 days). This provides a measure of the "typical" delivery time
that is more useful for understanding the general performance of the system without being
distorted by a few exceptionally slow deliveries.

Solution with Python


from scipy import stats

# Delivery time data deliveries = [2, 3, 3, 4, 4, 5, 7, 8, 16, 12, 15]

# Calculate the median median = stats. scoreatpercentile(delivenes, 50)

print('The median delivery time is:', median) In this code, we use the scipy library, a widely used Python

Let's look at the various steps:

1. We start by importing the stats module from scipy, which offers various tools for
statistical analysis.
2. We create a list called deliveries that contains the ordered delivery times for the month
of September. These represent the data to be analyzed.
3. We use the scoreatpercentite function from the stats module to calculate the median of
the deliveries list.
4. We print the result using print, which communicates to the user what the median
delivery time is. The choice of the median allows for a more robust measure of the
"typical" delivery time compared to the arithmetic mean, as it is not significantly
influenced by extremely high or low values (outliers).

Using scipy makes the process of calculating the median simple and efficient, automating
the handling of ordered data and providing reliable results regardless of the presence of
outliers.

Exercise 30. Analysis of Customer Service Times in a Call Center An international


call center is analyzing response times to calls to improve customer service. Here are the
recorded wait times (in minutes) during a peak period: [1, 2, 2, 3, 3, 3, 4, 5, 7, 10, 12, 15,
20]

Management wants to understand what the typical wait time is to identify areas for
immediate improvement and is seeking to establish a benchmark that is not influenced by
exceptionally long waits due to rare unforeseen events.

What is the value that best represents this time series, and how can it be used to guide the
strategic decisions of the call center?

Solution

To determine the central value of the wait times, we use the median. The median
represents the value that lies in the middle of the data when they are ordered. Let's order
the wait times: 1, 2, 2, 3, 3, 3, 4, 5, 7, 10, 12, 15, 20

Since the total number of observations is odd (13 data points), the median is the seventh
value in the ordered series. Thus, the median is 4 minutes.

The median is a useful statistical measure in this context because it is not affected by
significantly longer waits (outliers) that might skew the arithmetic mean. This can serve as
a benchmark for the typical wait time and guide the strategic decisions of the call center
towards reducing queues or making operational improvements for longer time frames.

Solution with Python

import scipy.stats as stats

times » [1, 2, 2, 3, 3, 3, 4, 5, 7, 10, 12, 15, 20]


# Calculate the median using scipy median » stats.scoreatpercentileftimes, 50)

printf'The median of the wait times is:', median) This code uses the scipy library, a Python library for sci

A list times is then defined, which contains the wait times, i.e., the recorded data from the
call center.

To calculate the median, we can use scoreatpercentilef), which takes the list of data as its
first argument and the percentile as the second argument (in this case, 50) and returns the
median as output.

The median is then printed to the screen with a simple printt).

Using the median instead of the arithmetic mean is strategic because it is unaffected by
the presence of exceptionally high or low values (outliers), thus providing a value that
better represents the central tendency of the data distribution. This information can be
useful for business decisions, as it provides a realistic benchmark of the typical wait time in
the call center.
2.4 Percentiles

Percentiles are positional measures that divide an ordered set of data into 100 equal parts.
The percentile of order p indicates the value below which p% of the observations lie.

Given an ordered sample of n observations xyx2,...,xn, the percentile of order p (with 0 < p
< 100) is the value Pp such that:
P
# of values < P„ - ---------n
P 100
Depending on the number of observations, the percentile can be determined in the
following ways:

• ‘ n's an integer, the percentile is the average of the two adjacent values.
• If • n is not an integer, the percentile corresponds to the value of the next
observation.

Some percentiles have specific names and are widely used in statistics:

• 25th percentile (P25): called the first quartile (QJ.


• 50th percentile (P50): corresponds to the median (M).
• 75th percentile (P7S): called the third quartile (Q3).

Let’s consider some properties:

• Percentiles provide information about the data distribution, helping to understand


dispersion and skewness.
• They are particularly useful for identifying extreme data and creating intervals, such as
the interquartile range ((?3 - (?,).
• They are often used in standardized test analyses, medical studies, and performance
analysis.

Thus, percentiles are essential tools for describing the distribution of a dataset and
analyzing its variability.

Exercise 31. Analysis of Response Times in a Contact Center An insurance


company's contact center is evaluating the effectiveness of its operations to improve
customer satisfaction. In particular, management wants to understand the risk associated
with long response times during peak hours. A sample of response times (in seconds) was
collected for calls made during the past month. The collected data are as follows: 38, 42,
35, 39, 45, 40, 37, 44, 41, 36, 38, 39, 43, 34, 37, 42, 38, 40, 36, 37

The objective is to understand up to what value the response times extend within which
95% of calls fall, to better plan personnel and ensure high service standards.

Solution

To solve this problem, we calculate the 95th percentile of the response time sample. This
type of calculation allows us to understand up to what point the 95% of the sample data
lies, indicating a high response time risk above this value during peak hours. We sort the
sample in ascending order: 34, 35, 36, 36, 37, 37, 37, 38, 38, 38, 39, 39, 40, 40, 41, 42, 42,
43, 44, 45.

The position of the 95th percentile is given by: P = 0.95 • (/V + 1) = 0.95 • 21 = 19.95 . The
95th percentile is a weighted average between the 19th and the 20th observation, where
19.95 is close to the value of the 20th position. In the ordered dataset, the 19th position is
44 and the 20th is 45 . Thus, the indication will be towards the 44 .

Therefore, in the context of personnel and resource planning, the response time that
covers 95% of the calls is 45 seconds.

Solution with Python

import numpy as np

response times = (38, 42, 35, 39, 45, 40, 37, 44, 41, 36, 38, 39, 43, 34, 37, 42, 38, 40, 36, 37]

percentile 95 = np.percentile(response times, 95) percentile 95

The Python code uses the NumPy library, a fundamental library for scientific computing
with Python. It provides support for multidimensional arrays and high-level matrix objects.
In this case, we use the np.percentile function to calculate the desired percentile of a
dataset. The function np.percentile(array, percentile) will return the value below which the
specified percentage of the dataset falls.

The code proceeds as follows:

1. response times is an array that contains the sample data of response times in seconds.
2. Using the statement np.percentile(response times, 95) we calculate the 95th percentile.
This function analyzes the ordered vector of data and provides the value below which
95% of the observations fall. This allows us to identify a value that represents a
benchmark against which to plan the contact center staff.
3. Finally, percentile 95 contains the value of the 95th percentile of the response times,
which in the context of the exercise is 44 seconds.

Exercise 32. Sales Performance Analysis of a Product A sportswear company is


analyzing the sales performance of one of its flagship products, a technical jacket for
outdoor activities, to understand how production might be optimized in the next season.
During the last quarter, weekly sales data (in units) were collected at the chain's physical
retail locations. The recorded weekly sales data are as follows: 120, 135, 140, 150, 155,
160, 165, 170, 175, 180, 185, 190, 195, 200.

The management intends to determine the number of weekly sales above which only 25%
of the weeks exist, to recognize strengths and weaknesses in the product’s production and
distribution chain.

Solution

To solve this problem, we need to find the 75th percentile of the weekly sales. Percentiles
are a statistical concept that indicates a value below which a given percentage of data
falls.

Here are the steps to follow:

1. Arrange the data in ascending order: 120, 135, 140, 150, 155, 160, 165, 170, 175, 180,
185, 190, 195, 200.
2. Determine the position of the percentile using the formula: P = (n - 1) • (percentile/100)
+ 1, where n is the number of data points and percentile is the target.
o In this case: P = (14 - 1) • (0.75) + 1 = 10.75
3. Since P is not an integer, take a linear combination of positions 10 and 11 of the sorted
data:
o Sale at position 10: 180, Sale at position 11: 185.
o Final calculation: Sa/e„ = 180 + 0.75 • (185 - 180) = 183.75

Round up or down at the company's discretion based on the units sold, thus obtaining
approximately 184 weekly sales.

The technical jacket will have weekly sales equal to or exceeding 184 units in 25% of the
weeks. This value allows the company to better understand sales behavior and plan
production and distribution more effectively.

Solution with Python

import numpy as np

# Weekly sales data sales = np.arrayf[120, 135, 140, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 2001)

# Calculate the 75th percentile percentile 75 = np.percentile(sales, 75)

print(f"The number of sales at the 75th percentile is: (percentile 75}") In this Code, we are using the numpy lib

Here are the steps:

• The sales vector contains the given weekly sales data.


• The np.percentile function from numpy is used to calculate the 75th percentile of the
sales vector. By specifying the parameter 75, we indicate that we want to find the value
below which 75% of the weekly sales data falls.
• Finally, we print the calculated value.

This operation allows us to quickly and accurately obtain percentiles of the sales data,
which can be extremely useful for data analysis, helping the company understand sales
behavior and make informed decisions about production and distribution of the product.

Exercise 33. Customer Satisfaction Analysis in the Hospitality Industry A major


hotel chain is looking to improve customer satisfaction and has collected data on customer
satisfaction scores (on a scale of 1 to 10) for all its facilities in the last month. The recorded
scores are as follows: 6,7,9,8,7,6,5,8,9,10,7,8,6,5,7,9,10,8,7,6,8,9,9,7,6

The management of the chain wants to identify the score below which 25% of the ratings
fall, to pinpoint the hotels that require immediate improvement in the services offered.

Solution

To solve this problem, the goal is to calculate the 25th percentile of the satisfaction score
data. First, the data sample is ordered in ascending order: 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8,
8, 8, 9, 9, 9, 9, 9, 9, 10, 10

The percentile formula is P = (n + 1) ■ p/100, where n is the total number of observations


and p is the desired percentile. For the 25th percentile, we have: n = 23 (total number of
scores) p = 25 (percentage) Substituting the values into the percentile formula, we get: P =
(23 + 1) • 25/100 = 6
Thus, the 25th percentile is the 6th value in the ordered data set, which is 6.

The satisfaction score below which 25% of the ratings fall is 6. This calculation is crucial for
the hotel chain management to identify the facilities whose satisfaction score is in the
lowest quartile, requiring priority corrective actions.

Solution with Python

from scipy import stats

satisfaction scores = [6, 7, 9, 8, 7, 6, 5, 8, 9, 18, 7, 8, 6, 5, 7, 9, 18, 8, 7, 6, 8, 9, 9, 7, 6]

p25 = stats.scoreatpercentilefsatisfaction scores, 25) print(f"The 25th percentile is {p25}”) In this COde, we USe the

Let's see the various steps:

1. The scoreatpercentile function of scipy.stats calculates the value below which a certain
percentage of the underlying data falls.
2. Our dataset satisfaction scores is passed to the function along with the percentile we
are interested in (25 in this case).
3. We print the 25th percentile. This indicates the score below which 25% of the data lies
in the context of satisfaction scores, useful for identifying hotels that need immediate
improvements.

By using scipy, the code is not only readable and concise but also efficient, making it
suitable for more complex statistical analyses.
2.5 Chebyshev's Inequality

Chebyshev's inequality is an important result in the theory of probability that provides a


bound on the probability that a random variable lies within a certain number of standard
deviations from its mean. It is valid for any probability distribution with a finite variance.

Let X be a random variable with mean p and variance a2. Then, for any k > 1,

P(|X - p|> ka) <


1
P
which is equivalent to writing:

P(|X - p| < ka) > 1 -—


Zt

This inequality states that the probability that X deviates from its mean by at least k times
the standard deviation is at most

Let's look at some characteristics:

• Chebyshev's inequality is general, as it applies to any distribution, unlike the empirical


rule (which only applies to normal distributions).
• It guarantees that at least a fraction of 1 -p of the data falls within the interval [p -
ka.p + ka].
• For example:
o For k = 2, at least 75% of the values lie between p - 2a and p + 2a.
o For k = 3, at least 88.89% of the values lie between p - 3a and p + 3a.

Chebyshev's inequality is used for:

• Establishing concentration bounds for unknown distributions.


• Analyzing variability in data without making assumptions about their distribution.

Since it is valid for any distribution with a finite variance, it is a fundamental tool in
statistics and probability theory, especially in business contexts where it is necessary to
manage the possibility of extreme events.

Exercise 34. Analysis of Sales Fluctuations of Mass Consumer Products The


company ABC S.p.A., active in the mass consumer goods sector, wants to monitor the
consistency of the monthly sales of its main products. From the sales history of the last 5
years, analysts have calculated that on average, 10,000 units of product X are sold each
month, with a standard deviation of 2.500 units.

The board of directors is interested in knowing how frequently significant fluctuations from
the average can occur since high variations in sales can influence supply decisions and
marketing strategies. In particular, they want to estimate the absolute minimum and
maximum sales threshold that can be expected in 2024, such that 90% of the months fall
within this range.

Knowing that no particular assumptions can be made about the data distribution, what
estimate might the statistics team of ABC S.p.A. provide?
Solution

To address the board of directors' question, we can use a useful principle when the data
distribution is not known: the Chebyshev's inequality.

Chebyshev's inequality states that for any data distribution with mean p and standard
deviation a, the proportion of observations falling within k standard deviations from the
mean is at least 1 -p-.

In our case, we have:

• p = 10, 000 units


• o= 2, 500 units
• We want at least 90% (p = 0.9) of the months to fall within the desired range.

We need to find k such that:


1 -_L > 0.9
A2

This resolves to:


1< 0.1
A2

k2 > 10

k>V10 ~ 3.162

This means that, to ensure at least 90% of the months fall within this range, we must
consider an interval of 3.162 standard deviations from the mean.

Now, let’s calculate the range:


p ± k ■ a = 10, 000 ± 3.162 • 2, 500

= 10, 000 ± 7, 905 = [2, 095, 17, 905]

Therefore, the statistics team can tell that for 2024, in order to align with the board's
expectations, it can be expected with at least 90% certainty that the monthly sales of this
product will fluctuate approximately between 2,095 and 17,905 units.

Solution with Python

import numpy as np

it Known parameters standard deviation = 2500

mean « 160G0

4 Calculation of k target percentage =0.9

k = np.sqrtU / <1 - target percentage)) # Chebyshev's formula

it Calculation of the confidence interval lower interval = mean • k • standard deviation upper interval = mean + k • standard

it Result (lower intervai, upper interval) In the Python code above, we are mainly using the library numpy

Details:
1. We define mean and standard deviation, which represent respectively the historical
average sales and standard deviation.
2. The key formula of Chebyshev's inequality is k = V * • — P \ The target percentage
is set to 0.9 because we want to cover at least 90% of the fluctuations.
3. Using the mean and the calculated value of k, the lower and upper bounds of the
interval are calculated as mean ■ k * standard deviation and mean + k * standard deviation.
This represents the range within which we expect the monthly sales to fall with 90%
certainty.

This statistical approach adapts well even when the distribution is not normal, making it
versatile for various practical applications.

Exercise 35. Monitoring Inventory Fluctuations A tech company, TechWare, is looking


to manage fluctuations in their inventory of graphics cards to meet the demand in the
highly dynamic computer hardware market. Analysts have collected data over the past few
years showing an average monthly stock level of 5000 units, with a standard deviation of
1200 units. The operations managers are concerned about the impact of inventory
variations on production capacity and storage costs.

They want to identify a monthly inventory safety range such that at least 85% of the time,
the inventory levels remain within this range. Using historical data, what inventory range
should TechWare consider?

Solution

To answer this question, we apply a technique that allows us to estimate how many
observations lie within a certain number of standard deviations from the mean, regardless
of the actual distribution of the data.

For TechWare, the mean p of the monthly inventory is 5000 units and the standard
deviation a is 1200 units. We are interested in a range that covers at least 85% of the
inventory levels.

We apply the definition:

P(|X - p|< ka) > 1 --—


A2
where k is the number of standard deviations from the mean.

We require that:
1 -J_= 0.85

A2
1
__ =0.15
A2
k2 = 1 .

0.15
k « 2.58
This means that at least 85% of the data will be within 2.58 standard deviations from the
mean.

We calculate the range:


Lower = p - ka = 5000 - 2.58 • 1200 = 1904

Upper = p + ko = 5000 + 2.58 • 1200 = 8096

Therefore, TechWare should predict that the inventory will be between 1904 and 8096 units
for 85% of the months, based on historical data and without assuming a specific
distribution.

Solution with Python

# Given data mean = 5080

sigma = 120G

probability =9.85

# Calculate the k value for which 1 l/k


2
* ■ probability k = (1 / (1 probability)) •• 0.5

# Calculate the lower and upper limits lower limit = mean - k * sigma upper limit = mean ♦ k • sigma

# Display the results (lower limit, upper limit) Let’s look at the details of this script:

1. We define the mean and standard deviation of inventories provided by the problem,
along with the required probability of 85%.
2. We use the derived formula 1 -jL = 0.85 to calculate k, whose value is k=(i/(i-
probability**
))
0.5. Applying this formula might lead to results with more decimal points
than reported in the solution, resulting in slight discrepancies.
3. With the k value, we calculate the lower and upper bounds of the safety range using
the formulas:
o Lower = p - ko
o Upper = id + ko
4. Finally, the code displays the results, corresponding to the range within which the
inventory will fall 85% of the time.

This approach does not assume that the data is normally distributed, making it more
general and applicable even when the data distribution is unknown.

Exercise 36. Management of Profit Margins The ItalFood company, specialized in the
production and distribution of food products, wants to analyze the variability of monthly
profit margins on its main products. From the analysis of historical data of the last 3 years,
it emerges that the average profit margin is 15000 euros, with a standard deviation of
3000 euros.

The management is concerned about the impacts that overly variable profit margins can
have on investment decisions and wants to establish a safety margin. Specifically, they
want to know what the range of monthly profit margins could be such that it can be stated
with confidence that at least 95% of the months of the next year could have a profit margin
within this interval. Provide an estimate based on historical data.

Solution
To solve this problem, we apply a fundamental statistical principle to estimate the safety
margin: 1 where k is the number of standard deviations from the mean.

We know that the average profit margin is p = 15000 euros and the standard deviation is a
= 3000 euros. We want to estimate an interval such that at least 95% of the profit margins
fall within it.

The variable 1 —= 0.95 indicates how many owe need to consider. Solving the equation
* 1 1
for k, we obtain 1 ~ = 0.95 which implies A = 0.05 and therefore k2 - 20. Consequently,
__ k-
k = v20« 4.47.

The desired interval will therefore be:


p - ko = 15000 - 4.47 • 3000 ~ 1590 euros

p + ko = 15000 + 4.47 • 3000 = 28410 euros

Therefore, it can be stated with at least 95% confidence that the monthly profit margin for
ItalFood will be between 1590 euros and 28410 euros.

Solution with Python

import numpy as np

# Problem data mu = 15890 # Average profit margin sigma = 3000 # Standard deviation

# Calculation of the number of standard deviations (k) k = np.sqrt(l / (1 - 0.95))

# Calculate the range of profit margins lower bound » mu • k • sigma upper bound = mu + k * sigma

# Results range interval = (lower bound, upper bound) range Interval III this Code, W© are calculating a safety

Here are the various steps:

• Import numpy for numerical operations.


• Define the mean p and the standard deviation a based on the provided data.
• Determine the number k of standard deviations necessary to cover 95% of the data
using the formula derived from Chebyshev's theorem.
• Calculate the profit margin range using p - ko for the lower bound and p + ko for the
upper bound.

The calculated interval thus provides a range of profit margins that, with at least 95%
confidence, will contain most of the future monthly profit data. Again, the use of more
decimal places might lead to slight discrepancies in the result compared to what was
provided in the original solution.
2.6 Identification of Outliers with IQR Method

Outliers are anomalous values that significantly deviate from the majority of the data in a
set. A common method for detecting them is the use of the interquartile range (IQR).

The interquartile range is defined as:


IQR = Q3-Qt

where:

• Oj (first quartile) is the value that separates the first 25% of the data.
• Q3 (third quartile) is the value that separates the first 75% of the data.

The IQR measures the central dispersion of the dataset, ignoring the extremes.

A value x is considered an outlier if it lies outside the range:


[(?! - 1.5 • IQR, Q3 + 1.5 • IQR]

The advantages of the IQR method are several.

• It does not rely on assumptions about the data distribution.


• It is more robust than the standard deviation for data with skewed distributions.
• It is easy to interpret and implement in exploratory data analysis.

The IQR method is therefore an effective tool for detecting anomalous values and
improving the quality of statistical analysis.

Exercise 37. Quarterly Sales Analysis to Identify Anomalies In a company that


produces construction materials, the data analysis team is reviewing the sales data of the
last two years for each quarter. The quarterly data, in thousands of units, for various
products are as follows:

QuarterProduct AProduct BProduct C

01 23 45 12

Q2 26 47 15

Q3 28 49 11

Q4 29 52 14

QI 31 53 37

Q2 34 54 16

Q3 30 55 18

Q4 35 75 17
Table 2.1 : Product Sales.

The team is tasked with identifying which numbers provided can be considered significant
enough to cause concern or require further investigation by management.

Use the provided data to identify any potential anomalies and indicate the calculation
method and the values that are considered out of the ordinary for each product.

Solution

For each product, we calculate the first quartile (QI) and the third quartile (Q3) to
determine what we should consider as an anomaly. We proceed with the calculation of the
interquartile range (IQR = Q3 - QI) and set limits to identify outliers (i.e., QI - 1.5/QR and
Q3 + 1.5/QR). Values outside these limits can indeed be defined as outliers.

Product A:

• Ordered data: 23, 26, 28, 29, 30, 31, 34, 35


• QI = 27.5
• Q3 = 31.75
. IQR = Q3-Q1 = 4.25
• Limits: [QI - 1.5 IQR,03 + 1.5 IQR] = [21.125, 38.125]
• There are no values outside this interval.

Product B:

. Ordered data: 45, 47, 49, 52, 53, 54, 55, 75


. QI = 48.5
. Q3 = 54.25
• IQR = O3-Q1 = 5.75
. Limits: [QI - 1.5 -IQR,03 + 1.5 IQR] = [39.875, 62.875]
• Identified outlier: 75

Product C:

. Ordered data: 11, 12, 14, 15, 16, 17, 18, 37


• QI = 13.5
. Q3 = 17.25
. IQR = Q3-Q1 = 3.75
. Limits: [QI - 1.5 • /QR,Q3 + 1.5 • IQR] = [7.875, 22.875]
• Identified outlier: 37

For Product B and C, the sales of 75 in the fourth quarter and 37 in the first quarter of the
second year respectively represent statistical anomalies and warrant further analysis to
understand the underlying causes. This procedure is based on the analysis of limits
calculated using the interquartile range method for identifying outliers.

Solution with Python

Import numpy as np from sclpy.stats import iqr

# Quarterly sales data sales ° (

'Product A’: (23, 26, 28, 29, 31, 34, 30, 351, 'Product 8': (45, 47, 49, 52, 53, 54, 55, 751, 'Product C
*
: (12, 15. 11, 14, 3
# Function to calculate and identify outliers def find outliersfdata, product name): sorted data = sorted(data) 01 = np.per-

outliers = [x for x in data if x < limitsfO] or x > limits(1]|

print(f"(product name):'
*) printff - Ordered data: (sorted data}") print(f” - 01 « (01)") prmt(f" - 03 ■ (03)”) print(f"

for product, data in sales.items!): find outliersfdata, product) The provided Python code identifies outliers j

• The scipy library is used, particularly the stats module for directly calculating the IQR
(interquartile range), numpy is used to calculate the percentiles representing the
necessary quartiles (QI and Q3).
• Quarterly data for each product is represented in a dictionary. Each product is a key
with a list of quarterly sales as values.
• There is a function that takes the data for a product and calculates the quartiles QI
and Q3 using np.percentile. The IQR is calculated using scipy.stats.iqr. The limits for
identifying outliers are defined as [Q1-1.5-/QR,Q3+ 1.5 • IQR]. The outliers are
identified by comparing the data against these limits.
• The data is ordered to facilitate the correct calculation of percentiles, although it is not
strictly necessary. Each intermediate step, from the ordered data to the identified
limits, is printed for transparency and manual verification.

The final result includes informative output on the presence of any outliers for each
product.

Exercise 38. Analysis of Employee Overtime Hours In an IT service company, the


human resources department is analyzing the overtime hours worked by employees over
the last six months to determine if there are any significant irregularities. The monthly
overtime hours for ten employees are as follows:

EmployeejanFebMarAprMayJun

1 5 8 10 3 2 12

2 7 9 11 6 3 15

3 6 7 13 8 4 2

4 9 11 14 9 6 0

5 10 13 12 7 5 0

6 4 10 12 3 2 25

7 8 8 6 9 4 8

8 10 5 8 12 1 10

9 3 2 1 4 4 3

10 7 9 8 6 5 2
Table 2.2 : Employee Overtime

The task is to identify which employees are working disproportionate amounts of overtime
in any given month, and if any anomalies might require preventive measures to avoid work
overload.

Solution

To solve this problem, we can explore the distributions of monthly overtime hours to
identify outliers. We will use a method involving the calculation of the interquartile range
(IQR).

1. For each month, order the overtime hours and calculate the first quartile (QI), the third
quartile (Q3), and then the interquartile range (IQR) as follows:
IQR = Q3 - QI
2. Any value below QI - 1.5 • IQR or above Q3 + 1.5 ■ IQR is considered an outlier.
3. Take January for example: Overtime hours: (5, 7, 6, 9, 10, 4, 8, 10, 3, 7]
o Ordered: [3, 4, 5, 6, 7, 7, 8, 9, 10, 10]
o QI = 5.25,Q3 = 8.75
O IQR = Q3 - QI = 3.5
o Outlier limits: 3.5 and 14
No employee worked a significant number of hours outside this range, so there are no
significant anomalies for January.
4. Use the same approach for each month.

Solution with Python

import numpy as np from sclpy.stats import iqr

# Monthly overtime hours for ten employees overtime hours = {

'Jan
:
* (5, 7, 6. 9, 10, 4, 8, 18, 3, 71, ’Feb': [8, 9, 7, 11, 13, 10, 8, 5, 2. 9], 'Mar
:
* [10, 11, 13, 14, 12. 12, 6, 8, 1, 6

# Function to identify outliers def identify out tiers(monthly hours I: QI - np.percentlle(monthly hours, 25) Q3 = np.percent

outliers • [(index ♦ 1, hours) for index, hours in enumerate(monthly hours) If hours < lower bound or hours > upper bound)

return outliers

# Apply the function to each month outliers.per month » (month: identify outliers(hours) for month, hours in overtimehours.it

# Print the results for month, outliers in outliers per month.iteras(): if outliers: prlnt(f'Outliers In the month of (month

Let's look at the code in detail:

• The function identify outliers takes a list of hours for each month, calculates the
quartiles, the IQR, then determines the lower and upper limits, thus identifying values
that significantly deviate from the norm.
• Outliers are identified if they are below QI - 1.5 • IQR or above Q3 + 1.5 IQR. This
method is effective because it identifies statistically significant deviations from the
'average' of other data.
• For each month, the code prints employees with anomalous overtime or states the
absence of such anomalies.
Exercise 39. Company Energy Efficiency Analysis An energy company is assessing the
energy consumption efficiency of different sections of their production facilities. Corporate
auditors have collected weekly data, expressed in thousands of kWh, related to
consumption for four different sections over the past two months.

WeekSection ISection 2Section 3Section 4

1 105 98 115 102

2 110 99 120 104

3 108 105 122 106

4 111 101 117 108

5 109 110 121 103

6 113 120 129 107

7 115 104 124 110

8 116 125 127 105

Table 2.3: Energy Consumption

The team is tasked with determining which data among these are significantly different
from the others, suggesting a need for more in-depth analysis. Identify the standout values
and describe how to interpret them.

Solution

To identify outliers in the energy consumption of the sections, we apply the interquartile
range (IQR) method.

Here are the steps:

1. Identify the first quartile QI and the third quartile Q3 for the consumption of each
section.
2. Calculate the IQR: IQR = 03 - QI.
3. A value is considered an outlier if it is less than QI - 1.5 • IQR or greater than Q3 + 1.5 •
IQR.

Any value that falls outside the identified IQR range may indicate unusual energy
consumption and would need investigation to understand the causes, which could be
technical or operational. This approach helps maintain energy efficiency and optimization
within the company.

Solution with Python

import numpy as np import pandas as pd from scipy.stats import iqr


it Weekly consumption data for each section data = {'Week': [1, 2, 3, 4, 5, 6, 7, 8], 'Section 1': [105, 110, 108, 111, 109, 1

# Create a DataFrame from the data consumptions df = pd.DataFrame(data) outliers = {}

# Function to calculate and identify outliers using IQR

for section in data.keysO: if section ! = 'Week': 01 ■ np.percentile(consumptions df[section], 25) 03 = np.percentilefconsul

# Determine bounds for outliers lower bound = 01 - 1-5 * IQR value upper bound = 03 ♦ 1.5 * I0R value

# Identification of outliers section outliers = consumptions df((consumptions df[section] < lower bound) | (consumptions df[se

# Results prmt(outiiers) In this script, we import the necessary libraries: numpy for numerical oper

The DataFrame is constructed from the weekly consumption table. For each section, we
calculate the first quartile Qi and the third quartile Q3 using numpy.percentile, while the
interquartile range iqr is calculated with scipy. stats, iqr.

We create bounds to consider the values as outliers: any value below QI - 1.5 • IQR or
above Q3 + 1.5 • IQR is considered an outlier. We then use pandas conditional indexing to
extract these values and store them in a dictionary called outliers, which is then printed at
the end of the script.

This analysis allows identifying unusual energy consumption in each section for further
investigations.
2.7 Pearson Correlation Coefficient

The Pearson correlation coefficient (or Pearson's linear correlation coefficient) is a measure
of the linear relationship between two numerical variables. It indicates how strongly and in
what direction two variables are correlated.

Given a set of n pairs of data (Xp/j), (x;,y2),..., (x„,y„), the Pearson correlation coefficient,
denoted as r, is defined as:

-
*
) 2-
where:

• x and y are the means of the variables x and y.


• The numerator represents the covariance between x and y.
• The denominator is the product of the standard deviations of x and y.

The Pearson correlation coefficient takes values in the range [-1, 1], which are interpreted
as follows:

• r = 1 -» Perfectly positive correlation: as x increases, y increases proportionally.


• r = -1 -» Perfectly negative correlation: asx increases, y decreases proportionally.
• r = 0 -» No linear correlation: there is no linear relationship between x and y.
• 0 < r < 1 -» Positive correlation: as x increases, y tends to increase.
• -1 < r < 0 -» Negative correlation: as x increases, y tends to decrease.

Here are some main properties of the Pearson correlation coefficient:

• r is dimensionless, meaning it does not depend on the unit of measurement of the


variables.
• It is symmetric: the correlation between x and y is the same as that between y and x.
• It only measures linear correlation: if the relationship between the variables is non­
linear, r can be close to zero even if there is a relationship between x and y.

The Pearson index is widely used in:

• Statistics and data science to analyze relationships between variables.


• Economics and finance to study the relationship between economic indicators.
• Social sciences to evaluate correlations between observed phenomena.
• Medicine and biology to check associations between clinical variables.

It's important to remember that correlation does not imply causation: a high value of r does
not mean that one variable necessarily causes the other.

Exercise 40. Analysis of the relationship between sales and advertising An e-


commerce company wants to understand the relationship between weekly advertising
expenses and the sales generated through its online campaigns. Data was collected over
10 weeks, concerning advertising expenses in thousands of euros and sales in thousands of
units generated. The collected data is as follows:
WeekAdvertising Expenses (euros)Sales (units)

1 20 150

2 25 160

3 30 220

4 35 240

5 40 280

6 25 170

7 30 200

8 40 260

9 45 300

10 50 320

Table 2.4: Expenses and sales.

Calculate the value of the statistical measure that quantifies the strength of the linear
relationship between the advertising expenses and sales. Interpret the result obtained.

Solution

To solve this problem, we use Pearson's correlation coefficient, which measures the
strength and direction of the linear relationship between two quantitative variables.

Let's see the various steps:

1. Calculate the covariance between the two variables:


Cov(Advertising, Sales) =
_ V ((Expenses - Mean Advertising) • (Sales, - Mean Sales))
~ F
2. Calculate the standard deviation of both variables:
o Standard deviation of advertising expenses
o Standard deviation of sales
3. Calculate the Pearson correlation coefficient (r):
Cov (Adm i ring. Sales)
Standard deviation Advertising • Standard deviation Sales
4. Interpret the result. If r is close to 1, it signifies a strong positive relationship, while a
value close to 0 indicates a weak relationship.

After performing the calculations, let's assume r = 0.98. This value indicates a strong
positive correlation between advertising expenses and sales, suggesting that as
advertising expenses increase, sales tend to increase proportionally. This implies that the
company's advertising activities have a significant influence on sales.

Solution with Python

import numpy as np from scipy.stats import pearsonr

# Data advertising expenses = np.array([20, 25, 30, 35, 40, 25, 30, 40, 45, 58)} sales = np.array( [150, 160, 220, 240, 280, 1

# Calculate Pearson correlation coefficient pearson corr, = pearsonr(adverrlsingexpenses, sales)

# Output print( f "Pearson correlation coefficient: (pearson_corr)"') In this code, we use the SClpy library, whiC

Let's see the various steps:

1. We import numpy for easily working with vectors and matrices, and pearsonr from the
scipy.stats library to calculate the Pearson correlation coefficient.
2. Initially, we define two numpy arrays: one for the weekly advertising expenses and one
for the sales generated.
3. We use the pearsonr function provided by scipy.stats, which calculates the Pearson
correlation coefficient. This function returns two values: the first is the desired
correlation coefficient, the second is the p-value (which in this case we are not
interested in) of a hypothesis test whose null hypothesis is the absence of a correlation.
4. Finally, we print the calculated value of the Pearson correlation coefficient, which
indicates the strength and direction of the linear relationship between advertising
expenses and sales.

The use of scipy greatly simplifies the calculation thanks to the pearsonr function, which
allows for quickly obtaining the correlation coefficient without having to manually
implement the mathematical formula.

Exercise 41. Relationship between Training Hours and Employee Productivity A


consulting company recently implemented a training program for its employees and wants
to understand the impact of this program on productivity. Data were collected from seven
departments of the company regarding the average number of monthly training hours per
employee and average monthly productivity per employee, measured in output units. The
collected data are as follows:

DepartmentTraining Hours (average)Productivity

A 10 200

B 15 220

C 22 240

D 25 260

E 31 280

F 35 300

G 40 320
Table 2.5: Productivity and Training Hours.

Determine the statistical measure that quantifies the strength of the relationship between
training hours and productivity. Provide an interpretation of the results.

Solution

In this exercise, we use Pearson's correlation coefficient to examine the linear relationship
between the average number of training hours and average productivity per employee.
Pearson's coefficient, denoted by r, is calculated using the formula:
r e,(.v,-_.v)(r.
-j)
r=vTrv.-A')2E(T-n2
Where X, and Y, are our data on training hours and productivity, respectively, and X and Y
are the means of these data.

Calculating this coefficient for the given data, we obtain r = 0.997. This value, close to 1,
suggests that there is a strong positive correlation between training hours and productivity.
This implies that, generally, a greater amount of training hours is associated with an
increase in productivity in the analyzed departments. However, it's important to note that
this coefficient only measures correlation, not causation, and other factors could influence
the observed relationship.

Solution with Python

import numpy as np from scipy.stats import pearsonr

4 Data training hours - np.array](10, 15, 22, 25, 31. 35. 40]) productivity « np.array](2QQ, 220, 240, 260, 280, 300, 3201 )

# Calculating Pearson’s correlation r, = pearsonrftraininghours, productivity)

# Result r In the provided code, we use the scipy library, particularly the stats module, to calc

Let's explore the steps of the code:

1. We import numpy for convenient numerical array operations and pearsonr from
scipy.stats to calculate Pearson’s correlation coefficient.
2. The data fortraining hours and productivity are stored in two numpy arrays. These
arrays represent the average training hours and the average output for each
department, respectively.
3. We use the function pearsonr from scipy.stats, which takes two arrays and returns
Pearson's correlation coefficient and the associated p-value. In this context, we are
primarily interested in the correlation coefficient r, which quantifies the linear
relationship.
4. The value of r is printed on the screen, showing the strength of the correlation.

In this case, Pearson's correlation coefficient is calculated to show how strong the
relationship is between the two analyzed variables. The result is close to 1, indicating a
strong positive correlation between training hours and average productivity. However, it's
important to keep in mind that correlation does not imply causation.

Exercise 42. Relationship between Customer Satisfaction and Contract Renewal A


mobile phone service company wants to analyze the link between customer satisfaction
levels regarding the services offered and the annual contract renewal rate. Data was
collected from 8 customers through a survey, where satisfaction is rated on a scale from 1
to 10, and contract renewal information is expressed as a percentage. The collected data is
as follows:

CustomerSatisfactionRenewal (%)

1 4 40

2 5 50

3 7 70

4 8 80

5 6 60

6 3 30

7 9 90

8 10 100

Table 2.6: Satisfaction Level and Renewal Rate.

Calculate the measure that allows understanding the relationship between satisfaction and
contract renewal. Interpret the significance of the obtained result.

Solution

To analyze the relationship between customer satisfaction levels and contract renewal
rates, we use the Pearson correlation coefficient. This coefficient quantifies the strength
and direction of the linear relationship between two numerical variables.

We calculate the mean of satisfaction (x) and the mean of renewal (y):
x = 1(4 + 5 + 7 + 8 + 6 + 3 + 9 + 10) = 6.5

8
y = 1(40 + 50 + 70 + 80 + 60 + 30 + 90 + 100) = 65

8
Next, we calculate the sum of the products of the deviations of the individual observations
from their mean:
)(y,-y) = (4-6.5)(40-65)+...+ (10-6.5)(100-65) = 420
2 ,=iB(x r*

And now we calculate the squared deviations for each variable:


1 /=ie(x I -x)2 = (4 - 6.5)2 + ... + (10 - 6.5)2 = 42
Z /=i8(y i -y)2 = (40 - 65)2 + ... + (100 - 65)2 = 4200

Finally, the Pearson correlation coefficient r is calculated as follows:


- x)(7/j - g)
vTto -£)2 -E(g< - .?)2
420 _420_
r~ >/42-4200 ’ 420 ’
A value of r = 1 indicates a perfect positive correlation. This means there is a direct and
strong linear relationship between customer satisfaction and contract renewal rate: as
satisfaction increases, contract renewal increases proportionally.

Solution with Python

import numpy as np from scipy.stats import pearsonr

n Collected data satisfaction = np.array((4, 5, 7, 8, 6, 3, 9, 10]) renewal = np.arrayf(40, 50, 70, 80, 60, 38, 90, 108] I

# Calculation of Pearson's correlation r, = pearsonr(satisfaction, renewal)

r In this code, the libraries numpy and scipy are used to calculate Pearson’s correlation coeffi

Let's look at the details:

• We import numpy and scipy.stats.pearsonr. numpy is used to work with arrays, which are
essential for managing and calculating numerical data. The pearsonr function from the
scipy library allows for directly calculating the Pearson correlation coefficient without
the need for manual implementation.
• The satisfaction data and renewal rate are stored in two numpy arrays: satisfaction and
renewal.
• The pearsonr function is applied to the two arrays to calculate the Pearson correlation
coefficient r and the p-value (ignored in this exercise). The coefficient r quantifies the
strength and direction of the linear relationship between the two arrays.
• By returning r, we obtain the measure of correlation between customer satisfaction and
the renewal rate. In this case, the calculated value is 1, indicating a perfect positive
correlation.
2.8 Spearman's Rank Correlation Coefficient

Spearman's rank correlation coefficient (or Spearman's rho) is a measure of the strength
and direction of the monotonic relationship between two variables. Unlike Pearson's
coefficient, which measures linear relationships, Spearman's coefficient is more suitable for
data that do not follow a linear distribution but may still have a monotonic relationship (i.e.,
always increasing or always decreasing).

Spearman's rank correlation coefficient, denoted by ps (or simply rs), is defined as:

where:

• d, = R(x,) - R(y,) is the difference between the ranks of each data pair (x,,yf). The rank is
a sequential number that follows the order of values (the lowest value will have rank 1,
the next will have rank 2, and so on).
• R(x,) and R(y,) are the ranks of the values x, and y„ respectively.
• n is the number of observations in the dataset.

Spearman's rank correlation coefficient takes values in the range [-1, 1]:

• ps = 1 -» Perfect positive correlation: the ranks of x and y are in perfect ascending


order.
• ps = -1 -♦ Perfect negative correlation: the ranks of x and y are in perfect reverse order.
• ps = 0 -» No monotonic correlation: there is no monotonic relationship between x and y.
• 0 < ps < 1 -» Positive monotonic correlation: as x increases, y also tends to increase.
• -1 < ps < 0 -» Negative monotonic correlation: as x increases, y tends to decrease.

Here are some of its properties:

• p., only measures the monotonic relationship, making it less sensitive to non-linear
deviations compared to Pearson’s coefficient.
• It is dimensionless, meaning it does not depend on the units of measurement of the
variables.
• It is symmetric, meaning the correlation between x and y is the same as between y and
x.
• It is less affected by outliers compared to Pearson's coefficient, as it is based on ranks
rather than absolute values.

Spearman's coefficient is useful in contexts where:

• The relationship between variables is monotonic but not necessarily linear.


• The data contain extreme values (outliers) that might influence Pearson’s coefficient.
• One wants to analyze ordinal variables or data that do not follow a normal distribution.

Exercise 43. Analysis of Sales and Product Reviews A new startup is launching a
range of electronic products and wants to understand the relationship between weekly
sales and customer reviews left on the website. Data has been collected for ten
consecutive weeks on sales (in thousands of units) and the average review score (from 1 to
5). The data is as follows:

WeekSales (x)Reviews (y)

1 30 4.3

2 45 3.8

3 18 2.9

4 25 4.0

5 35 4.5

6 50 4.7

7 40 4.1

8 28 3.1

9 33 3.6

10 38 4.2

Table 2.7: Sales and Reviews.


r i

The startup wants to know if there is a correlation between sales and the review score.

Solution

To solve this problem, we use the Spearman's rank correlation coefficient. This coefficient is
useful when you want to measure the monotonic dependence between two variables (not
necessarily linear).

To calculate the Spearman's rank correlation coefficient, we first transform the


observations into ranks, which are progressive numbers following the order of the values.

Rank table:

WeekSales Rank (x)Review Rank (y)

1 4 8

2 9 4

3 1 1
4 2 5

5 6 9

6 10 10

7 8 6

8 3 2

9 5 3

10 7 7

Table 2.8: Rank table.

Then, we calculate the Spearman coefficient:

rs= 1 ------ . —
n(u2 — 1)
Where d, is the difference between the ranks of x and y, and n is the number of
observations.

We calculate the various d„ which are: [4 - 8, 9 - 4,..., 7-7] i.e., [-4, 5, 0,-3,-3, 0, 2, 1, 2, 0].
Thus, Z d,2 = 68.

Now, we can calculate rs:


6-68
r. = 1----- -—-------- - ~ 0.5878
lO(K)2- 1)
The result rs = 0.5878 indicates a fairly strong positive correlation.

Solution with Python

Import scipy.stats as stats

sales = [38, 45, 18. 25, 35, 50, 40. 28, 33. 381

reviews = [4.3, 3.8, 2.9. 4.8, 4.5, 4.7, 4.1, 3.1, 3.6, 4.21

# Calculate Spearman's rank correlation coefficient spearman corn, _ = stats.spearmanrfsales, reviews)

spearman corr In this code, using the scipy library, we calculate the Spearman's rank correlation c

Let's examine the various steps:

1. The spearmanr function is part of the scipy.stats module and is used to calculate
Spearman's rank correlation coefficient.
2. The weekly sales and review data are represented as Python lists: sates and reviews.
3. The spearmanr function returns both the Spearman's rank correlation coefficient and the
p-value. In this example, we only use the correlation coefficient (assigned to
spearman corr).
4. A Spearman coefficient value approximately equal to 0.5878 indicates a moderately
strong positive correlation, suggesting that an increase in review scores is moderately
associated with an increase in sales.

In summary, this simple implementation highlights the usefulness of the scipy library for
performing complex statistical analyses in just a few lines of code.

Exercise 44. Performance Analysis of Marketing Campaigns A multinational


company in the food sector is evaluating the effectiveness of its digital advertising
campaigns. The company has collected data on the performance of 8 recent campaigns,
measuring their budget (in thousands of euros) and perceived efficiency score. Each
campaign was evaluated by a panel of experts who assigned a score from 1 to 10 based on
public impact and return on investment. The data collected is as follows:

CampaignBudget (x)Score (y)

1 120 8

2 100 6

3 150 9

4 80 5

5 200 8

6 50 4

7 170 8

8 130 7

Table 2.9: Budgets and scores.

The company's goal is to understand the correlation between the invested budget in the
campaigns and the efficiency score perceived by the experts.

Solution

To determine whether there is a relationship between the advertising campaign budget and
the perceived efficiency score, we use Spearman's rank correlation coefficient. This method
evaluates the correlation between two ordinal ranks when the data is not necessarily
normally distributed or does not have a linear relationship.

The steps are:

1. Sort the data based on their respective budget values (x) and score (y).
2. Calculate the ranks for each value.
3. Determine the difference between the ranks of the two variables for each campaign.
4. Calculate the coefficient using Spearman’s formula:
where d, is the difference between the ranks and n is the number of observations.

After performing the calculation, if the obtained coefficient is close to -1, it indicates a
perfect negative correlation, if it's close to 1 it indicates a perfect positive correlation, and
if it's around 0, it suggests the absence of correlation. Suppose we obtain a p = 0.83, which
would indicate a significant positive correlation between the campaign budgets and the
perceived performance score. This result suggests that an increase in campaign budget is
generally associated with an improvement in the performance score perceived by the
experts.

Solution with Python

import numpy as np from scipy,stats import spearmanr

# Campaign data budgets = np.arrayi [120, 100, 150, 80, 200, 50, 170, 130]) scores = np.array([8, 6, 9, 5, 8, 4, 8, 7])

# Calculate Spearman's correlation coefficient rho, = speamanr(budgets, scores)

# Display the result def correction interpretationfrho): if rho > 8.5: return ’The result indicates a significant positive •

ellf rho < -0.5: return “The result indicates a significant negative correlation between budget and score. '

else: return “The result indicates no significant correlation between budget and score.“

# Print the result eval result = correction interpretationrho) print(f"Spearman's correlation coefficient is: (rho:.2f). (e\.

Let's look at the various steps:

1. The spearmanr function from the scipy.stats library calculates Spearman's correlation
coefficient between two data arrays. It returns the correlation coefficient, p, and the p-
value of the test, which is not used in this specific context.
2. The correction interpretation function takes the value of p and provides a textual
interpretation of the result. This is done through a simple if -elif -else condition that
checks whether the coefficient shows a positive, negative, or no significant correlation.
3. Finally, the code prints the value of p rounded to two decimal places along with the
respective textual interpretation. The coefficient p quantifies the degree of correlation
between budget and score, helping the company understand the relationship between
investment and the perception of efficiency.

Exercise 45. Assessment of the Relationship Between Social Media Engagement


and Web Traffic An e-commerce company wants to explore the effectiveness of its social
media presence in generating traffic to its website. The analytics team collected data over
seven consecutive weeks, measuring engagement on social posts (in hundreds of
interactions such as likes, comments, and shares) and the corresponding web traffic
generated (in thousands of average daily visits). The collected data are:

WeekSocial Engagement (x)Web Traffic (y)

1 10 8

2 15 13
3 8 5

4 12 10

5 20 16

6 7 3

7 18 14

Table 2.10: Engagement and Traffic.

The management wants to determine how closely social media interaction is correlated
with web traffic to the company's site.

Solution

To analyze the relationship between social media engagement and web traffic to the
company’s site, we use Spearman's Rank Correlation Coefficient. This method is ideal for
verifying the strength and direction of a monotonic association between two variables,
especially when the data do not necessarily follow a normal distribution.

Let's go through the steps:

1. Rank Calculation:
o Assign a rank to each social engagement and web traffic value, ordering them
separately in ascending order.
2. Calculation of Rank Differences (d):
o For each data pair, subtract the web traffic rank from the social engagement rank.
3. Calculation of d2:
o Square the rank differences obtained (d2).
4. Spearman's Formula (p):
o Apply the formula:

where n is the number of observations (7 in this case).


5. Interpretation:
o A value of p close to +1 indicates a strong positive correlation, while a value close
to -1 indicates a negative correlation. A value close to 0 suggests no correlation.

Performing these calculations, assume we obtain a p value of about 1, suggesting a strong


positive correlation. This would indicate that high social engagement tends to be
associated with an increase in site traffic, reflecting an effective digital marketing strategy
for this company.

Solution with Python

import numpy as np from sclpy.stats import spearmanr

# Social media engagement and web traffic data engagement social = [10, 15, 8. 12, 20, 7, 181
web traffic - [8, 13, 5, 10, 16, 3, 141

# Calculation of Spearman's correlation coefficient spearman coefficient, p value = spearmanrfengagement social, web traffic)

print(f“Spearman's Correlation Coefficient: {spearman coefficient: .2f}") printff“P-value: {p value:.4f}“) The code uses t

Let's look at the steps of the code:

1. numpy is imported as np for potential numerical operations, while spearmanr from the
scipy.stats library is used to compute Spearman's coefficient directly.
2. The arrays engagement social and web traffic contain the data on social media
engagement and web traffic, respectively, collected over seven consecutive weeks.
3. The spearmanro function takes the two data sets as input and returns the Spearman
correlation coefficient and the p-value, indicating the statistical significance of the
calculated coefficient.
4. The results are formatted and printed. The value of Spearman's correlation coefficient
indicates the strength and direction of the correlation. In this context, a value close to 1
would suggest a strong positive correlation, indicating that social engagement is
strongly correlated with an increase in web traffic. A low p-value (typically < 0.05)
would suggest that the observed correlation is statistically significant.

In summary, this code allows verifying the relationship between social media engagement
and web traffic generated, providing an indication of the effectiveness of the social media
marketing strategy. Spearman's correlation calculation is particularly useful when the data
are not normally distributed, making it ideal for many practical business scenarios.
Chapter 3
Regression Analysis
In this chapter, we will tackle a series of exercises dedicated
to regression problems, with a particular focus on linear and
exponential regression. Regression is an essential statistical
tool for modeling relationships between variables and is
widely used in various fields, such as financial analysis,
marketing, sales forecasting, and scientific research.

We will start with linear regression, which allows us to


identify the relationship between an independent variable
and a dependent variable through a linear function. We will
learn to calculate its coefficients using different approaches.

Subsequently, we will analyze exponential regression, used


when the data follows a pattern that increases or decreases
exponentially. This regression is particularly useful for
modeling phenomena such as population growth, the sales
trend of a product over time, and the growth of a company.

In addition to calculating the coefficients of the regressions,


we will also focus on how to perform extrapolations and
forecasts based on the obtained models. We will thus
understand how to estimate future values based on
historical data.

The solutions will be presented from both a theoretical and


practical perspective, using mathematical formulas and
various notations. Furthermore, we will explore practical
implementation through different functions to allow the
reader to become familiar with the various conventions used
in statistics and data science. This approach will enable
mastering different tools and gaining greater confidence in
interpreting and applying regression methods in real
contexts.
3.1 Linear Regression

Linear regression is a statistical technique used to model the relationship between a


dependent variable Y and one or more independent variables X. In the simplest case,
simple linear regression, the relationship between the two variables is described by the
equation:
Y = Pa + p}X + e

where:

• pa is the intercept (or constant term)


• is the regression coefficient (or slope)
• e represents the error term

The most common method for estimating the coefficients and is the method of least
squares, which minimizes the sum of the squared errors:
SSE = 2 i=in(Y,-Y,Y

where Y, = p0 + pxX, represents the predicted values from the model.

This technique is widely used in various fields, including economics, engineering, and social
sciences, for making predictions and analyzing data.

Exercise 46. Demand Analysis and Forecast for a New Product A technology
company has recently launched a new electronic device onto the market. To optimize
production and distribution, the data analysis department wants to examine how the
product’s price affects the weekly demand.

Here are the data for the first 12 weeks of sales:

WeekPrice (euro)Sales (units)

1 250 400

2 220 450

3 210 470

4 240 420

5 260 370

6 230 440

7 225 460

8 235 430
9 245 410

10 255 390

11 200 490

12 215 480

Table 3.1: Price and sales.

The goal is to determine the relationship between price and demand and to forecast the
demand when the price is set at 230 euros.

Solution

To solve this problem, we will apply the concept of linear regression to determine the
relationship between the product's price and its demand. Linear regression involves finding
an equation in the form:
y - mx + b

where y is the predicted demand, x is the price, m is the regression coefficient (slope), and
b is the intercept.

To calculate m and b, we use the following formulas:


m _ E, (f. - - y)

E, to - *
) 2

b = y - mx

Where x and y are the average of the prices and sales, respectively.

First, we calculate x and y:


-x =250 + 220 + ...+215
_________________________ = 232
12
— 400 -j- 4o0 -f- 480
y =-------------------------------------- =434
12
Calculate m:
_ (250 - 232)(400 - 434) 4-... 4- (215 - 232) (480 - 434) _
m (250 - 232)2 + ... + (215 - 232)2 '

Calculate b:
b = 434 - (-1.96 • 232) = 889

Thus, the regression equation is:


y = -1.96x + 889

For a price of 230 euros:


y = -1.96 • (230) + 889 = 438

Therefore, the predicted demand when the price is 230 euros is approximately 438 units.

Through linear regression, we have determined that, for a new price, the equation allows
us to predict potential demand, thereby helping the company to adequately plan stocks
and marketing strategies.

Solution with Python

import numpy as np from scipy import stats

# Data prices = np.arrayf(250, 220, 210. 240. 260, 230. 225, 235, 245, 255, 280. 2151) sales = np.arrayf (400, 458. 470, 420.

# Use of scipy's linregress function slope, intercept, r value, p value, std err = stats.linreqress(prices, sales)

# Regression equation * y - mx * b regression equation = fy = {slope:.2f}x + (intercept:.2f)“

# Demand estimation at 230 euros new price - 238

predicted demand « slope * new price ♦ intercept

regression equation, predicted demand In this code, we use the numpy library to handle the price and sale

Here are the various steps of the Python code:

1. Import numpy for handling numerical data and stats from scipy to perform linear
regression.
2. Create two arrays prices and sales containing the data from the first 12 weeks.
3. The function stats.linregresstprices, sales) automatically calculates the slope slope (m)
and the intercept intercept (b) of the regression line equation.
4. The function also returns r value, p value, and std err, which are respectively the
correlation coefficient, p-value, and the standard error of the estimate, although in this
exercise we are primarily interested in slope and intercept.
5. We use the line equation, given by y = mx + b, with x equal to 230: predicted demand =
slope • new price + intercept.

At the end, the code provides both the regression equation and the estimated demand for
a price of 230 euros. This solution allows the company to make forecasts to optimize
production and distribution based on adopted pricing strategies.

Exercise 47. Analysis of Production Time to Optimize the Output A clothing


manufacturing company is aiming to improve efficiency on its production lines. Analysts
have collected data on the total working time (in minutes) for each batch produced and the
number of garments delivered in that timeframe, obtaining the following information:

DayWorking Time (minutes)Garments Delivered

1 350 150

2 300 180

3 400 130

4 320 170
5 360 140

6 310 175

7 330 160

8 345 155

9 290 185

10 380 135

Table 3.2: Working Time and Garments Delivered.

Management wants to know how working time affects the number of garments delivered
and desires an estimate of the number of garments that can be produced if the working
time is 325 minutes.

Solution

To determine the relationship between working time and the number of garments
delivered, we can apply the method of linear regression.

Let's go through the steps:

1. Calculate Covariance and Variance:


o Calculate the covariance between working time (X) and garments delivered (Y ).
o Calculate the variance of the working time (X).
2. Determine the slope (bt) and intercept (b0):
o Use the formula to obtain bx:
Cov(A'. Y)
bl~ Var(X)

o Calculate b0 using the mean of X and Y :


b0 = Y- bx X
3. Find the regression line equation:
o The equation is: Y = b0 + bt • X
o By inserting the data, we will obtain a specific equation.
4. Estimate the garments produced for 325 minutes:
o Substitute 325 in place of X in the equation found to get the estimated value of Y .

This approach allows us to understand the linear relationship between working time and
production yield, useful for making informed decisions about production resources. By
using linear regression, management can accurately estimate the expected output with a
given input of time.

Solution with Python

import numpy as np from sclpy import stats

# Data working time = np.array( [350, 300, 400, 329, 360, 310, 330, 345, 290, 380]) garments delivered - np.array((159, 180, 1
» Calculate the regression line slope, intercept, r value, p value, std err « stats.llnregress(working time, garments delivere

# Line equation * garments delivered = intercept ♦ slope • workingtime

# Estimate for 325 minutes estimated working time ■ 325

estimated garments = intercept ♦ slope * estimated working time

estimated garments In the described code, we use the scipy library, specifically stats.linregress, t

Let's go through the steps:

1. Two numpy arrays contain the working times and delivered garments for each
monitored day.
2. We use stats.linregress, which provides several parameters including:
o slope: the slope of the regression line, indicating how much the delivered garments
vary for each additional minute of working.
o intercept: the intercept, representing the estimated number of garments delivered
when the working time is zero.
o r value: the correlation coefficient, useful for understanding the strength of the
relationship.
o p value and std err: provide additional information about the significance of the
model.
3. The line is given by garments delivered = intercept + slope * working time.
4. By substituting 325 minutes for the array of times, we get an estimate of how many
garments can be delivered.

This method leverages scipy's capabilities for performing advanced statistical calculations
with simple function calls, making the data analysis process highly effective.

Exercise 48. Forecasting Future Sales Based on Past Expansion Trends An e-


commerce company wants to estimate the number of new customers it will attract in the
coming months thanks to a series of expansion initiatives already implemented. Over the
past two years, the company has tracked the number of initiatives launched and the new
customers acquired each quarter. The data is reported in the table below:

Initiatives (number)New Customers (units)

1 120

3 330

4 480

6 620

8 790

Table 3.3: Initiatives and new customers.

Use these data to estimate how many new customers the company could acquire in the
next quarter if it plans to launch 5 new initiatives.
Solution

To solve this problem, we apply the concept of linear regression. Linear regression helps us
model the relationship between a dependent variable (new customers) and an independent
variable (initiatives launched) through a linear equation, namely:
y = mx + c

where y is the dependent variable, x is the independent variable, m is the slope of the line,
and c is the intercept.

From the sample data, we can calculate the slope m and the intercept c using the formulas:
_ »(E jv'a) - (EJJ(E.fr)
n(E^)-(E^)2
?)
_(E&)(E
* -(E^XE^yJ
"(E^)-(E
*
.) 2
Inserting the values, we get:
m ~ 95 c ~ 50

The linear regression equation found is then:


y = 95 • x + 50

To estimate the new customers with the launch of 5 new initiatives:


y = 95 • 5 + 50 = 525

Therefore, it is estimated that the company will attract about 525 new customers if it
launches 5 new initiatives in the next quarter.

Solution with Python

from scipy.stats import Unregress

# Data initiatives - |1, 3, 4, 6, 8|

newcustomers = (120, 338, 480, 620, 790)

« Linear regression calculation slope, intercept, r value, p value, std err = linregressiinitiatives, new customers)

# Prediction for 5 new initiatives y pred = slope • 5 ♦ intercept

# Result predicted customers = round(y pred) predicted customers In the Python code above, we used the SClpy

Let’s look at the details:

1. initiatives and new customers represent the data of initiatives launched and new
customers acquired each quarter, respectively.
2. The linregress function is used with two lists, initiatives and new customers, returning
the slope, intercept, R-squared value, p-value, and standard error.
3. stope: represents the slope of the line which indicates the expected change in new
customers for each additional initiative.
4. intercept: is the point where the regression line intersects the y-axis.
5. We use the stope and intercept values to predict how many new customers will be
acquired for 5 new initiatives using the linear equation:
y = slope ■ 5 + intercept
6. Finally, we round the result to the nearest integer since it makes sense to treat the
number of customers as an integer quantity.
7. The final result is assigned to predicted customers, which solves the proposed problem by
providing an estimate of the number of new customers expected.

Exercise 49. Employee Performance Prediction A technology company aims to discern


how an employee's work experience influences their internally assessed performance. The
company has collected data on employees that includes their job tenure and their annual
performance score. The collected data is:

Experience (years)Performance (score)

1 50

3 67

5 78

6 82

8 90

Table 3.4: Experience and performance.

The company wants to predict the performance score of an employee with 7 years of
experience. Use the provided data to estimate this score.

Solution

To tackle this business prediction problem, we can apply the extrapolation technique using
linear regression.

Let's go through the steps:

1. The form of the linear regression equation is y = mx + c, where m is the slope and c is
the intercept.
o Calculate the mean of x (experience) and y (performance), x = * 1 " 1 * ■> t-(i _ 4 5y
= jp-67-= 73 4

o Calculate the slope m: m = —~ 5.64


o Calculate the intercept c: c = y - m ~ 47.44
The regression equation becomes: y = 5.64x + 47.44
2. Extrapolate the performance for an employee with 7 years of experience obtaining y =
86.94

The estimated performance for an employee with 7 years of experience is approximately


87 points. This exercise demonstrated how extrapolation using linear regression can be a
valid statistical tool for making business predictions based on historical data.
Solution with Python

import numpy as np from scipy import stats

# Provided data experience = np.array([l, 3, 5, 6, 8]) performance = np.arrayf[56, 67, 78, 82, 901)

# Calculate linear regression slope, intercept, r value, p value, std err = stats.linregressfexperience, performance)

# Function to predict performance based on years of experience predict performance = lambda x: slope * x + intercept

# Predict performance for an employee with 7 years of experience experience predict = 7

performance predict = predict performance(experience predict)

performance predict In this code, we primarily used the numpy library for handling numerical data ant

Let's see the details:

1. The first part of the code defines two numpy arrays that represent the years of
experience and the performance scores of the employees.
2. We use scipy.stats.Unregress, a convenient method to calculate linear regression
between two data series. This method returns several statistically relevant values such
as the slope (slope), the intercept (intercept), and the correlation coefficient (r value),
among others.
3. slope represents the slope, and the intercept represents the intercept, which are used
to create the prediction function y = mx + c in the form of a lambda function to simplify
the calculation of the prediction.
4. Finally, the code calculates and prints the estimated performance score for an
employee with 7 years of experience using the previously defined prediction function.
3.2 Exponential Regression

Exponential regression is a statistical model used to describe the relationship between a


dependent variable Y and an independent variable X when the growth or decay of the data
follows an exponential pattern. The equation for the model is typically expressed in the
form:
Y = ae^x + £

where:

• a is a coefficient representing the initial value of Y


• p is the growth (if positive) or decay (if negative) coefficient
• e is the base of the natural logarithm
• £ is the error term

To estimate the parameters a and p, a common approach is to apply the logarithmic


transformation of the data, obtaining a linear model:
In Y = In a + pX + £

where ln(x) is the natural logarithm and it requires a > 0. In this way, linear regression can
be applied to the transformed data, estimating In a and p using the least squares method.
Once these values are obtained, a is calculated as:
a - eln a

Exponential regression is commonly used in fields such as population growth, the spread of
diseases, financial analysis, and radioactive decay.

Exercise 50. Growth of a Tech Startup A technology startup is analyzing its monthly
growth in active users. In the first month, the startup had 100 active users. In the following
five months, the number of active users grew to 150, 225, 337, 506, and 759 respectively.
The CEO wants to establish a model that represents the growth rate of the user base over
time and plans to use this model to project the number of users in the coming year. What is
the most suitable mathematical model to describe the user growth? Calculate the
projections for the seventh month and discuss whether the growth rate will remain
sustainable in the long term.

Solution

The growth of users follows an exponential pattern typical in the context of startups with
viral or rapid initial growth experiences. In this case, we are using exponential regression to
model the data because the number of users seems to be growing at increasing rates over
time.

To establish the model, we seek an equation of the type Users(t) = a-e01, where a
represents the initial number of users and b represents the growth rate. The given data
indicates that there are 100 users in the first month, and we aim to find a good exponential
fit.

We use a logarithmic transformation on the data to linearize the relationship and then
apply linear regression on the logarithmic values to determine b. From the straight line
log(y) = log(a) + b • t, we can then solve for a and b.
Once the parameters a and b are estimated, we can make a projection for the seventh
month. Assuming a ~ 67 and that b has been determined as, for example, b = 0.4, we
have:
Users(7) = 67 • « 1101

This implies very rapid growth. However, as the user base increases, practical limits such
as market saturation or infrastructure costs may arise. It is important to continuously
monitor the growth rate, adapt strategies, and consider geographical expansion or product
innovations to sustain such rates. In summary, exponential regression offers a powerful
tool to model and predict the initial growth of startups, but prudence is necessary when
projecting the future.

Solution with Python

import numpy as np from scipy.optimize import curve fit import matplotlib.pyplot as pit

# User data per month months = np.array([l, 2, 3, 4, 5, 6]) # months users = np.arrayf [100, 150, 225, 337, 506, 759]) # activ

# Define the exponential function # Users(t) = a exp(b t) def exponential model(t, a, b): return a np.expfb t)

# Use curve fit to estimate parameters 'a' and 'b'

popt parameters, = curvefitfexponential model, months, users, p0=(100, 0.4))

# Extract parameters param a, param b = popt parameters

# Projection for the seventh month projection month = 7

projected users = exponential_model(projection_month, parama, param b)

print(fProjected users for month {projection month): {projected users:.2f)")

# Plot the fitted model for the known data extended months - np.arangefl, 13, 1) # including projection up to 12 months predi

plt.scatter(months, users, color='red', label='Real Data') pit.plot{extended months, predicted users, label='Exponential Model

Let's go through the details:

1. We use numpy for numerical operations, scipy for fitting and matplotlib for data
visualization in the form of a graph.
2. We provide two arrays, months and users, representing the months and the number of
active users, respectively.
3. The function exponential model represents the equation we want to fit to the data:
Users(t) = a ■ ebt.
4. We use curve fit to fit the exponential function to our data. We give an initial estimate
of the parameters po. This function optimizes a and b for the best data fit.
5. The values for a and b are obtained from the optimized parameters.
6. We use the refined parameters to calculate the expected number of users for the
seventh month, with possible extended projections up to a year.
7. We use matplotlib to plot the graph representing the actual data and the curve fitted by
the model. This helps visualize the adequacy of the exponential growth model to the
observed data.

This approach allows for simple modeling of user growth and providing projections, aware
of the variables not considered like market limits or other external constraints.

Exercise 51. Forecasting Sales of a New Mobile App A company developing mobile
applications has recently launched a new fitness app. In the eight months following the
launch, the monthly sales in thousands of units were: 200, 290, 421, 612, 889, 1291, 1874,
2716.
The development team, impressed by the growth in sales, wants to create a forecasting
model to estimate expected sales over the next three months and to plan future updates
and marketing strategies.

Solution

To model the sales growth of an application over time, we can use an exponential
regression model, since the growth rate seems to accelerate as months go by. The general
form of an exponential growth model is:
V (t) = V 0 • ekt

Where:

• V (t) represents the sales at month t.


• V 0 is the initial sales value.
• k is the exponential growth constant.
• t is the time considered in months.

Using the data provided, we can perform an exponential regression to estimate the
parameters V 0 and k. After calculating these parameters, we can use the model to predict
sales in the months following the eight provided, for example to estimate sales at the tenth
month.

For the tenth month, using the newly constructed model: If we find, for example, V 0 = 200
and k = 0.3, the sales for the tenth month would be:
V (10) = 200 • e0-310 ~ 4017 units

A sustainability analysis must consider that exponential growth is typically not sustainable
in the long term due to factors such as market saturation and increased competition. This
implies that, although the model may provide good short-term predictions, long-term
estimates should be treated with caution. It is also important to constantly monitor market
conditions and adapt the model if anomalies are observed in growth data.

Solution with Python

import numpy as np from scipy.optimize import curve fit

# Sales data months = np.array(|l, 2, 3, 4, 5, 6, 7, 8)) sales = np.array([200, 290, 421, 612, 889, 1291, 1874, 2716])

# Define exponential function

def exponential_growth(t, V0, k): return V0 np.expfk t)

# Perform data fitting params, covariance = curve fit(exponentialgrowth, months, sales, p0=[208, 0.3])

# Obtain fitted parameters V0, k = params print(f"V0: {V0), k: {k)")

# Predict sales for the tenth month month to predict = 10

predicted sales = exponential growth(month to predict, V0, k) printff"Predicted sales for month (month to predict}: {predicted

Let's take a detailed look:

1. The monthly sales data is stored in two numpy arrays, months and sales, which contain
the month numbers and the associated sales in thousands of units, respectively.
2. The function exponential growthft, vo, k) defines the exponential model, where vo and k
are the parameters to be estimated through regression.
3. The function curve fit from scipy.optimize is used to estimate the parameters vo and k
of the exponential equation that best fits the observed data. The po parameter provides
an initial guess for these parameters, improving the convergence of the fitting
algorithm.
4. After fitting, the estimated values for vo and k are printed.
5. Finally, the exponential growth function is used to estimate sales in the tenth month
using the calculated parameters, giving an idea of expected growth.

This model allows for estimating future sales growth but requires monitoring its validity
over the long term to account for potential market saturation phenomena.

Example 52. Analyzing Market Growth in an Innovative Sector A technology


company called TechGrowth Inc. is analyzing the growth of its business volume over the
past 5 years to plan its expansion strategy for the next 3 years. The historical revenue
data, in millions of euros, is shown in the following table:

Year Revenue (millions of euros)

201810

201911.5

202013.2

202115.1

202217.25

Table 3.5: Annual revenue.

The management wants to forecast the revenue for the years 2023, 2024, and 2025. Using
the available data, determine a possible growth model using an extrapolation method
based on an appropriate regression model and calculate projections for the coming years.

Solution

For this exercise, we assume that the growth model for TechGrowth Inc. follows an
exponential trend. This type of model is common for companies in technologically
innovative sectors where growth can quickly accelerate due to increased demand or
significant technological advances.

The function of an exponential model is generally expressed as y = ab\ where:

• y is the value of the dependent variable (in this case, the revenue);
• a is the initial revenue,
• b is the base of the exponential growth, indicating what the annual growth rate will be;
• x is the time measured in years.

To fit this model to the provided data, we follow these steps:

1. Calculate the average annual growth rate using the data:


o From 2018 to 2022, the revenue grows from 10 to 17.25 million.
o The average annual growth can be calculated as b = 1y , resulting in a rate b

~ 1.15.
2. Predetermine the revenue growth equation: y = 10 • 1.15
.*
3. Calculate the forecasted revenues for 2023, 2024, and 2025:
o For 2023: y =10 • 1.155 ~ 19.79 million;
o For 2024: y =10 • 1.156 ~ 22.67 million;
o For 2025: y =10 • 1.157 ~ 25.97 million.

This exploration approach using exponential regression provides insights into how the
company might grow if the current expansion rate is maintained. Such projection assists
strategic management in planning resources, investments, and market strategies for the
future.

Solution with Python

Import numpy as np from scipy.optimize import curve fit

# Historical revenue data past years = np.array([2818, 2819, 2028, 2821, 2822]) past revenue = np.arrayf[10, 11.5, 13.2, 15.1

# Define the exponential function a y = a * tTx

def exponential model(x, a, b): return a • b’-x

# Calculate the parameters a and b using curve fit popt, pcov = curve fit(exponential model, past years • 2818, past revenue!

a, b = popt

# Future data for prediction future years » np.arrayf[2023, 2024, 2025))

# Compute predicted revenues predicted revenue - exponential model(future years • 2818, a, b)

* Print results for year, revenue in zip(future years, predicted revenue): print(f‘Predicted revenue for (year): euro (reveni

Let's tackle the details:

1. The function exponential model is defined, representing the growth model y = a • b\ The
model is based on two parameters to be estimated: a, representing the initial revenue,
and b. the exponential growth rate.
2. Using curve fit, the best-fit values of a and b are calculated to fit the exponential
model to the historical data. The time x is calculated as the difference from the first
year (2018) to maintain manageable numbers.
3. Using the estimated values of a and b, the predicted revenue is computed using the
exponential model function.
4. The predicted revenue for each future year is printed, showing annual projections
rounded to two decimal places.

Exercise 53. Projection of Solar Energy Demand for a Company The company
SolarWave Energy specializes in providing solar panels for private homes. Over the past 6
years, it has observed a trend in the number of annual installations that provides a solid
basis for planning the future. The installation data (number of units) is as follows:

Year Number of Installations

2016150

2017180
2018220

2019270

2020330

2021400

Table 3.6: Trend of the Number of Installations.

The management wants to estimate the number of projected installations for the years
2022, 2023, and 2024 to plan the production and procurement of necessary materials. Use
the available data to construct an appropriate projection model and calculate future
estimates.

Solution

To address the problem of projecting the demand for solar panels for SolarWave Energy, an
exponential regression model is used, given the nature of the data, which indicates a
growth rate that increases more than proportionally. This choice allows capturing potential
exponential growth in the number of installations.

The general form of an exponential regression model can be described as:


y = ab’

Where:

• y represents the projected number of installations.


• a and b are the model parameters determined using the historical data.
• x represents the year starting from the first year considered as base 0 (e.g., 2016 = 0).

Using a logarithmic transformation of historical data, we can first determine the


parameters a and b using a linear model applied to the logarithm of the installations:
log(y) = log(a) + x log(b)

Using the series of historical data, we perform a linear fit to calculate log(a) and log(b),
from which we can derive a and b.

Once the parameter values are obtained, we calculate the projections for the subsequent
years (2022, 2023, 2024) by substituting the respective x values into the exponential
formula.

This process allows SolarWave Energy to estimate future market demand with reasonable
accuracy, thus adopting the necessary operational decisions to align production capacity
and inventory management.

Solution with Python

import numpy as np from scipy.optimize import curve fit

* Historical data years ■ np.array((2016, 2017, 2818, 2019, 2028, 28211) installations - np.array((150, 180, 228, 270, 338. A

a Transform years to base zero base year = yearsfS)

data =■ years • base year


# Exponential regression function def exponentlal_model(x, a, b): return a • np.power(b, x)

# Fit the model to the historical data params, = curvefit(exponential model, x data, installations)

a, b = params

# Predictions for the years 2022, 2823, 2024

years to predict = np.array([2022, 2023, 2024]) x to predict = years to predict - base year

predictions = exponential model(x to predict, a, b)

# Results for year, prediction in zip(years to predict, predictions): print(f"Prediction for the year {year}: {intfpredictior

Let's see the various details:

1. The curve fit function from the scipy.optimize module is central to estimating the
parameters of our exponential model, curve fit searches for the best values for the
model parameters defined in the exponential model function, fitting them to the provided
data.
2. We transform the years into a base-zero format to facilitate model fitting. This means
the first year becomes 'O', the second '1', and so on.
3. We define a function exponential model that represents the general formula of
exponential growth: y = ab.*
4. We use curve fit to determine the parameters a and b that best fit our model to the
historical data.
5. After identifying the parameters, we proceed to calculate the forecasts for the
subsequent years 2022, 2023, and 2024.
6. Finally, we print the calculated forecast results, providing SolarWave Energy with an
estimate of future installations.

This approach effectively addresses the variability in demand by projecting non-linear


growth over time.
Chapter 4
Conditional Probability
In this chapter, we will tackle a series of exercises on conditional probability, a fundamental
concept that extends what we have already seen in the first chapter on basic probability.
Conditional probability refers to the probability of an event occurring given that another
event has already occurred.

It is denoted by P(A\B) and is defined as:

P(A\B) =
Pl A n /?), with P(B) > 0.
PW
This formula expresses the fraction of the probability of B that is shared with A. If P(A |B) =
P(A). then events A and B are independent, meaning that the knowledge of B does not
affect the probability of A.

A fundamental application of conditional probability is Bayes' theorem:


PlT? A)P(A)
P(A|8) =
PW
Conditional probability is widely used in statistics, artificial intelligence, social sciences, and
medicine. It plays a crucial role in the work of a data analyst, as it allows for refining
predictions and supporting data-driven decision-making. When well-structured and
supplemented by appropriate storytelling, it can become a powerful tool for corporate
management, facilitating more informed and strategic decision-making. For example, in a
marketing context, it can be used to predict customer behavior based on past purchase
data, while in the financial sector, it can help assess the risk of certain investments.

Exercise 54. Analysis of a Product's Sales Forecasts A consumer electronics


company is analyzing the performance of a new smartphone model launched in the
market. The company wants to understand the effectiveness of its sales forecasts and the
role of a specific advertising channel. After an analysis, the company collects the following
data:

• 70% of the sales forecasts turned out to be correct.


• 20% of the correct forecasts were supported by social media advertising campaigns.
• Overall, 30% of the actual sales are attributable to social media advertising campaigns.

Management wants to know what is the probability that a sale supported by a social media
advertising campaign was correctly forecasted.

Solution

To solve this problem, we use the concept of conditional probability and Bayes' theorem.
Bayes' theorem is defined as:
P(S|(7) • P((7)
P(C|S) =
P(S)
Where:

• P(C|S) is the probability that a forecast is correct given that it is supported by a social
media campaign.
• P(S|C) is the probability that a sale happens through a social media campaign given
that the forecast was correct (20% or 0.2).
• P(C) is the probability that a forecast is correct (70% or 0.7).
• P(S) is the probability that a sale happens through a social media campaign (30% or
0.3).

By inserting the values into the formula, we obtain:


0.2 0.7 0.14
P(C|S) = 0.467
0.3 "oT
Therefore, the probability that a sale supported by a social advertising campaign was
correctly forecasted is approximately 46.7%. This use of conditional probability and Bayes'
theorem allows the company to evaluate the effectiveness of its forecasts in combination
with advertising campaigns on specific channels.

Solution with Python

# Data provided in the problem # Probability that a sale happens via social given the forecast is correct P S given C = 0.2

# Probability that a forecast is correct PC ■ 8.7

a Probability that a sale happens via social P S = G.3

# Calculation of the conditional probability using Bayes


* theorem PC given S = (PS given C • P C) / PS

print(’The probability that a sale supported by social was correctly forecasted is approximately:', P C given S) The code US

Let's see the various details:

• Variables:
ops given c, p c, and p s represent the probabilities as described in the problem.
• Calculation using Bayes' theorem:
o We use Bayes' theorem to calculate p s given c, the probability of a correct forecast
given a social campaign.

In summary, the core of the problem lies in understanding and applying Bayes’ theorem.

Exercise 55. Analysis of the Effectiveness of a Retargeting Strategy An e-


commerce company is evaluating the effectiveness of its retargeting campaign. The
marketing managers want to understand how effective it is in converting site visitors into
buyers. After collecting the data, the company observed that:

• 60% of site visitors viewed retargeting ads.


• Among those who saw the retargeting ads, 25% made a purchase.
• Overall, 20% of site visitors made a purchase.

Management wants to determine the probability that a visitor who made a purchase saw
the retargeting ads.

Solution
In the given business context, we want to calculate the probability P(V |A), which is the
probability that a visitor who made a purchase saw the retargeting ads.

We use Bayes' theorem:

p(»)
Where:

• P(A|\Z ) = 0.25 (probability that a visitor makes a purchase after seeing an ad).
• P(V ) = 0.60 (probability that a visitor sees the retargeting ads).
• P(A) = 0.20 (overall probability that a visitor makes a purchase).

Substituting the values into the formula: P(V |A) = 11 (11 = 0.75

Therefore, the probability that a visitor who made a purchase saw the retargeting ads is
75%. This means that the retargeting campaign was quite effective, as a large majority of
buyers saw the retargeting ads before making a purchase.

Solution with Python

# Provided data * Probability of seeinq a retargeting ad P V = G.68

H Probability of making a purchase after seeing the ad PA given V = B.25

# Overall probability of making a purchase P A = 0.20

# Calculate P(V | A) using Bayes' theorem P V given A = (P given V ♦ P V) /PA

p v given a In this exercise, we use Bayes' theorem to calculate the probability P(V |A), which re|

To do this, we start by defining the probabilities provided in the exercise:

• p v: The probability that a visitor saw the retargeting ads.


• pa given v: The probability that a visitor makes a purchase given they saw a
retargeting ad.
• pa: The overall probability that a visitor makes a purchase.

We then calculate P(V |A) by substituting the values into Bayes' theorem. The calculated
value p v given a allows us to deduce the effectiveness of the retargeting campaign. A high
value, in this case 0.75 or 75%, indicates that a large majority of buyers saw the
retargeting ads before making the purchase, suggesting that the campaign was quite
effective.

Exercise 56. Corporate Risk Analysis in the Insurance Sector An insurance company
wants to analyze the risk of claims for customers who have auto insurance. After analyzing
historical data, the company gathers the following information:

• 10% of customers are involved in an accident annually.


• 5% of customers are classified as high-risk.
• 30% of high-risk customers had at least one accident in the past year.

The company wants to know the probability that a customer who has had an accident
belongs to the high-risk category.
Solution

In the context of this problem, we are examining the concept of conditional probability and
using Bayes' theorem to determine the probability of interest.

To calculate the required conditional probability P(H\S}t which is the probability that a
customer is high-risk given that they have had an accident, we use Bayes' theorem:

P(H|S) =
P(S\ II)-P(II)

Where:

• P(S\H) = 0.30 is the probability of having an accident given that the customer is high-
risk,
• P(H) = 0.05 is the probability of being a high-risk customer,
• P(S) = 0.10 is the probability of having an accident.

Substituting the values into the formula:


0.30 0.05
P(H|S) = 0.15
0.10
Therefore, the probability that a customer who has had an accident belongs to the high-risk
category is 15%.

This result provides useful insight for the company regarding customer segmentation and
profiling, suggesting that, while only a small portion of customers belong to the high-risk
category, a considerable proportion of claims come from this group.

Solution with Python

# Define the probabilities » Probability of accident given high-risk PS given H » 0.36

# Probability of being high-risk P H = 6.05

# Probability of having an accident PS = 0.10

# Calculate the conditional probability using Bayes' theorem P H given S = (P S given H • P HI / P S

# Result P H given S

In the above code, we use Bayes’ theorem to calculate the conditional probability that a
customer belongs to the high-risk category given that they have had an accident. Bayes’
theorem is a fundamental part of statistics and allows us to update probabilities in light of
new evidence or data.

Details:

• ps given h represents the probability that a high-risk customer has an accident.


• p h is the probability that a customer is high-risk.
• p s is the probability that a customer has an accident.

The formula is directly implemented to obtain p h given s, which represents the conditional
probability we are seeking.
Exercise 57. Production Analysis of a Manufacturing Plant A factory produces
electronic components, including printed circuits, and wants to analyze the production
process. The management has decided to focus on the product quality associated with a
particular supplier of materials.

• 12% of the components produced are defective.


• 8% of the components were produced using materials from a specific supplier.
• 20% of the defective components were produced with materials from the specified
supplier.

The management aims to determine the probability that a component produced with
materials from the supplier is defective.

Solution

To solve this problem, we apply the conditional probability formula to determine the
probability that a component produced with materials from the supplier is defective: P(D\F).

With the given data:

• P(D) = 0.12 (probability that a component is defective)


• P(F) = 0.08 (probability that a component is produced with materials from the supplier)
• P(F\D) = 0.20 (probability that a defective component is produced with materials from
the supplier)

Using Bayes' theorem, we can calculate:

P(D\F) =
P(F\D)P(D)
P(F)
Inserting the values:
„0|f).0.20 0.12. 0.30
0.08
Therefore, the probability that a component produced with materials from the supplier is
defective is 30%.

Using the supplier's materials seems to be associated with a higher probability of defects
compared to the overall plant average (30% versus 12%). This indicates that sourcing from
this supplier might increase the risk of defects in the components, suggesting the need for
a review of the materials or processes used by this supplier to reduce defects.

Solution with Python

# Provided probabilities # Probability that a component Is defective P D • 0.12

# Probability that a component Is produced with materials from the supplier P F = 0.08

# Probability that a defective component is produced with materials from the supplier F

P F given 0 = 0.20

# Calculate the conditional probability P(D|F) using Bayes' theorem P D_given F « (PFgiven D • P 0) / PF

print(f"The probability that a component produced with materials from the supplier is defective is: (P u given F:.2f)") In the
1. The known probabilities are set as variables:
o p o is the probability that a component is defective (0.12 or 12%).
o p f is the probability that a component is produced using materials from supplier F
(0.08 or 8%).
o p f given d is the probability that a defective component is produced with materials
from supplier F (0.20 or 20%).
2. We use Bayes' theorem to calculate the conditional probability p o given f, which is the
probability that a component produced with materials from the supplier is defective.
F\F D)P(D)
P(D\F) = —1—1— ------ -—
P{F)
3. Finally, the calculated probability is printed in a format that shows it with two decimal
places.

Exercise 58. Performance Analysis of a Loyalty Program A retail company has


implemented a loyalty program to increase sales. The management wants to analyze the
effectiveness of the program by observing the relationship between participation in the
program and the number of purchases over 100 euros.

• 40% of customers are enrolled in the loyalty program.


• Among the enrolled customers, 25% make purchases over 100 euros.
• Among all customers, 15% make purchases over 100 euros.

The management intends to determine the probability that a customer who made a
purchase over 100 euros is enrolled in the loyalty program.

Definition of events:

• 'I' represents the event 'the customer is enrolled in the loyalty program'.
• 'A' represents the event 'the customer makes a purchase over 100 euros'.

Calculate this probability and analyze what this data indicates about the effectiveness of
the program.

Solution

To solve this problem, we use the concept of conditional probability and Bayes’ theorem.

We have:

• P(/) = 0.40 (probability that a customer is enrolled in the loyalty program)


• P(A\I) = 0.25 (probability that an enrolled customer makes a purchase over 100 euros)
• P(A) = 0.15 (probability that any customer makes a purchase over 100 euros)

We want to calculate P(/|A), the probability that a customer who made a purchase over 100
euros is enrolled in the program.

Using Bayes' theorem:

Substituting the values:


0.25 • 0.40
P(I\A) =
(U5
P(I\A) = 0'10 = 0.6667

0 15
Therefore, the probability that a customer who made a purchase over 100 euros is enrolled
in the loyalty program is approximately 66.67%.

This analysis shows that the loyalty program is effective in generating high-value
purchases, as a significant proportion of such purchases are made by customers enrolled in
the program.

Solution with Python

def calculate probability(): - Definition of initial probabilities # Probability that a customer is enrolled in the loyalty f

# Probability that an enrolled customer makes a purchase over 100 euros P A given I = 0.25

# Probability that a customer makes a purchase over 100 euros P A = 0.15

# Calculate the conditional probability using Bayes' theorem PI given A » (PA given I • PI) / PA return PI given A

probability « calculate probabllity() prmtff’The probability that a customer who made a purchase over 100 euros is enrolled i

1. p I is defined as the probability that a customer is enrolled in the loyalty program,


p a given i as the probability of making a purchase over 100 euros given that the
customer is enrolled, and p a as the probability of such purchases for any customer.
2. Using Bayes’ theorem, we calculate p i given a, which is the probability that a customer
who made a purchase over 100 euros is enrolled in the program. Simply multiply
p a given I with p i and divide by p a.
3. Finally, the calculated value p i given a is formatted and printed to show the probability
that a customer who made a purchase over 100 euros is enrolled in the loyalty
program, expressed as a percentage.

This code highlights the effectiveness of the loyalty program in generating significant
purchases among enrolled customers, as a good portion of the purchases over 100 euros
are made by loyal customers.
Chapter 5
Probability Distributions
In this chapter, we will present some practical exercises on
the most common probability distributions, with a focus on
their application to concrete scenarios. The goal is to place
these distributions in realistic business contexts, providing
the reader with useful tools for analysis and solving practical
problems.

We will closely examine the main probability distributions,


including the binomial distribution, which is used to model
the number of successes in a series of independent trials;
the Poisson distribution, useful for describing the number of
rare events occurring over a given time interval; the
exponential distribution, employed to model the time
between successive events in a Poisson process; the
uniform distribution, which represents situations where all
outcomes are equally likely; the triangular distribution, often
used in decision-making and management contexts due to
its simplicity; and finally, the normal distribution,
fundamental in statistics due to its frequent occurrence in
natural and business phenomena.

In addition to understanding the shape and characteristics


of each distribution, the proposed exercises will guide the
reader in identifying the most suitable model depending on
the type of problem.

The purpose of this chapter is thus twofold: to provide a


practical understanding of the use of probability
distributions and to develop the ability to interpret the
results obtained to make informed and data-driven
decisions.
5.1 Binomial Distribution

The binomial distribution describes the number of successes in a


sequence of n independent trials, each with two possible outcomes
(success or failure) and success probability p. The probability function
is:
/ 71 \

P(X = k) = \ ^/p
(l
* - p)n-k, k = 0, 1.... n

n\
(k / is the binomial coefficient, which counts the number of
ways to obtain k successes out of n trials.

The main quantities of the distribution are:

• Mean: E[X] = np
• Variance: V ar(X) = np(l - p)

It can be used in these business areas:

• Sales analysis (probability of closing k contracts out of n


contacted customers).
• Quality control (number of defective products in a batch).
• Marketing (probability of a positive response to an advertising
campaign).

Exercise 59. Sales Forecasting on an Ecommerce Site An


ecommerce company has launched a new advertising campaign and
wishes to predict the effectiveness of this initiative. For a sample of
1000 site visitors, the analyst has determined that the probability of a
single visitor making a purchase is, on average, 15%. Calculate the
probability that more than 200 of these visitors will make a purchase.
Assume that the purchasing decisions of different visitors are
independent of each other.

Solution
To solve this problem, we can model the number of purchases made
as a discrete random variable, specifically using a binomial
distribution. Given N total visitors and p as the probability of purchase
for each visitor, the probability of having n successful events
(purchasing users) is defined by the binomial distribution:

P(n) = \ " /pn(l - p)N-n

where the binomial coefficient' n ' is defined as:

The notation n! indicates the factorial of n, which is n • (n - 1) • (n -


2)...l.

The steps to follow are:

1. Initially calculate P(X < 200) using the cumulative property of the
binomial distribution: this can be done using dedicated functions
in statistical software or binomial tables (cf. subsequent
solutions).
2. Then, use the complement of the cumulative probability: P(X >
200) = 1 - P(X < 200).
3. With the parameters of the problem, the probability of having
fewer than 200 purchases is very high, so the complementary
probability of having more than 200 purchases will be very low.

Solution with Python

from scipy,stats import binom

# Problem parameters n = 1000 # Number of visitors p = 0.15 # Purchase probability per visitor

# Calculate P(X <= 200) p_cum_200 = binom.cdf(200, n, p)

# Use the complement to get P(X > 200) p_greater_2O0 = 1 - p_cum_200

# Result p_greater_2O0
In this code, we use the binomial distribution to calculate the
probability that more than 200 out of 1000 visitors make a purchase.
We use the binom.cdf function from the scipy.stats library to calculate
the cumulative probability up to 200 purchases. The cdf method
(which stands for Cumulative Distribution Function) gives us the
probability that the random variable X (number of purchases) takes
values less than or equal to 200.

The parameter n represents the total number of visits, while p is the


probability of purchase for a single visitor. The binom.cdf function
returns the sum of probabilities from the base (0 purchases) up to 200
purchases.

To find the probability of having more than 200 purchases, we


calculate the complement of P(X < 200) by subtracting this probability
from 1, since the set of events considered between P(X < 200) and
P(X > 200) covers all possible outcomes.

In summary, the code uses the scipy.stats library to perform


calculations on the binomial distribution accurately and efficiently,
relieving the need to perform complex manual calculations on large
datasets, such as in the case of a sample of 1000 visitors.

Exercise 60. Analysis of Fraudulent Claims A financial company


conducts an analysis on the reimbursement claims received from its
clients, noting that historically 2% of the claims turn out to be
fraudulent. In a specific month, 300 claims are submitted to the
department. Calculate the probability that at least 10 of these claims
are fraudulent. Assume that each claim is independent of the others.

Solution

To solve this problem, we use an approach based on discrete binomial


distribution calculations and focus on the theoretical model of
repeated independent events. Here, the total number of trials is 300
(the number of claims) and the probability of success for a trial is 2%
(percentage of fraudulent claims).

Define the random variable X as the number of fraudulent


reimbursements out of 300 claims. The variable X thus follows a
binomial distribution for which we analyze the required probability.
The probability of having at least 10 fraudulent claims is given by:
P(X > 10) = 1 -P(X < 10)

To calculate P(X < 10), summing the probabilities of getting from 0 to


9 successes:
P(X< 10) = lk=oW = k)

Each term P(X = k) is calculated as:

P(X = k) (0.02)fc • (0.98)3°°-*

By calculating the values, we obtain P(X < 10). Knowing its amount,
we determine P(X > 10), thus solving the problem of fraud analysis in
the reimbursements submitted that month.

Solution with Python

from scipy.stats import binom

# Parameters n = 300 # claims p = 0.02 # probability that a claim is fraudulent

# Probability of having less than 10 fraudulent claims prob_less_than_10 = binom.cdf(9, n, p)

# Probability of having at least 10 fraudulent claims prob_at_least_10 = 1 - prob_less_than_10

p rob_at_least_10

In this code, we use the scipy library, specifically the binom module,
which manages the binomial distribution. This allows us to calculate
the cumulative probability related to discrete events in a series of
trials.

Parameters:

• n: Represents the total number of claims, i.e., 300 in this case.


• p: Is the probability of success for a single claim, i.e., the
probability that a claim is fraudulent (2% or 0.02).

For probability calculation, we use the function binom.cdf(k, n, p)


which calculates the cumulative probability up to k successes (in this
case, from 0 to 9 successes). The function binom.cdf returns the
probability that the number of successes is less than or equal to k.
prob_iess_than_i0 calculates the cumulative probability of having less
than 10 fraudulent claims out of 300.

Finally, we subtract prob_iess_than_i0 from 1 to determine the


probability of having at least 10 fraudulent claims (prob_at_least_io).
This uses the relationship:
P(X > 10) = 1 - P(X < 10)

The result provides a measure of the desired probability, thus solving


the problem.
5.2 Poisson Distribution

The Poisson distribution models the number of events that occur in a given interval of time
or space, assuming that the events are independent and occur at a constant average rate
A (lambda). The probability function of having k events is:
\^e-A
P(X = k) = ______ _ k = 0, 1, 2,...
A* ’
The main properties are:

• Mean: E[X] = A
• Variance: V ar(X) = A

Here are some possible applications:

• Call center management (number of calls per hour).


• Fault analysis in production lines (number of defects per unit of time).
• Customer flows in a store (number of arrivals in a time interval).

Exercise 61. Customer Flow Management in a Retail Store In a shopping mall, an


electronics store has observed that the average number of customers entering one of its
branches in an hour is 25. The staff management has identified 30 customers as the
maximum manageable number of people at the same time without compromising the
quality of service.

You want to calculate the probability that more than 30 customers will enter a particular
retail outlet within an hour.

Solution

The described situation can be represented with a Poisson distribution, which is suitable for
modeling the number of events occurring in a fixed interval of time, given a certain
average rate of occurrence and the independence between events.

To calculate the probability that more than 30 customers enter in an hour, we first calculate
the probability that exactly 0, 1, ..., 30 customers enter, and then subtract the sum of
these probabilities from 1.

The formula for the Poisson distribution is:

p(x = k) =:______
k\
where A is the average rate, i.e., 25 customers per hour, and k is the number of customers.

We then calculate:
25fce 25
P(X > 30) = 1 -Z k=o30—----------
A!
Using a calculation software or a scientific calculator, we find:
P(X > 30) = 0.137

Hence, there is approximately a 13.7

Solution with Python

from scipy.stats impart polssan

# Poisson distribution parameters lambda value = 25

threshold = 38

# Calculate cumulative probability P(X 30) probability less than equal 38 = polsson.cdf(threshold, lambda value)

# Calculate the probability that X > 30

probability greater than 38 » 1 - probability_less than equal 30

# Print the probability prlnt(f“Probability that more than 30 customers enter in an hour: (probability greater than 38:.3f)“)

In our case, we want to calculate P(X > 30), the probability that more than 30 customers
enter the store in an hour. This is 1 minus the probability that at most 30 customers enter,
orP(X < 30).

The key steps in the code are:

1. Define the parameters: lambda value representing the average customer arrival rate
(25) and threshold as the maximum customer threshold (30).
2. Calculate the cumulative probability: We use the SciPy function poisson.cdf (threshold,
lambda value) to find the cumulative probability up to 30 customers.
3. Find the desired probability: The sought probability is P(X > 30) = 1 - P(X < 30).
4. Display the result: We use formatted output to show the calculated probability with
three decimal places, which is about 0.137, i.e., 13.7%

This procedure efficiently expresses the problem’s solution using a programming language
and the use of an advanced statistical library.

Exercise 62. Warehouse Management Problem A small e-commerce company


receives an average of 3 fragile product orders per day. The warehouse capacity allows
careful handling of up to 5 fragile product orders daily without incurring overload damage.

Calculate the probability that the number of fragile orders received in one day exceeds the
manageable number of 5 orders.

Solution

To solve this problem, the Poisson distribution can be used, which models the probability of
a given number of events occurring in a fixed interval of time when the events are
independent and occur with a known average rate.

The average arrival rate of fragile orders is A = 3 orders per day. We need to find the
probability that more than 5 orders are placed in one day.

The probability of observing k events in a time interval with a mean of A is given by the
Poisson distribution formula:
Where e is the base of the natural logarithm, approximately equal to 2.71828.

We calculate the probability of receiving at most 5 orders, P(X < 5), and subtract it from 1
to get the complementary probability:
r-33A’
P(X > 5) = 1 -I fc=05____ _
A!
Carrying out the calculations, we find:
P(X < 5) ~ 0.9161

Thus,
P(X > 5) = 1 - 0.9161 = 0.0839

The probability that the number of fragile orders received in one day is more than 5 is
approximately 0.0839, or 8.39%. This value indicates that the risk of daily overload is
relatively low but not negligible, suggesting that warehouse management should anticipate
potential peak periods.

Solution with Python

from scipy.stats import poisson

# Lambda parameter for the Poisson distribution lambd = 3

H Calculate the probability of PCX <= 5) prob X leq 5 » poisson.cdf(5, lambd)

# The probability that more than 5 orders are placed in one day is P(X > 5) prob X gt_5 » i - prob X leq_5

# Output the probability prob X gt 5

In this code, we use the scipy library, specifically the stats module, which offers a wide
variety of statistical distributions, including the Poisson distribution. The function
poisson.cdf calculates the cumulative distribution function of the Poisson distribution, which
is the probability of obtaining a value less than or equal to a certain number, given a mean
value (lambda). In our case, lambda is equal to 3, representing the average number of
fragile product orders per day.

We calculate the cumulative probability up to 5 orders using poisson.cdf (5, lambd), and then
obtain the complementary probability of having more than 5 orders by subtracting this
value from 1.

The scipy.stats.poisson is particularly useful for calculating the Poisson distribution without
needing to manually write the mathematical formulas, reducing the risk of errors.

This code is a basic example of how to simplify probability calculations for events
distributed over time that are rare and independent.
5.3 Exponential Distribution

The exponential distribution models the time between two successive events in a Poisson
process, where events occur independently at a constant average rate A. The probability
density function (PDF) is:
f(x) = Ae'Ax, x > 0,A > 0.

The main properties are:

. Mean: EfX] = A
• Variance: V ar(X) = -A_

It can be used for the following purposes:

• Modeling waiting times in service systems (e.g., time between customer arrivals).
• Reliability analysis (lifetime before failure in electronic devices).
• Finance (time between significant changes in stock prices).

Exercise 63. Managing Customer Flow in a Retail Store An electronics store has
observed that the average waiting time between the arrival of one customer and the next
is 5 minutes. The store manager wants to improve staff efficiency and optimize customer
service. Calculate the probability that, in any given period, two customers will arrive less
than 10 minutes apart.

Solution

To solve this problem, we use the concept of the exponential distribution, which is suitable
for modeling the time between independent events occurring at a constant average rate.

In our case, the average time between the arrival of two customers is 5 minutes. We want
to find the probability that the time X between two arrivals is less than 10 minutes. The
cumulative distribution function (CDF) for an exponentially distributed random variable is
given by:
P(X < t) = 1 - e-Af

In this context, A = 1/5 for the exponential distribution, because the inverse of the average
of 5 minutes determines the parameter of the exponential distribution. We want to
calculate the probability that the time between two customers is less than 10 minutes.

Therefore:
P(X < 10) = 1 - e'17510 = 1 - e'2 « 1 - 0.1353 = 0.8647

Thus, the probability that two customers arrive less than 10 minutes apart is approximately
86.47%

Solution with Python

from scipy.stats import expon

# Parameter lambda of the exponential distribution Lambda param = 1/5

* Calculate P(X <= 10) probability = expon.cdf(10. scale=l/lambda param)


* Print the probability printC'The probability that two customers arrive less than 10 minutes apart is:", probability) In th

The problem involves the exponential distribution, which models the time between events
in a continuous Poisson process. In this context, we are modeling the time between the
arrival of two customers in the store.

Steps in the code:

1. Define the lambda parameter:


o lambda param = 1/5 represents the rate of the exponential distribution, calculated as
the inverse of the average time (5 minutes).
2. Calculate the probability with the CDF function:
o expon.cdf(io, scale=i/lambda param) calculates the cumulative probability that the
exponential random variable is less than or equal to 10 minutes. The cdf function
returns this probability directly.
o Here, scale=i/iambda param is used instead of the rate parameter to match the
scaling requirements of the cdf function of expon.
3. Display the result:
o printc...) is used to show the calculated probability to the user, indicating that
there is an 86.47% chance that two customers will arrive less than 10 minutes
apart.

Using scipy, the calculation and manipulation of statistical distributions become simple and
efficient, as the library provides ready-to-use functions for many common statistical
distributions.

Exercise 64. Efficiency of a Call Center's Response In a telemarketing service


company, the team manager is evaluating the effectiveness of its customer response
system. Currently, the average response time of the operators, from the moment a call
arrives until its actual handling, is 2 minutes. We want to determine the probability that a
customer will have to wait more than 3 minutes before receiving a response, as they
typically hang up at that point. Evaluating this probability will help the company to verify
the adequacy of its resources and implement any improvements to avoid losing contact
with potential customers.

Solution

In this context, the response time of the operators follows an exponential distribution. The
main characteristic of this distribution is that it describes the time between events that
occur with a certain frequency (in our case, the average response time of 2 minutes). The
cumulative distribution function (CDF), which expresses the probability that an exponential
random variable is less than a certain value, can be represented as F(tj = 1 - e'Af, where A
is the inverse of the average response time (2 minutes in our exercise). Therefore, the
probability of responding after 3 minutes is the complement of the CDF for t = 3.

The probability sought is P(T > 3) = 1-F(3) = = e-1-5 ~ 0.2231. Thus, the probability that
a customer has to wait more than 3 minutes is approximately 22.31%. The result indicates
that almost one in four customers might hang up due to excessive waiting. The company
might consider increasing staff or optimizing processes to reduce wait times, improving the
customer experience.

Solution with Python

import numpy as np
# Average response time lambda_ « 1.0 / 2 # 2 minutes

wait time = 3 # minutes def probability exceeds wait(t, lambda # Calculate the complement of the CDF to get P(T > t) reti

# Calculate the probability that the wait exceeds 3 minutes probability » probability exceeds wait(wait time, lambda J

print(f"The probability that a customer waits more than {wait time} minutes is about (probability:.4f} (i.e., {probability
*
100: .

Let's see it in detail:

• numpy library: It is mainly used here to compute the exponential function e15. The
imported function np.expf) calculates the exponential of a given input number, which is
necessary for evaluating the exponential distribution.
• lambda represents the inverse of the average response time, which is 2 minutes in this
case, wait time is set to 3 minutes, the time after which the customer tends to hang up.
• The function probability exceeds wait(t, lambda ) computes the complementary
probability of the CDF (Cumulative Distribution Function) using the formula P(T > t) = e~
tA, which represents the probability that a customer waits more than t minutes.
• Finally, a print statement formats and displays the calculated probability as a well-
readable percentage.

This calculation helps understand resource adequacy in a customer service situation, to


determine whether it is necessary to improve processes or increase staff to reduce wait
times.
5.4 Uniform Distribution

The uniform distribution describes a random variable that has the same probability of
taking any value within an interval [a,b], There are two main types:

• Discrete uniform: each discrete value in a finite set has an equal probability.
• Continuous uniform: the probability density is constant over an interval [a,b|.

For a continuous uniform distribution, the probability density function (PDF) is:

f(x) =--------- , a < x £ b.


b—a
The main properties are:

. Mean: E[X] =
• Variance: V ar(X) = (l>
12
Here are some possible uses:

• Simulation of scenarios where every outcome is equally likely.


• Generation of random numbers in computational models.
• Production planning when demand is uncertain but uniformly distributed over an
interval.

Exercise 65. Sales Analysis in a Clothing Store A clothing store receives daily supplies
of a particular item of clothing. The supply manager, to better plan orders, has determined
that the number of items sold daily ranges from a minimum of 10 to a maximum of 50
items. With this in mind, calculate the average number of items sold per day. Assume that
sales are equally likely in this range.

Solution

In this exercise, we are using the concept of a continuous uniform distribution. When we
have a uniform distribution, the mean (or expected value) of a random variable X that has
minimum a and maximum b is given by the formula: p =

In our problem, a = 10 and b = 50. Applying the formula for the mean of the uniform
distribution, we get:

2 2
Therefore, the average number of items sold daily in the store is 30.

Solution with Python

from scipy.stats import uniform

a = 10 » minimum number of items sold b = 50 « maximum number of items sold

# Calculate the mean of the uniform distribution average items sold = uniform.mean(loc=a, scale=(b-a))
prlnt(f"Tlie average number of Items sold dally is: (average Items sold)-) In this Python code, we calculate the av

Let's look at the code in detail:

• We use scipy.stats, a Python library for statistics that offers useful functions for working
with distributions.
• We define a as the minimum number of items sold and b as the maximum number of
items sold, which are 10 and 50, respectively.
• We use uniform.mean(toc=a, scate=(b-a)) to calculate the mean of the uniform
distribution. Here, toe represents the parameter a, while scale represents the length of
the interval (b - a).
• Finally, we print the result, which should return the value 30, which is the mean of the
specified range, so the average number of items sold daily in the uniform distribution is
30.

Exercise 66. Optimization of Logistics in a Warehouse An e-commerce company


wants to optimize warehouse management. They know that the delivery time of a supplier
for orders varies from 2 to 5 days. The logistics manager wants to determine the
probability that the delivery time for an order is less than 4 days. Also, calculate the
average delivery time and discuss how this information could influence the management of
safety stock in the warehouse.

Solution

The exercise is based on a uniform distribution. In the case of a continuous uniform


distribution, given an interval fa,b], we can say that:
t — a
P(X <x) =
b — a

To calculate the probability that the delivery time is less than 4 days, we use a = 2 days
and b = 5 days. Therefore:

P(X < 4) == _
4-2 2
5-2 3
Thus, there is a 66.67% probability that the delivery time is less than 4 days.

For the average delivery time, we use the formula for the expected value of a uniform
distribution:

These calculations suggest that the average delivery time is 3.5 days. With this
information, the logistics manager can decide to maintain sufficient stock to handle
occasional delays by setting the safety stock level to cover at least those rare cases where
the delivery time reaches the maximum limit of 5 days. Using the uniform distribution
allows for estimating these waiting times directly and easily implementing them in
management strategies.

Python Solution

from scipy.stats import uniform

* Delivery days Interval min days « 2


max days ■ 5

# Calculate probability that delivery time is less than 4 days probability less than 4 = uniform.cdf(4, loc=min days, scale=ma

# Calculate average delivery time average delivery time = uniform.mean(loc=min days, scale=max days - min days)

# Output results print(f"Probability that delivery time Is less than 4 days: {probability less than 4:.4f}“) prlntff“Average

Here's a detailed look at the code:

• We define min days and max days as the endpoints of the interval.
• The function uniform.cdf (x, toe, scale) calculates the cumulative distribution function
(CDF) for the uniform distribution. In this case, loc is the minimum value (2 days) and
scale is the difference between the maximum and minimum (5-2 days), cdf returns the
probability that a random variable is less than x, which in this case is 4 days.
• The function uniform.meandoc, scale) returns the mean or expected value of the random
variable, which for a uniform distribution is the average of its endpoints.

Finally, we print the results showing the calculated probability and the average delivery
time. This information helps the logistics manager consider maintaining an adequate level
of safety stock to mitigate the impact of possible delays.
5.5 Triangular Distribution

The triangular distribution is a continuous probability distribution that derives its name
from the triangular shape of its probability density function. It is characterized by three
main parameters: the minimum, the maximum, and the mode (modal value). The
probability density function is defined over an interval [a,b], with c representing the mode,
which is the point where the probability is highest.

The probability density function for a random variable X following a triangular distribution
is defined as:

if a < x < c.

- 6.(
if c < r <

otherwise.

The expected value E[X] and the variance Var(X) for a triangular distribution are given by:
EOT - " + <' + <, var(x). (»-a)2 + (c-g)(i>-c)
3 18
The triangular distribution is useful in situations where the extreme values (minimum and
maximum) are known, while the mode represents a plausible estimate based on
experience or previous data. It is often used in simulation models, risk analysis, and
forecasting, particularly in project management contexts.

Exercise 67. Optimization of the Product Launch Timeline In a software development


company, the project team is planning the launch of a new product. It is estimated that the
time necessary to complete the final testing phase varies between a minimum of 5 days
and a maximum of 15 days. Historical information and team experience suggest that the
most likely time to complete this activity is around 10 days. However, the marketing team
needs to know the probability that this phase will take more than 10 days in order to plan
any corrective actions. Determine this probability considering that the time taken follows a
specific continuous distribution between days 5 and 15.

Solution

To solve this problem, we use a triangular distribution, suitable for modeling phenomena
where we know a minimum, a maximum, and a most likely value within an interval. In our
case, the minimum (a) is 5 days, the maximum (b) is 15 days, and the most likely value (c)
is 10 days. The cumulative distribution function (CDF) for a triangular distribution can be
used to calculate the probability that the time taken exceeds a given value.

For a right triangular distribution, the CDF up to a certain value x is calculated as follows:

1. If x < c. then the probability is given by:


2. If x > c, then:
(b - *
) 2
P(X < x) = 1
(b - a)(b - c)

Since we want to calculate the probability that the duration exceeds 10 days, we use:
P(X > 10) = 1 - P(X < 10)

Implementing, we obtain:
(10 — 5)2 25
P{X < 10) =------ '------------ --------- = — =0.5
(15- 5)(10 — 5) 50
Therefore:
P(X > 10) = 1 - 0.5 = 0.5

Thus, the probability that the testing phase takes more than 10 days is 50%. This allows
the marketing team to know that there is a significant possibility that the launch might be
delayed and to prepare accordingly.

Solution with Python

from scipy.stats import triang

« Parameters of the triangular distribution n minimum a » 5

# maximum b = 15

# most likely value (mode) c ■ 19

# Calculate the scale and position for the triangular distribution # loc is the start of the interval loc « a

# scale is the difference between the maximum and the minimum scale = b- a

* c as a fraction between 0 and 1

c relative = (c • a) / scale

# Create the triangular distribution object triang dist • triangle relative, loc=loc, scale»scale)

# Calculate the probability that X > 19, that is 1 - P(X <= 10) x = 10

probability more than x = 1 - triang dist.cdf(x)

# Result probability more than x In this solution, we used the scipy.stats library, which provides man^

We defined a triangular distribution with known parameters: minimum (a) equal to 5,


maximum (b) equal to 15, and the mode (c) which is the most likely value, equal to 10. The
triangular distribution is defined with three parameters: loc, scale, and a shape parameter
c relative, which is the relative position of the mode within the interval set by loc to loc +
scale.

The triang function uses these parameters to create a triangular distribution object. Then,
using the cdf method provided by the distribution, we calculate P(X <= x) (the cumulative
probability up to 10 days). The probability that the time exceeds 10 days is i P(x <= 10),
and this calculation gives us the probability that the marketing team requires for their
planning.

Exercise 68. Inventory Management and Delivery Time Estimation An online store
selling technology products is optimizing the inventory management of its warehouses. The
company wants to predict the time needed to receive a new batch of products from an
overseas supplier. Based on previous experiences, the delivery time can vary between a
minimum of 7 days and a maximum of 21 days. However, a delivery time of 14 days is
considered the most probable according to the commercial agreements with the supplier.
The logistics manager would like to know the probability that an order will take less than 12
days to arrive, in order to improve the planning of sales promotions and warehouse
management. Calculate this probability.

Solution

To solve this exercise, we use the concept of continuous triangular distribution. The
triangular distribution is defined by three parameters: the minimum value (a), the
maximum value (b), and the most probable value (c). In this case, we have a = 7, b = 21,
and c = 14.

The cumulative distribution function (CDF) for a triangular distribution is given by two
expressions, one for the increasing interval (from a to c) and one for the decreasing
interval (from c to b). To calculate the probability that the delivery times are less than 12
days, we use the part of the CDF for the increasing branch:
{x — a)2
F(x) =---------------- -------- for a < x < c
(6-a)(c - a)

Substituting the values, forx = 12 we obtain:


(12 —7)2 25
F(12) =____ 2_______ -_____ =___ = 0.2551
(21 —7)(14 —7) 98

Therefore, the probability that the delivery time is less than 12 days is approximately
25.51%.

Solution with Python

from scipy.stats import triang

# Parameters of the triangular distribution a = 7 # minimum value b = 21 9 maximum value c = 14 9 most probable value

9 Calculate the loc and scale parameters for scipy loc « a scale = b - a c param = (c - a) / scale

9 Create the triangular distribution object triang distribution = trlangfc param, loc=loc, scale°scale)

# Calculate the probability that the delivery time is less than 12 days prob less than 12 ■ triang distribution.cdf(12)

pnnt(f"The probability that the order takes less than 12 days is approximately (prob less than 12 • I00:.2f)V) In this COC

To create an instance of this distribution, the parameters to specify are:

• c param: the shape parameter of the distribution, calculated as (c a) / (b a), where c


is the most probable value, a is the minimum value, and b is the maximum.
• loc: the location parameter, which in this case is equal to the minimum value a.
• scale: the scale parameter of the distribution, calculated as b • a.

We create a triang distribution object using these parameters. The cdt function
(cumulative distribution function) of this object allows us to calculate the probability that a
random variable is less than or equal to a certain value. In this case, we calculated
triang distribution.cdf (12), which gives us the probability that the delivery time is less than
12 days. The result is then multiplied by 100 and formatted to be expressed as a
percentage.
5.6 Normal Distribution

The normal distribution, also known as the Gaussian distribution, is one of the most
common and important probability distributions in statistics. It is a continuous distribution
characterized by two main parameters: the mean (p) and the standard deviation (a).

Its probability density function (PDF) has the shape of a symmetric bell curve around the
mean value p, and is defined as:

where:

• p (mu) is the mean of the distribution, indicating the central point of the curve,
• a (sigma) is the standard deviation, which measures the dispersion of the values
around the mean,
• exp denotes the exponential function.

Here are some properties of the normal distribution:

• The normal distribution is symmetric with respect to its mean value p, which implies
that the probability of observing a value greater than p is the same as observing one
that is smaller.
• The normal distribution has the expected value 6 [X] = p and the variance Var(X) = a2.
• The curve of the normal distribution is bell-shaped, with most values concentrated
around the mean p.
• The probability density function tends to zero as it moves away from the mean. This
means that the probability of observing extreme values (far from the mean) is
extremely low.
• Approximately 68% of the values lie within one standard deviation from the mean
(p±a), about 95% lie within two standard deviations (p ± 2a), and about 99.7% lie
within three standard deviations (p ± 3a).

Exercise 69. Delivery Time Analysis A logistics company has recorded the delivery
times of its packages in various cities. Delivery times are influenced by numerous factors
such as traffic, weather conditions, and distance. Management wants to estimate the
average delivery time in a specific city to optimize customer service.

A study is conducted where 45 days are randomly selected, and delivery times for each
package are recorded. The result is a dataset with an average of 50 minutes and a
standard deviation of 10 minutes.

Determine the probability that the average delivery time exceeds 52 minutes based on the
sample taken, assuming the distribution of delivery times on any given day can vary.

Assume the normality of the samples in the calculations.

Solution

The problem requires determining the probability that the average delivery time, based on
a sample of 45 days, is greater than 52 minutes.
From the exercise, we are given the sample mean x = 50 minutes and the sample standard
deviation s = 10 minutes, with a sample size n = 45.

According to the central limit theorem, if n is sufficiently large (typically n > 30), the
distribution of the sample mean x approaches a normal distribution centered on the
A
population mean p with a standard deviation of This allows us to calculate the
probability using:

Assuming that p = 50 (the unknown population mean, estimated from the sample mean),
we have:
52 - 50 2
w_ = ~Tn~ = « 1.34
6.708 1.49
Using the standard normal distribution table, we find that the probability P(Z > 1.34) is
approximately 0.09.

Thus, there is a 9% probability that the average delivery time exceeds 52 minutes.

Solution with Python

from scipy.stats import norm import numpy as np

# Problem data mean sample = 50 # sample mean std sample = 10 # sample standard deviation n = 45 # sample size mean hypotb

# Calculation of standard error of the sample mean std error = std sample / np.sqrt(n)

# Calculation of Z-score z score = (mean hypothesized - mean sample) / std error

# Probability calculation probability = 1 - norm.cdf(z score)

probability To solve this problem, we use the scipy library, particularly the stats module that pr<

We start by defining the data: the sample mean, the sample standard deviation, and the
sample size. We then assume a hypothesized mean of 52 minutes to calculate the
probability of exceeding this value.

We calculate the standard error of the sample mean (also known as the standard error)
using std sample / np.sqrt(n). This step leverages the central limit theorem property that
allows us to make the calculation considering the sample sufficiently large.

Next, we calculate the z score, which represents how many standard deviations the value
52 (the hypothesized mean) is beyond the sample mean, using the formula
(mean hypothesized - mean sample) / std error.

Once z score is obtained, we calculate the cumulative probability using the


norm.cdfiz score) function. The cdf function returns the value of the cumulative distribution
function, so to obtain the probability that z score is greater than a certain value, we
calculate 1 - norm.cdf(z score).

In the end, we obtain the desired probability, which in this case returns an approximate
value of 9%. This indicates there is a 0.09 probability that the average delivery time
exceeds 52 minutes.
Exercise 70. Customer Satisfaction Analysis A restaurant chain wants to evaluate the
average customer satisfaction to improve service quality. Management decides to collect
data from satisfaction surveys filled in by customers over 60 different days. The
satisfaction score is a numerical measure ranging from 1 to 100, and the collected data
show an average of 80 with a standard deviation of 12.

The company wants to know the probability that the average satisfaction calculated over
these 60 days is less than 78, considering that satisfaction on any given day can vary
depending on several factors such as waiting time, service quality, and the menu of the
day.

Assume the validity of the normal distribution of satisfaction scores calculated over the
samples.

Solution

To solve this problem, we use a fundamental concept in statistics: the distribution of the
sample mean. In this case, we have a number of samples equal to 60, which is large
enough to apply the central limit theorem, suggesting that the distribution of the sample
means will be approximately normal, regardless of the original data distribution.

The theorem tells us that the mean of the sample means is equal to the population mean
and that the standard deviation of the sample means is the population standard deviation
divided by the square root of the sample size.

Therefore:
Mean of the sample means = 80

12
Standard deviation of the sample means = —■— = 1.549
V60
We calculate the z value for a sample mean score of 78:
78 - 80
z ==-1.29
1.549
Using a table or statistical software, we find that the probability that the sample mean is
less than 78 is approximately 0.0985, or 9.85%

This probability represents the likelihood that the average customer satisfaction is
significantly lower than the observed average and could indicate areas for improvement on
days when satisfaction is particularly low.

Solution with Python

from scipy.stats import norm import math

# Problem data population mean = 89

population std dev = 12

sample size = 69

desired sample mean - 78

# Calculation of the standard deviation of the sample mean sigma sample mean = population std dev I math.sqrt(sample size)

# Calculation of the z value z = (desired sample mean - population mean) / sigma sample mean
# Calculation of the probability using the standard normal distribution probability ■ norm.cdf(z)

# Final result print(f-The probability that the sample mean is less than (desired sample mean) is approximately {probability:.

Here's a breakdown of the steps:

• Import norm from scipy.stats to work with the standard normal distribution, and math for
basic mathematical calculations.
• Define the problem variables, such as the population mean and standard deviation, the
number of samples, and the desired sample mean value.
• Use the formula to obtain the standard deviation of the sample means, which is the
population standard deviation divided by the square root of the sample size.
• Calculate the z value that determines the position of the desired sample mean relative
to the population mean on the standard normal distribution scale.
• Using norm.cdf, we calculate the probability that the sample mean is below the desired
value. The function cdt calculates the cumulative distribution function, useful for finding
the accumulated probability up to a certain point, as in this case.
• Finally, print the calculated probability in both decimal and percentage forms.

This approach allows us to statistically verify the possibility that the average satisfaction
over a given period deviates significantly from the historical average, thanks to the power
of the normal distribution.
Chapter 6
Hypothesis Testing
In this chapter, we will analyze various examples of
statistical hypothesis testing, fundamental tools for making
data-based decisions and evaluating the significance of a
result. Each example will start with the formulation of the
null hypothesis and the alternative hypothesis, then proceed
with the calculation of the p-value or the critical value of a
test statistic. Both methods will be shown depending on the
circumstances, allowing the reader to choose the most
suitable strategy depending on the problem at hand.

The main objective of hypothesis testing is to determine


whether the observed results in the data are statistically
significant or due to chance. This is particularly useful in
fields such as scientific research, quality control, marketing,
and finance, where making decisions based on reliable data
is essential.

Throughout the chapter, we will address the following


statistical tests:

• Student's t-test (one-sample and two-sample): used to


compare the mean of a sample with a benchmark or the
means of two samples and to verify if there are
statistically significant differences between them.
• Z-test for proportions: a test that allows comparing
multiple proportions with each other or with a reference
benchmark, useful for relating success rates to each
other or to a reference value.
• Levene's test for variance: essential for comparing
variability between two or more groups and verifying
the hypothesis of homoscedasticity (equality of
variances), a condition often necessary for other
statistical tests.
• One-way ANOVA: an extension of the Student's t-test
that allows comparing the means of more than two
groups simultaneously, determining if at least one group
has a mean that significantly differs from the others. It
is widely used in experimental and business settings.
• Kruskal-Wallis test: a test that allows verifying the
significance of the differences between the medians of
multiple samples, useful when data are not normally
distributed.
• Kolmogorov-Smirnov test: employed to compare an
empirical distribution with a theoretical distribution or to
compare two empirical distributions with each other,
useful for checking if two samples come from the same
population.
• Shapiro-Wilk test: used to verify the normality of a
dataset, an essential condition for many parametric
tests.
• Chi-square test: one of the most common tests for
verifying the independence between two categorical
variables.
• Fisher's exact test: particularly useful for analyzing
contingency tables with small sample sizes.

Each test will be accompanied by practical examples and


detailed interpretations, making the context of application
and the meaning of the results obtained clear. Moreover, the
limits and validity conditions of each test will be discussed,
allowing the reader to gain a critical understanding of
statistical tools and avoid common errors in data analysis.
6.1 Student's t-test for a Single Sample Mean

The one-sample Student's t-test is a statistical test used to determine if the mean of a
sample comes from a population with a specified mean. This test is particularly useful
when the sample size is relatively small (generally less than 30 values) and the population
variance is unknown. It is a test that assumes the data are normally distributed.

The one-sample Student's t-test is used in various business contexts to test hypotheses
regarding the mean of a single group. Some examples include the following cases:

• A company might want to check if the mean weight of a batch of products conforms to
a specified target value (for example, if the average weight of a package is equal to
500 grams).
• A company may want to test if the average annual earnings of a division are equal to a
target figure, such as a 10% growth.
• A company might use this test to determine if the mean of customer satisfaction scores
from a survey equals a predefined mean (for example, 8 out of 10).

To correctly use the one-sample Student's t-test, the following requirements must be met:

• The observations in the sample must be independent of each other.


• The test is based on the assumption that the sample data follow a normal distribution.
If the sample is large enough, this condition can be approximately met even for
distributions that are not perfectly normal, thanks to the central limit theorem.

The one-sample Student's t-test is based on the formulation of two hypotheses:

• Null hypothesis (Ho): The null hypothesis posits that the sample mean is equal to the
specified population mean. Mathematically:
Ho : M = Mo,
where p is the population mean and p0 is the hypothesized mean value.
• Alternative hypothesis (HJ: The alternative hypothesis posits that the sample mean is
not equal to the specified population mean. It can be two-tailed:
Hi : p
* 0
or one-tailed
H, : p > p0 or H: : p < p0

The test statistic for the one-sample Student's t-test is given by:
T - //(,
f= s

where:

• x is the sample mean,


• p0 is the specified population mean (value under test),
• s is the sample standard deviation,
• n is the sample size.

The t statistic follows a Student’s t distribution with n - 1 degrees of freedom. The p-value
of the test determines the rejection of the null hypothesis if it is less than the required
significance level (typically 5%).

Exercise 71. Performance Analysis of a New Company Search Engine A large e-


commerce company recently implemented a new search engine on its website. Before the
change, the average time to find a product was 2.5 seconds. Management wants to verify if
the new search engine has indeed changed the average time required to find a product. For
this reason, a random sample of 30 recent searches was selected, showing an average of
2.7 seconds and a standard deviation of 0.6 seconds. Assuming that the distribution of
search times follows a normal distribution, determine if there is a significant difference in
the average search time with a significance level of 5%.

Solution

To tackle this problem, a two-tailed one-sample Student's t-test is used.

Here are the steps to follow:

1. Definition of hypotheses:
o Null hypothesis (Ho): The average time to find a product with the new search
engine is still 2.5 seconds (p = 2.5).
o Alternative hypothesis (/-/,): The average time to find a product is different from 2.5
seconds (p*
2.5).
2. Perform the test:
The formula for the t-value is:
X ~ /'fl
s(y/ri
Whe_re:
o x = 2.7 (sample mean)
o p0 = 2.5 (theoretical mean under Ho)
o s = 0.6 (sample standard deviation)
o n = 30 (sample size)
Calculating:
2.7-2.5
t =---------- == 1.8257
0.6/V^O
3. Compare with the critical value With a significance level of 5%, we look for the critical
value for a two-tailed t-test and 29 degrees of freedom. Using a t-table, we find that
the critical t-value for 29 degrees of freedom is approximately ±2.045. Since the
calculated t-value (1.8257) falls within the interval from -2.045 to 2.045, we do not
reject the null hypothesis.

In conclusion, there is not enough evidence to conclude that the new search engine has
significantly altered the average time necessary to find a product at the 5% significance
level.

Solution with Python

import sclpy.stats as stats import math

# Problem parameters sample mean =2.7

theoretical mean =2.5


standard deviation = 0.6

sample size « 30

significance level ■ 0.05

t stat » (sample mean ■ theoretical mean) (standard deviation math.sqrt(sample size)) df » sample size - 1

» Critical values for two-tailed test critical value - stats.t.ppff1 - significance level/2, df)

# Decision test significant = abs(t stat) > critical value

# Solution output {'t stat': t stat, ‘critical value': critical value, 'test significant': test significant)

The code uses the scipy.stats library to perform a one-sample Student's t-test. This library
is powerful for performing statistical tests and associated calculations.

Let's see the various steps:

• Import scipy.stats as stats and math, scipy.stats is useful for statistical functions, while
math simplifies the use of square roots.
• Define variables representing the sample mean (2.7), theoretical mean (2.5), standard
deviation (0.6), and sample size (30).
• Calculate the t-value using the formula: t = —y=, where x is the sample mean, p0 is
the mean under the null hypothesis, s is the sample standard deviation, and n is the
sample size.
• Determine the degrees of freedom with df = sample size - i.
• Use stats.t.ppf to obtain the critical value of the two-tailed t-test, considering the
significance level set at 5%.
• Determine if the test is significant by comparing t stat with critical value. If the
absolute value of t stat exceeds critical value, the null hypothesis is rejected.
• Finally, the code returns t stat, critical value, and a boolean test significant indicating
if the test result is significant.

Alternatively, if we had the entire sample data instead of the mean and standard deviation,
we could use the ttest isamp function from scipy to calculate the p-value directly and reject
the null hypothesis if it was smaller than the desired significance level.

Exercise 72. Performance Analysis of Production Machines A manufacturing


company is evaluating the efficiency of its production machines after a recent major
maintenance intervention. Historically, the average time to produce one unit was 1.2
minutes. After the intervention, to determine if there was an improvement in production
performance, an engineer collected a random sample of 25 units produced, recording an
average time of 1.1 minutes and a standard deviation of 0.3 minutes.

Given this scenario and a 5% significance level, the company wants to understand whether
the maintenance intervention has improved the average production time.

Solution

To tackle this exercise, we use a one-sample Student's t-test, specifying that it is a left­
tailed test because we are checking if the average production time has decreased.

Here are the steps:

1. Definition of hypotheses:
o Null hypothesis (Ho): The average production time is greater than or equal to 1.2
minutes (p > 1.2 minutes).
o Alternative hypothesis (/-/,): The average production time is less than 1.2 minutes
(p < 1.2 minutes).
2. Calculation of the t-test statistic:
11-12 -0.1
t =--- :—— =---------- — =______ = -1.67
s x/n 0.3/V25 0.06
3. Conclusion:
With a significance level of 5% and 24 degrees of freedom (n - 1), the critical value of t
for a one-tailed test is approximately -1.711. Since the calculated value (-1.67) is not
less than the critical value (-1.711), we cannot reject the null hypothesis.

In conclusion, with the data available to us, there is not enough evidence to claim that the
maintenance intervention has significantly improved the average production time at the
5% level.

Solution with Python

import sclpy.stats as stats

* Problem data historical,mean » 1.2 9 minutes samplemean « 1.1 9 minutes standard deviation » 0.3 9 minutes n » 25 9 sarr

9 Calculation of the t test statistic t stat = (sample mean - historical mean) (standard deviation (n •• 0.5))

9 Settings for the t-test alpha « 0.05 9 significance level qdl « n - 1 9 degrees of freedom

9 Calculation of the critical value t critical = stats.t.ppf(alpha, gdl)

# Test result t stat, t critical, t stat < t critical In this example, we use the scipy.stats library, which

1. We define the data related to the exercise, including the historical mean (p0), the
sample mean (x), the standard deviation, and the sample size.
2. We use the formula for the t-test for one sample:

t=s/VH
This calculates how much the sample mean differs from the historical mean in units of
standard deviation.
3. We establish the significance level (a) at 5% and calculate the degrees of freedom (n -
1) for the test.
4. Using stats.t.ppf, we obtain the critical value for the one-tailed test. This represents
the value that our t must be lower than to reject the null hypothesis.
5. We compare the test statistic with the critical value to determine if there is a
statistically significant improvement in production time. If the calculated t value is less
than the critical value, we reject the null hypothesis in favor of the alternative
hypothesis.

Also, in this case, if we had the entire sample available, the use of the function ttest isamp
would give us the p-value of the test.
6. 2 Student's t-test for the Means of Two Samples

The two-sample Student's t-test is a statistical test used to compare the means of two
independent samples and determine if there are any significant differences between them.
This test is particularly useful when you want to verify if two groups have similar or
different means, and it is used when the variances of the two populations are unknown.

The two-sample Student's t-test is often used in various business contexts to compare two
groups. Some examples include the following cases:

• A company might want to compare the average sales performance between two
divisions. The test can determine if the differences in average sales are statistically
significant.
• A company may want to test if two groups of consumers (e.g., consumers from two
different cities) have a similar average product satisfaction.
• A company could want to compare the average production times between two
production plants or two production lines to determine if there are significant
differences in their efficiency.

To correctly apply the two-sample Student's t-test, the following requirements must be
satisfied:

• The two samples must be independent of each other, meaning the observations in one
sample must not influence those in the other sample.
• The data in the samples should be normally distributed. However, with sufficiently
large sample sizes (generally n > 30), this condition can be approximately met even if
the data do not perfectly follow a normal distribution, thanks to the central limit
theorem.
• Although the two-sample Student's t-test does not require equal variances, a more
common version of the test, called the Student's t-test with equal variance assumption,
assumes that the variances of the two samples are equal. If this condition is not met, a
t-test with different variances can be used.

The two-sample Student's t-test is based on the formulation of two hypotheses:

• The null hypothesis Ho asserts that the means of the two samples are equal, meaning
there is no significant difference between the two populations. Mathematically:
: Pi = P2.
where and p2 are the means of the two populations.
• The alternative hypothesis claims that the means of the two samples are different,
meaning there is a significant difference between the two populations. It can be two-
tailed:
«i :Pi
P
* 2
or one-tailed:
H1:p1>p2 or Hi : P! <

The test statistic for the two-sample Student's t-test is calculated as:
^7-^7

'=V./d7d-
T n-
where:

• xt- and x2 are the means of the two samples,


• Sj2 and s / are the variances of the two samples,
• n1 and n2 are the sizes of the two samples.

The t statistic follows a Student's t-distribution with degrees of freedom calculated as:

H1 —1 U2 — 1

Again, the null hypothesis is rejected if the p-value of the test is less than the desired
critical threshold.

Exercise 73. Productivity Analysis in Two Plants A manufacturing company has two
plants, one located in Milan and the other in Rome. Management wants to compare the
average productivity of the two plants to determine if one is significantly more productive.
For this reason, production hours per shift are collected in both plants. The sample data are
as follows:

. Milan: [8, 7.5, 8, 7, 8.5, 9, 7.8, 8.2, 8, 7.5]


. Rome: [6, 6.5, 7, 6, 6.8, 7.2, 7, 6.5, 6.7, 7.1]

Suppose the variances of the two populations are different. Verify, using a significance level
of 5%, whether there is a significant difference between the two productivity means.

Solution

To analyze the problem, we use the Welch's test, a two-sample Student's t-test used when
the variances of the two populations are assumed to be different.

The null hypothesis (Ho) is that the Milan plant is as productive as the Rome plant. The
alternative hypothesis is that it is not.

Let's go through the steps:

1. Calculation of sample means:


o Milan: Xj = b~~ 5 = 7.95
o Rome: x2 = ‘ -6 ..+7.1 = 6 68
2. Calculation of sample variances:
o Milan: s 2 = ELUzIlL = 0.3116
Ml-1_ n

o Rome: s 2 = = 0.1839
IIQ— 1
3. Calculation of the Welch's test statistic:
2
*1 “*
t= ri------ 7"= 5.7
v? + r-
v n1 »2
4. Determination of degrees of freedom:
2

The degrees of freedom are rounded to the nearest integer: 17.


5. Determination of the critical value:
o Using a Student’s t-distribution with 17 degrees of freedom and a 5% significance
level, the critical value for two-tailed is approximately 2.10.
6. Since the calculated test statistic (t = 5.7) is greater than the critical value (2.10), we
reject the null hypothesis Ho. This suggests that there is a significant difference in the
average productivity between the two plants.

Solution with Python

import numpy as np from scipy import stats

# Data milano data = np.array((8, 7.5, 8, 7, 8.5, 9, 7.8, 8.2, 8, 7.S)> roma data =• np.array([6, 6.5, 7, 6, 6.8, 7.2, 7, 6.5,

# Welch's test t stat, p value = stats.ttest Ind(milano data, roma data, equal var=False)

# Comparison test result » 'Reject HO" if p value < 0.05 else "Do not reject HO"

The code implements a Welch's test to verify if there is a significant difference between the
productivity means of two production plants located in Milan and Rome. The main libraries
used are numpy and scipy, which are essential for performing statistical and mathematical
calculations in Python.

The main part of the code that executes the statistical test is done with the function
stats, ttest indo which calculates the t-test for independent samples without assuming
equal variances (option equal var=Faise). This returns the test statistic value (t stat) and the
associated p-value.

Finally, the code compares the p-value with the 5% threshold to decide whether to accept
or reject the null hypothesis: if the p-value is less than 5%, the null hypothesis is rejected,
indicating that the productivity difference between the two plants is statistically significant.

Alternatively, we could have calculated the critical t-value with the expression
np.abs(stats.t.ppf ((i-0.95)/2,17)), which would indeed give us 2.10. With this critical value
and a t-value of 5.7, we reject the null hypothesis.

Exercise 74. Comparison between Two New Marketing Campaigns A company


specializing in digital marketing has launched two new online advertising campaigns to
promote a beauty product. The campaigns were managed on two different social media
platforms: the first on Instagram and the second on Facebook. After one month, data was
collected on purchases made by users who interacted with the ads on both platforms:

• Instagram: [85, 90, 88, 92, 89, 91, 87, 86, 95, 94]
• Facebook: [80, 78, 85, 82, 81, 83, 79, 77, 84, 82]

The management wants to determine if the difference in the average number of purchases
prompted by the two platforms is significant. A significance level of 5% is used.
Solution

To address this question, it is necessary to compare the means of the two samples and
determine if there is a statistically significant difference between them. This is a typical
case for applying the Welch's t-test, used when the variances of the two populations are
presumed to be unequal.

Let's go through the steps:

1. Formulation of hypotheses:
o Null hypothesis (Ho): The mean number of purchases on Instagram is equal to the
mean number of purchases on Facebook.
o Alternative hypothesis (Hr): The mean number of purchases on Instagram is
different from the mean number of purchases on Facebook.
2. Calculation of sample statistics:
o Mean for Instagram: xx = — 90—88+92+89 +91 +87+86—95—1)4 _ 89.7
o Mean for Facebook: x, = — rX-85+8'2 +81+83 +79+77-81-82 _ 81.1
( - d’
o Variance for Instagram: s/ = — t-i1 ' = 11.12
n— 1 q
o Variance for Facebook: s,2 = — rU1-1-’- = 6.77
n-1
3. Calculation of the Welch's statistic:

o The Welch's statistic is defined as t =+ + = 6.43


V +
o The degrees of freedom are calculated by rounding to the nearest whole number
the result of this expression:

4. Determination of the critical value and comparison:


o Compare the calculated statistic value with the critical value obtained from the t
distribution with a significance level of 5%. Alternatively, calculate the p-value
directly using a Python function (refer to Python solution).
5. Conclusions:
o If the calculated value exceeds the critical value, we reject the null hypothesis,
indicating a significant difference between the average number of purchases on
the two platforms. Alternatively, reject the null hypothesis if the p-value is less than
5%.

In conclusion, by using Welch's t-test to compare the two means, we can determine if the
advertising campaigns on Instagram and Facebook produce significantly different results in
terms of purchases.

Solution with Python

import numpy as np from scipy import stats

# Collected data instagram purchases = np.arrayl(85, 99, 88, 92, 89, 91, 87, 86, 95, 94]) facebook purchases = np.array(|80,

# Mean and variance for each sample instagram mean = np.mean(instagram purchases) facebook mean = np.mean(facebook purchases)
Instagram variance ■ np.var(lnstagram purchases, ddof=l) facebook variance = np.var(facebook purchases, ddof=l)

# Welch's test t statistic, p value = stats.ttestindfinstagram purchases, facebook purchases, equal var=False)

# Results print(f"Instagram Mean: {instagram mean)") print(f"Facebook Mean: {facebook mean}") printff"Instagram Variance: {

# Conclusion based on the p-value alpha = 0.05

if p value < alpha: printf"Reject the null hypothesis: there is a significant difference between the two platforms.") else:

Let's go through the steps:

1. Library Imports
o numpy is used to facilitate the calculation of means and variances.
o scipy.stats provides the ttest ind method to perform the Welch's t-test, allowing
comparison of the sample means.
2. Calculation of Basic Statistics
o Means for Instagram and Facebook are calculated with np.mean.
o Sample variances are calculated with np.var using the ddof=i parameter to obtain
the sample (corrected) variance.
3. Execution of Welch's Test
o stats.ttest ind performs the test, utilizing equal var=Faise to specify unequal
variances between samples. It returns the t-statistic and the p-value.
4. Conclusion of the Test
o Determine whether to reject the null hypothesis by comparing the p-value to the
significance level alpha, set at 5%.

The ultimate goal of the code is to determine whether the observed differences in average
purchases prompted by Instagram and Facebook are statistically significant, meaning they
are not due to chance.
6.3 Z-test on Proportions

The z-test for proportions is a statistical test used to compare a sample proportion with a
theoretical proportion or to compare two sample proportions. It is used when you want to
verify if a proportion of a population or two populations is significantly different from a
reference value or another proportion. This test is based on the normal distribution and is
applied when the sample is large enough.

The z-test for proportions is commonly used in various business contexts to make
inferences about proportions or percentages. Some examples include:

• A company might want to test if the percentage of satisfied customers exceeds a


predefined threshold (e.g., 75%).
• A company might want to check whether the proportion of people responding positively
to an advertising campaign is greater than a target proportion, such as 10%.
• If a company has a quality target, such as not more than 5%.

To correctly apply the z-test for proportions, the following requirements must be met:

• The values in the samples must be independent of each other.


• The z-test for proportions applies when the sample size is sufficiently large, i.e., when
both of the following conditions are met:
n • p0 > 5 and n • (1 - p0) > 5,
where n is the sample size and p0 is the theoretical or reference proportion of success.

The z-test for proportions is based on formulating two hypotheses:

• The null hypothesis Ho states that the sample proportion is equal to the reference
proportion. In mathematical terms:
Ho: p = Po
where p is the sample proportion and p0 is the theoretical or reference proportion.
• The alternative hypothesis H, states that the sample proportion is different from the
reference proportion. It can be two-tailed:
Hj : p
* 0
or one-tailed
: p > p0 or H, : p < p0

The test statistic for the z-test for proportions is calculated as:
p-po
Z~ /po(l—po)'
V n

where:

• Pis the observed proportion in the sample (i.e., the number of successes divided by the
total number of observations),
• p0 is the reference proportion,
• n is the sample size.
The z statistic follows a standard normal distribution with a mean of 0 and a standard
deviation of 1.

If the p-value is less than the critical threshold (typically 5%), we reject the null hypothesis.

Exercise 75. Exploratory Analysis of a Marketing Campaign A software company


launched a new SaaS product on the market and initiated a marketing campaign to
promote it. Historically, the average conversion rate of potential customers to paying
subscribers for similar products has been 4%. After the first month of the campaign, the
marketing team collected data and found that out of 1200 potential customers, 60 became
paying subscribers. Using a significance level of 5%, the team intends to verify if the
campaign has led to a significant difference in the conversion rate compared to the
historical average. Determine if the campaign had an impact on the conversion rate.

Solution

To solve this problem, we will apply a z-test for a single proportion.

1. Problem data:
o Historical proportion (p0) = 0.04
o Total number of customers (n) = 1200
o Number of successes (converted customers) = 60
o Proportion of successes in the sample (p) = 60/1200 = 0.05
2. Formulation of hypotheses:
o Null hypothesis (Ho): the proportion of success is equal to the historical proportion.
(P = Po)
o Alternative hypothesis (H,): the proportion of success is different from the historical
p
*
proportion. (p 0)
3. Calculation of the z-test statistic: The formula for the z-test for a single proportion is:
(P ~ Po)
2 / po(l-po)
V «

Inserting the values:


(0 05-0.04)
z~ /o.(M( 1-0 04) * 1,77
V 1200

4. Determination of the p-value: Comparing the calculated z-value with the critical value
of the standard normal distribution at the 0.05 level of significance for two tails, we can
find the associated p-value.

In a z table, a value of z = 1.77 corresponds to a p-value of approximately 0.0771


(considering a symmetric standard normal distribution).
5. Conclusion:
Since the p-value of 0.0771 is greater than the significance level of 5% (0.05), we do
not reject the null hypothesis.

We do not have sufficient statistical evidence to state that the marketing campaign
induced a significant change in the conversion rate compared to the historical rate of 4%.
Solution with Python

from scipy.stats import norm import math

# Problem data p9 = 0.04 * historical proportion n = 1200 # total number of customers successes = 60 » number of converted

» Formulation of hypotheses # HO: p = p0

# Hl: p • = p0

# Calculation of the z statistic z «= (p • pO) / math.sqrtf(p0 * (1 - p9)) / n)

H Determination of the p-value p value = 2 • (1 - norm.cdf (abs(z))) * two tails

# Print results result = {

’z value’: z, 'p value’: p value, ’significant’: p value < 9.05

)
print(result) The code implements a z-test for a single proportion using the data provided in the p

First, the problem data is defined with variables such as pO, which represents the historical
proportion, n for the total number of customers, successes for the converted customers, and
p for the observed proportion.

The null and alternative hypotheses are explained in the form of comments in the code,
indicating the statistical context of the test.

The calculation of the z statistic is performed using the formula derived for a single
proportion:
(? ~ Po)
Z~ /poll-Po)

The p-value is calculated using the norm.cdf function, which helps find the cumulative
probability of a standard normal distribution. The use of 2 * (i norm.cdf(abs(z))) reflects
the consideration of a two-tailed test.

Finally, the result is printed in a structured form to show the value of z, the p-value, and an
indication of whether the result is statistically significant relative to the 5% significance
level.

Exercise 76. Verification of the Effectiveness of a Quality Improvement Strategy


A manufacturing company wishes to improve the quality of products in one of its
automotive production lines. Historically, the percentage of defective parts has been 8%.
After implementing a new quality control strategy, a sample of 500 parts was inspected,
and it was found that 30 of these were defective. Management wants to know if the new
strategy has led to a significant change in the percentage of defective parts. Using a 5%
significance level, determine if there is a significant difference in the defect percentage.

Solution

In this exercise, a two-tailed z-test for proportions is applied. The aim is to determine if the
new strategy has led to a significant difference compared to the historical defect
percentage.

1. Hypotheses:
o Null hypothesis (Ha): The defect percentage is equal to the historical standard: p =
0.08.
o Alternative hypothesis (/-/,): The defect percentage is different from the historical
standard: p
*
0.08.
2. Calculation of the sample proportion:
30
p= ----- - = 0.06
500
3. Calculation of the z-score:
p-p 0.06-0.08
2= /p(i_p) = /0,08 0.92 s=-1-648
V n V 500
4. With a 5% significance level in a two-tailed test, the critical values are approximately
±1.96.
5. With z = -1.648, we do not fall into the critical range of ±1.96. Therefore, there is not
enough evidence to reject the null hypothesis.

There is not sufficient statistical evidence to assert that the new quality control strategy
has led to a significant change in the percentage of defective parts compared to the
historical standard of 8%. The two-tailed z-test for proportions suggests that the observed
difference in the sample could be due to random variability.

Solution with Python

import numpy as np from scipy.stats import norm

# Problem parameters p 0 = G.08 # historical proportion of defective parts n = 580 # sample size x = 30 # number of defecti
*

# Calculation of the sample proportion p hat = x / n

# Calculation of the z-score z = (p hat ■ p 0) / np.sqrtfp 8 • (1 - p 0) / n)

# Since this is a tvra-tailed test, find the critical value for 5%

alpha = 8.85

z critical ■ norm.ppffl • alpha / 2)

# Decision reject null » abs(z) > z critical

« Results (reject null, z. z critical) In this code, we perform a two-tailed z-test for proportions to d

Let's examine the various steps in the code:

1. Definition of the Parameters:


o po represents the historical proportion of defects (8%).
o n is the sample size (500 parts).
o x is the number of defective parts in the sample (30 parts).
2. Calculation of the Sample Proportion:
o Calculated as p hat = x / n, resulting in 0.06 or 6%.
3. Calculation of the z-Score:
o Formula used: z = (p hat - p0) / sqrt(po * (i ■ p0) / n). This quantifies the
deviation of the sample proportion from the hypothesized value in terms of
standard deviation units.
4. Determination of the critical value:
o For a two-tailed test with a 5% significance level, we use norm.ppf to find the critical
z-score value.
5. Statistical Decision:
o If the absolute value of the calculated z-score is greater than z critical, we reject
the null hypothesis. Otherwise, there is insufficient evidence to reject the null
hypothesis.
6. Results:
o The variable reject null is False, indicating that there is not sufficient statistical
evidence to conclude that the new strategy has led to a significant change
compared to the historical standard.
6.4 One-Way ANOVA on the Mean of Multiple Groups

The one-way analysis of variance (ANOVA) is a statistical technique used to compare the
means of three or more independent groups in order to determine if at least one of them
significantly differs from the others. One-way ANOVA extends the concept of the t-test to
compare more than two groups.

One-way ANOVA is commonly used in business to compare various groups on a continuous


variable. Some examples include:

• A company may use ANOVA to compare the average sales performances among
different departments and determine if the observed differences are statistically
significant.
• ANOVA can be used to compare the average productivity of different production lines
and check for significant differences.
• If a company introduces different models of a product, ANOVA can be used to compare
the average quality ratings among the different models.

To correctly apply one-way ANOVA, the following requirements must be satisfied:

• The groups must be independent from each other. Each observation in a group must be
independent of those in other groups.
• The data in each group should follow a normal distribution. If the sample size is large,
the central limit theorem allows us to approximate normality.
• The variances of the groups should be similar, a condition known as homoscedasticity.
This can be verified using Levene's test or other techniques.

One-way ANOVA is based on the formulation of two hypotheses:

• The null hypothesis states that all group means are equal. In other words, there are no
significant differences between the groups:
Ho : Pi = P? = • • •= Pit»
where PpP^...^ are the group means.
• The alternative hypothesis asserts that at least one group's mean differs from the
others:
H, : At least one mean differs from the others.

The test statistic for one-way ANOVA is the F-statistic, which measures the ratio between
the variance between the groups and the variance within the groups:
Variance between the groups
Variance within the groups
where:

• Variance between the groups measures how much the group means differ from the
overall mean,
• Variance within the groups measures the variability of the observations within each
group.

If the variance between the groups is significantly greater than the variance within the
groups, the F value will be high, suggesting that the group means differ significantly.
The p-value indicates the probability of obtaining observed results, or more extreme
results, if the null hypothesis were true. In other words, the p-value measures the evidence
against the null hypothesis.

If the p-value is less than the significance level a (for example, 0.05), we reject the null
hypothesis, suggesting that at least one group's mean is significantly different from the
others. If the p-value is greater than the significance level a, we do not reject the null
hypothesis, suggesting that there is not sufficient evidence to claim that the group means
are different.

Exercise 77. Analysis of Differences in Customer Service Response Times Across


Different Regions A major e-commerce company wants to evaluate whether there are
significant differences in customer service response times across different regions. The
collected data includes average response times (in minutes) for three different regions:
North, Center, and South. A statistical analysis is requested to determine if the average
response time differences between the regions are significant.

Data:

. North: 10, 12, 15, 14, 13


. Center: 9, 8, 10, 11, 10
. South: 14, 16, 15, 17, 15

Determine if there are significant differences in the average response times among the
three regions. Assume that the distributions are normal with equal variance.

Solution

To solve this problem, we use statistical analysis to compare the means of multiple groups.
The appropriate test for this scenario is a one-way Analysis of Variance (ANOVA), which
determines if there are significant differences between the means of three or more
independent groups.

Here are the steps of the analysis:

1. Formulate the hypotheses:


o Null hypothesis (Ho): There are no significant differences between the group means,
o Alternative hypothesis (HJ: At least one group mean is different from the others.
2. Calculate the F-ratio:
o Determine the ratio between the variability explained by the model and the
residual variability. This calculation requires the use of specific statistical software.
3. Compare the calculated F-value with the critical F-value from the ANOVA tables or use
statistical software for the final test.
4. Test result:
o If the calculated F-value is greater than the critical F-value, we reject the null
hypothesis of no difference.
o In our case, suppose that the calculated F-value is greater than the critical F-value,
so we reject Ho, concluding that there are significant differences in the average
response times of customer service among the North, Center, and South regions.

This analysis suggests that the company should further investigate and manage customer
service resources to reduce differences in response times among different regions. By
identifying specific causes of the differences, such as potential inefficiencies in regional
operations, the company can improve customer service and the overall experience.

Solution with Python

import numpy as np from scipy.stats import f oneway

# Response time data for each region north = [10, 12, 15, 14, 13]

center = [9, 8, 10, 11, 10)

south = [14, 16, 15, 17, 15]

# Calculate one-way ANOVA stat, p value = f oneway(north, center, south)

# Print the results print(f"F-value: {stat}") printff"p-value: {p value}")

# Determine the result alpha = 0.05 # significance level if p value < alpha: printCThere are significant differences in av'

In the provided code, we used the t oneway function from the scipy.stats module, which is
designed to perform one-way Analysis of Variance (ANOVA). This statistical test allows us to
compare the means of three or more groups and determine if at least one of these means
is significantly different from the others.

Steps of the code:

1. Import numpy for general mathematical operations and f oneway from scipy.stats to
perform the ANOVA test.
2. The response times for the North, Center, and South regions are provided in Python
lists.
3. Use f oneway by passing the three data lists to obtain the F-value (stat) and the p-value
(p value). These values help us determine the significance of the observed differences.
4. Print the F-value and the p-value for an immediate view of the ANOVA results.
5. Compare the p-value with a significance level (alpha), here set at 0.05, to decide
whether to reject the null hypothesis. If the p-value is less than alpha, we conclude that
there are significant differences among the group means.

This approach is efficient for identifying statistically significant differences in the presence
of multiple groups and provides valuable insights to support strategic business decisions.

Exercise 78. Customer Satisfaction Analysis in Different Branches A chain of retail


stores wants to understand if there are significant differences in customer satisfaction
between different branches located in distinct geographical areas. The collected data
includes the average customer satisfaction score (on a scale from 1 to 10) for three
branches located in different areas: Urban, Suburban, and Rural.

Collected data:

. Urban: 7, 8, 7, 6, 9
• Suburban: 5, 6, 5, 7, 6
. Rural: 8, 7, 9, 9, 8

Determine whether the differences in average customer satisfaction scores between


branches are statistically significant. Assume the distributions are normal and have the
same variance.

Solution
To tackle this problem, we use a statistical method to compare means between more than
two independent groups. In this case, a statistical analysis suitable for comparing the
means of more than two groups is adopted, without presuming which one might have a
higher mean.

Apply the following steps:

1. Formulation of Hypotheses:
o Null hypothesis (Ho): There are no significant differences in the average customer
satisfaction scores between branches.
o Alternative hypothesis (HJ: At least one of the branch satisfaction means is
significantly different from the others.
2. Calculate the necessary statistic using dedicated software. At this stage, we calculate
the overall variability between groups and within the groups themselves.
3. Compare the calculated statistic with the critical value referenced from the statistical
software or statistical table for the specific number of groups and samples.
4. If the calculated statistic exceeds the critical value or if the p-value is less than the
level of significance (usually 0.05), reject the null hypothesis, concluding that there are
significant differences in satisfaction scores. Otherwise, do not reject the null
hypothesis.

After applying this method, we find that the statistic allows us to reject the null hypothesis:
there are significant differences between the average satisfaction scores of the various
branches, suggesting that the area may influence the perceived customer satisfaction.

Solution with Python

import scipy.stats as stats

# Customer satisfaction data for urban, suburban, and rural branches satisfaction urban = (7, 8, 7, 6, 9]

satisfaction suburban = [5, 6, 5, 7, 6]

satisfaction rural = |8, 7, 9, 9, 8)

# Perform a one-way ANOVA test f statistic, p value = stats.f onewayfsatisfaction urban, satisfaction suburban, satisfaction r

# Print the results printC'F-statistic:", f statistic) print("p-value:", p value)

# Conclusion based on the p-value alpha =0.05

if p value < alpha: printC'There are significant differences between the average satisfaction scores of the branches.") else:

Let’s see the various steps of the code:

• Importing stats from scipy: We use scipy.stats to access the f oneway function, which
performs the one-way ANOVA test.
• Customer satisfaction scores for the three branches (urban, suburban, and rural) are
stored in separate lists.
• The f oneway function takes several groups as input and returns two values: the F-
statistic and the p-value. The F-statistic measures the proportion of variance between
groups relative to the variance within the groups. The p-value allows us to evaluate the
evidence against the null hypothesis.
• We compare the p-value with the level of significance (set here at 0.05). If the p-value
is lower than this value, we reject the null hypothesis, concluding that there are
significant differences between the average satisfaction scores of the branches.
Otherwise, we do not reject the null hypothesis.
The use of ANOVA is appropriate in this context because we are analyzing three
independent groups, and the goal is to understand if at least one group significantly differs
from the others in terms of average scores.
6.5 Kruskal-Wallis Test on the Median of Multiple Groups

The Kruskal-Wallis test is a non-parametric test used to compare the medians of more than
two independent groups. It is a non-parametric version of the one-way ANOVA and is used
when the assumptions of normality or homogeneity of variances required by ANOVA cannot
be met. The Kruskal-Wallis test is useful when the data are ordinal or when quantitative
data do not follow a normal distribution.

The Kruskal-Wallis test is applied in various business contexts to compare data across
multiple groups when reliance on parametric tests is not possible. Some examples include:

• If a company wants to compare the performance of different work teams (for example,
based on scores or ordinal ratings), the Kruskal-Wallis test can be used to verify if there
are significant differences between the teams.
• When customers express preferences for different variants of a product on an ordinal
scale, the Kruskal-Wallis test can be used to compare the medians of the ratings among
the different product variants.
• A company that produces goods in different plants might use the Kruskal-Wallis test to
compare the quality of products across different plants, based on ordinal measures or
scores that are not normally distributed.

To correctly apply the Kruskal-Wallis test, the following requirements must be met:

• The groups must be independent of each other. Each observation must belong to only
one group and must be independent from the others.
• The data must be at least ordinal, meaning they must have a natural order.

The Kruskal-Wallis test is based on the formulation of the following hypotheses:

• The null hypothesis states that all group medians are identical.
• The alternative hypothesis suggests that at least one of the group medians differs from
the others.

The Kruskal-Wallis test statistic is based on the rank of data within each group. The main
steps to calculate the statistic are:

• Order all observations from all groups together and assign them ranks.
• Calculate the sum of ranks for each group.
• Calculate the H statistic as follows:
12 7?2
H = TV7T7----- 7T * i=l — - 3(N + D,
A (A -I-1) n.
where:
o N is the total number of observations (the sum of the sizes of the groups),
o k is the number of groups,
o Rt is the sum of ranks for group /,
o n, is the size of group /.

The H statistic follows an approximately chi-square distribution with k -1 degrees of


freedom when the null hypothesis is true. This distribution allows for the calculation of the
p-value of the test, which can be interpreted as follows:
• If the p-value is less than the significance level a (e.g., 0.05), we reject the null
hypothesis, suggesting that at least one of the group medians is significantly different
from the others.
• If the p-value is greater than the significance level a, we do not reject the null
hypothesis, suggesting there is not enough evidence to claim that the group medians
differ.

Exercise 79. Performance Analysis of Sales Teams A clothing company wants to


compare the sales performance of three different teams located in Northern, Central, and
Southern Italy. Each team provided weekly revenue for a period of one month. The
collected data is as follows:

. North: [1500, 1600, 1650, 1580]


. Central: [1400, 1490, 1520, 1470]
. South: [1550, 1450, 1500, 1480]

The company wants to determine if there are significant differences in the sales
performance of the three teams. Assuming that the data might not follow a normal
distribution, how should the company proceed?

Solution

To tackle this problem, the company should use the Kruskal-Wallis test, a non-parametric
test used to determine if there are significant differences between the medians of three or
more independent groups, especially when the data do not follow a normal distribution.
The null hypothesis Ho is that the medians are equal.

To perform the Kruskal-Wallis test, it is advisable to use specific software that calculates the
p-value or the test statistic H. If the p-value is less than 5%, the null hypothesis is rejected.
Similarly, if H is more extreme than the critical value for this dataset, the null hypothesis is
rejected.

If the calculations result in a significant statistic, management could further investigate the
differences between the teams and consider regionally tailored improvement strategies.

Solution with Python

import scipy.stats as stats

# Given data north = [1500, 1600, 1650, 1580]

center = [1400, 1490, 1520, 1470)

south ■ [1550, 1450, 1500, 1480]

data = [north, center, south]

# Performing Kruskal-Wallis test H statistic, p value = stats.kruskal(’data)

# Output results {

'H statistic': H statistic, 'p value’: p value }

The provided code uses the scipy.stats library to perform the Kruskal-Wallis Test, a non­
parametric statistical test that does not assume the data follows a normal distribution. This
makes it ideal for comparing the medians of independent groups such as the weekly sales
of the North, Central, and South teams presented in the problem.
Let's look at the various steps in the code:

1. The scipy.stats module, which contains the kruskal method, is imported.


2. The weekly sales data for the three teams, North, Central, and South, are stored in
three lists.
3. All the data is combined into one list of lists to facilitate passing the data to the
statistical test.
4. The function stats.kruskal() is called with the data of the three groups as input. This
computes the H statistic and the corresponding p-value.
5. The code returns a dictionary containing the value of the statistic H and the p-value,
indicating whether the differences between the groups are statistically significant. A p-
value less than 0.05 would suggest significant differences.

Using scipy.stats.kruskal, we can efficiently determine if there is significant variability in


sales performance among the different regional teams of a company.

Exercise 80. Customer Satisfaction Comparison Between Departments A


technology company wants to evaluate the level of satisfaction of the customers managed
by three support departments: Software, Hardware, and Networks. For each department,
feedback scores have been collected from various customers as follows:

• Software: [85, 90, 78, 88, 92]


• Hardware: [80, 82, 78, 76, 85]
. Networks: [88, 85, 84, 87, 90]

The company wants to determine if there are significant differences in the customer
satisfaction levels among the three departments. Considering that the feedback scores
may not follow a normal distribution, how should the company proceed to reach a
significant conclusion?

Solution

To address the company's problem and evaluate the significance of the differences in
satisfaction levels among the three departments, the Kruskal-Wallis test can be referred to.
This non-parametric test is suitable for comparing three or more independent groups,
especially when the normal distribution of the data cannot be assumed.

Let's see the various steps:

1. Initial Hypothesis:
o Null hypothesis (Ho): There are no significant differences in the satisfaction levels
among the Software, Hardware, and Networks departments.
o Alternative hypothesis There are significant differences in the satisfaction
levels among the departments.
2. Test Calculation:
o Using specific statistical software, the p-value of the test is calculated.
3. Interpretation:
o If the calculated p-value is less than or equal to 5%, the null hypothesis is rejected.

Using the Kruskal-Wallis test, the company can determine if the observed differences in
rank means are large enough to be considered statistically significant.

Solution with Python


import scipy.stats as stats

# Scores for each department software scores = [85, 90, 78, 88, 921

hardware scores » (80, 82, 78, 76, 851

network scores = [88, 85, 84, 87, 991

# Kruskal-Wallis test hstatistic, p value - stats.kruskal(software_scores, hardware scores, networkscores)

# print the h statistic and p-vaiue h_statistic, p_vaiue The provided code uses the scipy.stats library to p€

Let's see the various steps:

1. Three lists of feedback scores corresponding to the three departments: Software,


Hardware, and Networks are defined.
2. Using stats.kruskaU), the code calculates the Kruskal-Wallis test value (H), which is a
measure of the differences between the ranks of the groups. The function returns both
the H value and the associated p-value.
3. The code prints the statistical value and the p-value. These values can be used to
determine if there are significant differences in the satisfaction levels among the
departments. A small p-value (typically less than 0.05) would indicate that there are
statistically significant differences in the feedback scores between at least a pair of
departments.

The choice to use scipy is motivated by its wide set of tools for statistical analysis, making
it particularly useful for performing tests like the Kruskal-Wallis quickly and efficiently.
6.6 Levene’s Test for Equality of Variances Across
Multiple Groups

Levene’s test is a statistical test used to check the homogeneity of variances across two or
more groups. Homogeneity of variances, meaning the condition that variances among
groups are similar, is a fundamental assumption for the application of various parametric
tests, such as ANOVA.

Levene’s test is used in business settings to verify if data groups in an analysis of variance
(ANOVA) or other statistical tests show similar variances. Some examples of its application
are:

• In an analysis of the performance of different work teams, Levene’s test can be used to
check that the variances among team scores are similar before applying a test like
ANOVA.
• In a company that produces products at different facilities, Levene’s test can be used
to verify whether the quality variations among products are homogeneous across
different facilities.
• When a company tests different variants of a product, Levene’s test can be used to
ensure that the variances in customer evaluations are similar among the variants.

Levene’s test does not have the normality restriction that characterizes other tests like
ANOVA. However, for correct application, the independence of observations among groups
must be assured.

• Each observation must be independent of others, implying that an observation in one


group should not influence that of another group.
• Levene’s test is applicable with continuous or ordinal scale data.

The hypotheses tested in Levene’s test are as follows:

• Null hypothesis (Ho): the variances of the groups are equal. In other words, there are no
significant differences in variances among the groups:
Ho: o,2 = a 22 = • . . = o k2
where a2,a k2 are the variances of the groups.
• Alternative hypothesis (HJ: at least one of the variances differs from the others.

Levene’s test statistic is based on the calculation of the difference between each
observation and the median (or alternatively the mean) of the group. The test is calculated
as follows:

• For each group, calculate the absolute deviation between each observed value and the
group's median (or mean).
• Perform an analysis of variance on the absolute values of the deviations obtained.
• Levene’s test statistic W is given by:

w (A- - fc) EL n.(2. - zf


-D E‘_iEZi<^-2.)2
(*
where:
o N is the total number of observations,
o k is the number of groups,
o n, is the size of group i,
o Z, is the mean of the absolute deviations in group
o Z is the mean of the means of the absolute deviations.

The statistic \N follows an F-distribution with k - 1 and N -k degrees of freedom, if the null
hypothesis is true. This allows us to calculate the p-value of the test. If the p-value is less
than the significance level a (for example, 0.05), we reject the null hypothesis, suggesting
that at least one of the group variances is significantly different from the others.

Exercise 81. Comparison of Quarterly Sales Variances Between Two Regions An


international company produces and distributes industrial machinery worldwide. The
company's marketing division wants to determine if the variability of quarterly sales differs
significantly between two regions: Western Europe and Eastern Asia. Analysts have
collected data on quarterly sales, recording the value for each quarter over the past five
years for both regions. The data are as follows:

• Western Europe (in millions of euros): [20, 22, 23, 25, 21, 24, 26, 23, 24, 22, 25, 23, 24,
22, 25, 27, 20, 22, 23, 24]
. Eastern Asia (in millions of euros): [18, 20, 21, 19, 20, 22, 23, 21, 19, 20, 21, 20, 18,
21, 23, 22, 20, 19, 21, 20]

It is required to determine if there is a significant difference in the variability of quarterly


sales between the two regions. Calculate and interpret the results of the appropriate
statistical test.

Solution

To perform the test, we use Levene’s test, which is designed to check for the equality of
variances between two or more groups. This test is more robust than other variance
homogeneity tests when data are not necessarily normally distributed.

Levene's test requires specific software to be carried out and, like all tests, returns a p-
value. The null hypothesis Ho is that the variances of the groups are equal. If the p-value is
less than the significance level we have set (e.g., 5%), we reject the null hypothesis.

Should this occur, managers might need to consider this different variability in sales for a
more efficient allocation of resources and to obtain a more accurate forecast of future sales
between the different regions.

Solution with Python

import numpy as np from scipy.stats import levene

# Sales data sales western europe = (26, 22, 23, 25, 21, 24, 26, 23, 24, 22, 25, 23, 24, 22, 25, 27, 20, 22, 23, 24)

sales eastern asia = (18, 20, 21, 19, 20, 22, 23, 21, 19, 20, 21, 20, 18, 21, 23, 22, 20, 19, 21, 20]

# Variance calculation variance western europe = np.var(sales western europe, ddof=l) variance eastern asia = np.varfsales ee

# Levene's Test stat, p value = levene(sales western europe, sales eastern asia)

# Results print(f"Variance of Western Europe: {variance western europe}") print(f"Variance of Eastern Asia: {variance eastern

# Interpretation alpha =0.05

if p value < alpha: print("There is a significant difference in the variability of quarterly sales between the two regions.")
The np.varo function from NumPy is used to calculate the variance of a data series, and
the parameter ddof=i specifies that we are calculating the sample variance.

SciPy provides functions for advanced computation and statistical testing. In this context,
we use the tevene function available in the scipy. stats module to perform Levene's test,
which is ideal for checking the equality of variances between two groups.

In the code, we initially load the quarterly sales data for the two regions. Then we calculate
the variance for each group. We then use Levene's test to determine if the variances
between the two regions differ significantly. The resulting pvalue variable allows us to make
a decision: if it is lower than the significance level (alpha) of 0.05, we reject the null
hypothesis, indicating that the variances are significantly different.

The code output includes the calculated variances, the results of Levene’s test (statistics
and p-value), and a conclusion sobre the hypothesis tested.

Exercise 82. Analysis of Turnover Variability A major consulting firm has divided its
workforce into three regional teams: North America, Europe, and Asia-Pacific. Each team
has worked on similar projects and now the human resources department wants to check if
there is a significant difference in the quarterly staff turnover variability among these three
regions. The recent data related to turnover rates for the last six quarters are:

. North America: [5.2, 4.9, 5.5, 5.0, 5.3, 5.1]


. Europe: [5.7, 5.5, 5.8, 5.6, 5.9, 5.7]
• Asia-Pacific: [4.8, 4.7, 4.9, 4.8, 4.6, 4.7]

Determine if there is a significant difference in turnover variability among the three groups
using an appropriate statistical test.

Solution

To solve this problem, we use Levene's test, which is designed to assess the difference in
variability across multiple groups without assumptions about the probability distributions
involved.

Using specific software, it’s possible to calculate the p-value from the test. If the p-value is
less than a chosen significance level (typically 0.05), we can conclude that there is a
significant difference in turnover variability between at least two of the groups. If the p-
value is higher, we cannot reject the null hypothesis of homogeneity of variances.

Assuming we obtained a p-value of 0.26, we cannot assert that there is a significant


difference in turnover variability among the regions of North America, Europe, and Asia-
Pacific.

Solution with Python

from scipy.stats import tevene

# Turnover rate data for the three regions north america = [5.2, 4.9, 5.5, 5.0, 5.3, 5.1]

europe = [5.7, 5.5, 5.8, 5.6, 5.9, 5.7]

asia pacific = [4.8, 4.7, 4.9, 4.8, 4.6, 4.7]

# Use Levene’s Test to check for variance homogeneity stat, p value « levenefnorth america, europe, asia pacific)

# Output the results print(f"Levene's Test Statistic: (stat)") printtf"p-value: (p value)")


# Interpret the result significance » 0.05

if p value < significance: printf"There is a significant difference in turnover variability among the groups.") else: print

We clearly identified the turnover rates for the regions of North America, Europe, and Asia-
Pacific as separate Python lists. These lists contain the raw data to be examined.

The command levenefnorth america, europe, asia pacific) calculates Levene's test statistic
and returns a statistical value and a p-value. The test statistic measures evidence against
the null hypothesis of equal variances.

We compare the obtained p-value with a predetermined significance level, in this case,
0.05. If the p-value is below the significance level, we can reject the null hypothesis of
equal variances, indicating that there are significant differences in turnover variability
among the regions. If it is higher, we cannot reject the null hypothesis.

Finally, the code provides an interpretation of the result, indicating whether or not there is
evidence of a significant difference in turnover variability among the three regions.
6.7 One-Sample Kolmogorov-Smirnov Test

The one-sample Kolmogorov-Smirnov (KS) test is a non-parametric statistical test used to


compare the empirical distribution of a sample with a predefined theoretical distribution.
This test is particularly useful when you want to check if a sample comes from a specific
distribution, such as the normal, uniform, or other known distributions. It is widely used in
inferential statistics to test the fit of data to a chosen distribution.

The one-sample Kolmogorov-Smirnov test can be used in various business contexts to


verify if observed data follow a theoretical distribution. Some examples include:

• A company might use this test to check if monthly sales data follow a normal
distribution or another theoretical distribution, before applying statistical analysis
techniques that assume a specific distribution.
• If a company has a quality specification for a product, the test can be used to verify if
the variability of quality data follows a normal distribution or another expected
distribution.
• A company managing a queue of customers may use the Kolmogorov-Smirnov test to
verify if the waiting times follow an exponential distribution, which is a common
distribution in waiting time problems.

The one-sample Kolmogorov-Smirnov test does not need to assume a specific distribution
for the data, but it requires that the observations are independent. Additionally, it is crucial
that the theoretical distribution against which the data are compared is clearly defined.

The one-sample Kolmogorov-Smirnov test is based on the following hypotheses:

• Null hypothesis (Ho): the observed data follow the theoretical distribution. In other
words, there are no significant differences between the empirical distribution of the
data and the theoretical distribution.
• Alternative hypothesis (HJ: the observed data do not follow the theoretical distribution,
meaning there is a significant difference between the empirical distribution of the data
and the theoretical one.

The Kolmogorov-Smirnov test statistic measures the maximum absolute deviation between
the empirical cumulative distribution of the data (F„(x)) and the theoretical cumulative
distribution (F(x)):
D = sup x|Fn(x) - F(x)|,

where:

• Fn(x) is the empirical cumulative distribution function, calculated as the proportion of


observations less than or equal to x,
• F(x) is the theoretical cumulative distribution function of the reference distribution.

The statistic D measures the maximum vertical distance between the two curves (empirical
and theoretical). It is a random variable that follows a specific probability distribution
characteristic of this test, allowing us to calculate the p-value. If the p-value is lower than
the significance level a (e.g., 0.05), we reject the null hypothesis, suggesting that the data
do not follow the theoretical distribution. If the p-value is greater than the significance level
a, we do not reject the null hypothesis, suggesting that there is not enough evidence to
claim that the data do not follow the theoretical distribution.
Exercise 83. Analysis of Product Demand An electronics company wants to understand
if the weekly distribution of sales for a new smartphone model follows a certain predictive
normal distribution. This predictive distribution has been built based on weekly sales of
similar models over the past five years.

The company collected a 10-week sales sample for the new smartphone model, with the
following sales data (in units): 110, 112, 107, 115, 108, 111, 116, 109, 113, 114.

The predictive distribution can be described as a normal with mean p = 112 and standard
deviation o = 3.

It is required to verify if the sales sample belongs to the predictive distribution. Use a
significance level of 5%.

Solution

To solve this problem, we use the Kolmogorov-Smirnov (K-S) test, which allows us to
compare an empirical sample with a theoretical distribution.

First of all, we calculate the empirical cumulative distribution function (ECDF) for our
sample. Ordering the data: 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, the ECDF
assumes the following values:

. F(107) = 0.1
• F(108) = 0.2
• F(109) = 0.3
• ...
• F(116) = 1.0

The cumulative distribution function of the predictive normal distribution (CDF) must be
calculated for each value of the sample. For example:

• CDF(107,p = 112,0 = 3) = P(Z < (107 - 112)/3) = 0.047


. CDF(108,p = 112,0 = 3) = P(Z < (108 - 112)/3) = 0.091
• And so on...

Now we calculate the maximum absolute deviation between ECDF and the theoretical CDF,
denoted as D = max \ECDF(x) - CDF(x)|.

We compare the value of D with the Kolmogorov-Smirnov critical value for the 5%
significance level and a sample size of n = 10.

If D is greater than the critical value, we reject the null hypothesis that the sample belongs
to the predictive distribution. Otherwise, we do not have sufficient evidence to reject the
null hypothesis. Alternatively, we calculate the p-value of the test and reject the null
hypothesis if it is less than 5%.

Solution with Python

import numpy as np from scipy.stats import kstest, norm

# Empirical distribution data weekly sales = (110, 112, 107, 115, 108, 111, 116, 109, 113, 114]

# Parameters of the predictive normal distribution mu = 112

sigma ■ 3
# Calculate the D-value using kstest d, pvalue ■ kstest(weekly sales, 'norm', args=(mu, sigma))

# Interpretation of the result if p value < 0.05: print("We reject the null hypothesis: the distribution of sales does not fc

Below are the various details:

1. We have a list weekly sales that represents smartphone weekly sales. A predictive
normal distribution is defined with mean mu=H2 and standard deviation sigma=3.
2. We use the kstest function to verify the compatibility of the sample with a normal
distribution. The 'norm' parameter specifies the theoretical distribution to be used and
args=(mu, sigma) passes the distribution parameters.
3. We compare the p-value with the 5% significance level.
4. If it is lower, we reject the null hypothesis. If it is higher, there is no sufficient evidence
to reject the hypothesis.

Exercise 84. Analysis of Production Quality at a Bottling Plant A bottled beverage


company has recently upgraded one of its bottling plants and wants to evaluate if the
production quality in terms of bottle fill volume follows the expected distribution based on
historical performance. A previous study established that the liquid volume in bottles
should follow a normal distribution with a mean of p = 500 ml and a standard deviation of
a = 5 ml.

The company decides to test the quality by taking a sample of 15 bottles filled at the new
plant, with the following measured volumes (in ml): 498, 502, 501, 499, 500, 503, 497,
499, 501, 500, 500, 498, 504, 505, 496.

The company wants to statistically determine whether the volume filled by the new
machines can be considered consistent with their expected normal distribution, using a
significance level of 5%.

Solution

We use the Kolmogorov-Smirnov test to compare the sample with the expected normal
distribution.

Here are the steps:

1. Input Data:
o Sample volumes: 498, 502, 501, 499, 500, 503, 497, 499, 501, 500, 500, 498, 504,
505, 496.
o Expected distribution: Normal with p = 500 ml, a = 5 ml.
o Significance level: 0.05.
2. Calculation of the Test Statistic:
o Sort the sample data in ascending order.
o Calculate the empirical cumulative distribution function (ECDF) for the sample
data.
o Calculate the theoretical cumulative distribution function (CDF) using the expected
normal distribution.
o Determine the maximum difference between the ECDF and the CDF.
3. Calculation of Dmax.
o Dmax = max |ECDF(x) - CDF(x)|.
4. Determination of the Critical Value:
o For n = 15 samples and a significance level of 5%, the critical value of D is
approximately 0.351.
5. Comparison of Dmax with the Critical Value:
° Dmax > 0.351, we reject the null hypothesis that the sample follows the expected
normal distribution.
o If Dmax < 0.351, we do not reject the null hypothesis.
6. Alternatively, we can calculate the p-value of the test and reject the null hypothesis if
the p-value is less than 5%.

Assuming a calculation that leads to Dmax < 0.351, we do not have sufficient evidence to
reject the null hypothesis. Therefore, the sample of filling volumes of the new bottles from
our plant can be considered consistent with the expected normal distribution. Otherwise, if
Dmax were greater, there would be reason to review the bottling process to improve filling
accuracy according to the expected standards.

Solution with Python

from scipy.stats import kstest, norm

# Input data volumes = [498, 502, 501, 499, 500, 503, 497, 499, 501, 500, 500, 498, 504, 505, 496]

mu = 500 # Expected mean sigma = 5 # Expected standard deviation

# Using the Kolmogorov-Smirnov test statistic, p value = kstestfvolumes, 'norm', args=(mu, sigma))

# Determination of the result result = "We reject the null hypothesis" if p value < 0.05 else "We do not reject the null hypot

'Dmax': statistic, 'p value': p value, ’Conclusion': result )

In the Python code above, we use the scipy.stats library to perform a Kolmogorov-Smirnov
test, which allows us to compare the empirical distribution of the sample data with an
expected theoretical distribution, in this case, a normal distribution.

Here are the steps of the code:

1. Import Libraries:
o scipy.stats offers the function kstest, useful for performing the Kolmogorov-Smirnov
test.
2. Input Data:
o Sample volumes are defined in a list.
o The expected mean (mu) and standard deviation (sigma) are set as variables.
3. Execution of the Kolmogorov-Smirnov test:
o kstest is used to calculate the test statistic and the p-value. Here, we pass the
sample data, the name of the tested distribution ('norm' for normal), and the
distribution parameters (expected mean and standard deviation).
4. Comparison and Conclusion:
o The variable result is used to determine whether to reject the null hypothesis by
comparing the p-value with the required significance.
o In the final result, the test indicates whether the sample data can be considered
consistent with the expected normal distribution.
6.8 Two-Sample Kolmogorov-Smirnov Test

The two-sample Kolmogorov-Smirnov (KS) test is a non-parametric statistical test used to


compare the distributions of two independent samples. Unlike the one-sample test, which
compares a sample with a predefined theoretical distribution, the two-sample test
evaluates whether two samples come from the same continuous distribution without
making assumptions about the shape of that distribution.

The two-sample Kolmogorov-Smirnov test can be used in various business contexts to


compare the distributions of two groups. Some examples include:

• A company may use the Kolmogorov-Smirnov test to compare whether the


performance distributions of two work groups are similar or different, without making
assumptions about the shape of the distributions.
• If two plants use different production methods, the test can be employed to verify if
production times follow the same distribution, without assuming a specific form for the
distribution.
• A company can use this test to compare the response times of two sales channels,
verifying whether the distributions of times are similar or different.

The two-sample Kolmogorov-Smirnov test does not require making assumptions about the
form of the data distribution, but it requires that the samples are independent of each
other. It is a non-parametric test, so it can be used with data that do not follow a normal
distribution.

The two-sample Kolmogorov-Smirnov test evaluates the following hypotheses:

• Null hypothesis (Ho): the distributions of the two samples are identical.
• Alternative hypothesis (HJ: the distributions of the two samples are different.

The test statistic of the two-sample Kolmogorov-Smirnov test measures the maximum
absolute deviation between the empirical distributions of the two samples. The statistic D
is defined as:
D = sup JFJx) - P2(x)|,

where:

• FjW is the empirical cumulative distribution function of the first sample,


• F2(x) is the empirical cumulative distribution function of the second sample.

The statistic D measures the maximum distance between the two empirical distribution
curves. The p-value obtained from calculating D with respect to the specific distribution of
this test allows us to reject the null hypothesis if it is low (for example, less than 5%) or not
reject it if it is high.

Exercise 85. Analysis of Sales Distribution An e-commerce company wants to


understand if purchasing habits between two user groups, originating from two different
advertising campaigns, are statistically different. Data on transaction values made by each
group during an entire month have been collected.

Group A made the following purchases (in euros): 89, 95, 102, 110, 74, 93, 87, 105, 99, 91.
Group B, however, made the following purchases (in euros): 77, 85, 112, 120, 108, 92, 95,
85, 100, 94.

The company wants to know if there is a significant difference in purchasing behavior


between the two groups.

Assume the standard significance level a is 0.05.

Solution

To solve this problem, we can use a non-parametric test such as the Kolmogorov-Smirnov
test, which allows us to compare two samples without assumptions about the distribution.

Let us go through the steps:

1. Calculate the empirical cumulative distribution functions for both groups. This involves
ordering the data and calculating the proportion of observations below each value in
the ordered list.
2. Identify the maximum difference between the two calculated CDFs:
D = |F„(x) - f8(x)|
This measures where the cumulative distribution functions of the two groups differ the
most.
3. Determine if this maximum difference is statistically significant by comparing it to the
critical value derived from the Kolmogorov-Smirnov table. For a significance of 0.05
with n = m = 10 (the samples), consult the corresponding critical value.
4. If D is greater than the critical value, we can reject the null hypothesis that the two
distributions are the same, concluding there is a significant difference in purchasing
behavior between the two groups.
5. Alternatively, compare the test p-value with the significance level of 0.05. If it is less
than 0.05, we reject the null hypothesis.

Using an appropriate calculator or statistical software, it can be determined whether the


difference D between the two samples is significant, indicating whether the advertising
campaigns led to distinct purchasing behaviors among users.

Solution with Python

from scipy.stats import ks 2samp

# Purchase data for the two groups group A = (89, 95, 102, 110, 74, 93, 87, 105, 99, 91]

group B = (77, 85, 112, 120, 108, 92, 95, 85, 100, 94]

# Kolmogorov-Smirnov test ks statistic, p value = ks 2samp(group A, group B)

# Significance level alpha -0.05

# Determine the significance of the result if p value < alpha: result » "There is a significant difference in purchasing behc

else: result = "There is no significant difference in purchasing behavior."

{'ks statistic': ks statistic, 'p value': p value, 'result': result}

The code primarily uses the scipy library, specifically the stats module, which provides
useful tools for statistical analysis. In this case, we use the ks 2samp function to perform the
two-sample Kolmogorov-Smirnov test, which allows us to determine the difference between
the cumulative distributions of two independent samples.

Let's look at the details:


1. Groups A and B are represented as Python lists containing the transaction values.
2. The ks 2samp function calculates the test statistic and the p-value, which are used to
determine statistical significance.
3. A significance level of 0.05 is set, so if the p-value obtained from the test is lower than
this level, the null hypothesis (that the two groups have similar distributions) is
rejected, indicating a significant difference.
4. The results include the test statistic, the p-value, and a textual conclusion about the
presence of a significant difference in purchasing behavior. This is presented in a
Python dictionary for ease of access and interpretation.

Exercise 86. Differences in Purchasing Behavior Between Advertising Campaigns


A fashion company launched two distinct advertising campaigns to promote their new
autumn collection. After a month, the company's marketing team wants to analyze
whether there are significant differences in sales generated by each campaign. Sales data
(in number of items sold) were collected for ten consecutive days for each target group.

Campaign 1 recorded the following daily sales: [12, 15, 14, 10, 13, 17, 20, 15, 16, 14].

Campaign 2 recorded the following daily sales instead: [18, 14, 17, 15, 20, 22, 19, 17, 16,
21].

The management requires a statistical analysis to determine if the sales distributions differ
significantly. A standard significance level of a = 0.05 is assumed.

Accompany your response with the statistical reasoning of the case.

Solution

To analyze whether there is a significant difference between the distributions of sales


generated by the two campaigns, we use the Kolmogorov-Smirnov (K-S) statistical test,
which compares two samples to determine if they come from the same distribution.

Here are the steps:

1. Definition of hypotheses:
o Null hypothesis (/70): The two distributions (sales from Campaign 1 and Campaign
2) do not differ significantly.
o Alternative hypothesis (HJ: The two distributions differ significantly.
2. Calculate the empirical cumulative distributions for both samples.
3. Compute the maximum distance between the two empirical cumulative distributions.
4. Compare the calculated test value with the critical value for a = 0.05.
5. If the calculated value exceeds the critical value, we reject the null hypothesis.

If the null hypothesis is rejected, this suggests that the two advertising campaigns have
had different effectiveness in terms of daily sales; managers should then delve into which
aspects of the two campaigns might have caused this difference. Otherwise, there is
insufficient evidence to claim a significant difference in sales distribution behavior.

The analysis thus allows for mapping differences in marketing strategies and optimizing
future campaigns.

Solution with Python

import numpy as np from scipy.stats import kstest


# Daily sales data for each campaign sales campaign! ■ [12, 15, 14, 10, 13, 17, 20, 15, 16, 14]

sales campaign2 - [18, 14, 17, 15, 20, 22, 19, 17, 16, 21]

# Kolmogorov-Smirnov test calculation statistic, p value = kstestfsales campaign1, sales campaign2)

# Significance level alpha =0.05

# Decision if p value < alpha: print('Reject H0: there is a significant difference between the distributions of sales from th

numpy is imported as np to handle numerical data efficiently. However, in this specific case,
we don't need to use numpy, but it is commonly seen when working with numerical arrays.

The Kolmogorov-Smirnov test is computed with the kstest function, which compares two
independent samples. We pass the data for both campaigns to the function, which returns
two values: statistic and p value.

• statistic is the K-S test statistic.


• p value indicates the significance level of the result: if it is less than 0.05 (alpha), we
reject the null hypothesis.

The significance level alpha is set at 0.05, which is the standard value for many statistical
analyses.

Finally, we compare p value with alpha to determine if there is a significant difference


between the sales distributions from the two campaigns, printing the appropriate result.
6.9 Shapiro-Wilk Normality Test

The Shapiro-Wilk test is a statistical test used to determine whether a sample comes from a
normal distribution. This test is widely used in inferential statistics to verify the assumption
of normality in data, which is a fundamental premise for many statistical techniques, such
as analysis of variance (ANOVA) and the Student's t-test.

The main requirements for the test are:

• Each observation in the sample must be independent of the others.


• The Shapiro-Wilk test is used with continuous data and is not suitable for discrete or
categorical data.
• The Shapiro-Wilk test is particularly effective for samples ranging in size from about 3
to 5000 observations. For larger samples, the test may become too sensitive to small
deviations from normality.

The Shapiro-Wilk test is based on the following hypotheses:

• Null hypothesis (Ho): the data come from a normal distribution.


• Alternative hypothesis (HJ: the data do not come from a normal distribution.

The Shapiro-Wilk test statistic is based on the comparison between an estimate of the
sample variance and a theoretical variance expected under the normality assumption. The
test calculates a statistic W, which is defined as:

E’L. I2
where:

• x, is the /-th observation in the sample,


• x is the sample mean,
• a, are the coefficients calculated based on the ordered observations x, < x2 . .< xn.

The statistic W measures how closely the observed distribution approaches the normal
distribution. If the value of W is close to 1, the data are consistent with a normal
distribution. As always, the statistic can be used to derive the p-value through the specific
probability distribution of the test. If the p-value is less than the significance level a (e.g.,
0.05), we reject the null hypothesis, suggesting that the data do not follow a normal
distribution. If the p-value is greater than the significance level a. we do not reject the null
hypothesis, suggesting that there is not enough evidence to claim that the data do not
follow a normal distribution.

Exercise 87. Verification of Normality in Delivery Times In a logistics company, the


quality manager wants to analyze delivery times to evaluate if the data follows a normal
distribution. The times were collected for 30 recent deliveries (expressed in days): [2.3,
2.5, 3.1, 2.8, 3.0, 2.7, 2.9, 2.6, 3.4, 3.2, 3.0, 2.5, 3.5, 3.0, 2.8, 2.6, 2.7, 3.3, 3.1, 2.9, 2.4,
3.5, 2.8, 2.6, 2.9, 3.0, 2.7, 3.1, 3.0, 2.8].

The manager suspects that variability in delivery times might affect the logistics process.
Analyze the collected data to confirm the hypothesis that they follow a normal distribution
and provide statistically founded conclusions.

Solution

To verify if the delivery times follow a normal distribution, we can use the Shapiro-Wilk Test.

Let's see the steps:

1. Formulate the hypotheses:


o Ho (null hypothesis): Delivery times follow a normal distribution.
o H1 (alternative hypothesis): Delivery times do not follow a normal distribution.
2. Using a statistical tool or software (such as R, Python with scipy, etc.), calculate the
test value and the p-value.
3. If the p-value is greater than a significance level (e.g., 0.05), do not reject the null
hypothesis and conclude that the data follows a normal distribution. If the p-value is
less than or equal to the significance level, reject the null hypothesis, indicating that
the data does not follow a normal distribution.

Suppose, through calculation, we obtained a p-value of 0.12. Since 0.12 > 0.05, we do not
have sufficient evidence to reject the null hypothesis.

Solution with Python

import numpy as np from scipy import stats

# Delivery time data delivery times » np.array([2.3, 2.5, 3.1, 2.8, 3.8, 2.7, 2.9, 2.6, 3.4, 3.2, 3.0, 2.5, 3.5, 3.8, 2.8, 2.6

# Perform Shapiro-Wilk Test stat, p value ■ stats.shapiro(delivery times)

# Set significance level alpha =0.05

# Results print(f"W-statistic: {stat:.4f}") print(f“p-value: {pvalue:.4f)“)

if pvalue > alpha: print("Do not reject the null hypothesis. Delivery times follow a normal distribution.") else: print("R<

1. We use numpy to handle the data as an array and scipy.stats to access the shapiro
function, which performs the Shapiro-Wilk test.
2. The provided data, which are delivery times expressed in days, are organized into a
NumPy array.
3. We use stats.shapirot) passing the data array. This returns two values: the W-statistic
and the p-value.
4. Commonly chosen as 0.05, this value helps us decide the acceptance or rejection of
the null hypothesis.
5. If the p-value is greater than alpha, do not reject the null hypothesis. This means that
the data can be considered normally distributed. If the p-value is less than or equal to
alpha, reject the null hypothesis, indicating that the data does not follow a normal
distribution.

Exercise 88. Analysis of Order Preparation Time in a Restaurant In a chain


restaurant, the operations manager wants to evaluate whether the preparation times for
main courses follow a statistical model intended to optimize kitchen resources. The
preparation times for 40 main dishes (in minutes) were collected during a typical lunch
shift: [12.2, 10.8, 11.5, 13.6, 12.0, 11.7, 12.4, 13.1, 10.9, 11.3, 12.6, 13.0, 12.8, 11.4, 12.1,
13.3, 12.9, 11.6, 13.5, 12.3, 11.0, 10.8, 13.7, 12.5, 11.9, 13.4, 12.7, 11.2, 11.8, 13.2].

The manager intends to determine if the preparation times can be considered normal to
more accurately predict staff needs. Evaluate the data to confirm whether they follow a
suitable distribution to make reliable predictions and provide a conclusion based on the
analysis results.

Solution

To determine if the preparation times follow a normal distribution, we apply the Shapiro-
Wilk test. This test provides a methodology for verifying the normality of the data available.
A significant result (low p-value) indicates a deviation from normality.

After performing the test with the provided data, let's assume we obtain a p-value of 0.25.

Since the obtained p-value is higher than the common significance level of 0.05, we do not
have sufficient evidence to reject the null hypothesis that the data follow a normal
distribution. Therefore, the operations manager can proceed with the assumption that the
preparation times are normally distributed, thus facilitating better kitchen resource
planning using appropriate predictive models.

Solution with Python

import scipy.stats as stats

preparationtimes = (12.2, 10.8, 11.5, 13.6, 12.0, 11.7, 12.4, 13.1, 10.9, 11.3, 12.6, 13.0, 12.8, 11.4, 12.1, 13.3, 12.9, 11.6,

# Perform the Shapiro-Wilk test shapiro test = stats.shapiro(preparation times)

# Extract the p-value p value = shapiro test.pvalue

print(f"Shapiro-Wilk test result: statistic«{shapiro test.statistic), p-value={p value}")

# Interpret the result if p value > 0.05: print("There is not enough evidence to reject the hypothesis of normality.") else

First, the preparation time data is stored in a list called preparation times. This list represents
the observations collected from the restaurant.

Next, the Shapiro-Wilk test is performed using the stats.shapiroO function, which accepts
the data list as its argument. This function returns an object containing both the test
statistic and the associated p-value. In our case, the p-value is stored in shapiro test.pvatue.

The p-value is then interpreted to determine the normality of the data set. If the p-value is
greater than the commonly used significance level (0.05), there is not enough evidence to
reject the null hypothesis that the data follows a normal distribution. This supports the idea
that the data is normally distributed, which can facilitate the use of statistical predictive
models.

Conversely, a p-value less than 0.05 would indicate that the data probably does not follow
a normal distribution.

The scipy library is highly versatile and widely used in Python to perform various statistical
tests, including the Shapiro-Wilk test. This module provides a convenient interface for
working with many statistical algorithms and data analysis tools, making its use
particularly popular among researchers and data analysts.
6.10 Chi-Square Test on Contingency Tables

The chi-square test on contingency tables is a statistical test used to determine if there is a
significant relationship between two categorical variables. The contingency table is a table
that shows the observed frequencies of combinations of categories of two variables. The
chi-square test compares the observed frequencies with the expected frequencies,
calculating a statistic that measures the discrepancy between the two.

Some examples include:

• A company can use the chi-square test to examine if there is a relationship between
gender and product preference, checking if preferences are evenly distributed among
different gender groups.
• A company might want to understand if product sales are independent of the
distribution channel (e.g., online vs. physical stores). The chi-square test helps
determine if there is a significant relationship between these factors.
• The test can be used to examine if there are significant differences between employee
groups based on their geographic area and the level of job satisfaction.

The chi-square test has some fundamental requirements for its correct application:

• The observations must be independent of each other, meaning an individual cannot be


included in more than one cell of the contingency table.
• The variables under analysis must be categorical (nominal or ordinal).
• Each cell of the contingency table should have a sufficiently high observed frequency.
Generally, it is preferable for each cell to have at least 5 observations. If the
frequencies are too low, it is possible to aggregate the categories or use an alternative
test like Fisher's exact test.

The chi-square test on contingency tables is based on the following hypotheses:

• Null hypothesis (WQ): the two variables are independent, meaning there is no
relationship between them.
• Alternative hypothesis (H,): the two variables are dependent, meaning there is a
relationship between them.

The chi-square test statistic (%?) measures the discrepancy between the observed and
expected frequencies and is calculated as:

where:

• O:j is the observed frequency in the cell of the /-th row and j-th column,
• E,i is the expected frequency in the cell of the /-th row and /-th column,
• r is the number of rows in the contingency table,
• c is the number of columns in the contingency table.

The expected frequency E,; for each cell is calculated as:

• —
where:

• R, is the sum of observations in the /-th row,


• C, is the sum of observations in the /-th column,
• N is the total number of observations.

The p-value of the chi-square test represents the probability of obtaining a statistic
greater than or equal to the one observed, assuming the null hypothesis is true. If the p-
value is less than the significance level a (e.g., 0.05), we reject the null hypothesis,
suggesting that the variables are dependent. If the p-value is greater than the significance
level a, we do not reject the null hypothesis, suggesting that there is not enough evidence
to claim that the variables are dependent.

Exercise 89. Analysis of Purchasing Patterns in a Supermarket A supermarket wants


to understand if there is a significant relationship between the age groups of customers
and their preference for certain types of products. They have divided the clientele into
three age groups: 18-30, 31-50, and over 50. The products have been classified into three
categories: Fresh, Canned, and Frozen. Observations on a thousand buyers have been
recorded in the following table.

Age GroupFreshCannedFrozen

18-30 150 60 90

31-50 200 90 110

Over 50 250 110 130

Table 6.1: Age groups and product categories.

Analyze the data and determine if there is a significant association between the age groups
of customers and their product type preferences. Use a significance level of 5%.

Solution

To solve the problem, we need to test the null hypothesis that there is no association
between age groups and product preferences. We will use the chi-square statistical test for
this analysis.

1. Calculate Totals
o Calculate the row and column totals from the table:
• Totals for age groups: 300, 400, 490.
• Totals for product categories: 600, 260, 330.
• Overall total: 1190.
2. Calculate Expected Frequencies (E,;)
o Using the formula: E,; - (Row total • Column total)/Grand total
o Calculate, for example, for cell (18-30, Fresh): (300 • 600)/1190 ~ 151.26
3. Calculate the Value of/1 23
° / = Z((O „ - E,)2/E,j) for all cells.
o Perform this calculation for all combinations in the table.
4. Determine the Degrees of Freedom (df)
o df = (number of rows - l)(number of columns -1) = (3 - 1)(3 - 1) = 4
5. Compare with the Critical Value
o We use a significance level a = 0.05 and df - 4.
o The critical value is approximately 9.49 (by consulting the chi-square distribution
table or using software tools).
6. Test Conclusion
o If x2 > 9.49, we reject the null hypothesis, indicating that there is a significant
association.
o If %2 < 9.49, we cannot reject the null hypothesis.

In this exercise, the manual calculation process of/2, will indicate whether these age
groups influence their product preferences based on the provided dataset.

Solution with Python

import numpy as np from scipy.stats import chi2 contingency

# Data of the contingency table supermarket data = np.array(f

[150. 60. 90], « 18-30

1209, 90. 119), # 31-50

1250, 119, 130| tt Over 59

1)
# Perform the chi-square test chi2, p, dof, expected = chi2 contingency(supermarket data)

# Significance level alpha =0.05

# Print the results prlntff’Chi-square statistic: (chi2)-) *


p-value:
prlntff (p)"> print(f"Degrees of freedom: (dot)") prir

# Test conclusion if p < alpha: printCWe reject the null hypothesis: there is a significant association.") else: print(“h

Here are the details:

• The contingency table is a NumPy array, supermarket data, representing the observations
of different age groups and their preferences for product categories.
• The function chi2 contingency returns several values: the chi-square statistic, the
associated p-value, the degrees of freedom, and a matrix of expected frequencies. The
p-value allows us to determine if there is enough statistical evidence to reject the null
hypothesis.
• The chosen significance level is 0.05. If the p-value is lower than this, we reject the null
hypothesis and conclude that there is a significant association between age groups and
product preferences. Otherwise, there isn't enough evidence to state that.
• Finally, the code prints the results of the chi-square statistic, the p-value, the degrees
of freedom, and the calculated expected frequencies. The test conclusion indicates if
there is a significant association between the analyzed variables.

Exercise 90. Analysis of Corporate Transportation Preferences Based on Job


Position A multinational company wishes to understand if there is a significant
relationship between the job position of its employees and their choice of transportation
method for commuting. The positions are divided into three categories: Junior, Middle, and
Senior. The chosen modes of transport are: Car, Public Transport, and Bicycle. Data from a
sample of 900 employees was collected in the following table.
PositionCar Public TransportBicycle

Junior 100150 50

Middle 120130 70

Senior 80 90 110

Table 6.2: Position and type of transport.

Analyze the data to determine if there is a significant association between the job position
of employees and their preference for transportation modes. Use a significance level of 5%.

Solution

First, calculate the expected frequencies for each cell of the contingency table using the
formula Row total • Column total/Total. The marginal distributions are: For the Junior
category and Car, the expected frequency is: (300 • 300)7900 = 100. Similarly, calculate all
the expected frequencies:

PositionCar Public TransportBicycle

Junior 100 123.33 76.67

Middle 106.67131.55 81.77

Senior 93.33 115.11 71.55

Table 6.3: Expected frequencies.

Then, calculate the chi-square value:

Where O, is the observed frequency and E, is the expected frequency.

Calculate by summing across all categories:

With 4 degrees of freedom (df - (r - l)(c -1) = 2 • 2 = 4), compare the calculated chi-
square value with the critical chi-square value for a significance level of 5%. The threshold
value is 9.488.

Since x2 = 46 > 9.488, we reject the null hypothesis. Therefore, there is a significant
relationship between job position and the choice of transportation mode among
employees.

Solution with Python


import numpy as np from scipy.stats Import chi2_contlngency

# Observed data observed = np.array([[1Q0, 150, 50], [120, 130, 70], [80, 90, 110]])

# Calculation of the chi-square test of independence chi2, p, dof, expected » chi2 contingency(observed)

# Results chi2 calculated = chi2

p value = p expected frequencies = expected

result = {

'chi2 calculated': chi2 calculated, 'p value': p value, 'expected frequencies': expected frequencies.tolistO }

print(resuit) In this exercise, we use the Python library scipy, which is extremely useful for stat

Let's see the different steps of the code:

• Declare a variable observed as a numpy array, which encapsulates the data collected
from the example. This array represents the observed frequencies for each
combination of job position and transportation mode.
• Use chi2 contingency to calculate the chi-square value. The function will return several
values:
o chi2: the calculated chi-square test value.
o p: the p-value, which indicates the probability of obtaining a result at least as
extreme as the one observed, assuming the null hypothesis is true.
o dot: the degrees of freedom.
o expected: an array containing the expected frequencies.
• The results are stored in result as a dictionary. Here, expected frequencies is converted
into a list of lists for easier interpretation.
• Print the results, which include the chi-square value, the p-value, and the expected
frequencies, helping us verify whether there is a significant association between the
variables in our original observations.

In summary, the code analyzes whether there is a significant relationship between job
position and choice of transportation mode, by comparing observed frequencies with
expected ones and providing a chi-square value that allows us to accept or reject the null
hypothesis.
6.11 Fisher's Exact Test on 2x2 Tables

The Fisher's exact test is a statistical test used to assess the association between two
categorical variables in a 2x2 contingency table. This test is particularly useful when the
observed frequencies in table cells are small, as it does not require large sample conditions
like the chi-square test does.

Fisher's exact test is often used in various business contexts, especially when data is
infrequent or sample sizes are small. Some examples of usage include:

• If a company wants to know if there is a relationship between the success of an


advertising campaign (success/failure) and the gender of the consumer (male/female),
it can use Fisher's exact test, especially if the observed frequencies are low.
• In a clinical study, a pharmaceutical company might want to analyze if the response
rate to a treatment (positive/negative response) is independent of the treatment group
(treatment A/treatment B). If the frequencies are low, Fisher's exact test is an
appropriate choice.
• If quality control is performed on a small quantity of products and it is desired to test if
a defect is evenly distributed among different product groups (e.g., defect/no defect
with respect to a production line), Fisher’s exact test is useful.

Fisher's exact test has no restrictions on sample size and is particularly useful when the
observed frequencies in some cells of the 2x2 contingency table are low. However, there
are certain requirements:

• The variables being analyzed must be categorical (nominal).


• Fisher's exact test is specifically designed for 2x2 contingency tables.
• Observations must be independent of each other.

Fisher's exact test is based on the following hypotheses:

• Null hypothesis (WQ): the two variables are independent, meaning there is no
association between them.
• Alternative hypothesis (H,): the two variables are dependent, meaning there is a
relationship between them.

Fisher's exact test does not rely on a continuous test statistic like the chi-square but
calculates the exact probability of obtaining the observed data distribution under the null
hypothesis of independence. Specifically, the test calculates the probability of obtaining a
frequency distribution as observed, or more extreme, assuming the two variables are
independent.

The exact probability calculation is based on the hypergeometric distribution, and the
formula to calculate the probability P of a specific frequency configuration O in a 2x2 table
is given by:

where:

a,b,c,d are the values in the cells of the 2x2 contingency table,
• is the binomial coefficient and is equal to n
• x! is the factorial, i.e., the product of the first x integers starting from 1
• n is the total number of observations.

If the p-value is less than a (for example, 5%), the null hypothesis is rejected, suggesting
that the variables are dependent. If the p-value is greater than the significance level a, the
null hypothesis is not rejected, suggesting that the variables are independent.

Exercise 91. Analysis of the Impact of an Advertising Campaign An e-commerce


company wants to determine the effectiveness of a new advertising campaign. The
company has launched a campaign on two different social media platforms and wants to
understand if the campaign has significantly influenced purchasing behavior.

The company collected the following data one month after the campaign:

PurchaseNo Purchase

Active Advertisement 45 55

Inactive Advertisement30 70

Table 6.4: Impact of the campaign.

Evaluate if there is a significant difference in purchasing behavior between the two groups
(with active advertisement and without advertisement).

Solution

To compare the proportions of two groups on a 2x2 contingency table, Fisher's exact test is
an appropriate tool because it provides exact results without making assumptions about
the samples, making it ideal when we have low expected frequencies in a 2x2 matrix. It is
therefore preferred over the chi-square test in such cases.

First, we formulate the hypotheses:

• Null Hypothesis (Ho): There is no significant difference in purchasing behavior between


customers exposed to the advertisement and those not exposed.
• Alternative Hypothesis (f-/t): There is a significant difference in purchasing behavior
between customers exposed to the advertisement and those not exposed.

Then, we apply Fisher's exact test on the table using appropriate software.

Calculating, we obtain a p-value, let’s assume p < 0.05, which indicates that we can reject
the null hypothesis at the 5% significance level.

The advertising campaign had a significant impact on purchasing behavior. This suggests
that maintaining or further adapting the social media advertising strategy might be
advantageous for increasing sales.
Solution with Python

from scipy.stats import fisher exact

# Creation of the data table * ([Purchases with Advertisement, No Purchases with Advertisement], # [Purchases without Advert!

data = ([45, 55], (30. 78])

# Calculation of p-value using Fisher's exact test oddsratio, p value ■ fisher exact(data)

# Print the result def interpret result(p value, alpha-0.05): if p value < alpha: return "The advertising campaign had a sii

else: return “There is no significant difference in purchasing behavior.


*

Interpret result = interpret result(p value)

# Output of the results print(f"Odds ratio: {oddsratio)") print(f"P-value: (p value)") prlntdnterpret result) In the pr

Let's see the various steps:

1. The data from the two experimental conditions (active advertisement and inactive
advertisement) are stored in a list of lists data. Each inner list represents a row
corresponding to the conditions (purchases and no purchases) for the two situations.
2. The function fisher exact calculates both the odds ratio and the p-value for the
provided table. The latter allows us to determine if the result is statistically significant.
Here, oddsratio is ignored but printed for information.
3. We define a function interpret result that checks if the p-value is lower than a certain
level of significance (alpha), typically 0.05. If so, it suggests that the advertising
campaign has a significant impact.
4. Finally, the odds ratio, p-value, and the interpretation of the result are printed. The user
can see if the advertising campaign was statistically significant in modifying purchasing
behavior.

Exercise 92. Analysis of the effectiveness of a promotional strategy A supermarket


chain has implemented a new promotional strategy to increase sales of a certain brand A
product. The promotion was conducted in some selected stores, while no promotional
strategy was implemented in other stores. After a month, the company wants to
understand if the promotional strategy had a significant impact on the product's sales.

The company collected the following data on the sales of brand A product:

High SalesLow Sales

Stores with promotion 80 40

Stores without promotion60 60

Table 6.5: Impact of the promotional strategy.

Evaluate if there is a significant difference in sales of brand A product between stores with
and without promotion.

Solution
The chosen test is Fisher’s exact test, since the data are contained in a 2x2 contingency
table and verify the independence of classifications of two criteria. This test is more
appropriate than the chi-square test for comparing the proportions of two groups and
evaluating if there is a statistically significant difference, but can only be applied with 2x2
contingency tables.

The null hypothesis Ho is that there is no relation between the variables involved.

To calculate the p-value, we can use statistical software or an online calculator that allows
for entering the data of the 2x2 table. The resulting p-value is p = 0.012.

Since the p-value (0.012) is below the common significance level of 0.05, we reject the null
hypothesis Ho. Therefore, we can conclude that there is a significant difference in the sales
of brand A product between stores with and without promotion. This suggests that the
promotional strategy had a positive impact on sales.

Solution with Python

from scipy.stats import fisher exact

# Definition of the contingency table contingency table = [[80, 401, (60, 60]1

# Performing Fisher's exact test oddsratio, p value = fisher exact(contingency table)

# Printing the result print(f ’The resulting p-value is: {p value}")


*

# Interpretation of the results significance level =0.05

if p value < significance level: print("The null hypothesis is rejected: there is a significant difference in sales.") else:

Let's see the various steps:

1. The code starts by importing fisher exact from the scipy.stats module. This function is
designed to calculate Fisher's exact test on a 2x2 contingency table.
2. A nested list contingency table is created representing the sales data collected between
stores with and without promotion.
3. The fisher exact function is called passing the contingency table as an argument. This
returns two values: oddsratio and p value. In this context, we are more interested in the
p value, which will help to determine statistical significance.
4. The p-value is printed as output to provide the result of the statistical test.
5. The code compares the p-value with a preset significance level (0.05). If the p-value is
less than 0.05, the null hypothesis is rejected, indicating that there is a significant
difference in sales. Conversely, if the p-value were higher, there would be insufficient
evidence to claim the sales difference is statistically significant.

Using Fisher's exact test is useful when sample sizes are small or the distributed data do
not meet some of the necessary assumptions for other parametric tests.
Chapter 7
Confidence Intervals
In this chapter, we will delve into the use of confidence
intervals, which are fundamental tools in statistics for
estimating an unknown parameter of a population based on
a data sample. Confidence intervals provide a range of
values within which the true parameter of the population is
located with a certain probability, usually expressed as a
confidence level (e.g., 95% or 99%). These intervals are
used to infer, with some confidence, the value of a mean or
a proportion from a sample.

Confidence intervals will be calculated using the quantiles of


the Student's t-distribution or the normal distribution
depending on the sample size. In the case of small sample
sizes and when the population variance is unknown, the
Student's t-distribution is used. This approach is particularly
useful when working with small datasets, where the
estimation of the standard deviation is less precise. For
large samples, the normal distribution is used thanks to the
central limit theorem, which ensures that the distribution of
the sample mean follows a normal distribution regardless of
the original data distribution when the sample size is
sufficiently large.

Moreover, we will see how to calculate confidence intervals


for proportions. For proportions, the normal approximation is
generally used, which is appropriate when the proportion is
not too close to 0 or 1 and when the number of successes
and failures is sufficiently high, usually greater than 10. In
situations where these requirements are not met,
calculating confidence intervals can be more complex. In
these cases, more advanced methods such as the Clopper-
Pearson exact method or the Wilson method might be
considered, which provide more accurate estimates in the
presence of small sample sizes or extreme proportions,
although they will not be discussed in detail in this chapter.
7.1 Confidence Intervals for the Mean

Confidence intervals are a statistical tool that allows us to estimate a range within which it
is likely that a population parameter, such as the mean, is found. In the case of the mean,
the confidence interval provides an estimate of the range in which the true population
mean is found, based on a sample drawn from the population itself.

To calculate a confidence interval for the mean, the following formula is used:

Cl =

where:

• X is the sample mean,


• a is the population standard deviation,
• n is the sample size,
• za/7 is the critical value of the standard normal distribution for the desired confidence
level (for a 95% confidence, the most common, the value is 1.96).

In the case where the population standard deviation is not known, Student’s t-distribution
is used and the formula becomes:

Cl = tn/'2.n— 1 " r— n—l •


Jn

where ta/2,n-i is the critical value of the t-distribution with n- 1 degrees of freedom.

Confidence intervals for the mean find various applications in the business environment.
Here are some examples of usage:

• A company can compare the confidence intervals of the mean of two samples (e.g.,
sales of two products in different periods) to determine if there is a significant
difference between the means. If the confidence intervals overlap, no significant
difference between the means can be concluded, whereas if they do not overlap, the
difference is likely significant.
• A company can compare the confidence interval of the mean of a business variable
(e.g., return on investment) with a predefined benchmark value (e.g., an annual
target). If the confidence interval does not include the benchmark, the company may
consider the performance unsatisfactory.
• Use confidence intervals to estimate future trends of business variables like sales,
costs, or customer satisfaction. This aids in planning and risk management, allowing
decisions to be made based on more accurate forecasts.
• In a quality control context, the confidence interval of the mean of a measurement
(e.g., the size of a product) can be used to verify if the production process is under
control. If the confidence interval falls outside of specified tolerances, this could
indicate a problem in the production process.

Exercise 93. Analysis of Monthly Sales in Two Different Regions A company wants
to compare the sales efficiency between two regions: Region A and Region B. Over the past
year, they have collected data on the monthly sales for each region.
The management wants an accurate estimate of the average monthly sales for each region
and wants to know if there is a significant difference between the two regions.

For both regions, the collected data are as follows (in thousands of euros): Region A: [65,
70, 62, 68, 74, 69, 72, 71, 66, 73, 75, 68]

Region B: [60, 62, 58, 66, 63, 65, 67, 64, 61, 66, 62, 67]

Calculate the 95% confidence interval for the average sales of each region and compare
the two intervals to determine if the sales efficiency significantly differs between the two
regions.

Solution

Let's see the steps to solve the exercise.

Calculation of confidence intervals for Region A:

1. Calculate the sample mean (xA) and the sample standard deviation (sA).
2. Calculate the 95% confidence interval:
o Degrees of freedom = nA - 1 = 11
o Use the Student's t-distribution with 11 degrees of freedom to find the critical value
^0.025)-

o Confidence interval = xA ± t0025

Calculation of confidence intervals for Region B:

1. Calculate the sample mean (xe) and the sample standard deviation (s8).
2. Calculate the 95% confidence interval:
o Degrees of freedom = nB - 1 = 11
o Use the Student's t-distribution with 11 degrees of freedom to find the critical value
(£0.025)-

o Confidence interval = xB ± t0025 •

Let's compare the confidence intervals of the two regions:

• If the intervals do not overlap, we can conclude that there is a significant difference in
average sales between the two regions.
• If the intervals overlap, we cannot conclude that there is a significant difference.

Solution with Python

import numpy as np from scipy import stats

# Sales data in thousands of euros sales region A = np.array([65, 76, 62, 68, 74, 69, 72, 71, 66, 73, 75, 681) sales region E

# Calculate the mean and standard deviation for Region A mean A = np.meant sales region A) std dev A = np.std(sales region A,

# Calculate the mean and standard deviation for Region B

mean B » np.meanfsales region B) std dev B ■ np.stdfsales region B, ddof«=l)

# Number of data points for each Region nA - len(sales region A) n B = lent sales region 8)

degrees freedom A = n A • 1

degrees freedom B = n B - 1
* Calculate the critical t value for 95’4 confidence Intervals tscoreA » stats.t .ppf(1 - 0.025, degrees freedom A) t scoreJ

# Confidence Interval for Region A cl A lower » mean A t score A ♦ (std dev A / np.sqrt(n A)) cl A upper •= mean A ♦ t score

# Confidence interval for Region 8

ci 8 lower = mean B - t score B • (std dev B / np.sqrtfn BH ci B upper - mean B ♦ t score B x (std dev B / np.sqrtfn B))

# Results print(f”95% Confidence interval for Region A: [{ci A lower:.2f). {ci A upper:.2f)1") prlnt(f“954 Confidence interve

# Comparison if ci A upper < ci B lower or cl B upper < ci A lower: print!"The confidence intervals do NOT overlap: there is

Here are the steps:

1. Monthly sales for Regions A and B are given as lists. We convert these lists into nuinpy
arrays to allow vectorized mathematical operations.
2. Using np.mean() and np.std(ddof=i), we calculate the mean and the sample standard
deviation for each region respectively. ddof=i specifies that we are calculating the
sample standard deviation.
3. To calculate the confidence intervals, we use scipy’s stats.t.ppf() function, which
returns the critical t value given a confidence level and degrees of freedom (number of
data points in the sample minus one). These critical values are used to determine the
lower and upper limits of the confidence intervals using the formulas:
O Lower limit: x -100„
o Upper limit: x + t0025 •—
4. After printing the confidence intervals for both datasets, we compare them to check if
they overlap. If they do not overlap, this suggests a significant difference in average
sales between the two regions. We print the conclusion based on this comparison.

Exercise 94. Analysis of the performance of two project teams A technology


company has two software development teams: Team X and Team Y. Over the past year,
the company has tracked the number of projects completed monthly by each team.

The company management desires an accurate estimate of the average number of


projects completed monthly for each team and wants to know if there is a significant
difference in performance between the two teams.

For both teams, the collected data are as follows (projects completed monthly):

. Team X: [8, 9, 7, 10, 12, 11, 13, 8, 9, 11, 10, 12]


• Team Y: [7, 8, 6, 9, 8, 10, 7, 9, 6, 8, 7, 9]

Calculate the 95% confidence interval for the mean number of projects completed by each
team and compare the two intervals to determine if the performance differs significantly
between the two teams.

Solution

Begin with calculating the confidence intervals for the two teams.

Team X:

1. Calculate the mean:


- 8 + 9 + ... +12 120
Xx =----------------------- = ------= 11
12 12
2. Calculate the sample standard deviation (sx):
,
*
/£( - h)2' /(8 — 10)2 + ... + (12 — 10)2~
sx = V n- 1 = V 12-1 =1.86

3. Determine the 95% confidence interval:

C/95 = Xx ± to.025 /yj

Where to.025 ~ 2.201 for 11 degrees of freedom.


C/95 = [8.82, 11.18]

Team Y:

1. Calculate the mean:

2. Calculate the sample standard deviation (sy):


-?>')2 /(7 — 7.83)2+... + (9 — 7.83P
st -V n— 1 -V 12-1 =1.26

3. Calculate the 95% confidence interval:


5y

C/95 = Yr ± to.025 —7=


V//
C/95 = [7.03, 8.64]

The confidence intervals do not overlap:

. Team X: [8.82,11.18]
. Team Y: [7.03,8.64]

This suggests that there is a significant difference in average monthly performance


between the two teams, with Team X completing more projects on average than Team Y.

Solution with Python

Import numpy as np from scipy Import stats

# Data team x = f8, 9, 7, 18, 12, 11, 13, 8, 9, 11. 10, 12]

team y - 17, 8, 6, 9, 8, 10, 7, 9, 6, 8, 7, 9|

# Function to calculate confidence interval def calculate confidence interval(data, confidence^.95I: n = len(data) mean = i

# Calculate confidence intervals for both teams interval x = calculate confidence interval(team x) interval y = calculate cor

# Output confidence Intervals prlnt('95% Confidence interval Team *


,
X: interval x) print('95^ Confidence interval Team Y:‘, 1

The code begins by defining the data: the number of projects completed monthly by Team
X and Team Y. It then defines a function calculate confidence interval that takes a team's
data and a confidence level as input, defaulting to 95%.

Within the function, we calculate several parameters:

The sample mean using np.mean.


• The sample standard deviation with np.std using ddof=i to indicate that we want the
sample standard deviation and not the population’s.
• The critical t value using stats, t.ppf, which computes the quantile of the Student's t
distribution.
• Finally, the margin of error is calculated by multiplying the critical t value by the
sample standard deviation divided by the square root of the number of observations
(n).

The confidence intervals for each team are calculated by calling the function with the
team's data and are subsequently displayed on screen using the print function.

Exercise 95. Operational Efficiency Analysis of a Company A technology company


recently implemented a new warehouse management system with the aim of improving
operational efficiency. After one month of implementation, the company management
wants to verify whether the average order processing time, calculated in minutes, has
improved compared to the previous company benchmark of 30 minutes.

Data related to the processing time for a random sample of 50 orders were collected, and
an average of 28 minutes with a standard deviation of 5 minutes was calculated.

1. Calculate and interpret a 95% confidence interval for the average order processing
time after the implementation of the new system.
2. Based on the calculated confidence interval, can the company conclude that the new
system is more efficient than the benchmark of 30 minutes? Justify your answer.

Solution

Let's go through the different points of the exercise:

1. We calculate the 95% confidence interval for the average processing time.
The confidence interval is given by the formula:
S

where:
o x = 28 is the sample mean
o s = 5 is the sample standard deviation
o n = 50 is the sample size
o z = 1.96 is the critical value for a 95% confidence interval using the normal
distribution (an acceptable approximation given the sample size)
Calculate the margin of error:

ME = 1.96 ■ 1.384
\/50

Therefore, the confidence interval is:


28 ± 1.384 = [26.616, 29.384]

The 95% confidence interval for the average order processing time ranges from 26.616
to 29.384 minutes.
2. To compare the sample mean with the benchmark of 30 minutes, we observe that our
confidence interval [26.616, 29.384] is entirely below 30 minutes. This suggests that,
with a 95% confidence level, the new warehouse management system indeed reduces
the order processing time compared to the company benchmark of 30 minutes.
Therefore, the company can conclude that the new system is more efficient.

Solution with Python

import scipy.stats as stats import math

# Sample data samplemean = 28

samplestddev = 5 n = 50

# Critical value for a 95

z = stats.norm.ppf(0.975)

# Calculate the margin of error marginoferror ■ z * (samplestddev / math.sqrt(n))

# Calculate the confidence interval confidenceintervallower = samplemean - marginoferror confidenceintervalupper = samplemear

# Check if the confidence interval is below the benchmark of 30 minutes benchmark = 30

ismoreefficient = confidenceintervalupper < benchmark printds the new system more efficient than the benchmark?:, ismoreeffic

Here are the main steps of the code implementation:

1. We set up the provided sample data, which includes the sample mean, sample
standard deviation, and sample size.
2. We use the stats module of the scipy library to calculate the critical value z for a two-
tailed 95% confidence interval. The parameter 0.975 represents the two extremes of
the normal distribution, considering we are in the 95% range.
3. We compute the margin of error using the margin of error formula for a normal
distribution. The margin of error depends on the sample standard deviation and the
sample size.
4. We calculate the lower and upper limits of the confidence interval using the sample
mean and the margin of error.
5. We determine whether the computed confidence interval is entirely below the
benchmark of 30 minutes. If it is, we can conclude that the new system is more
efficient.

The final printed result confirms that the confidence interval is below 30 minutes, and that,
with a 95% confidence level, the new warehouse management system is more efficient
compared to the previous company standard.

Exercise 96. Operational Efficiency Analysis and Benchmark Comparison A


supermarket chain recently introduced a new payment system aiming to reduce the
average waiting time in line. Historically, the company's benchmark for the average
waiting time is 10 minutes. To evaluate the effectiveness of the new system, a study was
conducted on a random sample of 60 receipts, recording an average waiting time of 8.5
minutes with a standard deviation of 2 minutes.

1. Calculate and interpret a 95% confidence interval for the average waiting time in line
with the new payment system.
2. Based on the calculated confidence interval, can the company consider the new
system more efficient than the previous benchmark of 10 minutes? Justify your answer.

Solution
Let's look at the solution to the various questions. To calculate the 95% confidence interval
for the mean, we use the following formula:
/ o \
Cl = x ± zl ——
\ /

Where:

x = 8.5 minutes (sample mean)


o=2 minutes (standard deviation)
n = 60 (sample size)
Z ~ 1.96 for the 95% confidence level Calculate the standard error (SE):
2
SE = ——= 0.258
V60
The confidence interval is, therefore:
8.5 ± 1.96 • 0.258 = 8.5 ± 0.505

Thus, the 95% confidence interval is approximately (7.995, 9.005) minutes.

The calculated confidence interval (7.995, 9.005) minutes does not include the 10-minute
benchmark. This means that, at a 95% confidence level, we can state that the average
waiting time in line with the new system is significantly less than 10 minutes. Therefore,
the company can consider the new system to be more efficient compared to the previous
benchmark.

Solution with Python

import scipy.stats as stats import numpy as np

# Problem data # Sample mean mean = 8.5

n Sample standard deviation std dev ■ 2

# Sample size n = 60

confidence Level =0.95

# Calculate the Z-score for the required confidence level z score = stats.norm.ppf(l - (] - confidence level) I 2)

# Calculate the standard error of the mean (SE) standard error = std dev / np.sqrt(n)

# Calculate the confidence interval margin of error = z score * standard error confidence interval = (mean - margin of error,

prmt(confidence intervai) In this code, we use the scipy.stats library to calculate the Z value for a

Subsequently, we calculate the standard error of the mean (SE) using numpy for the square
root of the sample size. The standard error is the standard deviation divided by the square
root of the number of observations in the sample and represents the quantity used to
calculate the margin of error.

The margin of error is then determined by multiplying the obtained z value by the standard
error. The final confidence interval is calculated by adding and subtracting the margin of
error from the sample mean, thus obtaining the lower and upper limits of the confidence
interval. This information is then presented as (7.995, 9.605), indicating with 95%
confidence that the average waiting time with the new system is within this range, which
does not include the 10-minute benchmark, and therefore we can conclude that the new
system is more efficient.
7.2 Confidence Intervals for Proportions

Confidence intervals for proportions are used to estimate the range within which the
proportion of a certain event falls in a population, based on a drawn sample. These
intervals provide an estimate of the likelihood that the sample proportion of successes
represents the population proportion.

To calculate a confidence interval for the proportion, the following formula is used:

a-16 z ./p*1 a ,/ <1


p -pA
c' -1 p - • y —-—-p + • y —-— I

where:
A
• P is the sample proportion (the number of successes divided by the total number of
observations in the sample),
• n is the sample size,
• za„ is the critical value from the standard normal distribution for the desired confidence
level (as with the mean, a 95% confidence level is associated with the value 1.96).

In cases where the sample size is small or the proportion of successes is very close to 0 or
1, a continuity correction or alternative methods like the Wilson method are used to ensure
more accurate confidence intervals.

Confidence intervals for proportions are very useful in various business contexts, such as
market research, quality control, and customer satisfaction analysis. Here are some
examples of their use:

• Companies use confidence intervals to estimate the proportion of customers who


prefer a certain product. For example, if a company surveys 1000 customers and 60%
express a preference for a new product, the confidence interval can be used to
estimate the proportion of the entire customer population that would prefer the
product.
• A company can compare two sample proportions (for example, the proportion of
customers purchasing a product in two different stores) and determine whether the
difference between the two proportions is significant. If the confidence intervals of the
two proportions do not overlap, the difference is likely significant.
• In quality control, the confidence interval for the proportion of defective products in a
sample can be used to monitor production quality. If the confidence interval indicates a
proportion of defective products that is too high, the company may decide to improve
the production process.
• Confidence intervals for proportions can also be used to estimate the percentage of
customers who will make purchases in a specific marketing campaign. For example, if
30% of surveyed customers declare their intention to purchase a product, the
confidence interval can help make more accurate forecasts on the campaign's success.
• Companies can use confidence intervals to estimate the proportion of customers
satisfied with a service or product. This is useful for making informed decisions
regarding the modification of products or services based on customer preferences.

Exercise 97. Analysis of the Return Rate of Advertising Campaigns An e-commerce


company wants to evaluate the effectiveness of two different online advertising
campaigns, called Campaign A and Campaign B, launched in the same period. For both
Campaign A and Campaign B, it recorded the total number of ad views and the number of
conversions (purchases made).

The data are as follows:

• Campaign A: 1200 views, 300 conversions


• Campaign B: 1500 views, 360 conversions

Calculate the 95% confidence interval for the conversion rate of both campaigns. Compare
the two conversion rates and determine if the difference is statistically significant at the
5% significance level.

Assuming that the number of views is sufficient to allow a normal approximation, your task
is to help the company understand which campaign was more effective.

Solution

To solve this problem, we need to calculate the conversion rate for each campaign and
then construct the confidence intervals for these rates. Subsequently, we compare these
intervals to see if there is a significant difference.

Let's start with calculating the conversion rates

Campaign A: pA = = 0.25
Campaign B: pB = = 0.24

Next, we move on to calculating the 95% confidence intervals. We use the formula for the
confidence interval for a proportion:
IpO -p)
ci = P± z\ n

where P is the observed proportion, z is the critical Z value for a 95% confidence interval
(1.96, according to the normal distribution tables), and n is the total number of
observations.

• For Campaign A:

CIA = 0.25 ± 1.96 V 1200 = [0.225, 0.274]


• For Campaign B:
0.24-0.76
CIB = 0.24 ± 1.96 1500 « [0.218, 0.261]

The confidence intervals for Campaign A and Campaign B overlap, which implies that we
cannot conclude that there is a statistically significant difference in the performance of the
two campaigns at the 5% level.

Both confidence intervals indicate that the campaigns might have similar conversion rates.
Therefore, the company might need to consider other factors, like cost per view, to decide
which campaign is more advantageous.

Solution with Python

import numpy as np from scipy.stats import norm

# Campaign data views A = 1200

conversions A = 300

views B = 1500

conversions B = 360

# Calculating conversion rates p A = conversions A / views A p B = conversions B / views B

# Function to calculate the confidence interval def confidence interval(p, n, z=1.96): se = np.sqrt{(p * (1 - p)) / n) retu

# Calculating the 95% confidence intervals ci A = confidence interval(p A, views A) ci B = confidence intervaUp B, views B)

# Output of the results print("Conversion rate for Campaign A: ", p A) print("Confidence interval for Campaign A: ", ci_A)

# Comparison between confidence intervals if ci A[ll < ci B[0] or ci A[0] > ci_B[l]: print("The difference in the conversion

Let's see the various steps:

1. Importing libraries:
o numpy is a widely used library for mathematical calculations in Python. In this
example, it is used for calculating the standard deviation of a proportion.
o scipy.stats is a part of the scipy library, which provides statistical functions,
including z values for the normal distribution.
2. Initial Data:
© We have the views and conversions for two campaigns, A and B.
3. Calculation of conversion rates:
o Conversion rates are calculated by dividing conversions by the total number of
views for each campaign.
4. confidence interval function:
o This function calculates the confidence interval for a given proportion using the
critical value z of the normal distribution (1.96 for 95% confidence).
o The standard deviation of the proportion is calculated and then used to determine
the interval.
5. Calculation and output of confidence intervals:
o Confidence intervals for both campaigns are calculated and printed.
6. Comparison of confidence intervals:
o We check if the confidence intervals overlap. If they do not overlap, the difference
in the conversion rate is considered statistically significant.

This process provides a quantitative analysis to determine if one advertising campaign is


more effective than another, with a certain level of statistical significance.

Exercise 98. Analysis of the effectiveness of two discount strategies A


supermarket wants to evaluate the effectiveness of two different discount strategies
applied to a particular product, referred to as Product X, during a specific week. Strategy 1
offered a 10% discount, while Strategy 2 provided a voucher of 5 euros on the total
purchase for customers who bought at least one unit of Product X. The following data were
collected:

• Strategy 1: 800 total customers during the week, 240 purchased Product X.
• Strategy 2: 1000 total customers during the week, 310 purchased Product X.
Calculate the 95% confidence interval for the percentage of customers who purchased
Product X for each of the two strategies. Compare the two percentages and determine
whether the difference is statistically significant.

Assume that the number of customers is large enough to allow for the normal
approximation. Assist the supermarket in deciding which strategy had a greater impact on
Product X sales.

Solution

Let’s go through the steps to arrive at the solution.

To calculate the 95% confidence interval of a proportion p, we use the following formula:
/pU ~?)

Confidence Interval = P± Za/2' V

where P is the proportion of successes (purchases), is the critical value of the standard
normal distribution (approximately 1.96 for a 95% interval), and n is the sample size.

Strategy 1:

. p, =210 = 0.3
1 X(M>
• The confidence interval is: ______________
/o.3(l - 0.3)
0.3 ± 1.96V 800
= 0.3 ± 1.96 • 0.0162

= 0.3 ± 0.0317

= [0.2683, 0.3317]

Strategy 2:

. p2=Jio = o.3i
2 1000
• The confidence interval is: _________________
/0.31(l - 0.31)
0.31 ±1.96 V 1000
= 0.31 ± 1.96 • 0.0144

= 0.31 ± 0.0282

= [0.2813, 0.3386]

Since the intervals overlap, there is no statistically significant difference between the two
strategies at the 5% level of significance. Therefore, both strategies had a similar impact
on the sales of Product X.

Solution with Python


Import sclpy.stats as stats Import math

# Data nl « 880

xl « 240

n2 = 1800

x2 = 310

# Calculate proportions pl hat = xl / nl

p2 hat » x2 / n2

# Function to calculate the confidence interval of a proportion def confidence interval^ hat. n, confidence-0.95): z =■ stat
*

# Calculate confidence intervals for the two strategies dl ° confidence interval(pl_hat, nil ci2 « confidence_interval(p2_hc

Here's the breakdown:

1. The proportions of customers who purchased Product X under the two strategies are
calculated as the ratio of the number of customers who purchased to the total number
of customers for each strategy.
2. The confidence interval function uses the normal distribution to calculate the 95%
confidence interval of a proportion. The math library is used to compute the square root
necessary in the calculation of the margin of error.

The overlap between the confidence intervals allows us to say that the two strategies are
statistically similar.

Exercise 99. Market Analysis: Customer Satisfaction Evaluation An e-commerce


company wants to evaluate its customers' satisfaction with its delivery services. In the
past, an industry research showed a benchmark satisfaction level of 0.88 (88% satisfied
customers). The company recently conducted a survey on a random sample of 150
customers, discovering that 132 customers are satisfied.

Based on these results, you need to:

1. Calculate the 95% confidence interval for the proportion of satisfied customers of the
company.
2. Determine, with a 5% level of significance, if the customer satisfaction of the company
is in line with the industry benchmark.

Solve the above points, explaining each step.

Solution

Let's go through the various steps to solve the exercise.

1. Calculation of the 95% confidence interval for the proportion.


The sample proportion P is given by:
P=^ = 0.33

150
The 95% confidence interval for a proportion is given by:
/p(l-p)
P±z\ n
Where zfor a confidence level of 95% is approximately 1.96. Thus:
/0.88- 0.12
0.88 ± 1.96-V 150

= 0.88 ± 1.96 • 0.027

= 0.88 ± 0.05292

Hence, the confidence interval is approximately [0.827, 0.933J.


2. Comparison between the sample proportion and the benchmark
o Null hypothesis (Ha): The customer satisfaction proportion of the company is equal
to the benchmark p = 0.88.
o Alternative hypothesis (HJ: The customer satisfaction proportion of the company is
different from the benchmark p*
0.88.
Let's calculate the z value:
P ~ Po
Z ~ . /7>o(l-po)
V n

0.88 - 0.88
2= / o xx.o r> = 0
V 150

Since z = 0 is less than 1.96 (critical value for a two-tailed test with a = 0.05), we
cannot reject the null hypothesis.

The data does not show a significant difference between the company’s customer
satisfaction proportion and the benchmark. The satisfaction appears to be in line with the
industry benchmark.

Solution with Python

import scipy.stats as st import math

# Problem data satisfied customers » 132

total customers = 150

# Calculate the sample proportion p hat » satisfied customers / total customers

# Calculate the *
95 confidence interval /value ■ st.norm.ppf(0.975) # 0.975 because it is twotailed and (1-0.65)72

error margin =* / value ’ math.sqrt((p hat • (1 - p hat)) / total customers) confidence interval « (p hat • error margin, p hat

# Verification with the benchmark pB - 0.88

z test = (p hat pQ) I math. sqrtf (p0 ’ (1 pB)) / total customers) p value = 2 • (1 - st .norm.cdf (abs(z test))) * Two-tailed

# Results print('95^ confidence interval:’, confidence interval) print(’/-value:’, z test) print(’p-value:’, p value) The

Let's go through the steps:

1. Start by defining the problem data, which is the number of satisfied customers and the
total number of customers surveyed. This allows us to calculate the sample proportion
(P).
2. The sample proportion P is simply the ratio of satisfied customers to the total number
of customers.
3. Calculation of the confidence interval:
o Calculate the critical z value for a 95% confidence level using st.norm.ppf, which is
similar to the inverse quantile function.
o Calculate the error margin, which is multiplied by the standard deviation of the
sample proportion divided by the sample size.
o Determine the confidence interval by adding and subtracting the error margin from
P.
4. Hypothesis testing against the benchmark:
o Calculate the z value (z-test) to check if the difference between the sample
proportion and the benchmark is sufficiently small to be insignificant.
o The p value is calculated as a two-tailed probability (helpful to check if we can
reject the null hypothesis).
5. The final prints communicate the confidence interval and the result of the comparison
with the benchmark.

The use of math.sqrt is for calculating the square root, while scipy.stats is essential for
obtaining statistical distribution values.

Exercise 100. Analysis of Telephone Customer Service Quality A telephone service


company has received numerous reports about call quality. According to a sector survey
conducted last year, the quality benchmark for similar services was 0.90 (90% of calls
without disturbances). The company recently conducted a survey on a random sample of
120 customers and found that 105 of them confirmed calls without issues.

Based on these results, it is required to:

1. Calculate the 95% confidence interval for the proportion of customers satisfied with call
quality.
2. Determine, with a significance level of 5%, if the company's call quality is in line with
the sector benchmark.

Solution

To calculate the 95% confidence interval for the proportion of satisfied customers, we use
the formula for the confidence interval of a proportion:
Ip(L-p)
p ± za/2\ n

Where:

• p= = 0.875 is the sample proportion.


• n = 120 is the sample size.
• Za/2 = 1.96 is the critical value for a 95% confidence level.

Calculate the standard error (SE):_____________________


/0.875- (1 0.875)
SE = \ 120 = 0.0293
Calculate the confidence interval:
0.875 ± 1.96 • 0.0293 « 0.875 ± 0.0575

Thus, the 95% confidence interval is [0.8158, 0.9342],

Now, let's formulate the hypotheses for the test:

• Ho: p = 0.90 (call quality is in line with the benchmark)


• :*
0.90
p (call quality is not in line with the benchmark)

Calculate the test statistic z:


0.875 - 0.90
7= ~r--------- =-0 9129
/ 0.90'1 0.90) uylzy
V 120

Compare Z with the critical value Zo/2 = ±1.96. Since -0.9129 is within the acceptance
range, we do not reject the null hypothesis.

The 95% confidence interval for the proportion of satisfied customers is: [0.8158, 0.9342],
With a significance level of 5%, there is not enough evidence to assert that call quality
differs from the sector benchmark of 90%. Therefore, we conclude that call quality could be
in line with the benchmark.

Solution with Python

Import math from scipy.stats import norm

# Problem data n «» 120

p hat = 105 / 120 # sample proportion p benchmark » 0.90

alpha » 0.B5

# Calculate the 95% confidence interval for the sample proportion z alpha half » norm.ppffl ■ alpha / 2) SE = math.sqrt(p hat

# Calculate the margin of error e margin « z alpha half • SE

* Confidence interval confidence Interval = (p hat - e margin, p hat + e margin)

# Calculate the test statistic SE benchmark = math.sqrtfp benchmark • (1 - p benchmark) / n) Z = (p hat - p benchmark) / SE t

» Results {

'interval', confidenceinterval, 'statistics': Z, 'decision'; "Do not reject the null hypothesis" if abs(Z) < zalpha half el

The code uses the scipy library to obtain the critical Z value for a given confidence level,
facilitating statistical calculation compared to manual use of statistical tables.

Let's look at the various steps:

1. First, gather the data: sample size, sample proportion, and sector benchmark.
2. To determine the 95% confidence interval, calculate the margin of error using the
critical Z value (obtained using nom.ppt from scipy.stats) and the standard error of the
sample proportion.
3. Calculate the confidence interval by adding and subtracting the margin of error from
the sample proportion.
4. For the hypothesis test, calculate the test statistic Z by comparing the difference
between the sample proportion and the benchmark with the standard error of the
benchmark.
5. Compare Z with the critical value; if it’s less, it implies there is no evidence to reject
the null hypothesis, suggesting call quality might be in line with the benchmark.
The Author
Gianluca Malato was born in 1986 and is an Italian data
scientist, entrepreneur, and author. In 2010, he graduated
with honors in Theoretical Physics of Disordered Systems at
the University of Rome "La Sapienza" (thesis advisors:
Giorgio Parisi and Tommaso Rizzo). He worked for years as a
data architect, project manager, data analyst, and data
scientist for a large Italian company. Currently, he holds the
position of Head of Education at profession.ai academy.

He has published several books and articles about Data


Science on his blog yourdatateacher.com and the online
publication Towards Data Science (towardsdatascience.com). He
received the "Top Writer" mention on Medium.com in the
"Artificial Intelligence" category for his articles.

He has also written several fiction books in Italian, focusing


on the horror, thriller, and fantasy genres.

Website: https://www.yourdatateacher.com

Books: https://www.yourdatateacher.com/en/my-books/

Email Address: gianlucatayourdatateacher.com

Linkedln Profile: https://linkedin.com/in/gianlucamalato/


Contents
1 Probability

2 Descriptive Statistics

2.1 Mean Value

2.2 Standard Deviation

2.3 Median Value

2.4 Percentiles

2.5 Chebyshev's Inequality

2.6 Identification of Outliers with IQR Method

2.7 Pearson Correlation Coefficient

2.8 Spearman's Rank Correlation Coefficient

3 Regression Analysis

3.1 Linear Regression

3.2 Exponential Regression

4 Conditional Probability

5 Probability Distributions

5.1 Binomial Distribution

5.2 Poisson Distribution

5.3 Exponential Distribution


5.4 Uniform Distribution

5.5 Triangular Distribution

5.6 Normal Distribution

6 Hypothesis Testing

6.1 Student's t-test for a Single Sample Mean

6.2 Student's t-test for the Means of Two Samples

6.3 Z-test on Proportions

6.4 One-Way ANOVA on the Mean of Multiple


Groups

6.5 Kruskal-Wallis Test on the Median of Multiple


Groups

6.6 Levene's Test for Eguality of Variances Across


Multiple Groups

6.7 One-Sample Kolmoqorov-Smirnov Test

6.8 Two-Sample Kolmogorov-Smirnov Test

6.9 Shapiro-Wilk Normality Test

6.10 Chi-Souare Test on Contingency Tables

6.11 Fisher's Exact Test on 2x2 Tables

7 Confidence Intervals

7.1 Confidence Intervals for the Mean

7.2 Confidence Intervals for Proportions

You might also like