0% found this document useful (0 votes)

23 views100 pages

02 Basic Stat Analysis

Uploaded by

dumwong194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views100 pages

02 Basic Stat Analysis

Uploaded by

dumwong194

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MSBD 5001

Foundations of Data Analytics

Fall 2025

Topic 2: Basic Statistical Analysis

Cecia Chan
Department of Computer Science and Engineering
The Hong Kong University of
Science and Technology

MSBD5001 Fall 2025 1

Basic Statistical Analysis

The lecture notes are prepared based on various sources on the Internet.

MSBD5001 Fall 2025 2

Statistical Analysis
• Statistical analysis is the science of collecting, exploring and presenting large
amounts of data (a.k.a. dataset) to discover underlying patterns and trends.
• It is used extensively in science, from physics to social sciences.
• The following are the major tasks in statistical analysis:
• Describing and summarizing the data
• Identifying the relationship between variables
• Forecasting the outcomes

1[Link]

MSBD5001 Fall 2025 3

Why is statistical analysis important?
What will my
salary be when I Describe the data
graduate?

The average salary of

our graduates is
$80,000.

MSBD5001 Fall 2025 4

Why is statistical analysis important?
Decide the proper method

New way?
Old way?

MSBD5001 Fall 2025 5

Part 1
Describing and
Summarizing the Data

MSBD5001 Fall 2025 6

Data Sets
• Data sets can be thought of as a bunch of number or a list of things.
• Examples:
• Suppose we ask twenty students their weights and then record them as:
122 146 65 162 148 155 136 151 151 153
201 156 235 157 160 171 178 197 142 131
• This is a data set of 20 observations.
• Note: Number of items in a sample is called sample size, denoted as n
• Suppose we ask the students their hair color and get the responses:
Red Blond Blond Brown Brown Red Blond Blond Brown Black
Blond Red Red Brown Black Brown Red Black Brown Blond
• Data come in two types:
• Discrete (Example: Hair color data set)
• Continuous (Example: Weight data set)

MSBD5001 Fall 2025 7

Describing and Summarizing Data
• There are many ways to describe and summarize our data. We discuss a few
below.
1. Table

MSBD5001 Fall 2025 8

Describing and Summarizing Data
2. Bar chart 3. Pie chart

MSBD5001 Fall 2025 9

Describing and Summarizing Data
4. Stem-and-leaf plot
• Assume we have the data of maximum
ozone reading (in parts per
billion(ppb)) taken on 80 summer days
in a large city.
• A stem-and-leaf plot can be
constructed using
• the first digit of the two-digit
numbers and the first two digits of
the three-digit numbers
as the stem number and
• the remaining digits
as the leaf number.
MSBD5001 Fall 2025 10
Advantage of using Stem-and-leaf Plot
• The plot can be constructed quickly using pencil and paper.
• The values of each individual data point can be recovered from the plot.
• The data is arranged compactly since the stem is not repeated in multiple data
points.

MSBD5001 Fall 2025 11

Describing and Summarizing Data
• Apart from describing data graphically, data can also be described using
numerical numbers.
• The following are common numerical descriptive measures of data.
• Describing central tendency • Describing variability
• Mean (Arithmetic mean) • Range
• Min • pth percentile
• Max • Interquartile range (IQR)
• Median • Variance
• Mode • Standard deviation

• Before elaborating all the numerical description measures above, we will first
define a few basic concepts of statistics.
• They are population, sample, and sampling error.

MSBD5001 Fall 2025 12

Population, Sample and Sample Error
• Population: The collection of all individuals or items under consideration in a statistical study.
• Sample: Part of the population from which information is collected.
• Sampling error: Reflects the fact that the result we get from our sample is not going to be exactly
equal to the result we would have got if we had been able to measure the entire population.

Population 1 2 3 4 5 6 7 8 9 10 11 12

Sample (every 3rd)

2 5 8 11

A sampling method is a procedure for selecting sample elements from a population.

MSBD5001 Fall 2025 13
Example
• A school takes a poll to find out what students to eat at lunch.
• 70 students are randomly chosen to answer the poll questions.
• What are the population and the sample of this study?

Answer
• Population: All students at the school.
• Sample: The 70 students polled.

MSBD5001 Fall 2025 14

Describing and Summarizing Data

• Suppose we have a sample of size n, denoted as 𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 , the

followings are the formal definitions of all the descriptive measures mentioned
earlier.
1 𝑛
• Mean of a sample: 𝑋ത = 𝛴𝑖=1 𝑥𝑖
𝑛
• Minimum of a sample: Minimum of {𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 }
• Maximum of a sample: Maximum of{𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 }
• Median of a sample:
Middle number of the sorted list of {𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 }.
If n is even, the median is the simple average of the middle two numbers.
• Mode of a sample: The value that appears most often in {𝑥1 , 𝑥2 , … , 𝑥𝑖 , … , 𝑥𝑛 }
• Range of a sample: Minimum to Maximum

MSBD5001 Fall 2025 15

Describing and Summarizing Data
• pth percentile of a sample: The value so that roughly p% of the sample are smaller and
(100 - p)% of the sample are larger
• Interquartile range (IQR) of a sample: Third quartile - First quartile
• First quartile: Median of the first half of the data
• Third quartile: Median of the second half of the data
1 𝑛
• Variance of a sample: ෌ 𝑖=1
𝑥𝑖 − 𝑋ത 2
𝑛−1
1 𝑛
• Standard deviation of a sample: ෌𝑖=1 𝑥𝑖 − 𝑋ത 2
𝑛−1

MSBD5001 Fall 2025 16

Practice
• Suppose we ask twenty students their weights and record them as:
65 122 131 136 142 146 148 151 151 153
155 156 157 160 162 171 178 197 201 235
• Mean of a sample:
• Minimum of a sample:
• Maximum of a sample:
• Median of a sample:
• Mode of a sample:
• Range of a sample:
• Interquartile range (IQR) of a sample:
• Third quartile:
• First quartile:
• Variance of a sample:
• Standard deviation of a sample:

MSBD5001 Fall 2025 17

More about Graphic Displays of Basic
Statistical Descriptions
• Boxplot
• graphic display of five-number summary
• Histogram
• x-axis represents values, y-axis represents frequencies
• Quantile plot
• each value xi is paired with fi indicating that approximately 100 fi % of data are  xi
• Quantile-quantile (q-q) plot
• graphs the quantiles of one univariant distribution against the corresponding quantiles of
another
• Scatter plot
• each pair of values is a pair of coordinates and plotted as points in the plane

MSBD5001 Fall 2025 18

Measuring the Dispersion of Data:
Quartiles & Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: Data is represented with a box
• Q1, Q3, IQR: The ends of the box are at the first and third quartiles, i.e., the height of
the box is IQR
• Median (Q2) is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

MSBD5001 Fall 2025 19

Histogram Analysis 30

25
Histogram

• Histogram: Graph display of tabulated 15

frequencies, shown as bars 10

• Differences between histograms and bar 5

charts 0
10000 30000 50000 70000 90000

• Histograms are used to show distributions of

variables while bar charts are used to compare
variables
• Histograms plot binned quantitative data while
bar charts plot categorical data
• Bars can be reordered in bar charts but not in
histograms
• Differs from a bar chart in that it is the area of the
bar that denotes the value, not the height as in
bar charts, a crucial distinction when the
categories are not of uniform width Bar chart

MSBD5001 Fall 2025 20

Histograms Often Tell More than Boxplots
• The two histograms shown in the
left may have the same boxplot
representation
• The same values for:
• min, Q1, median, Q3, max

• But they have rather different data

distributions

MSBD5001 Fall 2025 21

Quantile Plot
• Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information
• Let xi for i = 1 to N, be the data sorted in increasing order.
• xi is paired with fi, which indicates that approximately 100 fi% of the data are
below or equal to the value xi

MSBD5001 Fall 2025 22

Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there is a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile.
Unit prices of items sold at Branch 1 tend to be lower than those at Branch 2

MSBD5001 Fall 2025 23

Scatterplot
• It uses Cartesian coordinates to display values for two variables for a set of data.
• The following shows the height and weight of 57 baseball players.
• (height, weight)

Things to think about when looking at scatterplots.

Form: Does it have a shape?
Direction: Does the data have a direction?
Strength: Are the points close together or
scattered?
MSBD5001 Fall 2025 24
Part II
Identifying Relationship
between Variables

MSBD5001 Fall 2025 25

Relationship between Variables
• We often collect data from several different variables on a subject.
• A simple example is a form, such as an application form, which are collected from
a group of people.
• Each item on the form corresponds to a variable.
• Example: Suppose the form is that students are filling out at a university. Items might include
the GPA, major, weight, height, gender, etc.
• We may describe each variable separately using the descriptive statistics, but
often we also want to investigate the relationship between the variables.
• Example: weight and height of students, denoted as Y and X, respectively.

MSBD5001 Fall 2025 26

Relationship between Variables
• Plot data with Y (weight) on the vertical axis and X (height) on the horizontal axis of a scatterplot.

Observations
• On the basis of the plot, a linear model is certainly worthy of a first try.

MSBD5001 Fall 2025 27

Relationship between Variables
• Linear model
𝑌 = 𝑎 + 𝑏𝑋 + 𝑒

where a is the y-intercept, b is the slope, and e is the random error

(i.e., if there were no error Y would be a deterministic linear function of X).

• Note: X is called independent variable and Y is called dependent variable.

MSBD5001 Fall 2025 28

Relationship between Variables
1. Eyeball Fit – Pick two points on the plot so that the line passing through them
gives a “fairly” good fit.
• To estimate the slope, take two points, say (𝑋1 , 𝑌1 ) and (𝑋2 , 𝑌2 ), then
𝑌2 − 𝑌1
𝑏෠ =
𝑋2 − 𝑋1

For the student data, we chose the points (69, 160) and (78, 225).
Hence the estimate of slope is
225 − 160 65
𝑏෠ = = = 7.2
78 − 69 9

• To estimate the y-intercept, simply take one of the points, say, (𝑋1 , 𝑌1 ),
෠ 1
then estimate the intercept by solving the linear equation for a, i.e., 𝑎ො = 𝑌1 − 𝑏𝑋
For the student data, we chose the point (69, 160),
𝑎ො = 160 − 7.2 69 = −336.8
• Thus, the predicted equation is
Y = −336.8 + 7.2X
MSBD5001 Fall 2025 29
Eyeball Fit

MSBD5001 Fall 2025 30

Relationship between Variables
2. Least Square Fit
• Fit a line 𝑌 = 𝑎 + 𝑏𝑋 such that it minimizes the error S

MSBD5001 Fall 2025 31

Relationship between Variables
• Least Square Fit (Cont'd)
• Alternatively, a and b can be found using the following:

• For the student data, we got

Can you prove the above?
b = 5.530918 𝑛 𝑛
Hint: 𝑛𝑋ത = 𝛴𝑖=1 𝑥𝑖 and 𝑛𝑌ത = 𝛴𝑖=1 𝑦𝑖
a = -215.861
Y = -215.861 + 5.530918X

MSBD5001 Fall 2025 32

Least Square Fit

MSBD5001 Fall 2025 33

Correlation Coefficient
• Correlation coefficient, denoted as r, measures the degree to which two
variables' movements are associated.

• where -1 ≤ r ≤ 1.
• r = 1 means a perfect positive relationship,
i.e., every positive increase of 1 in one variable, there is a positive increase of 1 in the other.
• r = -1 means a perfect negative relationship,
i.e., every positive increase of 1 in one variable, there is a negative decrease of 1 in the other.
• r close to zero indicate little or no linear relationship,
i.e., for every increase, there is not a positive or negative increase.
MSBD5001 Fall 2025 34
Correlation Coefficient

• For the student data, r = 0.704583.

MSBD5001 Fall 2025 35

Part III
Forecasting Outcomes

MSBD5001 Fall 2025 36

Experiment, Sample Space, Event
• An experiment is an action where the result is uncertain.
• A sample space is all the possible outcomes of an experiment, denoted as S.
Examples:
1. Flip a coin: S = {H, T}
2. Roll a six-sided die: S = {1, 2, 3, 4, 5, 6}
3. Roll a pair of six-sided dice: S = {(1, 1), (1, 2), (1, 3), …, (6, 6)}.
S consists of 36 pairs of integers.
• An event is a subset of S, denoted by A, B, C, etc.
Examples:
1. Flip a coin: A = {H}
2. Roll a six-sided die: B = {1, 2}
3. Roll a pair of six-sided dice: A = sum of up-faces 7 or 11

MSBD5001 Fall 2025 37

Probabilities
• Probability is the measure of how likely an event is to occur out of the number of
possible outcomes.
• In other words, it is a ratio where we compare how many times an outcome can
occur compared to all possible outcomes, i.e.,

• The followings are some facts about probability.

1. The probability of an event A is a number between 0 and 1.
2. The probability of the sample space is 1.
3. If two events cannot occur at the same time, the probability that one or the
other occurs is the sum of the probabilities of the individual events.
• We denote the probability of event A by P(A).
MSBD5001 Fall 2025 38
Examples
• Suppose we roll a six-sided die
• The sample space is S = {1, 2, 3, 4, 5, 6}.
• Let A be the event A = {1, 2}.
• The following are different probabilities on S and
the resulting probability of A = {1, 2}.
• p1 = p2 = p3 = p4 = p5 = p6 = 1/6,
P(A) = 2/6 = 1/3
• p1 = p2 = 0.25, p3 = p4 = 0.15, p5 = p6 = 0.1,
P(A) = 0.5

MSBD5001 Fall 2025 39

Determination of Probabilities: Tree
Diagrams
• Calculating probabilities can be hard. Sometimes we add them, sometimes we
multiply them, and often it is hard to figure out what to do.
• To remedy this, we often construct a tree diagram.
• Here is a tree diagram for the toss of a coin:

• The probability of each branch is written on the branch.

• The outcome is written at the end of the branch.

MSBD5001 Fall 2025 40

Determination of Probabilities: Tree
Diagrams
• We can extend the tree diagram to two tosses of a coin:

MSBD5001 Fall 2025 41

Determination of Probabilites: Tree Diagrams
• How to calculate the overall probabilities?
• Multiply probabilities along the branches.
• Add probabilities down columns.

• Results:
• The probability of “Head, Head” is 0.5 x 0.5 = 0.25.
• All probabilities add to 1.0.
• The probability of getting at least one Head from two tosses is 0.25 + 0.25 + 0.25 = 0.75.

MSBD5001 Fall 2025 42

Example
Problem:
• Suppose we have an urn with 30 blue balls and 50 red balls in it and that these balls are identical
except for color.
• Suppose further the balls are well mixed and that we draw 3 balls, without replacement.
• Determine the probability that the balls are all of the same color.
Solution:
• Step 1: Trace a branch up for blue, putting the probability of the first ball being blue, 30/80, on it
and a “B” and the end. Likewise, trace a branch down for red with 50/80 on it and an “R” at the
end.
• Step 2: The second ball is either blue or red, i.e., at the “B", draw one branch up for second ball
blue with the probability of 29/79 on it, and end it with a “B". Next draw one branch down for
second ball red with the probability 50/79 and end it with an “R”.
• Step 3: The ball can be blue or red so there will be two branches at the end of the four second
step branches.

MSBD5001 Fall 2025 43

Determination of Probabilities: Tree
Diagrams

MSBD5001 Fall 2025 44

Determination of Probabilities: Tree
Diagrams
• The probability of three blue balls is

• The probability of three red balls is

• Finally, the probability that the balls are the same color is the probability of either
3 reds or 3 blues is 0.0494 + 0.2386 = 0.2880.
• (Note: These events cannot happen at the same time.)

MSBD5001 Fall 2025 45

Independence
• Referring to the final tree diagram of the last example, the probabilities on the
branches are called conditional probabilities.
• Example
• Let B2 denote the event that the second ball is blue and
• Let B1 denote the event that the first ball is blue
• Then the probability on the first step upward branch is the probability that B2
occurs given that B1 has occurred, i.e., 29/79.
This is called conditional probability of B2 given B1 and we will denote it by
P(B2 | B1). The bar is pronounced “given”.
• In general, for two events A and B, if P(B|A) = P(B),
i.e., knowledge of A did not change the prediction of B,
then we say that A and B are independent events.

MSBD5001 Fall 2025 46

Independence
Question
What is P(B2)? Look at all the end nodes for which the second ball is
blue in the final tree diagram.

• Answer:

Observation
P(B2 | B1) = 29/79 and P(B2) = 30/80, which means B1 and B2 are not
independent events. In fact, they are dependent events.

MSBD5001 Fall 2025 47

Conditional Probability
• Let A and B be arbitrary events and we want to determine P(B|A).
• Formula of conditional probability

Assume we repeat the experiment many many times.

How to compute P(B|A)?
Count how many times that A occurs and count those times that B
has occurred.

• If A and B are independent events, we get

MSBD5001 Fall 2025 48

Example
• Problem:
• A jet airplane has 3 engines which function independently of one another.
The probability that an engine fails in fight is 0.0001. Furthermore, the plane
can fly if at least one engine is functioning. Determine the probability the
airplane has a successful fight.
• Solution:
• The event we want to consider is A = at least one engine operates throughout
the fight.
• Consider the complement of A, Ac which is the event all the three engines fail.
• Let B1 be the event that engine one fails, B2 be the event that engine two fails,
B3 be the event that engine three fails. Hence, Ac is the event B1 and B2 and B3
occurs. Thus

MSBD5001 Fall 2025 49

Example (Cont’d)
• Solution (Cont'd):
• As the engines function independently of one another, hence B1, B2 and B3 are
independent events. So,

• Therefore,

• Hence,

MSBD5001 Fall 2025 50

Random Variables
• In many problems there are only a few events of interest and, furthermore, they
are often be characterized in terms of a variable.
• In statistics, a random variable, usually written X, is a variable whose possible
values are numerical outcomes of an experiment.
• Example:
• Rolling a pair of dice, the events of interest are: the sum of up-faces is 2, or 3, or 4, ..., or 12.
Hence, there are only 11 events of interest.
• If we let X = the sum of the up-faces, then the events of interest can be expressed as: X = 2, X
= 3, ..., or X = 12. Hence X characterizes the events of interest. We call X a random variable.
• Random variables come in two types
• Discrete random variables
• Continuous random variables

MSBD5001 Fall 2025 51

Discrete Probability Models
• Using the example on the last page and assume that the dice are fair.
• P(X = 3) means the probability that a (1, 2) or a (2, 1) comes up which is 1/36 +
1/36 = 2/36
• Using the same reasoning for the other range items, we obtain probability model
for X:

• For a discrete random variable X, let p(x) denote the probability X assumes the
value x, and is called probability mass function (distribution).
• p(7) = 6/36

MSBD5001 Fall 2025 52

Parameters
• Sample can be generated by a probability model, where parameters are characteristics of
the model.
• Suppose 100 samples are drawn from the probability model of a number spinner
from 1 to 3.
22112212111111112313
13112233212323312121
32133122213311111111
12132212223232333311
11221132321132133112
• The frequency and relative frequency of each numbers are shown in the table below.

• This sample distribution is an estimate of the probability model for X,

i.e. 0.43 is our estimate of p(1), 0.31 is our estimate of p(2), and 0.26 is our estimate of p(3).

MSBD5001 Fall 2025 53

Parameters – Mean, Expected Value, or
Expectation
• The mean, expected value, or expectation of a random variable X is written as
E(X) or µ.
• If we observe n random values of X, i.e., x1, x2, … , xn, then the mean of n values
will be approximately equal to E(X) for large n defined as follows:

• where f(x) is the probability function of X.

MSBD5001 Fall 2025 54

Parameters – Mean, Expected Value, or
Expectation
• Referring to the spinner example, the sample mean (തx) = 183 / 100 = 1.83, which
can be calculated in one of the following ways:

• From the last line, xത is estimating

• where µ is called the mean (or the parameter) of the probability model.
MSBD5001 Fall 2025 55
Parameters – Variance
• Variance is another parameter of probability model.
• The variance of a random variable X is written as Var(X) or 𝜎 2 .
• It is a measure of how spread out it is.
• Are the values of X clustered tightly around their mean?
• The variance measures how far the values of X are from their mean, on average.
• Variance of X is

MSBD5001 Fall 2025 56

Variance
• Variance of the spinner sample is calculated by

MSBD5001 Fall 2025 57

Binomial Probability Model
• A binomial model is characterized by trials (called Bernoulli trials) which either
end in success or failure.
• Suppose we have n Bernoulli trials and p is the probability of success on a trial.
Then this is a binomial model if
• The Bernoulli trials are independent of one another.
• The probability of success, p, remains the same from trial to trial.
• The binomial random variable, X, is the number of successes in the n trials.
• Over the n trials, there could be one success, two successes, etc., up to n successes.
• So, the range of X is the set {0, 1, 2, …, n}.
• The probability of observing x success out of n trials is given by

where x = 0, 1, …, n.
• If the probabilities of X are distributed in this way, we write
MSBD5001 Fall 2025 58
Example
• Suppose we want the probability of getting 7 heads in ten flips of an unfair coin
for which the probability of getting a head is 2/3 and the probability of a tail is
1/3.
• In other words, X is bin(10, 2/3) and we want to compute P(X = 7).
• One possible way of obtaining 7 heads is if we observe the pattern HHHHHHHTTT and the
probability of obtaining this pattern is

• There are 𝐶710 of the patterns contain 7 heads.

• So, P(X = 7) can be computed by

MSBD5001 Fall 2025 59

Mean and Standard Deviation of Binomial
Probability Model
• Suppose we have an unfair coin for which the probability of getting a head is 2/3,
and the probability of a tail is 1/3.
• Consider tossing the coin five times in a row and counting the number of times we
observe a head.
• We denote this number as X = No. of heads in 5-coin tosses, where 0 ≤ X ≤ 5.
• Consider the example of the Binomial distribution below

• The mean value of the distribution can be calculated as

MSBD5001 Fall 2025 60

Mean and Standard Deviation of Binomial
Probability Model
• In general, there is a formula for the mean of a binomial distribution, µ.
There is also a formula for the standard deviation, 𝜎.

• In the example above, X is bin(5, 2/3) and so the mean and standard deviation are
given by
2
𝜇 = 𝑛𝑝 = 5 × = 3.3333
3
2 2
𝜎= 𝑛𝑝(1 − 𝑝) = 5 × ( ) × (1 − ) = 1.111
3 3

MSBD5001 Fall 2025 61

Shape of Binomial Distribution
• Different values of n and p lead to different distributions with different shapes.

Observations
• In general, the probabilities of a binomial will increase until np and then decrease.
• The probability distribution will be symmetric if p = 1/2, skewed right if p < 1/2, and
skewed left if p > 1/2.
MSBD5001 Fall 2025 62
Poisson Probability Model
• A Poisson distribution is a discrete probability distribution for the counts of
events that occur randomly in a given interval of time (or space).
• Let X be the number of events in a given interval, and if the mean number of
events per interval is λ, the probability of observing x events in a given interval is
given by

• where x = 0, 1, 2, 3, 4, …, and e ≈ 2.718282.

• If the probabilities of X are distributed in this way, we write

MSBD5001 Fall 2025 63

Example
• Births in a hospital occur randomly at an average rate of 1.8 births per hour. What
is the probability of observing 4 births in a given hour at the hospital?
• Let X = No. of births in a given hour, the probability of observing exactly 4 births in a given
hour can be calculated as

• What about the probability of observing more than or equal to 2 births in a given
hour at the hospital?
• We want P(X ≥ 2), i.e.,

MSBD5001 Fall 2025 64

Mean and Standard Deviation of Poisson
Probability Model
• In general, there is a formula for the mean of a Poisson distribution, µ. There is
also a formula for the standard deviation, 𝜎.

MSBD5001 Fall 2025 65

Shape of Poisson Distribution

Observations
• Unimodal
• Skewed left if λ increases
• Centred roughly on λ
• The variance (spread)
increases as λ increases

MSBD5001 Fall 2025 66

Probability Models for Continuous Data
• So far, we consider discrete data and discrete probability distributions.
• In practice, many data that we collect from experiments consist of continuous
measurements.
• So, we need to study probability models for continuous data.
• For continuous data, we do not have equally spaced discrete values so instead we
use a curve or function that describes the probability density over the range of
the distribution.
• The curve is chosen so that the area under the curve is equal to 1.
• If we observe a sample of data from such a distribution, we should see that the
values occur in regions where the density is highest.

MSBD5001 Fall 2025 67

Expectation and Variance of Continuous
Random Variables
• The expectation is defined differently for continuous and discrete random
variables.
• Let X be a continuous random variable with probability density function fX(x).
• The expected value of X is

• Similarly, variance is also defined differently.

MSBD5001 Fall 2025 68

Normal Probability Model
(a.k.a Gaussian or Gauss or Laplace-Gauss
Distribution)
• There will be many possible probability density functions over a continuous range
of values.
• The normal distribution describes a special class of such distributions that are
symmetric and can be described by two parameters.
• µ = The mean of the distribution.
• 𝜎 = The standard deviation of the distribution.
• Changing the values of µ and 𝜎 alters the positions and shapes of the
distributions.

MSBD5001 Fall 2025 69

Normal Probability Model
(a.k.a Gaussian or Gauss or Laplace-Gauss
Distribution)

MSBD5001 Fall 2025 70

Normal Probability Model
(a.k.a Gaussian or Gauss or Laplace-Gauss Distribution)
• If X is normally distributed with mean µ and standard deviation 𝜎, we write

• The probability density function of normal distribution is given by

MSBD5001 Fall 2025 71

Standard Normal Distribution
• The standard normal distribution has a mean of zero and a variance of one.
• The following shows the graph of the standard normal distribution which has probability
density function

• If the behavior of a continuous random variable X is described by the distribution

• then the behavior of the random variable is described by the standard
normal distribution N(0, 1).
• We call Z the standardized normal variable.

MSBD5001 Fall 2025 72

Example
1. If the random variable X is described by the distribution
N(45, 0.000625) then what is the transformation to obtain the standardized
normal variable?
• Given µ = 45, σ2 = 0:000625 and so that σ = 0.025,
hence Z = (X - 45)/0.025 is the required transformation.
2. When the random variable X takes value between 44.95 and 45.05, between
which values does the random variable Z lie?
• When X = 45.05, Z = (45.05 - 45) / 0.025 = 2.
• When X = 44.95, Z = (44.95 - 45) / 0.025 = -2.
• Hence Z lies between -2 and 2.
• fkccec

MSBD5001 Fall 2025 73

Probabilities and the Standard Normal
Distribution
• Standard normal distribution is used frequently, a table has been produced to
help calculate probabilities on the next page.
• It is based upon the following diagram:

• Since the total area under the curve is equal to 1, it follows from the
symmetry in the curve that the area under the curve in the region x > 0 is
equal to 0.5.
• The shaded area in the diagram above is the probability that Z takes values
between 0 to z1.
• When we look-up a value in the table, we obtain the value of the shaded
area.
MSBD5001 Fall 2025 74
MSBD5001 Fall 2025 75
Examples
What is the probability that Z takes values between 0 and 1.9?
• The second column headed ‘0’ is the one to choose and
its entry in the row beginning ‘1.9’ is 4713.
• This is to be read as 0.4713 (we omitted the 0 in each entry for clarity).
• So, the probability that Z takes values between 0 and 1.9 is 0.4713.

MSBD5001 Fall 2025 76

Probabilities and the Standard Normal
Distribution
• Now, let's see how to calculate probabilities represented by areas other than those we have
shown earlier.
• The following shows what we do if both Z values are positive.

Example: Find the probability that Z takes

values between 1 and 2.
P(0 < Z < z2), i.e. P(0 < Z < 2) is 0.4772.
P(0 < Z < z1), i.e. P(0 < Z < 1) is 0.3413.
Hence,
P(1 < Z < 2) = 0.4772 – 0.3413 = 0.1359.

• We can compute the probability that Z takes values between z1 and z2 by taking the area
difference of 0 and z2, and 0 and z1.
MSBD5001 Fall 2025 77
Confidence Intervals
• Taking a random sample from a lot of population and computing a statistic, such
as the mean from the data, is to approximate the mean of the population.
• But how well the sample statistic estimates the underlying population value is
always an issue?
• A confidence interval addresses this issue because it provides a range of values
which is likely to contain the population parameter of interest.
• Confidence intervals are constructed at a confidence level, such as 95% selected
by the user.
• What does it mean? It means that if the same population is sampled on
numerous occasions and interval estimates are made on each occasion, the
resulting intervals would bracket the true population parameter in approximately
95% of the cases.

MSBD5001 Fall 2025 78

Confidence Intervals

• The shaded area is 95% of the total area. If we look at the entry in the table
shown earlier corresponding to z = 1.96, we see that the value is 4750, which
means the probability of Z taking a value between 0 and 1.96 is 0.475. By
symmetry, the probability Z takes a value between -1.96 and 0 is also 0.475.
Combining these results, we see that
P(-1.96 < Z < 1.96) = 0.95 or 95%
• We say that the confidence interval for Z (about its mean of 0) is (-1.96, 1.96). It
follows that there is a 5% chance that Z lies outside this interval.

MSBD5001 Fall 2025 79

Example
• Suppose we measure the heights of 40 randomly chosen men, and get
• Mean height of 175cm
• Standard deviation of 20cm
• For 95%, the Confidence Interval is -1.96 to 1.96.
• Use Z = 1.96 in the following formula for the Confidence Interval:

• where µ is the mean, Z is the chosen Z-value, is the standard deviation, and n is the
number of samples.

• In other words, the true mean of ALL men (if we could measure their heights) is likely to
be between 168.8cm to 181.2cm.

MSBD5001 Fall 2025 80

Uniform Distribution
• The uniform distribution of random variable X restricted to a finite interval [a; b]
and fX (x) has constant density over the interval. We write

• The probability density function of uniform distribution is given by

MSBD5001 Fall 2025 81

Mean and Variance of a Uniform Distribution
• Mean of a uniform distribution

• Variance of a uniform distribution

MSBD5001 Fall 2025 82

Example
• Consider the random variable X which is distributed uniformly whose probability
dense function is

Find E(X) and Var (X).

• Solution:

MSBD5001 Fall 2025 83

Exponential Distribution
• The exponential distribution of random variable X is written as

• where the probability density function is given by

• and λ > 0 is called the rate of the distribution.

MSBD5001 Fall 2025 84

Mean and Variance of an Exponential
Distribution
• Mean of an exponential distribution

• Variance of an exponential distribution

MSBD5001 Fall 2025 85

Example
• Consider the random variable X which is distributed exponentially with the rate of
distribution 5

Find E(X) and Var (X).

• Solution:

MSBD5001 Fall 2025 86

Cumulative Distribution Function
• Cumulative distribution function of random variable X is defined as

• where fX(u) is the probability density function of random variable X.

• Graphical interpretations

MSBD5001 Fall 2025 87

Joint Probability Mass Function
• Discrete case
• Suppose X and Y are two discrete random variables, where X = {x1, x2, …, xm}
and Y = {y1, y2, …, yn}
• We define the joint probability mass function of X and Y by

where f(x, y) ≥ 0 and

• The probability of the event that X = xi and Y = yj is given by

MSBD5001 Fall 2025 88

Joint Probability Mass Function
• A joint probability mass function for X and Y can be represented by a joint
probability below.

• The probability that X = xi is obtained by adding all entries in the row corresponding to xi and
is given by

• Similarly, the probability Y = yj is obtained by adding all entries in the column corresponding to yj and is
given by

MSBD5001 Fall 2025 89

Example
• The joint probability function of two discrete variables X and Y is given by
f(x, y) = c(2x + y),
where x and y can assume all integers such that 0 ≤ x ≤ 2, 0 ≤ y ≤ 3, and f(x, y) = 0 otherwise.
• Find the value of the constant c.
• The sample points (x, y) for which probabilities are different from zero are shown on the
left. The probabilities associated with these points, given by c(2x + y), are shown on the
right.

• Since the grand total 42c must equal to 1, we have c = 1/42.

MSBD5001 Fall 2025 90

Example (Cont’d)
• Find P(X = 2, Y = 1).
• From the table, we see that

• Find P(X ≥ 1, Y ≤ 2).

MSBD5001 Fall 2025 91

Joint Probability Density Function
• Continuous case
• Suppose X and Y are two continuous random variables
• We define the joint probability density function of X and Y by f(x, y)
where f(x, y) ≥ 0 and

MSBD5001 Fall 2025 92

Example
• The joint probability density function of two continuous random variables X and Y is

• Find the value of the constant c.

• We must have the total probability equal to 1, i.e.,

• Using the definition of f(x, y), the integral has the value

• Then c = 1/96.
MSBD5001 Fall 2025 93
Example (Cont’d)
• Find P(1 < X < 2, 2 < Y < 3).
• Using the value c found, we have

MSBD5001 Fall 2025 94

Independent Random Variables
• Suppose that X and Y are discrete random variables.
• If the events X = x and Y = y are independent events for all x and y, then we say
that X and Y are independent random variables. In such case
P(X = x, Y = y) = P(X = x)P(Y = y)
• or equivalently
f(x, y) = fX(x) fY(y)

MSBD5001 Fall 2025 95

Multivariate Distributions
• All the results derived for the univariate case can be generalized to k random
variables.
• The joint probability distribution function of X1, X2, …, Xk will have the form
• when the random variables are discrete.

• when the random variables are continuous.

MSBD5001 Fall 2025 96

Multivariate Normal Distribution
• Recall the univariate normal distribution

• The k-variate normal distribution is given by

• where

MSBD5001 Fall 2025 97

Bayes Theorem
• In many situations, you will know one conditional distribution P(x|y) and P(x) but
you are really interested in the other conditional distribution P(y|x).

• Let A1,A2, …, An be a set of mutually exclusive events that together form the
sample space S.
Let B be any event from the same sample space, such that P(B) > 0. Then

MSBD5001 Fall 2025 98

Example
• For a magazine, the probability that the reader is male given that the reader is at least 35 years
old is 0.3. The probability that a reader is male, given that the reader is under 35, is 0.65. If 75% of
the reader are under 35, what is the probability that a randomly chosen reader is
a) Male
b) Female
c) Under 35 and it is given the reader is a female

MSBD5001 Fall 2025 99

Example
• Solution:
• (a) Let A1 be the event of the reader being at least 35 years old,
A2 the event of the reader being under 35 years old,
M be the event of the reader is being a male, and
F be the event of the reader is being a female.
𝑃 𝐴2 = 0.75, 𝑃 𝐴1 = 1 − 0.75 = 0.25
𝑃 𝑀|𝐴1 = 0.3, 𝑃 𝑀|𝐴2 = 0.65
𝑃 𝐹|𝐴1 = 0.7, 𝑃 𝐹|𝐴2 = 0.35
𝑃 𝑀 = 𝑃 𝐴1 , 𝑀 + 𝑃 𝐴2 , 𝑀
= 𝑃 𝐴1 𝑃(𝑀|𝐴1 ) + 𝑃 𝐴2 𝑃 𝑀 𝐴2 = 0.25 × 0.3 + 0.75 × 0.65 = 0.5625
• (b) 𝑃 𝐹 = 1 − 𝑃 𝑀 = 1 − 0.5625 = 0.4375
𝑃 𝐹|𝐴2 𝑃(𝐴2 ) 0.35×0.75
• (c) 𝑃 𝐴2 |𝐹 = = = 0.6
𝑃 𝐹|𝐴1 𝑃 𝐴1 +𝑃 𝐹|𝐴2 𝑃(𝐴2 ) 0.7×0.25+0.35×0.75

MSBD5001 Fall 2025 100

Introduction To Statistics 2024-2025
No ratings yet
Introduction To Statistics 2024-2025
40 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Notes: Section 1: Exploratory Data Analysis
No ratings yet
Notes: Section 1: Exploratory Data Analysis
6 pages
Manm526 W1
No ratings yet
Manm526 W1
38 pages
Data Visualization and Statistical Analysis
No ratings yet
Data Visualization and Statistical Analysis
29 pages
Midterms Gec Math Adooooor
100% (1)
Midterms Gec Math Adooooor
6 pages
Statistical Data Descriptions in Mining
No ratings yet
Statistical Data Descriptions in Mining
5 pages
Computatm Solution
No ratings yet
Computatm Solution
6 pages
msbd5001 01 Math
No ratings yet
msbd5001 01 Math
92 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
Lec 2
No ratings yet
Lec 2
26 pages
Deck 1 - Data Types, Data Display, and Summary 2024F
No ratings yet
Deck 1 - Data Types, Data Display, and Summary 2024F
42 pages
ST Formula Sheet Midterm
No ratings yet
ST Formula Sheet Midterm
4 pages
Descriptive Statistics Overview for STAT 614
No ratings yet
Descriptive Statistics Overview for STAT 614
56 pages
About Data
No ratings yet
About Data
25 pages
Spring Semester, 2020-2021
No ratings yet
Spring Semester, 2020-2021
40 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
Inferential Statistics Course
No ratings yet
Inferential Statistics Course
46 pages
Slide-04-Chapter2-Getting To Know Your Data
No ratings yet
Slide-04-Chapter2-Getting To Know Your Data
47 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
80 pages
Q & A - Unit 1 - Introduction To Statistics
No ratings yet
Q & A - Unit 1 - Introduction To Statistics
20 pages
Key Concepts in Statistics Explained
No ratings yet
Key Concepts in Statistics Explained
5 pages
Essentials of Business Analytics
No ratings yet
Essentials of Business Analytics
6 pages
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
No ratings yet
Probability and Statistics For Computer Scientists Second Edition, By: Michael Baron
63 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
PrepCourseStat Thanarak
No ratings yet
PrepCourseStat Thanarak
27 pages
Math Notes Module 4A
No ratings yet
Math Notes Module 4A
4 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Module 1
No ratings yet
Module 1
64 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Statistics Fundamentals for Data Science
No ratings yet
Statistics Fundamentals for Data Science
153 pages
QUALITATIVE DATA Are Measurements For Which There Is No Natural
No ratings yet
QUALITATIVE DATA Are Measurements For Which There Is No Natural
9 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Data Dispersion and Central Tendency Analysis
No ratings yet
Data Dispersion and Central Tendency Analysis
7 pages
Data Management
No ratings yet
Data Management
48 pages
PDF Notes
No ratings yet
PDF Notes
28 pages
Statistics: Organize Understand
No ratings yet
Statistics: Organize Understand
9 pages
Statistics - Basic Concepts
No ratings yet
Statistics - Basic Concepts
29 pages
Introduction Book 1
No ratings yet
Introduction Book 1
41 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Chapter 1
No ratings yet
Chapter 1
51 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Lecture 5
No ratings yet
Lecture 5
33 pages
Biostat Aguila Mission Solis
No ratings yet
Biostat Aguila Mission Solis
44 pages
Lecture 01 Introduction To Statistics PPT 06022025 095924am
No ratings yet
Lecture 01 Introduction To Statistics PPT 06022025 095924am
40 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
Data Mining Concepts and Techniques
100% (1)
Data Mining Concepts and Techniques
63 pages
02data Part2
No ratings yet
02data Part2
34 pages
Probability Distributions
No ratings yet
Probability Distributions
35 pages
Simulation Model Validation & System Dynamics
No ratings yet
Simulation Model Validation & System Dynamics
17 pages
Digging Numbers
No ratings yet
Digging Numbers
108 pages
Why Name Popularity Is A Good Test of Historicity - Luuk Van de Wegue
No ratings yet
Why Name Popularity Is A Good Test of Historicity - Luuk Van de Wegue
26 pages
Analyse A Health Care System and Explain How Would You Represent A Medical Diagnostic System
No ratings yet
Analyse A Health Care System and Explain How Would You Represent A Medical Diagnostic System
4 pages
Understanding Binomial Distributions
No ratings yet
Understanding Binomial Distributions
52 pages
Introduction To The Theory of Complex Systems Stefan Thurner Digital Version 2025
No ratings yet
Introduction To The Theory of Complex Systems Stefan Thurner Digital Version 2025
145 pages
Schaum's Outline of Probability, Random Variables, and Random Processes, Fourth Edition Hwei P. Hsu Full Chapters Instanly
100% (6)
Schaum's Outline of Probability, Random Variables, and Random Processes, Fourth Edition Hwei P. Hsu Full Chapters Instanly
54 pages
Chapter 4 Activity Geometric Distribution
No ratings yet
Chapter 4 Activity Geometric Distribution
6 pages
MBA Complete Curriculum 2024
No ratings yet
MBA Complete Curriculum 2024
111 pages
MATH019A Engineering Data Analysis
81% (26)
MATH019A Engineering Data Analysis
50 pages
Risk, Reliability & Safety Tutorial Solutions
No ratings yet
Risk, Reliability & Safety Tutorial Solutions
5 pages
Applied Choice Analysis
No ratings yet
Applied Choice Analysis
1,243 pages
Mathematics Courses for Various Departments
No ratings yet
Mathematics Courses for Various Departments
57 pages
Grade 11 Normal Distribution Module
No ratings yet
Grade 11 Normal Distribution Module
38 pages
BCS Second Year Syllabus 2024-25
No ratings yet
BCS Second Year Syllabus 2024-25
38 pages
A Credit Risk Model For Albania
No ratings yet
A Credit Risk Model For Albania
37 pages
Practical 2
No ratings yet
Practical 2
6 pages
Brief Intro To ML
No ratings yet
Brief Intro To ML
206 pages
Mathematics For ML
No ratings yet
Mathematics For ML
12 pages
Generative Adversarial Networks
No ratings yet
Generative Adversarial Networks
10 pages
Statistical Techniques-II - Complete Notes With Solved Examples
No ratings yet
Statistical Techniques-II - Complete Notes With Solved Examples
11 pages
Introduction To Psychological Statistics (Foster Et Al.)
No ratings yet
Introduction To Psychological Statistics (Foster Et Al.)
226 pages
Lecture 5 - 6-260
No ratings yet
Lecture 5 - 6-260
10 pages
Probstats Review1
No ratings yet
Probstats Review1
31 pages
MATH 121 (Chapter 6) - Normal Distribution
No ratings yet
MATH 121 (Chapter 6) - Normal Distribution
31 pages
Discrete Random Variables Guide
No ratings yet
Discrete Random Variables Guide
28 pages
MULTIVARIATE ANALYSIS Part 1
No ratings yet
MULTIVARIATE ANALYSIS Part 1
30 pages
Chapter 7-Frequency Analysis
100% (2)
Chapter 7-Frequency Analysis
18 pages
ECON1203 Course Outline
No ratings yet
ECON1203 Course Outline
21 pages