0% found this document useful (0 votes)

29 views93 pages

Mod 3 Probability

The document presents a course on Mathematics for Computer Science, specifically focusing on Probability and Information Theory. It covers key concepts such as random variables, probability distributions, and various types of distributions including discrete and continuous variables. Additionally, it discusses applications of these theories in analyzing uncertainty and making predictions based on probabilistic models.

Uploaded by

Ningamma Biradar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views93 pages

Mod 3 Probability

Uploaded by

Ningamma Biradar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

MATHEMATICS FOR COMPUTER

SCIENCE

Presentation Material
Department of Computer Science & Engineering
Course Code: 23CSE5101 Semester: M.Tech 1st Sem
1 Course Title: MCS Year: 2025-2026
Faculty Name: Dr Meenakshi Malhotra

MFET-
Department of Computer Science & Engineering 23CSE510
1
MODULE 3
Probability and Information Theory

MFET-
Department of Computer Science & Engineering 23CSE510
1
Probability and Information Theory

 Probability theory is an advanced branch of mathematics that deals with

measuring the likelihood of events occurring.
 It provides tools to analyze situations involving uncertainty and helps in
determining how likely certain outcomes are.
 This theory uses the concepts of random variables, sample space,
probability distributions, and more to determine the outcome of any
situation.
3

MFET-
Department of Computer Science & Engineering 23CSE5101
Probability Theory
4

MFET-
Department of Computer Science & Engineering 23CSE510
1
Probability Theory
5

MFET-
Department of Computer Science & Engineering 20CSE510
1
Probability
6

Department of Computer Science & Engineering

Probability
7

Department of Computer Science & Engineering

MFET- 23CSE5101
Probability
8

MFET-
Department of Computer Science & Engineering 20CSE510
1
Probability
Random Variables
10

Random experiments are experiments the outcomes of which cannot be predicted in advance

MFET-
Department of Computer Science & Engineering 23CSE510
1
Random Variables
11
 A random variable is a variable that can take on different values randomly
 On its own, a random variable is just a description of the states that are possible;
 it must be coupled with a probability distribution that specifies how likely
each of these states are

Random variables may be discrete or continuous.

 A discrete random variable is one that has a finite or countably infinite number
of states. Note that these states are not necessarily integers; they can also just
be named states that are not considered to have any numerical value.
 A continuous random variable is associated with a real value.

Department of Computer Science & Engineering MFET- 23CSE5101

Random Variables
12
 Example 1: Suppose we are interested in the number X of heads obtained in three tosses of
a coin. Let us see what is the variable in this experiment.
For that we first find the set of possible outcomes of the experiments ie., the sample space S.
The outcomes are

 Then

Let us denote by X(aj) the number of heads obtained when the outcome of our experiment is
aj, where j = 1,2, . . . ,8. You can easily check that

X(a1) = 3, X(a2) = X(a3) = X(a4) = 2, …………(1)

X(a5) = X(a6) = X(a7) = 1, X(q) = 0 …………(2)
Course Name &
Department of Computer Science & Engineering Course Code
Random
13
Variables
From here, using the law of probability you can calculate the probabilities as follows:
P[X = 0] = P(a8) = 1/8,
P[X = 1] = P{a5,a6,a7) = P(a5) + P(a6) + P(a7) = 3/8,
P[X = 2] = 3/8 and P[X = 3] = 1/8,
where we read PIX = j] as "probability that X equals j."

A random variable X is a function defined on the set of outcomes of a random

experiment and it takes on a numerical value corresponding to each outcome of the
experiment.

Department of Computer Science & Engineering MFET- 23CSE5101

14
Random Variables
Example 2: You are sitting in a plane waiting for its take off. The pilot announces a
delay until some incoming planes land. Suppose you want to find the following:
i) How long will it be before take off.
ii) How many incoming planes are there.
Let us discuss the random variables for (i) and (ii).of the above example.

The above examples show that random variables can be of different types.
There are mainly two types of random variables:

1) discrete random variable 2) continuous random variable

The random variable shown in (ii) of Example 2 is discrete and that of (i) is continuous.
Department of Computer Science & Engineering MFET- 23CSE5101
Probability Distributions
15

 a description of how likely a random variable or set of random variables is to

take on each of its possible states.
 depending on whether the variables are discrete or continuous.

Department of Computer Science & Engineering MFET- 23CSE5101

Probability Distributions
16

Department of Computer Science & Engineering MFET- 23CSE5101

Discrete Variables and Probability Mass
Functions
17

 A probability distribution over discrete variables is described using a

probability mass function (PMF).
 A rea-valued random variable X is said to be discrete if the range of X is
finite or countably infinite

 The probability mass function maps from a state of a random

variable to the probability of that random variable taking on that
state.

P (x = x).
MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
18

 Probability mass functions can act on many variables at the same time.
 Such a probability distribution over many variables is known as a joint
probability distribution.

MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
19

MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
20

consider a single discrete random variable x with k different states.

We can place a uniform distribution on x—that is, make each of its states equally
likely—by setting its probability mass function to

MFET-
Department of Computer Science & Engineering 20CSE510
1
21 Random Variables
Definition 1: A random variable X is said to be discrete if the. number of values that X can take is
finite or countably infinite.

Department of Computer Science & Engineering MFET- 23CSE5101

Random
22
Variables

MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
24

In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of

guessing at least 8 answers correctly out of 10 questions asked?

Also find the probability of the same if there are 4 options for a
correct answer

MFET-
Department of Computer Science & Engineering 23CSE5101
Discrete Variables and Probability Mass
Functions
25 In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of guessing at least 8 answers correctly
out of 10 questions asked?

Also find the probability of the same if there are 4 options for a correct answer

MFET-
Department of Computer Science & Engineering 23CSE5101
Discrete Variables and Probability Mass
Functions
26 In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of guessing at least 8 answers correctly
out of 10 questions asked?

Also find the probability of the same if there are 4 options for a correct answer

MFET-
Department of Computer Science & Engineering 23CSE5101
Multinoulli Distribution
27

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different
states, where k is finite

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Variables and Probability Density
Functions
28

To be a probability density function, a function p must satisfy the following properties:

Department of Computer Science & Engineering

Continuous Variables and Probability Density
Functions
29

A probability density function p(x) does not give the probability of a specific
state directly, instead, the probability of landing inside an infinitesimal region
with volume δx is given by p(x)δx.

 Integrate the density function to find the actual probability mass of a set of
points.
 Specifically, the probability that x lies in some set S is given by the integral of
p (x) over that set.
 In the univariate example, the probability that x lies in the interval [a, b] is
given by

MFET-
23CSE5101
Department of Computer Science & Engineering
Continuous Variables and Probability Density Functions
30

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Variables and Probability Density Functions

For probability mass functions (pmf) of discrete random variables, the integral above is
replaced with a sum
Continuous Variables and Probability Density
Functions
32

Geometrically means that the area bounded by the

curve f(x) and the x-axis is equal to unity

Department of Computer Science & Engineering

MFET- 23CSE5101
Continuous Variables and Probability Density Functions
34 Exponential Distribution

It is widely used in queueing systems, reliability analysis, and Poisson processes.

Common examples:
 Time until the next customer arrives at a shop
 Time before a machine fails
 Time between phone calls at a call center

Department of Computer Science & Engineering

Continuous Variables and Probability Density Functions
35 Exponential Distribution

Department of Computer Science & Engineering

Continuous Variables and Probability Density Functions
36

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
37

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
38

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
39

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions:
Exponential
40

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions:
Exponential
41

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions: Normal
42

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions: Normal
43

Be the values when x=a and x=b

Department of Computer Science & Engineering

Continuous Probability Distributions: Normal
44

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions: Normal
45

 out of all possible probability distributions with the same variance, the normal
distribution encodes the maximum amount of uncertainty over the real
numbers.
 We can thus think of the normal distribution as being the one that inserts the
least amount of prior knowledge into a model.

The most commonly used distribution over real numbers is the normal distribution
also known as the Gaussian distribution:

MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions: Normal
46

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions: Normal
47

Department of Computer Science & Engineering MFET- 23CSE5101

Continuous Probability Distributions: Normal
48

MFET-
Department of Computer Science & Engineering 20CSE510
1
Bernoulli Distribution
49

The Bernoulli distribution is a distribution over a single binary random variable.

It is controlled by a single parameter φ ∈ [0, 1],

MFET-
Department of Computer Science & Engineering 20CSE510
1
Exponential and Laplace Distributions
50

The exponential distribution uses the indicator function 1x≥0 to assign

probability zero to all negative values of x.

A closely related probability distribution that allows us to place a sharp peak

of probability mass at an arbitrary point μ is the Laplace distribution

Department of Computer Science & Engineering

The Dirac Distribution and Empirical
Distribution
51
In some cases, all of the mass in a probability distribution cluster around a single point.
This can be accomplished by defining a PDF using the Dirac delta function, δ(x):

Dirac delta function as being the limit point of a series of functions that put less and less mass on
all points other than zero.
By defining p(x) to be δ shifted by −μ we obtain an infinitely narrow and infinitely high peak of
probability mass where x = μ.

A common use of the Dirac delta distribution is as a component of an empirical distribution,

Department of Computer Science & Engineering

Marginal Probability
52

Sometimes we know the probability distribution over a set of variables and we want to know the
probability distribution over just a subset of them.
The probability distribution over the subset is known as the marginal probability distribution.

For continuous variables, we need to use integration instead of summation

Department of Computer Science & Engineering

Course Name &

Department of Computer Science & Engineering Course Code
Conditional Probability
54

the probability of some event, given that some other event has happened.
This is called a conditional probability.

The conditional probability is only defined when P (x = x) > 0.

Computing the consequences of an action is called making an intervention query.

Intervention queries are the domain of causal modeling,

Department of Computer Science & Engineering

Calculate the marginal distribution of pet preference among men and women:

•People who prefer cats: 7/22 = .32

•People who prefer fish: 7/22 = .32
•People who prefer dogs: 8/22 = .36

Course Name &

Department of Computer Science & Engineering Course Code
Consider two random variables X and Y with joint PMF given in Table
56

Course Name &

Department of Computer Science & Engineering Course Code
Consider two random variables X and Y with joint PMF given in Table
57

Course Name &

Department of Computer Science & Engineering Course Code
58

Course Name &

Department of Computer Science & Engineering Course Code
Independence and Conditional Independence
59

MFET-
Department of Computer Science & Engineering 20CSE510
1
The Chain Rule of Conditional Probabilities
60

Any joint probability distribution over many random variables may be decomposed into
conditional distributions over only one variable:

This observation is known as the chain rule or product rule of probability

MFET-
Department of Computer Science & Engineering 20CSE510
1
In a factory there are 100 units of a certain product, 5 of which are defective. We pick three
62 units from the 100 units at random. What is the probability that none of them are
defective?

Course Name &

Department of Computer Science & Engineering Course Code
In a factory, Event A = “a machine is selected” with a probability of 0.6, Event B = “the
machine is found to be faulty ” with a probability of 0.5, and Event C = “the fault is
63 repairable” with a probability of 0.4

= P(A)*P(B|A)*P(C|A,B) = 0.6*0.5*0.4=0.12

Course Name &

Department of Computer Science & Engineering Course Code
Expectation, Variance and Covariance
64

The expectation or expected value of some function f(x) with respect to a probability distribution
P (x) is the average or mean value that f takes on when x is drawn from P.

A machine produces defective items according to the following probability distribution for the number of defects
per hour:
Calculate the expected number of defects per hour.
Defects (X) 0 1 2 3
Probability 0.1 0.2 0.3 0.4

Department of Computer Science & Engineering

Expectation, Variance and Covariance
65
The variance gives a measure of how much the values of a function of a random variable x vary as
we sample different values of x from its probability distribution

When the variance is low, the values of f (x) cluster near their expected value.
The square root of the variance is known as the standard deviation

Consider the random variable XXX from Problem 3 (customers arriving at a café).
Using its probability distribution:
X 1 2 3
P 0.2 0.5 0.3
Compute the variance of the number of customers arriving per minute.
Department of Computer Science & Engineering
Expectation, Variance and Covariance
66

The covariance gives some sense of how much two values are linearly related to each other, as
well as the scale of these variables

MFET-
Department of Computer Science & Engineering 20CSE510
1
Expectation, Variance and Covariance
67 Two discrete random variables and represent the number of products sold from two
different counters (Counter A and Counter B) in a shop within the same hour.
Their joint probability distribution is:
Calculate the covariance between X and Y, and interpret
X Y P(X, Y) whether the counters show a positive or negative relationship.
1 1 0.1
1 2 0.2
2 1 0.3
2 2 0.4

Department of Computer Science & Engineering

Bayesian learning methods are relevant to machine learning for two
different reasons.
 First, Bayesian learning algorithms that calculate explicit
probabilities for hypotheses,
Bayesian  such as the naive Bayes classifier, are among the most practical
Learning approaches to certain types of learning problems.
 They provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.

10-12-2025 By Dr. Meenakshi Malhotra

e of Bayesian learning methods:

1. Provides practical learning algorithms:

 Naive Bayes learning
 Bayesian belief network learning
Bayesian  Combine prior knowledge (prior probabilities) with
Learning observed data
 Requires prior probabilities
2. Provides useful conceptual framework
 Provides “gold standard" for evaluating other learning
algorithms
 Additional insight into Occam's razor

10-12-2025 By Dr. Meenakshi Malhotra

tures of Bayesian learning methods:
 Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct.
 Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis.
Bayesian  Bayesian methods can accommodate hypotheses that make
Learning probabilistic predictions.
 New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally
intractable, they can provide a standard of optimal decision making
against which other practical methods can be measured

10-12-2025 By Dr. Meenakshi Malhotra

tures of Bayesian learning methods:

Difficulty in applying Bayesian methods

 require initial knowledge of many probabilities.
 significant computational cost required to determine the
Bayes optimal hypothesis in the general case
Bayesian
Learning
When these probabilities are not known in advance they are often
estimated based on
 background knowledge,
 previously available data, and
 assumptions about the form of the underlying distributions.

10-12-2025 By Dr. Meenakshi Malhotra

In machine learning we are often interested in determining the best
hypothesis from some space H, given the observed training data D.

Bayes
Theorem Bayes theorem provides a way to calculate the probability of a hypothesis
based on its prior probability, the probabilities of observing various data
given the hypothesis, and the observed data itself.

10-12-2025 By Dr. Meenakshi Malhotra

Bayes
Theorem

10-12-2025 By Dr. Meenakshi Malhotra

Bayes
Theorem

P(h|D) decreases as P(D) increases, because the more

probable it is that D will be observed independent of h, the
less evidence D provides in support of h.
10-12-2025 By Dr. Meenakshi Malhotra
Bayes P(h) to denote the prior probability that hypothesis h holds, before we have
Theorem observed the training data.

P(D) to denote the prior probability that training data D will be observed

P(D|h) to denote the probability of observing data D given some world in

which hypothesis h holds

P (h|D) is called the posterior probability of h, because it reflects our

confidence that h holds after we have seen the training data D

10-12-2025 By Dr. Meenakshi Malhotra

Bayes
Theorem

10-12-2025 By Dr. Meenakshi Malhotra

Maximum a posteriori hypothesis hMAP
Any maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.

 To find the most probable hypothesis h ϵ H given the observed data D

(or at least one of the maximally probable if there are several).
Bayes
Theorem

dropped the term P(D) because it is a constant independent of h

*
10-12-2025 By Dr. Meenakshi Malhotra
Maximum Likelihood hypothesis, hML

• Assume that every hypothesis in H is equally probable a priori

(P(hi) = P(hj) for all hi and hj in H).

Bayes • P(D|h) is often called the likelihood of the data D given h

Theorem

10-12-2025 By Dr. Meenakshi Malhotra

Summary of basic probability formulas.

Bayes
Theorem