MATHEMATICS FOR COMPUTER
SCIENCE
Presentation Material
Department of Computer Science & Engineering
Course Code: 23CSE5101 Semester: M.Tech 1st Sem
1 Course Title: MCS Year: 2025-2026
Faculty Name: Dr Meenakshi Malhotra
MFET-
Department of Computer Science & Engineering 23CSE510
1
MODULE 3
Probability and Information Theory
MFET-
Department of Computer Science & Engineering 23CSE510
1
Probability and Information Theory
Probability theory is an advanced branch of mathematics that deals with
measuring the likelihood of events occurring.
It provides tools to analyze situations involving uncertainty and helps in
determining how likely certain outcomes are.
This theory uses the concepts of random variables, sample space,
probability distributions, and more to determine the outcome of any
situation.
3
MFET-
Department of Computer Science & Engineering 23CSE5101
Probability Theory
4
MFET-
Department of Computer Science & Engineering 23CSE510
1
Probability Theory
5
MFET-
Department of Computer Science & Engineering 20CSE510
1
Probability
6
Department of Computer Science & Engineering
Probability
7
Department of Computer Science & Engineering
MFET- 23CSE5101
Probability
8
MFET-
Department of Computer Science & Engineering 20CSE510
1
Probability
Random Variables
10
Random experiments are experiments the outcomes of which cannot be predicted in advance
MFET-
Department of Computer Science & Engineering 23CSE510
1
Random Variables
11
A random variable is a variable that can take on different values randomly
On its own, a random variable is just a description of the states that are possible;
it must be coupled with a probability distribution that specifies how likely
each of these states are
Random variables may be discrete or continuous.
A discrete random variable is one that has a finite or countably infinite number
of states. Note that these states are not necessarily integers; they can also just
be named states that are not considered to have any numerical value.
A continuous random variable is associated with a real value.
Department of Computer Science & Engineering MFET- 23CSE5101
Random Variables
12
Example 1: Suppose we are interested in the number X of heads obtained in three tosses of
a coin. Let us see what is the variable in this experiment.
For that we first find the set of possible outcomes of the experiments ie., the sample space S.
The outcomes are
Then
Let us denote by X(aj) the number of heads obtained when the outcome of our experiment is
aj, where j = 1,2, . . . ,8. You can easily check that
X(a1) = 3, X(a2) = X(a3) = X(a4) = 2, …………(1)
X(a5) = X(a6) = X(a7) = 1, X(q) = 0 …………(2)
Course Name &
Department of Computer Science & Engineering Course Code
Random
13
Variables
From here, using the law of probability you can calculate the probabilities as follows:
P[X = 0] = P(a8) = 1/8,
P[X = 1] = P{a5,a6,a7) = P(a5) + P(a6) + P(a7) = 3/8,
P[X = 2] = 3/8 and P[X = 3] = 1/8,
where we read PIX = j] as "probability that X equals j."
A random variable X is a function defined on the set of outcomes of a random
experiment and it takes on a numerical value corresponding to each outcome of the
experiment.
Department of Computer Science & Engineering MFET- 23CSE5101
14
Random Variables
Example 2: You are sitting in a plane waiting for its take off. The pilot announces a
delay until some incoming planes land. Suppose you want to find the following:
i) How long will it be before take off.
ii) How many incoming planes are there.
Let us discuss the random variables for (i) and (ii).of the above example.
The above examples show that random variables can be of different types.
There are mainly two types of random variables:
1) discrete random variable 2) continuous random variable
The random variable shown in (ii) of Example 2 is discrete and that of (i) is continuous.
Department of Computer Science & Engineering MFET- 23CSE5101
Probability Distributions
15
a description of how likely a random variable or set of random variables is to
take on each of its possible states.
depending on whether the variables are discrete or continuous.
Department of Computer Science & Engineering MFET- 23CSE5101
Probability Distributions
16
Department of Computer Science & Engineering MFET- 23CSE5101
Discrete Variables and Probability Mass
Functions
17
A probability distribution over discrete variables is described using a
probability mass function (PMF).
A rea-valued random variable X is said to be discrete if the range of X is
finite or countably infinite
The probability mass function maps from a state of a random
variable to the probability of that random variable taking on that
state.
P (x = x).
MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
18
Probability mass functions can act on many variables at the same time.
Such a probability distribution over many variables is known as a joint
probability distribution.
MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
19
MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
20
consider a single discrete random variable x with k different states.
We can place a uniform distribution on x—that is, make each of its states equally
likely—by setting its probability mass function to
MFET-
Department of Computer Science & Engineering 20CSE510
1
21 Random Variables
Definition 1: A random variable X is said to be discrete if the. number of values that X can take is
finite or countably infinite.
Department of Computer Science & Engineering MFET- 23CSE5101
Random
22
Variables
MFET-
Department of Computer Science & Engineering 20CSE510
1
Discrete Variables and Probability Mass
Functions
24
In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of
guessing at least 8 answers correctly out of 10 questions asked?
Also find the probability of the same if there are 4 options for a
correct answer
MFET-
Department of Computer Science & Engineering 23CSE5101
Discrete Variables and Probability Mass
Functions
25 In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of guessing at least 8 answers correctly
out of 10 questions asked?
Also find the probability of the same if there are 4 options for a correct answer
MFET-
Department of Computer Science & Engineering 23CSE5101
Discrete Variables and Probability Mass
Functions
26 In a quiz contest of answering ‘Yes’ or ‘No’ what is the probability of guessing at least 8 answers correctly
out of 10 questions asked?
Also find the probability of the same if there are 4 options for a correct answer
MFET-
Department of Computer Science & Engineering 23CSE5101
Multinoulli Distribution
27
The multinoulli or categorical distribution is a distribution over a single discrete variable with k different
states, where k is finite
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Variables and Probability Density
Functions
28
To be a probability density function, a function p must satisfy the following properties:
Department of Computer Science & Engineering
Continuous Variables and Probability Density
Functions
29
A probability density function p(x) does not give the probability of a specific
state directly, instead, the probability of landing inside an infinitesimal region
with volume δx is given by p(x)δx.
Integrate the density function to find the actual probability mass of a set of
points.
Specifically, the probability that x lies in some set S is given by the integral of
p (x) over that set.
In the univariate example, the probability that x lies in the interval [a, b] is
given by
MFET-
23CSE5101
Department of Computer Science & Engineering
Continuous Variables and Probability Density Functions
30
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Variables and Probability Density Functions
For probability mass functions (pmf) of discrete random variables, the integral above is
replaced with a sum
Continuous Variables and Probability Density
Functions
32
Geometrically means that the area bounded by the
curve f(x) and the x-axis is equal to unity
Department of Computer Science & Engineering
MFET- 23CSE5101
Continuous Variables and Probability Density Functions
34 Exponential Distribution
It is widely used in queueing systems, reliability analysis, and Poisson processes.
Common examples:
Time until the next customer arrives at a shop
Time before a machine fails
Time between phone calls at a call center
Department of Computer Science & Engineering
Continuous Variables and Probability Density Functions
35 Exponential Distribution
Department of Computer Science & Engineering
Continuous Variables and Probability Density Functions
36
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
37
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
38
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions:
Exponential
39
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions:
Exponential
40
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions:
Exponential
41
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions: Normal
42
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions: Normal
43
Be the values when x=a and x=b
Department of Computer Science & Engineering
Continuous Probability Distributions: Normal
44
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions: Normal
45
out of all possible probability distributions with the same variance, the normal
distribution encodes the maximum amount of uncertainty over the real
numbers.
We can thus think of the normal distribution as being the one that inserts the
least amount of prior knowledge into a model.
The most commonly used distribution over real numbers is the normal distribution
also known as the Gaussian distribution:
MFET-
Department of Computer Science & Engineering 20CSE510
1
Continuous Probability Distributions: Normal
46
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions: Normal
47
Department of Computer Science & Engineering MFET- 23CSE5101
Continuous Probability Distributions: Normal
48
MFET-
Department of Computer Science & Engineering 20CSE510
1
Bernoulli Distribution
49
The Bernoulli distribution is a distribution over a single binary random variable.
It is controlled by a single parameter φ ∈ [0, 1],
MFET-
Department of Computer Science & Engineering 20CSE510
1
Exponential and Laplace Distributions
50
The exponential distribution uses the indicator function 1x≥0 to assign
probability zero to all negative values of x.
A closely related probability distribution that allows us to place a sharp peak
of probability mass at an arbitrary point μ is the Laplace distribution
Department of Computer Science & Engineering
The Dirac Distribution and Empirical
Distribution
51
In some cases, all of the mass in a probability distribution cluster around a single point.
This can be accomplished by defining a PDF using the Dirac delta function, δ(x):
Dirac delta function as being the limit point of a series of functions that put less and less mass on
all points other than zero.
By defining p(x) to be δ shifted by −μ we obtain an infinitely narrow and infinitely high peak of
probability mass where x = μ.
A common use of the Dirac delta distribution is as a component of an empirical distribution,
Department of Computer Science & Engineering
Marginal Probability
52
Sometimes we know the probability distribution over a set of variables and we want to know the
probability distribution over just a subset of them.
The probability distribution over the subset is known as the marginal probability distribution.
For continuous variables, we need to use integration instead of summation
Department of Computer Science & Engineering
53
Course Name &
Department of Computer Science & Engineering Course Code
Conditional Probability
54
the probability of some event, given that some other event has happened.
This is called a conditional probability.
The conditional probability is only defined when P (x = x) > 0.
Computing the consequences of an action is called making an intervention query.
Intervention queries are the domain of causal modeling,
Department of Computer Science & Engineering
55
Calculate the marginal distribution of pet preference among men and women:
•People who prefer cats: 7/22 = .32
•People who prefer fish: 7/22 = .32
•People who prefer dogs: 8/22 = .36
Course Name &
Department of Computer Science & Engineering Course Code
Consider two random variables X and Y with joint PMF given in Table
56
Course Name &
Department of Computer Science & Engineering Course Code
Consider two random variables X and Y with joint PMF given in Table
57
Course Name &
Department of Computer Science & Engineering Course Code
58
Course Name &
Department of Computer Science & Engineering Course Code
Independence and Conditional Independence
59
MFET-
Department of Computer Science & Engineering 20CSE510
1
The Chain Rule of Conditional Probabilities
60
Any joint probability distribution over many random variables may be decomposed into
conditional distributions over only one variable:
This observation is known as the chain rule or product rule of probability
MFET-
Department of Computer Science & Engineering 20CSE510
1
In a factory there are 100 units of a certain product, 5 of which are defective. We pick three
62 units from the 100 units at random. What is the probability that none of them are
defective?
Course Name &
Department of Computer Science & Engineering Course Code
In a factory, Event A = “a machine is selected” with a probability of 0.6, Event B = “the
machine is found to be faulty ” with a probability of 0.5, and Event C = “the fault is
63 repairable” with a probability of 0.4
= P(A)*P(B|A)*P(C|A,B) = 0.6*0.5*0.4=0.12
Course Name &
Department of Computer Science & Engineering Course Code
Expectation, Variance and Covariance
64
The expectation or expected value of some function f(x) with respect to a probability distribution
P (x) is the average or mean value that f takes on when x is drawn from P.
A machine produces defective items according to the following probability distribution for the number of defects
per hour:
Calculate the expected number of defects per hour.
Defects (X) 0 1 2 3
Probability 0.1 0.2 0.3 0.4
Department of Computer Science & Engineering
Expectation, Variance and Covariance
65
The variance gives a measure of how much the values of a function of a random variable x vary as
we sample different values of x from its probability distribution
When the variance is low, the values of f (x) cluster near their expected value.
The square root of the variance is known as the standard deviation
Consider the random variable XXX from Problem 3 (customers arriving at a café).
Using its probability distribution:
X 1 2 3
P 0.2 0.5 0.3
Compute the variance of the number of customers arriving per minute.
Department of Computer Science & Engineering
Expectation, Variance and Covariance
66
The covariance gives some sense of how much two values are linearly related to each other, as
well as the scale of these variables
MFET-
Department of Computer Science & Engineering 20CSE510
1
Expectation, Variance and Covariance
67 Two discrete random variables and represent the number of products sold from two
different counters (Counter A and Counter B) in a shop within the same hour.
Their joint probability distribution is:
Calculate the covariance between X and Y, and interpret
X Y P(X, Y) whether the counters show a positive or negative relationship.
1 1 0.1
1 2 0.2
2 1 0.3
2 2 0.4
Department of Computer Science & Engineering
Bayesian learning methods are relevant to machine learning for two
different reasons.
First, Bayesian learning algorithms that calculate explicit
probabilities for hypotheses,
Bayesian such as the naive Bayes classifier, are among the most practical
Learning approaches to certain types of learning problems.
They provide a useful perspective for understanding many learning
algorithms that do not explicitly manipulate probabilities.
10-12-2025 By Dr. Meenakshi Malhotra
e of Bayesian learning methods:
1. Provides practical learning algorithms:
Naive Bayes learning
Bayesian belief network learning
Bayesian Combine prior knowledge (prior probabilities) with
Learning observed data
Requires prior probabilities
2. Provides useful conceptual framework
Provides “gold standard" for evaluating other learning
algorithms
Additional insight into Occam's razor
10-12-2025 By Dr. Meenakshi Malhotra
tures of Bayesian learning methods:
Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct.
Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis.
Bayesian Bayesian methods can accommodate hypotheses that make
Learning probabilistic predictions.
New instances can be classified by combining the predictions of
multiple hypotheses, weighted by their probabilities.
Even in cases where Bayesian methods prove computationally
intractable, they can provide a standard of optimal decision making
against which other practical methods can be measured
10-12-2025 By Dr. Meenakshi Malhotra
tures of Bayesian learning methods:
Difficulty in applying Bayesian methods
require initial knowledge of many probabilities.
significant computational cost required to determine the
Bayes optimal hypothesis in the general case
Bayesian
Learning
When these probabilities are not known in advance they are often
estimated based on
background knowledge,
previously available data, and
assumptions about the form of the underlying distributions.
10-12-2025 By Dr. Meenakshi Malhotra
In machine learning we are often interested in determining the best
hypothesis from some space H, given the observed training data D.
Bayes
Theorem Bayes theorem provides a way to calculate the probability of a hypothesis
based on its prior probability, the probabilities of observing various data
given the hypothesis, and the observed data itself.
10-12-2025 By Dr. Meenakshi Malhotra
Bayes
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
Bayes
Theorem
P(h|D) decreases as P(D) increases, because the more
probable it is that D will be observed independent of h, the
less evidence D provides in support of h.
10-12-2025 By Dr. Meenakshi Malhotra
Bayes P(h) to denote the prior probability that hypothesis h holds, before we have
Theorem observed the training data.
P(D) to denote the prior probability that training data D will be observed
P(D|h) to denote the probability of observing data D given some world in
which hypothesis h holds
P (h|D) is called the posterior probability of h, because it reflects our
confidence that h holds after we have seen the training data D
10-12-2025 By Dr. Meenakshi Malhotra
Bayes
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
Maximum a posteriori hypothesis hMAP
Any maximally probable hypothesis is called a maximum a posteriori (MAP)
hypothesis.
To find the most probable hypothesis h ϵ H given the observed data D
(or at least one of the maximally probable if there are several).
Bayes
Theorem
dropped the term P(D) because it is a constant independent of h
*
10-12-2025 By Dr. Meenakshi Malhotra
Maximum Likelihood hypothesis, hML
• Assume that every hypothesis in H is equally probable a priori
(P(hi) = P(hj) for all hi and hj in H).
Bayes • P(D|h) is often called the likelihood of the data D given h
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
Summary of basic probability formulas.
Bayes
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
Bayes
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
there are two alternative hypotheses:
1) that the patient has a particular form of cancer.
2) and that the patient does not.
Bayes
Theorem
The available data is from a particular laboratory test
with two possible outcomes: (positive) and
Θ(negative).
Prior knowledge over the entire population of people
only .008 have this disease
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
Bayes Confusion Matrix
Theorem The test returns a correct positive result in only
98% of the cases in which the disease is
actually present and
a correct negative result in only 97% of the
cases in which the disease is not present.
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
Bayes
Theorem
Sensitivity of the test – true/false hypothesis when cancer is there or not there
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
Bayes
Theorem
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
Observe a new patient for whom the lab test returns a
positive result.
Bayes
The maximum a posteriori hypothesis can be
Theorem
found using
10-12-2025 By Dr. Meenakshi Malhotra
An Example: Does a patient have cancer or not?
To calculate the probability of having cancer given that the
test is positive i.e. P (cancer|+)
Bayes
Theorem This step is warranted because Bayes theorem states that
the posterior probabilities are just the above quantities
divided by the probability of the data, P
Notice: while the posterior probability of cancer is significantly higher than its prior
probability, the most probable hypothesis is still that the patient does not have cancer.
10-12-2025 By Dr. Meenakshi Malhotra
lationship between Bayes theorem and the problem of concept learn
Bayes Bayes theorem provides a principled way to calculate the
Theorem posterior probability of each hypothesis given the training
and data
Concept
Learning calculates the probability for each
possible hypothesis, then outputs the most probable.
10-12-2025 By Dr. Meenakshi Malhotra
MAXIMUM
LIKELIHOOD
& LEAST-
SQUARED Assuming the training examples are mutually independent
ERROR given h, we can write P(D|h) as the product of the various
p(di|h)
HYPOTHESES
the noise ei obeys a Normal distribution with zero mean and unknown
variance σ2, each di must also obey a Normal distribution with variance σ2
10-12-2025 By Dr. Meenakshi Malhotra
MAXIMUM
LIKELIHOOD
& LEAST-
SQUARED
ERROR
HYPOTHESES
10-12-2025 By Dr. Meenakshi Malhotra
Maximizing this negative quantity is equivalent to minimizing
the corresponding positive quantity.
MAXIMUM
LIKELIHOOD
& LEAST-
SQUARED
ERROR
HYPOTHESES
10-12-2025 By Dr. Meenakshi Malhotra