0% found this document useful (0 votes)
58 views46 pages

Statistical Methods in Data Science

P ill positive = = 0. 015/0. 0015 = 0. 1 = 10% P(positive) In 3 sentences: This document provides an overview of Bayesian statistics and graphical models. It summarizes key concepts in probability and Bayesian inference such as Bayes' rule, prior and posterior probabilities, and applying Bayes' rule to calculate conditional probabilities. An example demonstrates how to use Bayes' rule to calculate the probability of having a disease given a positive test result.

Uploaded by

Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views46 pages

Statistical Methods in Data Science

P ill positive = = 0. 015/0. 0015 = 0. 1 = 10% P(positive) In 3 sentences: This document provides an overview of Bayesian statistics and graphical models. It summarizes key concepts in probability and Bayesian inference such as Bayes' rule, prior and posterior probabilities, and applying Bayes' rule to calculate conditional probabilities. An example demonstrates how to use Bayes' rule to calculate the probability of having a disease given a positive test result.

Uploaded by

Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DAT405 – Part 2:

Statistical methods in
Data Science and AI
DAT405/DIT407, LP4 2022-2023, Module 4
Least squares Maximum likelihood
Tendency/dispersion Histograms/boxplots etc

Bayesian Estimation EM algorithm


Descriptive statistics
Confidence intervals MCMC

Jackknife Data Parametric

Bootstrap Data analysis Non-parametric


exploration
Statistical methods in Hypothesis testing
Sampling
data science and AI
Bayesian
LDA
MCMC Classification
Clustering
HMMs

Principal component Regression Bayesian Neural Markov networks


analsysis (PCA) networks networks
Mixture models Gaussian
Linear Ridge SVMs processes
Factor analysis
Logistic kNN

Chalmers
Syllabus – part 2
• Module 4: Bayesian statistics and graphical models
• Lecture 7: Bayesian statistics
• Lecture 8: Graphical models
• Module 5: Markov models, kernel methods and
decision trees
• Lecture 9: Markov models, reinforcement learning
• Lecture 10: Kernel methods and decision trees

Chalmers 3
Module 4: Bayesian
statistics
Probability theory and statistics
- a quick refresher

Chalmers 5
Probability versus statistics
Probability – predict the likelihood of a future event

Probability
F B Given model, predict data

Statistics – estimate the frequency of a past event

Statistics
? Given data, predict model

Chalmers 6
Sample space, events and random experiments
• A random experiment is a process
that produces random outcomes.
• The sample space is the set of all
possible outcomes in an experiment.
• An event is the outcome, or a subset
of possible outcomes, of an
experiment.

Chalmers 7
Example: roll a die
• Sample space: 𝑺 = 𝟏, 𝟐, … , 𝟔 = 𝟔 outcomes
• Events:
• "At least 3" = {𝟑, 𝟒, 𝟓, 𝟔}
• "Six" = {𝟔}
• "Odd" = {𝟏, 𝟑, 𝟓}
• Probabilities
𝑷 at least 3 = 𝟒/𝟔
𝑷 six = 𝟏/𝟔
𝑷 odd = 𝟑/𝟔

Chalmers 8
Venn diagrams of set operations

Union: 𝐴 ∪ 𝐵 Intersection: 𝐴 ∩ 𝐵 Mutually exclusive: 𝐴 ∩ 𝐵 = 𝜙

S S S

Chalmers 9
Combining events
• Now assume we want to combine two events
• A= ”at least 3” B = ”odd number”
• Union
A ⋃ B = {3,4,5,6} ⋃ {1,3,5} = {1,3,4,5,6}
𝑷(A ⋃ B) = 5/6
• Intersection
A ∩ B = {3,4,5,6} ∩ {1,3,5} = {3,5}
𝑷(A ∩ B) = 2/6

Chalmers 10
Conditional probability 𝐴 𝐵

• The conditional probability of an event 𝑨


given the knowledge that event 𝑩 occurred
𝑷 𝑨∩𝑩 𝑷(𝑨, 𝑩)
𝑷 𝑨𝑩 = =
𝑷(𝑩) 𝑷(𝑩)
• Note also
𝑷 𝑨, 𝑩 = 𝑷 𝑨 𝑩 𝑷 𝑩 = 𝑷 𝑩 𝑨 𝑷(𝑨)

Chalmers 11
Thomas Bayes
(1701 – 1761)

• Developed the idea of using


probability to represent
uncertainty about beliefs
• Most importantly: gave a
method on updating beliefs
given new evidence

Chalmers 12
Chalmers 13
Bayes’ rule 𝐴 𝐵

• Bayes’ rule
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)
Proof:
• 𝑨 ∩ 𝑩 = 𝑩 ∩ 𝑨 ⇒ 𝑷 𝑨 ∩ 𝑩 = 𝑷(𝑩 ∩ 𝑨)
• 𝑷 𝑨∩𝑩 =𝑷 𝑨 𝑩 𝑷 𝑩 𝑃(𝐴|𝐵)

• 𝑷 𝑩∩𝑨 =𝑷 𝑩 𝑨 𝑷 𝑨
⇒ 𝑷 𝑨 𝑩 𝑷 𝑩 = 𝑷 𝑩 𝑨 𝑷(𝑨), then divide by 𝑷(𝑩)

Chalmers 14
Bayes’ rule interpretation
likelihood prior
posterior

𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)

normalizer

We have prior information 𝑷(𝑨) of event 𝑨, and then update the posterior
probability 𝑷(𝑨|𝑩) as more information/data 𝑩 is achieved.

Chalmers 16
Bayes’ rule interpretation

Prior:
Before making observation you think the probability of your hypothesis is

Posterior:
After making observation you think the probability of your hypothesis is

Chalmers 17
Example: spam or ham?
• Suppose I get an email with the word ”invest”. Is it more likely to
spam or ham?
• Hard to estimate P(spam|”invest”). Bayes’ rule will help!
• Estimated proportions of emails sent to me that are spam or ham:
• P(spam) = 0.4
• P(ham) = 0.6
• Proportions of emails containing the word ”invest”
• P(”invest”|spam) = 0.05
Easier to
• P(”invest”|ham) = 0.01
estimate!
Chalmers 18
Example: (cont)
P "invest" spam P(spam) 0.05 ⋅ 0.4 0.02
P spam|"invest" = = =
P("invest") P("invest") P("invest")

P "invest" ham P(ham) 0.01 ⋅ 0.6 0.006


P ham|"invest" = = =
P("invest") P("invest") P("invest")

⇒ P spam|"invest" > P ham|"invest"


Didn’t have to
estimate P(“invest”)!

Chalmers 19
Mutually exclusive and exhaustive events
Also called
Events 𝑬𝟏 , 𝑬𝟐 , … , 𝑬𝒏 are pairwise disjoint.
• mutually exclusive if they cannot occur
simultaneously
𝑬𝒊 ∩ 𝑬𝒋 = 𝝓, 𝒊 ≠ 𝒋 Sample space 𝑆

• exhaustive if they cover the sample space 𝐸"


𝐸#
𝒏 𝐸!
𝑬𝟏 ∪ 𝑬𝟐 ∪ ⋯ ∪ 𝑬𝒏 = L 𝑬𝒊 = 𝑺
𝒊&𝟏
Also called a
partition. 𝐸$

Chalmers 20
Total law of probability
• For mutually exclusive and exhaustive events
𝑬𝟏 , 𝑬𝟐 , … , 𝑬𝒏 we get for any other event 𝑩
𝒏
𝑷 𝑩 = M 𝑷 𝑩 𝑬𝒊
𝒊&𝟏

Example: P(apple) =
P(green apple) + P(red apple) + P(other apples)

Chalmers 21
Bayes’ rule – extended 𝐴 𝐵

• Bayes’ rule
𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 =
𝑷(𝑩)
• For mutually exclusive and exhaustive events
𝑬𝟏 , 𝑬𝟐 , … , 𝑬𝒏 we get
𝑷 𝑩 𝑨 𝑷(𝑨) 𝑷 𝑩 𝑨 𝑷(𝑨)
𝑷 𝑨𝑩 = = 𝒏
𝑷(𝑩) ∑𝒊&𝟏 𝑷(𝑩|𝑬𝒊 )

𝑃(𝐴|𝐵)

Chalmers 23
Example: applying Bayes’ rule
• Assume that 15 out of 10,000 individuals in a population
have a certain disease 𝑫.
• The test is not perfect: when testing for the disease
• an ill person always tests positive
• a healthy person tests positive with probability 0.0002
• Given that you tested positive, what is the probability that
you have the disease?

Chalmers 24
Example (cont.)
𝑷 positive ill 𝑷(ill)
Bayes’ rule: 𝑷 ill positive =
𝑷(positive)
• We have
• 𝑷 ill = 𝟎. 𝟎𝟎𝟏𝟓 and 𝑷 healthy = 𝟏 − 𝟎. 𝟎𝟎𝟏𝟓 = 𝟎. 𝟗𝟗𝟖𝟓
• 𝑷 positive ill = 𝟏, 𝑷 positive healthy = 𝟎. 𝟎𝟎𝟐
• 𝑷 positive = 𝑷 positive ill 𝑷 ill + 𝑷 positive healthy 𝑷 healthy
Would you call
Hence this test good?
𝑷 positive ill 𝑷(ill) 1 ⋅ 0.0015
𝑷 ill positive = = = 𝟎. 𝟒𝟑
𝑷(positive) 1 ⋅ 0.0015 + 0.002 ⋅ 0.9985

Chalmers 25
Is Steve a librarian or a farmer?

Two options:
• Steve is a librarian, or
• Steve is a farmer

What do you think?

Description of Steve

Chalmers 27
Study by Kahneman and Tversky

Chalmers 28
What people in the study answered

Chalmers 29
Background fact
There were about 20 times as
many farmers than librarians
in the US at that time.

Did you consider the librarian


vs farmer ratio?

Most people in the study


didn’t.

Chalmers 30
Background fact There were about 20 times as
many farmers than librarians
in Sweden in 2017.

Librarians

Crop farmers

Animal farmers

Mixed farmers

Data source
Chalmers 31
Is Steve a librarian or a farmer?
• Let
• D = description (of Steve)
• L = librarian
• F = farmer
• We would like to know
P(L|D) = ?

Chalmers 32
Is Steve a librarian or a farmer?
• Bayes’ rule
P L P(D|L)
P LD =
P(D)

Chalmers 33
Without using
Estimate the prior P(L) information/data D

Visualize the entire


Librarians population

Farmers

P L ≈ 1/21 ≈ 5%

Chalmers 34
Our data
Add the info D

Mark the individuals


that match the
description D

Chalmers 35
Estimate the likelihoods P(D|L) and P(D|F)

Estimates:
P D L = 40%
P D F = 10%

Chalmers 36
Use Bayes’ rule to compute the posteriors

P L D = 4/(4 + 20) ≈ 17%


P F D = 20/ 4 + 20 ≈ 83%

So it is 4 times more likely that


Steve is a farmer.

Chalmers 37
What we did

Proportion of L and F Restricting to those satisfying D Proportion of L satisfying D

Chalmers 38
General case in one slide

Chalmers 43
Random variables and
probability distributions

Chalmers 44
Random variables and probability distributions
• A random variable is a function of the
𝑿: 𝑺 → ℝ
outcomes in a random experiment.
• Assumes values according to a 𝑷 𝒂≤𝑿≤𝒃 =?
probability distribution.
• Discrete r.v.: finite or countable number of 𝑷 𝑿=𝒂 >𝟎
values,
• Continuous r.v: takes all real values in 𝑷 𝑿=𝒂 =𝟎
given intervals 𝒃
𝑷 𝒂 ≤ 𝑿 ≤ 𝒃 = Z 𝒇 𝒙 𝒅𝒙
𝒂

Chalmers 45
Probability distributions
• Typically depend on one or more parameters
• Common discrete distributions
• Uniform: 𝑼(𝒂, 𝒃)
• Binomial: 𝑩𝒊𝒏(𝒏, 𝒑)
• Geometric: 𝑮𝒆𝒐(𝒑)
• Hypergeometric: 𝑯𝑮𝒆𝒐(𝑵, 𝑲, 𝒏)
• Poisson: 𝑷𝒐𝒊(𝝀)
• Negative binomial: 𝑵𝑩(𝒓, 𝒑)

Chalmers 46
Probability distributions
• Common continuous distributions
• Uniform: 𝑼[𝒂, 𝒃]
• Normal (Gaussian): 𝑵(𝝁, 𝝈𝟐 )
• Student’s t: 𝒕𝒏#𝟏
• Exponential: 𝑬𝒙𝒑(𝝀)
• Chi-square: 𝝌𝟐𝒏#𝟏
• Beta: 𝑩𝒆𝒕𝒂(𝜶, 𝜷)

Chalmers 47
Expected Value and Variance
Two key characteristics of a random variable

• Expected value:
• Mean value of random variable
• Variance:
• Measure of how far, on average, the random variable is from its
mean.

Chalmers 48
Expected Value and Variance
• For a discrete random variable 𝑿 the expected value is the
weighted average of the possible outcomes
𝝁 = 𝔼[𝑿] = m 𝑥% ∗ 𝑃(𝑋 = 𝑥% )
%
• Intuitively measures the value you can expect to get on
average in some random experiment
• E.g. Rolling a fair die once: 1/6 + 2/6 + 3/6 + 4/6 + 5/6 + 6/6 = 2 ½ =
3.5.

Chalmers 49
Expected Value and Variance
• For a discrete random variable 𝑿 the variance is the
weighted average of the square distance to the mean
𝑉𝑎𝑟[𝑿] = m(𝑥% − 𝜇)& ∗ 𝑃(𝑋 = 𝑥% )
%
• Intuitively measures how spread out the values of the random
variable are.

Chalmers 50
Statistical inference
Estimation and analysis of these parameters
in random samples to draw conclusions of
the underlying population.

Two main paradigms:


• Frequentism
• Bayesianism

Chalmers 51
Classical or frequentist probability theory:
• Probabilities are relative frequencies of the event in
a large number of trials.

Bayesian probability theory:


• Probabilities are reasonable expectation of an event,
quantifying personal beliefs and prior information, and
including the degree of certainty in those beliefs.

Chalmers 52
Frequentism versus Bayesianism
Frequentism Bayesianism
+ Objective + More natural
+ Trade of between errors + Logically rigourous
+ Design controls bias + Can explore different priors
+ Long prosperous history + Data can be added
- p-value depends on design - Prior is subjective
- Ad-hoc notions of ”data more - Assigning probabilities to
extreme” hypotheses
- Fully specified designs ahead

Chalmers 61

You might also like