CSE303
Lecture 4: Different Data Distributions
DATA DISTRIBUTION
NORMAL DISTRIBUTION
• In statistics, a normal distribution or Gaussian distribution is a type of continuous
probability distribution for a real-valued random variable. The general form of
its probability density function is
3
NORMAL DISTRIBUTION
• In probability theory, the
normal (or Gaussian or
Gauss or Laplace-Gauss)
distribution is a very common
continuous probability
distribution
• The probability density of the
The Normal Distribution has:
normal distribution is •mean = median = mode
•symmetry about the center
•50% of values less than the mean
and 50% greater than the mean
PROPERTIES OF NORMAL DISTRIBUTION
EXAMPLE 1
• 95% of students at school are between 1.1m and 1.7m tall. Assuming this data is
normally distributed can you calculate the mean and standard deviation?
STANDARD SCORE OR “Z-SCORE”
• The number of standard deviations from the mean is also called the "Standard
Score", "sigma" or "z-score“
• Example 2: In that same school one of your friends is 1.85m tall. Find out his z-
score.
• z-score (for one sample) = (x – μ) / σ = 1.85 – 1.4 / 0.15 = 3.0
WHY DO WE NEED Z-SCORE?
• Example 4: Professor Willoughby is marking a test. Here are the students results (out of 60
points):
20, 15, 26, 32, 18, 28, 35, 14, 26, 22, 17
Most students didn't even get 30 out of 60, and most will fail.
• Professor decides to Standardize all the scores and only fail people 1 standard deviation below
the mean.
• The Mean is 23, and the Standard Deviation is 6.6, and these are the Standard Scores:
-0.45, -1.21, 0.45, 1.36, -0.76, 0.76, 1.82, -1.36, 0.45, -0.15, -0.91
• Now only 2 students will fail (the ones who scored 15 and 14 on the test)
• Much fairer!
STANDARD NORMAL DISTRIBUTION
9
ANOTHER EXAMPLE
• Your score in a recent test was 0.5 standard deviations above the average, how many
people scored lower than you did?
NORMAL DISTRIBUTIONS
11
SKEWNESS
• It is the degree of distortion from the symmetrical bell curve or the normal
distribution. It measures the lack of symmetry in data distribution.
• It differentiates extreme values in one versus the other tail. A symmetrical
distribution will have a skewness of 0.
12
KURTOSIS
• Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is
used to describe the extreme values in one versus the other tail. It is actually the
measure of outliers present in the distribution.
13
FORMULA FOR SKEWNESS AND KURTOSIS
14
BINOMIAL DISTRIBUTION
• A binomial distribution can be thought of as simply the probability of a SUCCESS or
FAILURE outcome in an experiment or survey that is repeated multiple times.
• The binomial is a type of distribution that has two possible outcomes (the prefix “bi”
means two, or twice). For example, a coin toss has only two possible outcomes: heads
or tails and taking a test could have two possible outcomes: pass or fail.
• Binomial Distribution Function: b(x; n, P) = nCx * px * (1 – p)n – x
• Mean = n * P
• Variance = n * P * (1-P)
15
PRACTICE PROBLEMS
• A coin is tossed 10 times. What is the probability of getting exactly 6 heads?
• 60% of people who purchase sports cars are men. If 10 sports car owners are
randomly selected, find the probability that exactly 7 are men.
16
POISSON DISTRIBUTION
• A Poisson distribution is a tool that helps to predict the probability of certain events
from happening when you know how often the event has occurred. It gives us the
probability of a given number of events happening in a fixed interval of time.
• Poisson Distribution Function: P(x; μ) = (e -μ * μx) / x!
17
PRACTICE PROBLEMS
• The average number of major storms in your city is 2 per year. What is the probability
that exactly 3 storms will hit your city next year?
18
CORRELATION ANALYSIS (NOMINAL DATA)
• Χ2 (chi-square) test
(Observed - Expected ) 2
c2 = å
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
19
CHI-SQUARE CALCULATION: AN
EXAMPLE
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
(250 - 90) 2 (50 - 210) 2 (200 - 360) 2 (1000 - 840) 2
c =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are
20
correlated in the group
PRACTICE PROBLEM
• Let's say you want to know if gender has anything to do with political party preference. You poll
440 voters in a simple random sample to find out which political party they prefer. The results
of the survey are shown in the table below:
21
CHI-SQUARE TABLE
22
THANK YOU
23