0% found this document useful (0 votes)
27 views24 pages

Week 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views24 pages

Week 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSE315:Introduction to Data

Science
WEEK-4
Data Spread
• Variation - Variation is a measure of how far the data in a data set is from
its average value. The range determines the extent of the data, but this is
only the sum of the maximum and minimum values. So the whole data
image by range is not available. How far or near the average value of the
data is located is measured by variance.
• Mathematical equations for determining variance in terms of population
and sample data
Variance
• Suppose the height of 5 dogs of different breeds is 60 cm, 47 cm, 17 cm,
43 cm and 30 cm respectively, so the average height of dogs is 39.4 cm.

• Determining the population variance of the height of different breeds of


dogs…

• So we understand that the height of dogs can vary by 217.04 cm from


their average value.
Standard Deviation
• The value of the variance may vary from case to case, with the variance
being much lower if the height of the dogs were measured on a "foot"
scale. So many times it becomes difficult to decide from variance, that's
why we use another method called standard deviation.
• Standard Deviation - The standard deviation is obtained by rooting the
value of the variance.
What do I understand by standard
deviation?
• We saw earlier that the average height of 5 different breeds of dogs was
39.4 cm and their standard deviation was 14.73 cm. This means that the
height of the dogs will be within 39.4 cm + - 14.73 cm or (24.67 mm or
54.13 cm).

• If the standard deviation of a data is less then most of the values ​are near
the center i.e. the variation is less. Again, standard deviation means more
variation in data.
Meaning from curve
• Suppose the curve of the lower standard deviation above is the
distribution of the run of the batsman "A" and the curve of the higher
standard deviation is the distribution of the run of the batsman "B".
• From this it is clear that the "A" batsman is more reliable, he has run close
in most of the matches. On the other hand, "B" is a lot of unpredictable
variations of his runs, in some matches he has scored a lot of runs and in
some matches he has scored very few runs.
Covariance and Correlation
• Suppose you want to know if ice cream selling has any
relation with daily temperature?
• To do this, you took a few days 'worth of ice cream sales data
from a store next to your home and took data on those days'
temperature.
• Selling ice cream and scattering the temperature clearly
shows that as the temperature rises, so does the sale of ice
cream. If the temperature goes down again, the sales of ice
cream also go down.
Ice Cream Sales Data
Covariance
• Covariance - Covariance is the measure of the simultaneous change of
two variables. Covariance shows how much two variables change together,
how much relation exists between one variable and another variable.
• To determine the population covariance, two variables have to divide the
product of their average value subtraction by the total number of
observations n. Sample covariance, on the other hand, has to be divided
by n-1 in the same way.
Covariance
• We will apply the formula of sample covariance to determine the
covariance of temperature and quantity of ice cream sales in our ice
cream sales sample data.
• By subtracting the above, we
know that the ice cream sales
volume and temperature
coverage is 484.09.
• There is no scale of covariance
value, the value of variable
changes with the value of
covariance.
• So many times it becomes
difficult to come to a conclusion
from covariance.
• So we'll get to know another
measurement, that is
correlation.
Correlation
• Correlation- Correlation is a measure of the relationship between two
variables. That is, if the value of one variable increases or decreases, the
measure of how much the value of the other variable increases or
decreases is expressed through correlation.
• The value of correlation is scaled, ranging from -1 to 1. The closer to 1 the
stronger the relationship between the two variables, the weaker the
relationship closer to 0.

• The above equation is Pearson correlation equation, expressed by r.


Pearson correlation
Pearson correlation
• According to Pearson's method, we see that the relationship between ice
cream sales and temperature is 0.95, which is a very strong correlation.
Normal Distribution

• The appearance of real life data can be varied. However, sometimes the
appearance of the data is a lot like the shape of a bell or a bell shape.
• If the data looks like a bell shape of a data distribution plot, then it is called
normal distribution.
• In the perfect normal distribution, the data mean, mode and median are
exactly in the middle and the values ​of all are equal. Distribution consists of
50% data on both sides, i.e. data distribution is symmetric.
Examples of Normal Distribution
Some examples of normal distribution in real life,
• The weight of the newborn baby
• Income distribution
• The height of man
• IQ
• Blood pressure
• Exam results
• The amount of error of any measurement
Empirical law of distributions
There is a nice thing about normal distribution, it's called empirical Law of
Distribution.
• 68% of the data in a normal distribution is within the standard deviation 1.
• 95% of data is within standard deviation 2 and
• 99.7% of data is within standard deviation 3.
Skewness
• Normal distribution data is symmetrical, meaning that there is an equal
amount of data on both sides. When a curve moves slightly to the right or
left it is no longer symmetric, such a curve is called asymmetric.
• Skewness is a measure of the degree of asymmetry of a distribution curve.
If the curve is tilted to the right compared to the normal curve then it is
called negative skewed curve, whereas if the curve is tilted to the left from
the normal curve then it is called positive skewed curve.

Skewness Type
Skewness<0 Negatively Skewed
Skewness=0 No Skewed
Skewness>0 Positively Skewed
Skewness Example
Negative Skewness (Long Tail to the Left):
• Test Scores: In a well-designed test, most students score around the average,
with fewer scoring very high or very low. This creates a negative skew, with a tail
extending towards lower scores.
• Number of Customers Entering a Store: On a typical day, a store might see a
larger number of customers during peak hours and fewer throughout the rest of
the day. This can result in a negatively skewed distribution of customer arrivals.
Skewness Example
No Skew (Symmetrical Distribution):
• Dice Rolls: When you roll a fair die, each number (1 to 6) has an equal probability
of landing. This results in a perfectly symmetrical distribution with no skew.
• Heights of Adults: While individual heights can vary, the overall distribution of
adult heights in a population often approximates a normal curve with no
significant skew.
Skewness Example
Positive Skewness (Long Tail to the Right):
• Household Income: This is a classic example of positive skewness. Most people
earn a modest income, while a few individuals have significantly higher incomes
(millionaires and billionaires). This creates a long tail on the right side of the
distribution.
• Waiting Times at a Coffee Shop: The time it takes to get your coffee at a busy
shop can also be positively skewed. While most customers wait a few minutes,
some might encounter longer wait times due to complex orders, unexpected
delays, etc.
Kurtosis
• Even if a distribution is symmetrical, it can be pointed or
slightly dull.
• Kurtosis is a measure of how pointed or blunt the shape of a
distribution curve is.
• If a symmetric distribution curve is pointed, it is called
Leptokurtic, and if the curve is blunt in shape, it is called
Platykurtic. The curve of normal shape is called mesokurtic.

Kurtosis Type
Kurtosis<0 Platykurtic
Kurtosis=0 Mesokurtic
Kurtosis>0 Leptokurtic
Kurtosis Example
• Normal Distribution (Mesokurtic): Imagine a perfect bell curve, like the one
representing heights of adults. This is a mesokurtic distribution, with a kurtosis
value of 3.

• The tails gradually taper off without any extreme outliers, and most of the data
points cluster around the central peak.
Kurtosis Example
Leptokurtic Distribution (Heavy Tails): Think of exam scores in a really tough class.
This could be a leptokurtic distribution, with kurtosis exceeding 3.

The tails are much thicker and extend further out compared to a normal
distribution. This means there are more extreme values (both very high and very
low scores) than we'd expect in a normal curve.
Kurtosis Example
Platykurtic Distribution (Light Tails): Picture the waiting times at a bank with
multiple tellers. This might be a platykurtic distribution, with kurtosis less than 3.

The tails are flatter and thinner than in a normal distribution. This indicates fewer
outliers and a more concentrated central mass, meaning most customers wait for a
moderate amount of time.

You might also like