0% found this document useful (0 votes)
39 views1,254 pages

Dap

The document outlines a comprehensive course on data analytics using Python, detailing the topics covered over a span of 12 weeks, including fundamental concepts, statistical methods, and various analytical techniques. It emphasizes the importance of understanding data, its generation, and how to effectively utilize analytics for better decision-making in business contexts. The course aims to equip students with practical skills and insights into data analytics, differentiating between data analysis and analytics while introducing various types of analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views1,254 pages

Dap

The document outlines a comprehensive course on data analytics using Python, detailing the topics covered over a span of 12 weeks, including fundamental concepts, statistical methods, and various analytical techniques. It emphasizes the importance of understanding data, its generation, and how to effectively utilize analytics for better decision-making in business contexts. The course aims to equip students with practical skills and insights into data analytics, differentiating between data analysis and analytics while introducing various types of analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1254

INDEX

S. No Topic Page No
Week 1
1 Introduction to data analytics 1
2 Python Fundamentals - I 33
3 Python Fundamentals - II 54
4 Central Tendency and Dispersion - I 83
5 Central Tendency and Dispersion - II 108
Week 2
6 Introduction to Probability- I 127
7 Introduction to Probability- II 155
8 Probability Distributions - I 177
9 Probability Distributions - II 198
10 Probability Distributions - III 225
Week 3
11 Python Demo for Distributions 246
12 Sampling and Sampling Distribution 256
13 Distribution of Sample Means, population, and variance 287
14 Confidence interval estimation: Single population - I 304
15 Confidence interval estimation: Single population - II 324
Week 4
16 Hypothesis Testing- I 342
17 Hypothesis Testing- II 364
18 Hypothesis Testing- III 380
19 Errors in Hypothesis Testing 394
20 Hypothesis Testing: Two sample test- I 422
Week 5
21 Hypothesis Testing: Two sample test- II 442
22 Hypothesis Testing: Two sample test- III 464
23 ANOVA - I 480
24 ANOVA - II 494
25 Post Hoc Analysis(Tukey’s test) 513
Week 6
26 Randomize block design (RBD) 542
27 Two Way ANOVA 563
28 Linear Regression - I 583
29 Linear Regression - II 601
30 Linear Regression - III 614
Week 7
31 Estimation, Prediction of Regression Model Residual Analysis 634
32 Estimation, Prediction of Regression Model Residual Analysis - II 652
33 Multiple Regression Model - I 674
34 Multiple Regression Model-II 695
35 Categorical variable regression 714
Week 8
36 Maximum Likelihood Estimation- I 744
37 Maximum Likelihood Estimation-II 761
38 Logistic Regression- I 785
39 Logistic Regression-II 802
40 Linear Regression Model Vs Logistic Regression Model 818
Week 9
41 Confusion matrix and ROC- I 838
42 Confusion Matrix and ROC-II 860
43 Performance of Logistic Model-III 883
44 Regression Analysis Model Building - I 895
45 Regression Analysis Model Building (Interaction)- II 910
Week 10
46 Chi - Square Test of Independence - I 928
47 Chi-Square Test of Independence - II 949
48 Chi-Square Goodness of Fit Test 971
49 Cluster analysis: Introduction- I 990
50 Clustering analysis: part II 1009
Week 11
51 Clustering analysis: Part III 1026
52 Cluster analysis: Part IV 1046
53 Cluster analysis: Part V 1068
54 K- Means Clustering 1083
55 Hierarchical method of clustering -I 1109
Week 12
56 Hierarchical method of clustering- II 1134
57 Classification and Regression Trees (CART : I) 1162
58 Measures of attribute selection 1187
59 Attribute selection Measures in CART : II 1206
60 Classification and Regression Trees (CART) - III 1224
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee

Lecture No 1
Introduction to Data Analytics

Welcome students this course on data analytics with the Python today is the introduction
class. This lecture is on introduction to data analytics.
(Refer Slide Time: 00.34)

The objective of this course is to introduce the conceptual understanding using simple and
practical examples rather than repetitive and point clique mentality, here most of the students
generally they are, how they are using the software for doing data analytics. Just they want to
just click it, they want to get the result, they do not want bother about exactly what is
happening inside the software. This course should make you comfortable using analytics in
your career and your life.

You will know how to work with a real data and you might have learnt the many different
methodologies, but choosing the right methodology is important. This course will focus you
will help you how to choose the right data analytical tools.
(Refer Slide Time: 01:17)

1
Objective of the course is, when you look at this picture. How this person is using this tool,
there is a ladder. He was not knowing correctly how to use this ladder for the, for the
purpose it is intended. So the danger in using quantitative method does not generally lie in the
inability to perform the calculation, because of the computer development in computer
technology.

There are many packages are available for doing data analytics. But, the real threat is lack of
fundamental understanding of why to use particular technique or procedures and how to use it
correctly and, how to correctly interpret the result. This course will focus on how to choose
the right technique and how to use it correctly and how to interpret the result.
(Refer Slide Time: 02:01)

2
So what was the learning objective of this class that is; after completing this lecture what you
will learn one is you can define what is data and its importance. You can define what are data
analytics and types. You can explain why analytics is in today's business environment is so
important. Then we can see how statistics, analytics and data science are interrelated, there
seems to be some overlap in this we will clarify that what is the difference, how these are
overlapped how these are interrelated.

In this course we are going to use a package called Python. I will explain how and why it is
important to use the Python in this course, at the end of this session we will explain the four
important levels of data that is nominal, ordinal, interval and ratio. Now we will go to the
content;
(Refer Slide Time: 02:54)

We will define
data and its
importance. There are
three term one is
variable,
measurement and data.
Next we will see what
is generating so much
data. Next we will see
how data add value to the business, and then we will say why data is important.
(Refer Slide Time: 03:11)

3
See the variable, measurement and data these are the terms which we are going to use
frequently in this course. So what is a variable? Variable is a characteristic of any entity
being studied that is capable of taking on different values. Say for example, X is the variable
it can take any values it may be 1 it may be 2 or it may be 0 and so on. The measurement is,
when you standard processes used to assign numbers to your particular attributes or
characteristics of variable are called a measurement.

For that X, you want to substitute some values. For that value, you have to measure the
characteristics of the variable, that is nothing but your measurement. So then, what is the
data? Data are recorded measurement. So there is a variable you measure the phenomena,
after measuring the phenomena you are substituting some value for the variable so the
variable will take a particular value that value is nothing but your data.

So X is the variable for example number 5 is the data. How you are measuring that 5, that is
called measurement. Then what is generating so much of data.
(Refer Slide Time: 04:33)

4
Data can be generated different way humans, machines, and human - machines combines.
The humans, machines and human - machines combines in the sense, now seen everybody is
having the various Facebook account, we have LinkedIn account, we are in various social
network sites. Now the availability of the data is not the problem. It can be generated
anywhere where the information is generated and stored and structured or unstructured
format.
(Refer Slide Time: 05:06)

So how the data add value to the business? So the data after getting from various sources
assume that it is a store in the form of data warehouse. So from the data warehouse the data
can be used for development of a data product. Here we are using the word data product and
in the coming slides, I will explain exactly what is the data product with some examples. So

5
the same data, if we look at the right hand side that can be used to get more insights from the
data.

Okay, what do you mean the data product? For example, algorithm solutions in production,
marketing and sales, example of some data product. For example, recommendation engine
one of the example for data product. Suppose, if you go for Flipkart or Amazon for buying a
particular product in that package, that software itself, we will recommend to you what is the
next product, possible product that you can buy. That is nothing but the recommendation
engine.

Even if you watch some YouTube videos on particular topic, that YouTube itself will suggest
to you what are the relevant videos are available. So that is a recommendation engine. That
is one of the examples of your data product; with the help of data so that will help you to
forming a data product or you can get an insight from the data. That will add your business
value to you.
(Refer Slide Time: 06:27)

See this is an example of your data products, this is the driverless car. Google car, so the
whole concept of Google car is with the help of data. It is detecting all other requirements for
driving the cars. The next example is for recommendation engine, as I told you previously
when you buy any product they will suggest you that along with this product, the other
product also can be purchased.

6
Another very common example for a data product is Google. The Google is lot of
applications, there one of the application of example for data product is Google mapping. So
the Google mapping is helping you to find out what is the right route, which road there is a
traffic, in which road there is a toll booth, so this kind of information we can get it from the
Google map. So this Google map is the one of an example of your data production.
(Refer Slide Time: 07:20)

Now why data is important? The data helps in making better decisions, data helps in solve
problem by finding the reason for underperformance. Suppose some company it is not
performing properly by collecting the data we can identify what was the reason for this under
performance. The data helps one to evaluate the performance. So what is the current
performance, the data also can be used for benchmarking the performance of your business
organization.

And after benchmarking data helps one improving the performance also, so data also can help
one understand the consumers and the markets, especially the marketing context. You can
understand who are the right consumers and what kind of preferences they are having in the
market.
(Refer Slide Time: 08:16)

7
Next we will define what is a data analytics and its types? So in this coming two, three slides
we are going to discuss, we will define what is data analytics? Then you say why analytics is
important? Then we will see that data analysis? Then we will see how data analytics is
different from data analysis? At the end will we see types of data analytics?
(Refer Slide Time: 08:40)

We will define data analytics is the scientific process of transforming data into insights for
making better decisions. See it is a scientific process for transforming the data into for
making better decisions, even without the data also even without doing analytics also you can
make the decision but you cannot make the better decision without analytics. By the virtue of
your experience on intuitions you can take the decisions that also sometimes may be correct.

8
But about the help of data if you are making the decision then that will enable you to make
the better decisions. Another professor James Evans, he has defined the data analytics in this
way. it is the use of the data information technology, statistical analysis, quantitative methods
and mathematical or computer-based models to help managers gain improved insight about
their business operations and make better, fact-based decisions.

You see that there are many terms which are appearing here, one is IT, next one is a statistical
analysis, and next one is the quantitative methods, then mathematical knowledge and
computer-based models. So when we will see how these are interrelated in coming slides.
Generally, among the students, there is a confusion whether the analysis and analytics is same
or different?
(Refer Slide Time: 10:13)

Why analytics is important. The opportunity abounds for the use of analytics and big data
such as: for determining the credit risk, for developing new medicines, especially in
healthcare. The healthcare analytics is an emerging, that is helping you to identify what is the
correct medicines. Finding more efficient ways to deliver product and services. For example:
in the banking context data analytics is used for preventing the fraud, and it is uncovering the
cyber threats.

With the help of data analytics you can find out the possible cyber crimes and we can detect it
we can prevent it. And data analytics are also important for retaining the most valuable
customers. We can identify who is your valuable customer or non valuable customers. So we
can focus on more on our valuable customers. Okay,

9
(Refer Slide Time: 11:08)

Now what is the data analysis? Is the process of examining, transforming and arranging raw
data in a specific way to generate useful information from it. So data analysis allows for the
evaluation of data through analytical and logical reasoning to lead to some sort of outcome or
conclusion in some context. Data analysis is a multi-faceted process that involves a number
of steps approaches and diverse techniques. That we will see in coming lecture.
(Refer Slide Time: 11:41)

So now we will see what is the analysis is data analysis and data analytics. When you say
analysis when you say data analysis it is something about what has happened in the past. So
we will explain why that has happened? We will explain how it has happened? We can
explain why it has happened? For example, when we say data analysis that is nothing about

10
studying about what has happened it is like kind of a post-mortem analysis. What has
happened in the past?
(Refer Slide Time: 12:13)

Okay, in the contrary the analytics is studying about what will happen in future and with the
help of analytics. We can predict explore possible potential future events.
(Refer Slide Time: 12:25)

So the analytics is maybe qualitative or quantitative. For example in analytics if we say


qualitative analytics. So it is the decision mostly based on the intuition. But if you say in
quantitative where with the help of formulas with the help of algorithms will make the
decisions.
(Refer Slide Time: 12:44)

11
So in the analysis data analysis also we can go for qualitative. We can explain how and why
a story ends in that way it did? When we say in quantitative we can say, how the sales
decreased the last summer. When I say as I am repeating, when you say analysis is something
studying about what has happened in the past.
(Refer Slide Time 13:12)

Okay, so it is not exactly analysis equal to analytics. Similarly when you say data analysis is
different data analytics is different.

Similarly business analytics is different business analytics. When you say analytics is nothing
but studying about the future events with the help of the past data.
(Refer Slide Time 13:34)

12
Next we will go for classification of data analytics, based on the phase of workflow and the
kind of analysis required, there are four major types of data analytics. One is descriptive
analytics, diagnostic analytics, predictive analytics and prescriptive analytics. We will see
these four types of analytics in detail in coming classes:
(Refer Slide Time 13:57)

If we look at the difficulty and the kind of value which we can get from different types of
analytics; this picture shows for example: when you see the descriptive analytics that will
answer what happened? Diagnostic analytics, will help you to answer why did it happen?
Predictive analytics will help you what will happen? Prescriptive analytics will help you to
answer how can we make it happen? There is one context when you look at the level of
difficulty you see that the descriptive analytics is the level of difficulty is very less.

13
And the contrary when you look at the prescriptive analytics the difficulty level is more and
the value also, value in the sense business value which adds to you also more. so when there
is a more difficulty there is a more value. Okay,
(Refer Slide Time 14:54)

Then we listen what is the descriptive analytics? Descriptive analytics is the conventional
form of business intelligence or data analysis. It seeks to provide the depiction or summary
view of facts and figures in an understandable format. These either inform or prepare data for
further analysis. so descriptive analysis or we can say another way in statistics can summarize
raw data and convert it into your form that can be easily understood by humans. They can
describe in detail about an even that has occurred in the past. Okay,
(Refer Slide Time 15:40)

14
Some of the examples of descriptive analytics is a common example of descriptive analytics
are company reports that simply provide the historic review like: data queries, reports,
descriptive statistics, data visualization and data dashboard. Okay,
(Refer Slide Time 16:00)

The next one will go to the diagnostic analytics. Diagnostic analytics is a form of advanced
analytics which examines data or content to answer the question why did it happen? So we
are diagnosing, suppose we are meeting a doctor for consulting, so he will try to understand
why this has happened? Okay so that kind of analytics nothing but diagnostic analytics. So
the diagnostic analytical tools aid and analyst to dig deeper into an issue.

So that, they can arrive at the source of the problem. So doctor also will identify you
somebody has got some disease what was the sources of the problem. Similarly the
diagnostic analytics also if something has happened for example the company's not
performing well that diagnostic abilities will help you to identify what was the core reason
for that. In a structured business environment tools for both descriptive and diagnostic
analytics go parallel.

So when you look at the whether it is a prescriptive or diagnostic analytics, the tools,
analytical tools which are using can be same only the purpose may be different.
(Refer Slide Time 17:09)

15
For example: data discovery, data mining, and correlations. These tools can be used for your
prescriptive analytics also. Okay,
(Refer Slide Time 17:20)

Now we will go for predictive analytics, predictive analytics helps to forecast trends based on
the current events. When you say predicting obviously it say, that it is discussing about what
will happen in future? Predicting the probability of an event happening in future are
estimating accurate time it will happen can all be determined with the help of predictive
analytical models. Many different but co-dependent variables are analysed to predict a trend
in this type of analysis.

So in the predictive analytics one of the tool is the regression analysis. There may be some
independent variables, some dependent variables, sometimes more dependent variable, more

16
than one dependent variable and how these variables are inter-related. So that kind of study is
nothing but your predictive analytics.
(Refer Slide Time 18:11)

When you look at this picture, you see that with the help of historical data by using different
algorithm, predictive algorithms you can come with a model. Once the model is developed a
new data can be fit into this model so we can get some predictions about the past events.
(Refer Slide Time 18:35)

Example is linear regression, time series analysis and forecasting and data mining. These are
the techniques for predictive analytics.
(Refer Slide Time 18:46)

17
The last one is the prescriptive analytics. A set of techniques to indicate the best course of
action. It tells what decision to make to optimize the outcome. The goal of prescriptive
analytics is to enable: quality improvements, service enhancements, cost reductions and
increasing productivity. Okay,
(Refer Slide Time 19:13)

In the prescriptive analytics, some of the tools which we can use is optimization models,
simulation model, and decision analysis. These are the tools under prescriptive analytics.
(Refer Slide Time 19:27)

18
Next is we are going to see, why the analytics so important? In this section we will see what
is happening the demand for data analytics and we look at the different elements of data
analytics.
(Refer Slide Time 19:44)

This picture shows, Google Trends, this was up to 2017. See for example, the blue represents
the data scientist; this orange represents the statistician operation researchers. You see the
trend is it is increasing that means people are searching in the Google search engine the word
data scientist more number of times. See the search count is increasing. That means there is a
demand for that particular say job.
(Refer Slide Time 20:19)

19
You see, if you look at this is the newspaper clipping from Times of India. There are so many
news are coming about data scientists and the future requirement of data scientists. You see
the data scientist earning more than CA’s and engineers. You can look at this link for further.
(Refer Slide Time 20:37)

And you see the demand for data analytics. This also newspaper clipping with companies
across industries striving to bring their research and analysis department up to speed, the
demand for qualified data scientist is rising. So there is an emerging field. so many
companies are looking for the qualified data scientist. So if you take this course and end up
the course that you may be qualified for getting into these companies.
(Refer Slide Time 21:07)

20
Many times you see what is data analytics, Statistics, data mining, optimizations. These are
students having different understanding on that. So when we say data analytics, there are
different element one is statistics, next one is the business intelligence information systems,
then modelling and optimizations, then simulation and risk. We can say if you are able to do
what if analysis? That is nothing but sensitivity analysis, visualization, data mining. These are
the components of data analytics and how these different domains are interrelated?
(Refer Slide Time 21:47)

Next we will see, what kind of skill set is required to become a data analyst? then we will see
the small difference between data analyst and data scientist?
(Refer Slide Time 21:59)

21
See to become a data analyst is the basic fundamental knowledge is you need to have
knowledge of mathematics. Next you need to have the knowledge of technology is nothing
but hacking skill. Hacking skills in the sense, if the data is given hacking is done and looked
at the positive way. How to use the data to get more information? The next skill is business
and strategy acumen; you should have the knowledge of the domain and knowledge of the
business and you knew to the strategy equipment.

So these three skills are required for a good data scientist. It is very difficult to have a one
person will have all these three skills that's why availability of good data analyst is becoming
very difficult. Because somebody may be very good at mathematics but they may not have
very good knowledge and business, some people may be very well at technology, technology
in the sense information technology, they may not have good knowledge on the business
knowledge.

So we need to have the combination of all these three skills otherwise the group of people
some people from mathematics department or mathematics area, some people from computer
science, some people from the domain knowledge. They were to work together to form a
good data scientist team, so these forms data science.
(Refer Slide Time 23:31)

22
Now what is the difference between data analysts and data scientists and the difference is
what kind of role they are doing? For example; the role of your data analyst is, see in your
business context, he may have the knowledge of business domain. For example; if he is good
at doing analytics in the area of marketing, he can be called as a marketing analyst. If the
person is from finance area, he can be called it as a finance analyst. So he is the analyst, data
analyst.

But the role of data scientist is little bigger, because the data scientist need to have the
knowledge of advanced algorithms and machine learning and able to come out with a data
product. Which I told you in the previously, so the data scientist can come out with a data
product. Okay,
(Refer Slide Time 24:30)

23
In this course we are going to use Python. In this in the next lecture, I will tell you the basic
introduction about the Python. Here we will see why we are going to use the Python?.
Because python is very simple and easy to learn. Most importantly it is a free software and
open source. It uses interpreted, it is not the compiler. Suppose what do you my compiler and
interpreter is you need a compiler to solve the whole program but interpreter need not be in
that way.

It can solve, even you can interpret one sentence also, one line in the programming line also.
it is dynamically typed, dynamically type in the sense in some other programs every time you
have to declare the variable. What is the nature of the variable? Whether it is integer?
Whether it is a float? But here you need not do. It is dynamically takes the value. it is
extensible, extensible in the sense if you make a code in some other language that can be
extended with the help of Python.

And can be embedded, embedded in the sense you have made some program in Python it can
be embedded with the some other platforms and it has extensive library.
(Refer Slide Time 25:45)

The usability of Python is it is a desktop and web applications, it can be used for data
applications, it can be used for networking applications, most importantly it can be used for
data analyst, data science can be used for machine learning, it can be used for IoT Internet of
Things and artificial intelligence applications and can be used for games.
(Refer Slide Time 26:05)

24
Another reason for choosing Python is most of the companies, they use Python is a language
in their company. Like for example; Google, Facebook, NASA, Yahoo and eBay. They use
Python is a programming language.
(Refer Slide Time 26:23)

In this Python also we are going to use Jupyter notebook. In the next class I will explain you
because it is the client-server application is edit code on web browser. It is easy in
documentation, easy in demonstration and user friendly interface.
(Refer Slide Time 26:39)

25
This was the last session of this lecture; we will explain four different levels of the data. What
is the type of variables? Levels of data measurement? Compare for different level of data:
will say nominal, ordinal, interval and ratio. We will see that why and what is the usage of
knowing this different level of data?
(Refer Slide Time 27:03)

The one way for classifying the data is the categorical data, one is a numerical data. In
categorical data; you see marital status, political party, and eye color. These are categorical
data. Numerical data; it can be discrete or continuous. Discrete data may be a number of
children and defects per hour. So this is the discrete data. In the continuous data may be
weight and voltage. These are the example of continuous data.

26
So what is the difference between discrete and continuous is, you say a number of children
you may say two children or three children 2.5 children was not possible but in continuous, if
you look at between 0 & 1 the numbers are continuing there are infinite number of values that
are there between 0 & 1. So it is a continuous variable.
(Refer Slide Time 27:56)

Next will you see the different level of data measurement? Easily we have seen the
classification of data. We classified as the categorical data and numerical data. There is
another way of classification is, classifying into nominal data, ordinal data, interval data and
ratio data.
(Refer Slide Time 28:14)

We will look at, what is a nominal data? Nominal scale classifies data into distinct categories
in which no ranking is implied. The example of nominal data is gender, marital status. For

27
example; gender suppose you are conducting a questionnaire. Suppose you captured the
gender male 0, female 1. This 0 1 represents just the gender. You cannot do any arithmetic
operations with the help of the 0 & 1.

For example, you cannot find out the average, software will give you some number but there
is no meaning for that. Similarly, marital status, whether it is married or unmarried. This is
the example of nominal data.
(Refer Slide Time 29:01)

The next level of data is the ordinal scale. It classifies data into distinct categories in which
the ranking is implied. Here the numbers are the ranked. For example; you may ask the
customer to give a ranking about their level of satisfaction. For example, satisfied, neutral,
unsatisfied. The faculty ranking, for example; professor, associate professor, assistant
professor.

You see that their rank is followed for example 1 professor, 2 associate professor, 3 three
professor. Student grades, A, B, C, D, E, F. These are ranking, because the numbers 1, 2, 3
represents the rank.
(Refer Slide Time 29:45)

28
The next level of data is interval scale. The interval scale is ordered scale, in which the
difference between measurements is a meaningful quantity but the measurement do not have
to zero point. The example of interval scale is, for example year. Say now, this here is 2019,
you can add and subtract something. You can add another five years, its 2024 or you can
subtract another nine years, its 2010.

But you cannot multiply, if you multiply that number for example 2019 and 2020 you will
end up with the big number there is no meaning for that. Because, there is no meaning for
zero. Another example of interval scale is your Fahrenheit temperature. For example, in the
Fahrenheit scale, the zero represents freezing point but it is not the absence of the seat but
absence of the temperature but at the same time in the Kelvin for example minus 273 it is
absence of heat. So Kelvin will be the some other scale. That you will see the next one,
(Refer Slide Time 30:52)

29
The ratio scale is the ordered scale in which the difference between the measurements is a
meaningful quantity and the measurements have the true zero point. Weight, age, salary and
the Kelvin temperature comes under ratio scale. Because 0 Kelvin that means the absence of
the heat. So in the ratio scale, he can do all kinds of arithmetic operation. For example the
nominal, you cannot do any arithmetic operation. In ordinal you cannot do in arithmetic
operation.

In the interval you can add and subtract but you cannot multiply. But in the ratio data, you
can do all kinds of arithmetic operations. You can add. You can subtract, you can multiply,
and you can divide.
(Refer Slide Time 31:35)

30
You see the usage potential various level of data. For example the usage potential of nominal
data is not that much. The next one is ordinal; next one is interval, next one ratio. So the ratio
data is having the highest to use its potential. The nominal data is having the least usage
potential.
(Refer Slide Time 31:56)

This is more important, why we have to still know the different types of data. Because this
types of data is helping to choose the right analytical tools for doing analysis. For example; if
the data is the nominal data. You can do only nonparametric tests. For example the data is
ordinal, here also you can do only nonparametric test. But if the data is interval, you can do
parametric test. You see that interval; you can do all above plus addition and subtraction.

In the ratio, if you can do all of the above plus multiplication and division and statistical
methods. You can go for parametric methods. So the purpose of classifying the data into
nominal, ordinal, interval, ratio is to choose the right analytical tools with it whether it is a
parametric or non parametric. The other reason is sometime for if we want to do a non
parametric analysis that is used only for nominal data.

Sometime the students they will, the data may be nominal but they may go for a parametric
test that, should not be done. That is the purpose of knowing what kind of, what is the nature
of this data. So in this class we have seen the introduction for data analytics. We have seen
the importance of data analytics. We have seen the classification of data analytics. Then we
can we have seen what is the analytics and analyst and we have seen different types of data.

31
The next class we will learn about what is Python? How to install the Python and what kind
of descriptive analysis we can do with the help of Python? So the next class will meet you
with another lecture. Thank you very much.

32
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee

Lecture No 2
Python fundamentals -1

Good morning students, in the last class, that was the introduction class, we have seen the
importance of data analytics and we have seen certain classification of data analytics. This is
my second lecture that is Python fundamentals because we are going to use this Python. In
this lecture I have 3 objectives.
(Refer Slide Time: 00:50)

One is I will tell you how to install Python second one I will see some fundamentals of the
Python, third one some data visualization. In the data visualization I am going to give only
theory in this class. The next class we are going to use Python and we are going to take some
sample data and we have to visualize the data using Python software.
(Refer Slide Time: 01:13)

33
As I told you the 1st one is how to install this Python. There are 5 steps is there. Step 1, we
are going to see in detail in coming slides. In step 1 we are going to visit this website
www.anaconda.com at the address bar of the web browser 2nd one we are going to click on
download button 3rd one will download python 3.7 version for Windows operating system.
Then we will double click that is a 4th step we will double click on file to run the application.

The 5th one will follow the instruction until the completion of installation process. What I
have done I have taken the screenshot of all these, all the 5 steps while installing the laptop. I
am going to show each steps in the form of screenshot.
(Refer Slide Time: 02:03)

The 1st one is type this www.anaconda.com at the address bar of the web browser.
(Refer Slide Time: 02:13)

34
2nd one is once you typed it you can see this screen, here you see this location you can see
here. This location there is a download option. When you click that you see that the left side
also I have rounded there is a download option you download it.
(Refer Slide Time: 02:36)

In the 3rd step is there are two versions of python, python 3.7 and python 2.7. In these
courses we are going to use the latest version that is the Python 3.7.
(Refer Slide Time: 02:46)

35
In the 4th step double click on file to run the application it will get downloaded. when you
double click for example; I have stored this anaconda in F folder.
(Refer Slide Time: 02:59)

Step 5 is just to keep on click Next


(Refer Slide Time: 03:02)

36
You have to agree for their agreements, terms and conditions.
(Refer Slide Time: 03:06)

Next you select, just me recommended click Next.


(Refer Slide Time: 03:10)

37
Then it is installed in C drive click Next.
(Refer Slide Time: 03:15)

Then install, Installation process is started, then installation is completed.


(Refer Slide Time: 03:25)

38
Again click Next
(Refer Slide Time: 03:29)

Then click and finish Ok


(Refer Slide Time: 03:34)

39
Now we have installed the anaconda. So I will explain you how to open Jupyter notebook. I
will switch the screen
(Refer Slide Time: 03:46)

Yeah, this is the screen. The initially there are you see, I will see what is this some box it is
showing in blue color sometimes it will show in green color that I will show you later. So this
is the Jupyter notebook look like.
(Refer Slide Time: 04:04)

40
The next one is why there are some more interfaces there for using Python. There is a spider
is there Jupyter is it but we prefer Jupyter for some reasons because it is edit code on web
browser, it is easy in documentation, it is easy in demonstration and it is user- friendly
interface. That was the reason we are using Jupyter it is not necessary if you already you are
comfortable in some other interface you can continue with that.
(Refer Slide Time: 04:34)

See that in anaconda it consists of two software one is python that is on the left hand side the
another side the right hand side the Jupyter applications these are combined together and kept
in the Anaconda software package.
(Refer Slide Time: 04:49)

41
When you from the start when you type Jupyter you will get this screen
(Refer Slide Time: 04:58)

Then when you click launch you will get this one. So now from the start I am going to
explain how to start this Python jupyter notebook.
(Video Starts: 05:08)
You have to type Jupiter and Jupiter notebook .When you click it you will get this one.
Suppose if you want to type in a new go new Python 3. Yeah? Here there is a Jupyter there is
a it is coming untitled 2 there you can change the name. You give the name as introduction to
Python, introduction Okay?
(Video Ends: 05:40)

(Refer Slide Time: 05:40)

42
You see there is a box is appearing this is called cell I have made it in the red color, it is a cell
it can be the cell can be accessed using Enter key.
(Refer Slide Time: 05:51)

You see sometime that box will look like a green color, Green color indicates it is in edit
mode sometime the box will look like in blue color.
(Refer Slide Time: 06:01)

43
The blue color indicates it is a command mode. See when you go to below the help there is a
file name is called mark down .There if you type something then you select mark down that is
used for making documentation. So it contains documentation, here text not executed as a
code it is only for our understanding purpose.
(Refer Slide Time: 06:22)

Okay? Now about the Jupyter Notebook Command mode allowed editing notebook as a
whole. To close edit mode press Escape key. Execution can be done in three ways you can
simultaneously we can press Ctrl+Enter. So what will happen when you press Ctrl+Enter
output field cannot be modified, another way is to press Shift+Enter output field is modified.
Then there is a third way is there is a run button on the jupyter interface.

44
That you can directly you can click that. Then your code will get executed command line is
written proceeding with # tag symbol. So when you want to make some understanding on
your program you can use the # symbol, so that will not be executed.
(Refer Slide Time: 07:17)

That you only for your understanding purpose but there are about the Jupyter notebook
important shortcut keys. When you press A that is used to create a cell above when you press
B that is to create a cell below when you press D+ D for deleting cell. When you press M that
will made a say mark down cell, when you press Y that is for coding cell.
(Refer Slide Time: 07:46)

For example; when I am entering B


(Refer Slide Time: 07:54)

45
We will go to the next one fundamentals of Python and you see loading here .What we are
going to see in coming slides. Loading a simple delimited data file counting how many rows
and columns were loaded and determining which type of data was loaded. Then looking at
different parts of data by subsetting rows and columns because these activities are more
important because once we loaded a data that may have n number of cells n number of rows,
column and rows.

Sometime we need to do some operation using only few rows are few cells .You should know
how from a big data file how to use only a particular row or how to use a particular column.
Sometimes we can have a collection of rows also, collection of columns also for doing our
specific operations.
(Refer Slide Time: 08:49)

46
This was the reference book which I am following for this course and the book name is
Pandas for everyone especially for this lecture. It is the professor Daniel Y. Chen he is the
author of this book.
(Refer Slide Time: 09:04)

Now we are going to learn how to load a simple delimited data file. This is the fundamental
because before doing data analysis the first step is how to load the data into the Python. For
that purpose we are going to import some basic libraries one is pandas numpy another is
matplotlib.pyplot as plt. So, first we are going to import these three basic library .Then we are
going to load the data. The data sources it is taken from
www.github.com/gennybc/gapminder.

So I have downloaded this data set already I am going to tell you how to load the data set in
to the python. Before that I am going to open that excel file, I am going to show what is the
column? What is the row open the excel file?
(Video Starts 10:07)
When you look at this I am reading the column see that there is a country, year, population,
continent, life expectancy that is given as the short name life exp then gdp per capita. So in
rows there are, how many rows is there I will tell you how many rows is there I am coming
down and this is a this is csv file format. How many rows are there? There are 1705.

The last row is Zimbabwe Right? Please look at the data Zimbabwe year 2007. I think it is a
population, continent, life expectancy this is a per capita income. Okay? Now this data, this
csv file I am going to import into the Python. You see that I am going to call this data set df.

47
df= pd because pd is the short-form of pandas, Pandas nothing but the panel data
Pandas.read_csv.

Why I am using csv because the csv file is I am going to read it. The location of the file given
the path of that file you can directly copy that path but one thing you have to note it down.
See, C: this will be this should be \ because when you copy that path directly. Generally you
will get here / but you have to change it. So I changed it back C: / users / ET cell / desktop /
gapminder-five year data.csv.

Look at their it should be in the code. Now I am going to read the df, Yes? once I read it you
see that, the row is starting from 0 . That is a very important. It is a 0 indexing 0, 1, 2, 3, 4, 5,
6, 7, 8 I am able to see whatever I have seen in the csv file just a few minutes before. You see
I am able to see the country, year, population, continent, life expectancy and gdp per capita.
Okay? What I how I have read it pd. read_csv

Suppose I have installed, I have loaded that data I want to say what are the headings of that
file. Heading means what are the columns. For that there is, in Python there is a two type
print and open the parentheses df.head when you execute this one you will get 1st 5 rows that
means 0, 1, 2, 3, 4. So that means when you execute this one you can see 1st 5 rows from the
data set Yes? You are able to see that, Okay?

I will go to the next command, suppose I want to know the size of that file that is I want to
know how many rows and how many column is there. For that there is a command called the
shape. So print df.shape, df is they were finally because we outer loading that csv file we
have named in the variable called df. Okay? So when you type print df.shape then we will
come to know how many rows are there. How many columns are there?

So, I am typing print df shape. One more thing you should not type this parenthesis because it
is the shape is without parentheses. So I am going to remove this parenthesis again I got to
run it. Yes, it is showing how many rows? How many columns? Okay? We will go to the
next one now I want to know how many column names? What are the column names? So if I
type print df.columns Right?

48
Here, please note that here also there is no parenthesis if I type print df.columns. This was the
output which I copied see what is output disappearing country, year, population, continent,
life expectancy, gdp per capita, data type is object; I will show you how this comment is
running. Type print df., yes you are able to get this way. So what the students what you have
to do while looking at the video you have to open your laptop you have to type this
command.

Then you have to see you can verify the answer. Okay? The next command is to get the data
type of each column; you have to type this command print df.dtypes. That will give you the
summary of the all data set and what is the nature of the data. We will see that how it is
appearing. So, I am going to type print df.dtypes. Now you we will see the data type of each
column. For that you have to use this command dtypes.

So print df.dtypes, this is the output which you will get it. I will show you in the Python, first
we will look at what is the subject output you see countries object, year is an integer,
population pop it is a variable that is in the float. Float means there is a decimal continent it is
an object that means a character, life expectancy it is a float that means you are going to get
that value in decimals.

Similarly, gdp per cap that also going to get in the decimals then data type is object now we
will go to the Jupiter. We run this command so you see that you see line number 8. Print df.
dtypes you are getting whatever it was there in this or whatever I have shown in the slide is
there. Say country object, year integer, population float type of data and so on.
( Video Ends 17:10)
(Refer Slide Time: 17:11)

49
This is a classification of types of data in the perspective of pandas, in the perspective of
Python. See when they say string it is a most common data type it is a character. When I say
‘Int’ it is a whole number integers. It is a float number with the decimals. Date, time is that is
to represent the data it is not loaded by default that need to be imported. Whenever it is a
requirement is there that we will see
(Refer Slide Time: 17:38)

That one more command is to get more information about the data. So you type df.info you
will get the full details about each columns. We will do that one.
(Video starts: 17:53)
Look at this when it print df. info so I am getting data columns there were 6 columns country
there are 1704 rows is there Non null object that means all the data is there is no missing
values. Similarly year 1704 rows is there, Non-null that is an integer, Non null means, that all

50
the values are filled. There is no missing cell so population, float, continent object, life exp
float, gdp per capita float, memory usage is this much.

Suppose, there is a big data file is there we want to see the specific rows are specific columns.
How to do that? Now get the country column and save it to its own variable. So country if
you look at the data which I will show initially countries one of the column. So I want to pick
up only that country column I am going to save it. I am going to give the name for that a
country_df= df you see that you have to open the square bracket, Square bracket within
quote.

Suppose, in the country column I want to see 1st 5 rows Okay? You type print, open
parenthesis country_df.head that shows 1st 5 rows and see that now from the full data we
have fetched only the country column. That we have seen there are 1st 5 rows, that is 0 th row
is a, 0, 1, 2, 3, 4, 5 to 5 rows we are able to see when you from the big file. Suppose there
may be requirement you need to see what are the last five observations for that purpose.

You type print country_df.tail, then you can see from the bottom we can see last 5 rows. You
will see how it is appearing, Yes? So what is it we are able to see last 5 rows from the
country, country_df file.
(Video ends: 21:15)

(Refer Slide Time: 21:15)

51
There may be requirement you need to see more than one column at a time. So I am going to
save in the form of another file name that is called a subset, Subset=df. You see there is a
double square bracket so I want to switch the country columns, continent columns and year
columns. Then I going to see what are the heading that means I want to see what does the 5
rows of these subsets so we will go to they go to Python.
(Video starts: 17:46)
I am going to call it a subset continent. Suppose I want to see the 1st 5 rows of this file called
a subset. Data set called subset. You see that I am able to fetch 3 columns at a time that is on
the country, continent and year. The same way we from the subsets file I want to see last 5
rows so print subset.tail. Let us see what we are getting we will get this output Yes? You see
that there are 3 columns.

There were the last 5 rows from the button. So far we looking at different columns now we
want to subset rows by index label there is one command called loc. So first we look at the
file initial file that is a print df.head. Next you see that I want to locate the 0th row so for that
purpose, print df.loc see it is a square bracket you type 0 because if you suppose we want to
know the 1st row i out to enter because I would enter 0 because Python counts from 0, so
print df.loc 0 that will show the 1st row.

You see 0th row access country Afghanistan year 1952 population is this much continent is
Asia. Suppose I want to access this 1st row that means 0th row, Yeah? 0th row you can verify
0th row is the country Afghanistan, year 1952 this is a way to access a particular row. Dear
students whatever comments which I am typing that I will be given to you when you take this
course you can practice yourself.

You need not bother about in case we are not getting at this stage this all the commands all
the codes will be given to you .You can practice on yourself. Suppose I want to get the 100th
row how to access from the file df? I want to look at 100th row so you type print df.loc 99.
You can exactly access in 100th row what is the element is there? Suppose I want to access
100th row df, 100th row is the country Bangladesh, year 1967, population this is.

This is the way to access different rows for our calculation purpose. So far we have seen how
to load csv file into the Python, we have seen some basic commands.
(Video Ends: 25:21)

52
We have seen how to know the size of the file then we have seen how to access a particular
row and also we have seen how to subset from the given big file? How to subset different
small data file? So that can be used for our further analysis. So the next class we will see how
to access different columns that will continue in the next lecture. Thank you.

53
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee

Lecture No 3
Python fundamentals

Okay? We will continue our lecture. How to access different rows and columns, because, it has
very important applications.
(Refer Slide Time 00:33)

When the data file is very big sometimes you need to access only some rows or some columns
for your calculation purpose. That we will learn how to access a particular rows or particular
columns there is looking at columns, rows and cells. When you look at this see print df.head
when I use this command and getting there are different. For example; the first column says 0, 1,
2, 3, 4, country, year, population, continent, life expectations, gdppercapita.
(Refer Slide Time: 01:04)

54
Suppose I want to get the first row as we know that the Python counts from 0. If you want to
know the first row you type a print df.loc, it is a location in square bracket 0. Will do that you
will get the details which are there in the first row.
(Refer Slide Time: 01:26)

So first if I want to know hundredth row so printed df.loc 99. We knew that python count from
zero. If I want to know 100th row you have to type 99. So it should be in Square bracket you can
see the details in the 100th row.
(Refer Slide Time: 01:42)

55
Suppose we want to know the last row in the data set. So print df.tail n equal to 1. If you type n
equal to – 1, it will not work, that we will see why if you want to know the last row simply type
to df.tail n equal to 1, you will get to know that what is the last two, So we will see that.
(Video Starts: 02:01)
Now we are going to use this command to see the last row, that is a detail about the last rows.
Now we can subset a multiple rows at a time. For example; there will be requirement we have to
select 100th row, 1st row 100th rows and 1000th rows. For that purpose you type this command
print df.loc. You see there are two square brackets 0, 99, 999, you will see what output where
getting. So type print df.loc. Yes, so we are able to see the 1st row, 100th row, 1000th row.

There is another way we can subset rows by row number by using this command iloc. Previously
loc, now we are going to use iloc. Suppose for type I want to get the 2nd row, if I type print
df.iloc 1. I will get the details about the 2nd row. Okay? Yeah, this is a detail about the 2nd row.
Suppose I want to know 100th row by using iloc command so go there. Yes? That is the details
about the 100th row. You see that if I want to access the last row by using iloc command.

So you can directly type print df.iloc in squared bracket - 1. So that will be the details of the last
row. So what you can do we can open our Excel file you can verify what was the title, the last
row and soon.
(Video Ends: 04:27)

56
(Refer Slide Time: 04:27)

See then important note here with iloc command. We can pass in the - 1 to get the last row, but
same thing that we could not do with loc. That is the difference between loc command and iloc
command.
(Refer Slide Time: 04:42)

Suppose we want to get the first 100th and 1000th rows, using iloc command. So we are going to
type this print df.iloc 0, 99, 999. Let us see what answer we are getting.
(Video Starts: 04:58)
Yes? See, we have getting 1st, 100th and 1000th row.
(Video Ends: 05:21)

57
(Refer Slide Time: 05:21)

So far we are seeing subsetting rows. Now we will see subsetting columns, the Python slicing
syntax used a colon, colon represents all the rows. If you have just a colon that attribute refers to
everything. So if you just want to get the first columns using a loc, or iloc syntax. We can write
something like df.loc[ : , which column we need to refer, to subset the columns.
(Refer Slide Time: 05:49)

The next slide I need to show that we are going to subset the columns with loc, not the position
of the colon. It is used to select all rows.
(Refer Slide Time: 05:58)

58
You see that, subset equal df.loc: , I want to see only two columns that is year and population. So
when you type this way you will get all the rows only two columns details that is year and
population. You will type this so when you type print subset.head. You can get the first 5 rows.
So you will see how it appearing.
(Video Starts: 06:26)
Subsets equal to Subset is object because from the df is the initial object which has all the details.
Now I am going to fetch only few columns from the df object that I am going to saved in the
name subset, subset is the object. So all the rows but I need only year column and population
column so I am going to type I want to see the first 5 rows, see that I am able to see 1 st 5 rows,
only for 2 cells. That is year and population. This is the way to get only 2 cells from the 2
columns from the Big Files.
(Video Ends: 07:21)
(Refer Slide Time: 07:21)

59
There is another example subset column with iloc, iloc will allow us to use integers - 1 will
select the last column. The same thing whatever we have seen in the previously so subset equal
to df.iloc:, represents all the rows. Then [2 , 4 , - 1, then we can see by using this command print
subset.head 1st 5 rows.
(Video Starts: 07:46)
See that we are able to see the last column and the population column, life expectancy column.
You can open our Excel sheet you can verify whether we are getting the right answer or not.
(Video Ends: 08:26)
(Refer Slide Time: 08:26)

60
Sometime there is another way for subsetting columns by using the command called range. First
will make range of numbers we are going to save that range of a number in object called small _
range, so small _ range equal to list range 5. Print small range will get 0, 1, 2, 3, and 4. Now this
small _ range, object can be used to access the corresponding columns.
(Refer Slide Time: 08:57)

So if I type a subset equal to df.iloc:, small _ range I can get.


(Refer Slide Time: 09:04)

1st column, 2nd column, 3rd column, 4th column and 5th column, so we will try this.
(Video Starts: 09:09)

61
small_range is an object, we are going to create a range. Suppose we want to see what
small_range is. So it is up to 0 to 4, that means 1 to 5. Now we are going to subset using that
object called small _ range using ilocation command. df.ilocation:small_ range we see that here
we are able to see 5 column that is a country, year, population, continent and life expectancy.
(Video Ends: 10:21)
(Refer Slide Time: 10:22)

So far we have seen subsetting only rows and columns. Now we are going to subset rows and
columns simultaneously. For example; using loc command so if you type print df.loc 42
countries. We can check in the 42 label in country columns. What is the cell name, there cell
name is Angola. Will try this.
(Video Starts: 10:47)
Going to see in that file in 42nd label in country column what value is there so that is an Angola,
Yes?
(Video Ends: 11:09)
(Refer Slide Time: 11:09)

62
Yes, we can see what is in the using the same ilocation we can see in 42nd label in 0th column.
Now we can represent column also with 0 columns, what value it is, you will see that. You can
verify you have to get to the answer. You can open the Excel file. You can verify we are
correctly accessing the cell or not.
(Video Starts: 11:29)
Print df.iloc in 42nd label 0th column what is the value it is Angola.
(Video Ends: 11:46)
(Refer Slide Time: 11:46)

Next we can subset multiple rows and columns. For example; get the 1st, 100th and 100th rows
from the 1st, 4th and 6th column. So now we are going to, simultaneously we are going to fetch

63
rows and columns and corresponding cells. So print to df.iloc 0, 99, 999. Similarly column labels
is 0, 3, 5. Let us see what answer.
(Video Starts: 12:13)
This accessing rows and columns are very important functions because nowadays data file comes
with a lot of rows and lot of columns. We need not use all the columns, all the rows for further
analysis. Sometimes we need only specific rows or specific columns. So these basic commands
will help you, how to access a particular rows and columns, that will be very useful when we do
further analysis using Python. Yeah? This is the value so that means 1st row, 100th row 1000 th
row, 1st column and soon.
(Video Starts: 13:08)
(Refer Slide Time: 13:08)

And there is another way if you use the column names directly it makes the code a bit easier to
read. In terms of number and so you see number column. If you use for representing column, if
you use column name we can see what is there, so simply type the column name. So we use this
command, print df.loc 0, 99, 999. Then directly will type the column name country, life
expectancy, gdpPercap you see there is a square bracket here.
(Video Starts: 13:36)
That you have to do as the same that Life capital Exp, Yes? This is because country, life
expectation this is the easy way to because we cannot remember column name.
(Video Ends 14:48)

64
(Refer Slide Time 14:49)

This was not only that instead of see suppose if you put a 10 column 13 that corresponding rows
will be displayed. So print df.loc 10 to 13, the 10th row 11th, row 12th, row 13th, row will be
shown and in columns country and life expectancy and gdpPercap so we will try this command.
(Video Starts: 15:11)
That means we can see the range of rows at a time. You are able to see the 10th row, 11th, 12th
and 13th.
(Video Ends: 16:17)
(Refer Slide Time: 16:17)

Okay? Next see print df. head we can see we can able to see 1st, 10 rows.

65
(Refer Slide Time: 16:23)

The 10th row some time for each year in our data what was the average life expectancy. To
answer this question we need to split our data into parts per year and then we can get the life
expectancy column and calculate the mean.
(Refer Slide Time: 16:38)

So what is happening there is a command which I go to use called groupby, we look at the data it
is not grouped. So when you use this command print df.groupby year,and life expectancy and
corresponding mean. The mean of the on the in the year 1952, the mean of the life expectancy
variable is 49.05. In 57, 51.09. We look at the data; it is not in this order. So the groupby by year

66
this command is grouping all the values, with respect to year. So we will see what is the answer
for this, we will verify this.
(Video Starts: 17:15)
When you open that Excel file you will see that the Excel file will be in some other form it is not
grouped by year, different years are appearing at different places. So this command that is a
group by will help you to group the data in year wise. Yes, you see that you are able to get 1952
the life expectancy was 49 years you see that when you look at this data. When year increases
the life expectancy year also increases due to advancement of medical facility available and the
standard of life is also increasing.
(Video Ends: 18:42)
(Refer Slide Time: 18:42)

Now, we can form a stacked table. Stacked table is using the group by command. So you type
this multi _ group _ variable = df . \ . See the \ represents to breaking the command we can use \.
Otherwise you can write straightaway also no problem. df.group by year, continent, life
expectancy,gdp per capita, then we can find the mean. Then we will get this output for that
means in 1952, in Africa, the life expectancies 39, in America 53, in Asia 46 in Europe 64 will
try this command.
(Video Starts: 19:28)
When we takes these command you will get an output, that is a stacked table. That is very useful
for interpreting the whole dataset, is kind of a way of summarizing the data in the form of table.

67
Multi_group. You see that now year wise. It is very, very useful command it is year by 1952,
some country Africa. What was the average year 1957 Africa. We see that if you look at only the
Africa data. 52 to 39 in 57 41, in 62 43, in 67 45, see that we can interpret this way, by looking at
the, this table. Suppose you have to flatten this.
(Video Ends 21:24)
(Refer Slide Time: 21:24)

If, you need to flatten the data frame. You can use this reset underscore index method, just to
type flat = multi _ group _ var . reset _ index. Then you see now the data is again. Now it is
flattened. The same data set, which was it in the table form now it in the simple learned form. So
we will try this comment.
(Video Starts 21:48)
This is what you are doing the data manipulation, because from the big data file, we have to learn
this kind of fundamental data manipulation methods that will be very useful, in coming classes.
So able to use reset _ index command to flatten the, that stacked table. See that now we can see
first 15 rows. Now it is data is flattened into the normal form.
(Video Ends 22:41)
(Refer Slide Time: 22:41)

68
The next one is grouped frequency counts. By using nunique command, we can get a count of
unique values on the panda series. So when you type print df. groupby continent, country.
nunique, you can get unique values that means frequency. Okay, will try this command.
(Video Starts: 23:04)
Print, See Africa 52, America is 25, Asia 33. When you look at the data, again, you go to
excel,Excel data you can interpret what is the 52 means, what is the America 25 and soon.
(Video Ends: 23:49)
(Refer Slide Time: 23:49)

Now, some basic plot a way to construct two things one is year and life expectancy. So we are
going to create a new object that is called Global _ yearly_ life _expectancy. By grouping year

69
and life expectancy, with respect to its mean. Then we are going to print it. So you are going to
get two values one is year. Next one is life expectancy. That is a mean life expectancy, you will
see this.
(Video Starts: 24:17)
There is a new object. The object name is called Global _ yearly _ life expectancy. Yes, see that
year, and supposed we want to plot it. We will see we are going to plot this data, how we are
going to plot it.
(Video Ends: 25:28)
(Refer Slide Time: 25:28)

Simply, just that object name. plot. That automatically takes this was output, which I got is in x
axis in a year, in y axis, average life expectancy. We will run this.
(Video Starts 25:40)
So, what this data says that, when the year 1950 - 2000 you see when the year increases, the life
expectancy also increases.
(Video Ends: 26:07)
(Refer Slide Time 26:07)

70
Just we have seen only the simple plot, in coming classes, we will see some of the visual
representation of the data. We are going to see a histogram, frequency polygon, ogive curves,pie
chart, stem and leaf plot and pareto chart and scatter plot .
(Refer Slide Time: 26:21)

Suppose, this is the data, see what is there in East, west, north. In column first quarter, second
quarter, third quarter, fourth quarter.
(Refer Slide Time: 26:30)

71
Suppose the very easiest way is the graph. By using this is called bar graph, bar chart. Bar chart
is different regions are labeled as different colors. This is a method of visual representation of the
data. If you look at this, the eastern side in third quarter, there are more sales. Okay.
(Refer Slide Time: 26:53)

The another way to represent visually, the data is pie chat, is the first quarter, third quarter. You
look at this, third quarter, which is in blue in color. There are more sales. And most importantly
the pie chart, we can get pie chart only for categorical variable. The variable is continuous, you
cannot use bar chart, you cannot use pie chart. So the pie chart is used only for categorical
variable. That is for only count data.
(Refer Slide Time: 27:31)

72
The another one is the Multiple bar chart. This is another way to represent the data visually.
(Refer Slide Time: 27:39)

Another one is a simple pictogram.


(Refer Slide Time: 27:43)

73
See, this is the frequency table.
(Refer Slide Time: 27:25)

See, next one is frequency polygon. This figure is drawn from the previous table, which was
shown in the previous slide. So below 20 around 13,14. This represents frequency polygon.
When you connect the midpoint, you see that this is the. This is called frequency polygon. Then
the, this one is the cumulative frequency. It is not always, you cannot connect the midpoint, you
have to be very careful with the data is continuous, then only you can connect one this bar. The
data is not continuous, you cannot connect it.
(Refer Slide Time: 28:24)

74
Next one is a histogram .The histogram was constructed from the given table. You see.
(Refer Slide Time: 28:30)

The lower limit of the table values is going to in x axis. The frequency is shown in the y axis.
You see that this is data in continuous data. Okay, that was histogram. The purpose of histogram
is, the histogram will give you a rough idea what is the nature of the data whether, what kind of
distribution it follows. Whether it is following bell shaped curve, whether the data is skewed
right or skewed left.
(Refer Slide Time: 29:03)

75
Next one is the frequency polygon which I have shown you. If, the midpoint of histogram are
connected then there is called frequency polygon. Because, the frequency polygon is used to
know the trend.
(Refer Slide Time: 29:20)

Trend of the data. The next one is ogive curve. This is cumulative frequency curve .So what is
happening in the, for example 20- under 30, the upper limit of the interval is taken the x axis, the
cumulative frequency is taken in the y axis. For example, the first interval.20 - 30. So 30 the
upper interval is 6. For 40, upper interval is to 24, that is marked.

76
Because the advantage of this ogive curve is, supposed if we want to know below 16, how many
numbers are there, that can be read directly from the ogive curve. That is the purpose of ogive
curve.
(Refer Slide Time: 29:56)

Next one is the relative frequency curve. Exactly similar to that now actual frequency that
relative frequency was taken.
(Refer Slide Time: 30:08)

Okay. The next way to represent the data using pareto chart. The Pareto chart is having some
applications in quality control also. This is to identify which is more important, important
variable. Assume that, if you look at this Pareto chart. There are 3 axes one is frequency. In x

77
axis, different name is given poor wiring, short in coil, defective plug, other. You see there is
one more variable in terms of percentage.

For example, I am a quality control engineer, suppose my motor is failing so often. I want to
know there are different reason for failing of the motor. I want to know what are the main
reasons, due to which the motor fails. So what I have done. First I have go to frequency table,
that is due to poor wiring, the motor was falling for failing 40 times, frequencies 40. Due to short
in coil, the motor was failed 30 times.

Due to defective in plug, the motor was failed 25 times. Due to some other reasons the motor
was failed by say below 10 times. So the first technique is for drawing this one, we have to
arrange in the descending order of their frequency. So in x axis that values are taken. Then the
cumulative frequencies plotted on the, this axis. For example, how to interpret this table is. You
see, here this value corresponding this only 70.

So 70 % of the failure is due to only two reasons, that is poor wiring and short circuiting. So
what is the meaning of this one is, if you are able to address these 2 problems, 70% of the
failures can be eliminated. So the purpose of a Pareto chart is, to identify which is critical for us.
Generally it is called 80-20 principle. This is called the Pareto principle .That is 80% of the
problems are due to 20% of the reasons.

So similarly here, when you look at this, the cell, here need not always 80, see the 70% the
failures, only due to 2 factors that is due to poor wiring and short coil. So this is the pareto
chart.
(Refer Slide Time: 32:33)

78
The next one is scatter plot. The scatter plot is so far what ever seen only for one variable, the
scatter plot is used for two variable. In x axis registered vehicle, y axis the gasoline sales. So this
says the scatter plot says, when the number of registered vehicle is increasing the gasoline sales
is also increasing. So the scatter plot is used to know the trend out the data.
(Refer Slide Time: 32:59)

Some of the basic principle for excellent graph. One is the graph should not distort the data. The
graph should be very simple. It should not contain unnecessary adornments. So, so much
decoration in the graph is not required, the scale on the vertical axis should begin at 0. All axes
should be properly labeled. Weather should be x axis or y axis, it has to be properly labeled. The

79
graph should contain a title. The simplest possible graph should be used for given set of data.
These are the basic principle of excellent graph.
(Refer Slide Time: 33:39)

See when you look at this one. The left hand side it is a bad representation of the graph. What is
happening lot of animations, unnecessary pictures. The right hand side, it is a simple graph x axis
is taken as year, in y axis it has taken the wage. So it is showing some trend. But when you look
on the left hand side it is not giving any idea. What is happening year with respect to wage.
(Refer Slide Time: 34:04)

Another one you look at the left side picture and right side picture. Both are the same data. But
what is happening. When here in the left side picture the scale is 0 to 100, here it is 0 to 25 just

80
by changing the scale, we are able to get different interpretation. You see that when the when the
scale is increased. It looked like flat. If you are drawing in smaller scale. You see that look like
there's a lot of variations. So what is the learning is that we are to use proper scale to draw the
picture.
(Refer Slide Time: 34:40)

The next one is the graphical error, no 0 point on the vertical axis. When you look at the left side
of the figure January, February, March, April, May, June, the month is given in x axis. Monthly
sales is given y axis. But the problem on the left hand side is it did not start from 0. The right
side is you see that the small Brake is given. So, that, even though, 0 to 36 there is no data, you
have to make a small break like this. So that, we can come to know it start from 0.

So this is the right hand side is the right way of drawing the graph. This is the basic requirement.
In this lecture, what you have seen, how to access particular rows and columns by using basic
commands. Then we have seen the different visualization techniques, different theories of the
visualization technique. The next class will take in some sample data. By using the sample data
with the help of the sample data will try to visualize the data.

By having different tools like a pie chart, bar chart, pictogram, Pareto chartor simple graph.
Thank you, we will see you next class.

81
82
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee

Lecture No 4
Central Tendency and Dispersion

Good morning students, today we are going to the lecture 4. In this lecture we are going to
talk about central tendency and how to measure the dispersions. The lecture objectives we
talk about different types of central tendency.
(Refer Slide Time: 00:42)

Then different types of dispersions


(Refer Slide Time: 00:44)

83
What is measure of central tendency? Measure of central tendency yield information about
particular places or locations in a group of numbers. Suppose there are a group of number is
there that number group of numbers has to be replaced by a single number that single number
we can call it as central tendency. That is a single number to describe the characteristics of a
set of data.
(Refer Slide Time: 01:08)

Some of the central tendency which we are going to see in this lecture is arithmetic mean,
weighted mean, median, and percentile. In the dispersions we are going to talk about
skewness, kurtosis, range, interquartile range, variance, standard score and coefficient of
variation.
(Refer Slide Time: 01:25)

84
First we will see the first central tendency arithmetic mean. Commonly it is called as the
mean it is the average of a group of numbers; it is applicable for interval and ratio data. This
point is very important it is not applicable for nominal and ordinal data. It is affected by each
value in the data set including extreme values, one of the problem of the with the mean is that
it is affected by the extreme values computed by summing all values in the data set and
dividing the sum by the number of values in the data set.
(Refer Slide Time: 02:01)

See here I have used a notation µ, µ means capital letters µ represents, mean for the

population. The formula µ =


= (X 1 + X 2 + X 3 +..+XN )/ N
here N is the number of elements. For example; the values are 24, 13, 19, 26, 11 add these
numbers and divided by 5 because there are 5 elements. So 93 / 5, 18.6 is the mean of these 5
numbers. So now the 18.6 can be replaced by these set 5 numbers. Okay?

Suppose in your class if you see the average mark is 60. So the whole marks of all the
students can be represented to be a single number that is 60, 60 will give an idea about the
performance of the whole class.
(Refer Slide Time: 02:55)

85
Next what is the sample mean? Make sure that the difference here the X bar. Previously for
the population mean we have used µ for the sample mean we are using X bar.

X bar =
= X 1 + X 2 + X3 / n.
For example; 6 element is there 57, 86, 42, 38, 19, 66 so divided by 6, the mean is 63.167
(Refer Slide Time: 03:19)

Now how to find out the mean of a grouped data? The mean of your grouped data is nothing
but weighted average of class midpoints, class frequencies are the weight. For the formula is

µ=
So Sigma f is nothing but

86
= (f1 m1 + f2 m2 + f3 m3 and so on + fi mi )/ sum of all f.
That is nothing but your N. We will see you an example;
(Refer Slide Time: 03:58)

See this is the grouped data. What is given class interval is given frequency is given class
midpoint is given and multiplied value of frequency and midpoint also we can find out. For
example; see here 20 to 30 there are 6 numbers is their frequency 6. Suppose if you say the
marks of here if you say this this is an example of here marks obtained by in your class. So
between 20 and 30 there are 6 students is there. Between 30 under 40 there are 18 students is
there.

Suppose for this the data is in this format that is in grouped format how to find out the mean?
Okay? First what do you have to do first you have to find out the class midpoint. See 20 to 30
that is a class interval the midpoint is 25, for 30 and 40. The class midpoint is the middle
value 35 like this 45, 55, 65, and 75. Next one you have two multiplied by frequency and
class midpoint so 6 into 25 is 150, 18 into 35 is 630, 11 into 45 is 495 and so on.

What the formula says it is last column the sum value is 2150, 2150/50
Sigma f is some of the frequency so for this kind of grouped data the mean is 43.
(Refer Slide Time: 05:29)

87
Now we will go to the next central tendency that is the weighted average. Some time if you
look at the previous values, the each value is given equal weightage. Suppose it is not always
the case there may be some marks there some values where there may be higher weightage.
So for that case we have to go for weighted average. Some time you see this we will list two
average numbers but we want to assign more importance or weight to some of the numbers.
The average you need is the weighted average.
(Refer Slide Time: 05:58)

So the weighted average is sum of the product of weightage and that value/sum of values.
Where x is the data value and w is the weight assigned to that data value. The sum is taken
over all data values.
(Refer Slide Time: 06:20)

88
We will see one application of this weighted average. Suppose your midterm test score is 83
and your final exam is score is 95 using weights of 40% is for the midterm and 60% is for the
final exam compute the weighted average of your scores if the minimum average for an A
grade is 90 will you earn an A grade. So first we find out the weighted average so the mark is
83 weights age is 40% for midterm for interim your mark is 95 weightage is 60 %.

So multiply that then divided by some of the weight that = 0.4 + 0. 6 = 1, so 90.2. So if you
are above 90 you will get A, because you are crossing 90 obviously you will get the A grade.
(Refer Slide Time: 07:12)

Now we will go to the next central tendency Median, the middle value in ordered array of
number is called Median. It is applicable for ordinal interval and ratio data. You see
previously the mean is applicable only for interval and ratio data but the median is applicable

89
for ordinal data. There is a point has to be remembered and it is not applicable for nominal
data and one advantage of median is it is unaffected by extremely large and extremely small
values.
(Refer Slide Time: 07:46)

Next we will see how to compute median there are 2 procedures, first procedure is arrange
the observations in an ordered array. If there is an odd number of a term the median is the
middle term of the ordered array. If there is the even number of terms the median is the
average of middle two terms. Another procedure is the medians position of an ordered array
is given by n + 1 / 2, n is the number of data set.
(Refer Slide Time: 08:15)

We will see this example; I have taken some exam some numbers that is arranged in an order
that is an ascending order 3, 4, 5, 7 up to 22. There are 17 terms in the ordered array the

90
position of the median is, with respect to previous let n + 1 / 2. So n + 1 / 2 = (17 +1)/2 = 18 /
2 = 9. So the median is the 9th term, 9th term here is 15. If see the 22 which is the highest
number is replaced by 100 still the median is 15.

See if the 3 is replaced by -103 still the median is 15. So there is the advantage of this median
over mean is median is not disturbed by extreme values.
(Refer Slide Time: 08:59)

Previously the number of items are odd now let us see the another situation; there are 16
terms in the ordered array there is an even number, the position of the median is n + 1 / 2 that
is 16 + 1 / 2 is 8.5. So we have to look at the term where it is the position of 8.5. That is the
median is between 8th and the 9th term here the 8th term is 14, 9 th term is 15 so average of
that one is 14.5. Again, if the 21 is replaced by 100 the median is same 14.5, if the 3 is
replaced by - 88 still the median is 14.5.
(Refer Slide Time: 09:42)

91
Now let us see how to find out the median of your grouped data but it will be grouped data,
here if the data is given in the form of a frequency table. This case the formula to find out the

median of a group data is median = . Where, L is the lower limit of the


median class before using this formula from the given table you have to find out what is the
median class.

Then cfp = cumulative frequency of the class preceding the median class f median, the
frequency of the median class, W is the width of the median class; N is the total number of
frequencies.
(Refer Slide Time: 10:26)

92
See this is an example; as I told you before using this formula first order to find out the
median class. What is the median class is when you add the frequency it is a 50. 6 + 18 + 11
+ 11 + 3 + 1 is 50. So divide this 50 / 2 it is a 25. In the community frequency column in the
last column look at where that 25 is lying it is not between 30 and 40; it is going to lie on
between 40 and 50 because 24 for the next term is 35.

So the median class for this given group data is 40 and 50. So as usual L, is the lower limit of
the median class that is a 40 + N is 50. You see the cumulative frequency of the preceding
interval is 24. So,
Md = 40+ ((50/2) – 24) x10 /11
because the width interval is 10. When you simplify you would get 40.909, so this is the way
to find out the median of your grouped data.
(Refer Slide Time: 11:45)

Now mode the most frequently occurring value in a data set is mode applicable to all level of
data, measurement nominal, ordinal, interval and ratio. Sometimes there is a possibility the
data set may be bimodal. Bimodal means data sets that have two modes. That means two
numbers are repeated same number of time multimodal data sets that contain more than two
modes.
(Refer Slide Time: 12:12)

93
See this one sample data as it is given for this data set the mode is 44, because the 44 is
appearing more number of time. How many number of time 1, 2, 3, 4, 5. Okay? So the mode
is 44. That is there are more 44s than any other values.
(Refer Slide Time: 12:37)

That is the formula for finding mode of a grouped data. Here first we have to find out the
mode class. For that look at the frequency column there 18 is the highest frequency. So
corresponding the n class interval is called mode interval. Okay? The mode interval L Mo is
the lower limit of that mode interval is = 30 + (12/(12+7))x10

And d2 is difference between 18 and 11.

94
See 30 + see d1 is nothing but 18 is the mode interval and then the previous frequency is 6, so
18 - 6 is 12 / d1 is 12 + d2 is the difference between your 18 and 11 that is 7. So 12 + 7
multiplied by width is 10, so 36.31 is the mode of your grouped data. Yes? We have studied
mean, median, mode for group data and ungrouped data. Now the question is when to use
mean? When to use median? When to use mode? Okay?

Many time even though we study mean, median, mode we are not exactly told how to use or
when to use mean or when to use median or when to use mode?
(Refer Slide Time: 14:00)

For example look at this data set, this is left skewed data because the tail is on the left hand
side. The example for this is suppose, say the exam is very easy question paper and the x axis
is the marks and y axis is frequency. So there is more number of students who got higher
marks .Where the question paper is easy situation this is an example of left skewed data. So
what will happen here, here will be mean here will be median this will be mode.

You see another example; where the question paper is very tough. So this is called right
skewed data. You know here what is happening how we are saying that since the question
paper is very tough. There are more number of students who got the lesser marks that is why
the skewness on this side. So here there will be mean here will be median this will be mode
Okay? There will be another situation it is symmetric it is a bell looking at bill shaped curve
in this situation.

95
Now after looking at this hypothetical problem now the question arises when to use mean,
when to use median, mode look at the location of the median. The median is always in the
middle. Whether the data is left skewed or right skewed the median is always the middle. So
whenever the data is skewed you should go for median as a central tendency. If your data is
following a bell-shaped curve then you can use mean, median, mode.

There is no problem at all the clue for that choosing the correct central is first you have to
plot that curve go to plot the data outer plotting the data you have to get an idea of the
skewness of the data set. How it is distributed? Whether it is right skewed or left skewed or it
is bell shaped curve. If it is skewed data you go for median as the center tendency. If it is
following a bell-shaped curve you go for mean or median or mode as a central tendency.
(Refer Slide Time: 16:39)

Now you go to next one is a Percentile, mainly this you might have seen some of the cat
examination scores or gate examination scores their performance is expressed in terms of
percentile not the percentage because percentile is having some advantage over percentage
because percentage is absolute term but the percentile is the relative term the measure of
central tendency that divide a group of data into 100 parts it is called percentile.

For example; somebody say 90th percentile my score is 90th percentile indicates that at most
90% of the data lie below it and at least 10% the data lie above it. Okay? The median and the
50th percentile have the same value. It is applicable for ordinal, interval and ratio data it is
not applicable for nominal data.
(Refer Slide Time: 17:44)

96
Okay we will see an example how to compute a percentile the first step is organize the data
into an ascending ordered array calculate the pth percentile location. Suppose if I want to
know 30th percent location for that you have to find out the value i, i = (P / 100) multiplied
by n, n is the number of data set, the i is nothing but the percentiles location we got to find
out the i value.

If i is a whole number the percentage is the average of the values at the i and i + 1 positions.
If i is not a whole number the percentile is the i + 1 position in the ordered array.
(Refer Slide Time: 18:35)

Look at this example the raw data is given 14, 12, and 19 up to 17. I have arranged in the
ascending order the lowest value is 5, the highest value is 28. Suppose I want to know 30th
percentile for knowing the 30th percentile, first I have to find out i that is a (30 / 100 )

97
multiplied by 8 = 2.4. The i is nothing but location index as I explained the previous slide, i
is not the whole number. So you have to add i + 1, so 2. 4 + 1 = 3. 4.

In the 3.4 the whole number portion is 3 rights? So the 30th percentile is at the 3rd location of
an array. When you look at the 3rd location is 13, that means a person who scored 13 marks
his corresponding percentile is 30.
(Refer Slide Time: 19:26)

So far we talked about these different central tendencies will go for differing. Now we are
going for measuring dispersion measures of variability describes the spread or the dispersion
of the set of the data. The reliability of measure of central tendency is the dispersion because
many times, the central tendency will mislead the people. So the reliability of that central
tendency is calculated by or identified by its corresponding dispersion.

It is used to compare dispersion of various samples that is why whenever you plot the data
you not only show the mean you have to show the central tendency also because the
reliability of mean is explained by dispersion.
(Refer Slide Time: 20:14)

98
You look at this data, when you see the first two rows is no variability in cash flow mean is
same. The second one is variability in cash flow see there is a lot of variability in the second
one but the mean is same. If you look at only the mean it look like same when you look at
only the mean the mean value same but when you look at see the left hand side the second
dataset is having more variability. The quality of the mean is explained by its variability that
is nothing but dispersion.
(Refer Slide Time: 20:51)

There are different measures to measure the variability one is the range, inter-quartile range,
mean, absolute deviation, variance, standard deviation, z scores and coefficient of variations.
We will see one by one.
(Refer Slide Time: 21:07)

99
Suppose there is ungroup of data is there see this one you have to find out the range. The
range is nothing but the difference between the largest and the smallest value in a set of data.
It is very simple to compute. The problem here is it ignores all data points except the two
extremes. So the range is the largest value is 48 in this data set the smallest value is 35. 48-
35 = 13 you see that only the two values are taken care in between the values is not taken into
consideration for finding the range.
(Refer Slide Time: 21:46)

It is a quick estimate to measure the dispersion of a set of data. I will go for a quartile;
quartile measures the central tendency that divided group of data into 4 subgroups. We say Q
1, Q 2, Q 3. Q 1 is nothing but 25 % of the data set is below the first quartile. Q2, 50 % of the
data set is below the second quartile. Q3, 75 % the data set is below the third quartile. So we
can say Q 1 is the 25th percentile Q 2 is the 50th the percentile nothing but the median.

100
This is a very important point; Q2 is nothing but the median. Q3 is the 75th percentile and
another point is the quartile values are not necessarily members of the data set.
(Refer Slide Time: 22:34)

You see this lets say Q1, Q2, Q3. So Q1 see first 25 % the data set, Q2 first 50 % of the data
set Q3, 75 % of the data set Okay? It is nothing but the quartile is used to divide the whole
data set into 4 groups first 25, second 25, third 25 and last 25.
(Refer Slide Time: 22:59)

Suppose an example for finding the quartile, suppose the data is given 106, 109 and so on.
Okay? We have arranged it in the ascending order. First we got to find out the Q 1, Q 1 as I
told you but the 25th percentile so the location of the 25th percentile. First you have to find
out the location index i for the 25/ 100 x 8 = 2. Since the 2 is the even number. As I explained

101
previously if it is the location takes it 2 you have to find out that position plus the next
position and its average.

So in the second positions data set is 109 + 114 / 2 = 111.5. So the Q1 is nothing but here
111.5, Q 2 is 50th percentile 50 / 100 x 8 = 4, again the 4 is the even number. So the 4th
location is 1, 2, 3, 4th location is 116 and 5th the location is 121 so, 116+121 / 2 =118.5. So
the Q2 our median is 118.5. Then Q3, 75 /100 x 8 = 6, 6 is the even number the average of
6th and 7th values are 122 + 125 / 2. = 123.5.
(Refer Slide Time: 24:28)

This is the way to calculate Q 1, Q 2, and Q 3. Now the next term is interquartile range .So
the dispersions in the data set is measured with help of interquartile range by using this
formula Q3 - Q1. As we know Q 3 is 75th percentile Q1 is the 25th percentile so range of
values between the first and third quartile is called interquartile range. It is a range of middle
of .Why we are using quartile range because it is the less influenced by extreme values.

Because when we collect the data set we are not going to consider at very low values at the
same time very high values. So the middle values which is not affected by extremes that is
taken for further calculation .For that purpose we are using interquartile range.
(Refer Slide Time: 25:15)

102
There is a Q3, now we will go for deviation from the mean so dataset is the given 5, 9, 16, 17,
and 18. To find the deviation from the mean first to find the mean, mean is 13. Suppose there
is a graph is there so this is the 13 Okay? See the first value 5 the difference is 5 - 13 = - 8. So
this distance is your first deviation the second data is 9. So 9 - 13 = - 4 this is - 4. So this
deviation is expressed by these lines.

Look at that there is a negative deviation there is a positive deviation. Suppose if we want to
add the deviation general it will become 0. That is why we should go for mean absolute
deviation.
(Refer Slide Time: 26:12)

You see this X is given here there are 2 values are negative deviation 3 are three values are
positive deviation. When you add this it is becoming 0 so it seems we are getting 0 we cannot

103
measure the dispersion. One way is we have to remove this negatives you take only positive
value. When you take positive values 24 so = 24 / 5 there are 5 data set,
= 4.8 is called mean absolute deviation. It is the average of absolute deviations from the
mean.
(Refer Slide Time: 26:46)

There was a problem in the mean absolute deviation I will tell you what is the problem there,
see the next we will see population variance it is not the average of the squared deviation
from the arithmetic mean. Okay? So the X is there, mean is there so when you add the
absolute even digit is 0, one way the previously the mean absolute deviation you take an only
positive value. Now we are going to square it, the squaring of the deviation having some
advantage.

One advantage is we can remove the negative sign second one is but the deviation is less
when you square it. For example; - 4 square is 16 see - 8 squared is 64. So what is happening

more the deviation more this squared value. Okay? So


= 130/5
here we are squaring the purpose why we are squaring for there are two reason one is to
remove the negative sign, the second reason is giving higher penalty for higher deviation
values

104
The next one is the population standard deviation because already there is a variance but
variance is a squared number that we cannot compare. Suppose the two numbers are given
say 12 and 13 that is easy intuitively we can say which is higher which is smaller. Suppose
124, 169 is given notice squared number. We cannot compare intuitively and not only that it
is in the square root of squared term.

We want to have it in the actual term so for comparison purpose for that purpose we are
taking square root of that.
(Refer Slide Time: 28:25)

So 5.1 is the standard deviation next we will go to the sample variance the formula is same
but only thing is it is divided by n - 1.
(Refer Slide Time: 28:37)

105
Why we are dividing by n -1, the reason is that to make the variance as the unbiased
estimator. This is due to degrees of freedom since we already know the value of the mean
will last one degrees of freedom. That we are dividing by n – 1 so it is very important
whenever you find the sample variance so the in the denominator there should be a n - 1. So
here the variance is 221,288.67.
(Refer Slide Time: 29:06)

This is another sample standard deviation; just to take the square root of that it is a 470.41 so
a square root of the variance is nothing but standard deviation.
(Refer Slide Time: 29:17)

Now the purpose is why we have to study the standard deviation because the standard
deviation is giving an indicator of financial risk. Higher the standard deviation is more risk
lesser the standard deviation less at risk. In quality control context generally when we

106
manufacture something suppose here plant A and plant B or shift A and shift B whenever the
variances in lesser then the quality of the product is high.

The process capability also they should have the lesser variance means in the process
capabilities high. Then I suppose therefore comparing the populations household income of 2
cities, employee absenteeism in 2 plants for these purposes, it is for comparing the population
that means wherever there is a lesser standard deviation so that is having higher
homogeneous data set.
(Refer Slide Time: 30:12)

You see look at this one µ and σ, see this is a financial security A and B. See the return rate is
15, 15 it is both are giving equal return but look at this the σ standard deviation because in
financial context it is, it is measured as the risk. So the first one is 3% second with 7% so the
security B having a higher risk, so always we will go for where there is a lesser standard
deviation because mean is same.

We are the same time the risk all should be same.


So far we have seen different central tendencies, different dispersions. In the coming class
will use Python will take some sample data set. I will explain you how to find out central
tendency and the dispersion of the given data set. Thank you very much.

107
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology, Roorkee

Lecture No 5
Central Tendency and Dispersion

Good morning students; today we are going to the lecture number 5, Central tendency and
dispersion will continue what have stopped, from the previous lecture, what we are going to see
today is, one important property of a normal distribution. And second, we will see various
kourtosis. Then second we see the box and whisker plots that has a different way of measuring
the dispersion.
(Refer Slide Time 00:51)

See this empirical rule, if the histogram is Bell Shaped. Look at this normal distribution, Bell
shaped curve. This yellow line says that, from the mean if you are traveling either side on 1
sigma distance. You can cover 68 % of all observations. Come to the second one, from the
means 0. If, you travelling 2 sigma distance on the either side. You can cover 95% of all
observations.

The third one, if you are travel 3 sigma distance on either side from the mean of a normal
distribution. You can cover 99.7 % of all observations. This is very important empirical rule.

108
Even through you study in detail about the normal distribution and it is the properties in the
coming lectures, I wanted to say this idea may be very useful in coming lectures, because we can
say the normal distribution is the father of all the distributions.

Because if you have any doubt on nature of the distributions. If you are not really sure about
what distribution the data follow, you can assume normal distribution. But there is a limitation of
this Empirical Rule. It is applicable only for the Bell shaped curves. There may be a situation, the
actual phenomena need not follow bell shaped curve. At that time this formula that is the
empirical rule will not work. So we will go to another rule, in the next slides, the same thing.
(Refer Slide Time 02:24)

What I given the previous slide, see µ ± σ. You can cover 68% of the all observations. µ ± 2σ,
you can cover 95 % of the observations. µ ± 3σ, you can cover 99.7 % of all observations.
Actually this 1, 2, 3 is nothing but Z. I will tell you incoming classes, what is Z.
(Refer Slide Time 02:57)

109
The previously we have seen that the properties of normal distribution, that is a bell shaped
curve. Sometimes certain phenomenon need not follow the bell shaped curve. That time, you
cannot use that property which he studied previously; you had to go for another formula for to
find out how much observations are covering under 1σ, 2 σ and 3 σ distance. This idea was given
by Chebysheff’s. It is called Chebysheff’s theorem.

Yeah, more general interpretation of the standard deviation is derived from Chebysheff’s
theorem, which applies to all shape of the histogram, not just to bell shape. Previously we see,
what is totally for bell shaped, but here it is apply it is applicable to all shape of the histogram,
even the distribution can follow any shape. So the proportion of observation in any sample that
lie within k standard deviation of the mean is at least: 1- (1/ k2 ).

So, how to read this formula is, suppose a phenomenon which is not following normal
distribution. If you want to know the 2 σ distances, how much percentage of observation can be
covered? So when you substitute here 1- (1/ k 2 )= 1-1/4 = 3/4. So, 3/4 means 75 %. So if you are
travel 2 σ distance on the either side. For a distribution which is not following normal
distribution. You can cover 75 % of all observations.

You see the previously; It is 95 %. So you see that that is a given. For k equal to 2, the theorem
states that, at least 3/4th of all observations lie within two standard deviation of the mean. This is

110
lower bound compared to empirical rule approximation 95 %. In case the previous slide, if it is 2.
We can cover 95% of all observations, but here we can cover only 75% of all observations.
Sometime we can use Chebysheff’s theorem also; the data is not following normal distribution.
(Refer Slide Time 05:14)

These two properties in coming classes many times we will refer that 1 sigma, 2 sigma, 3 sigma.
Just like that I want to give an idea about the normal distribution, but we will go in detail later.
The next way to measure the dispersion is coefficient of variation. The ratio of standard
deviation to the mean, express as a percentage. So coefficient of variation is your sigma by mu, it
is the measurement of relative dispersion. Already there is a standard deviation is there.
What is the purpose of this coefficient of variations that will see the next slide?
(Refer Slide Time 05:53)

111
Look at this, you see, stock A, stock B or stock 1,stock 2 the µ1 = 29, σ1 = 4.6. Another one, µ2
= 84, σ 2 = 10, supposed to choose which is better. If you compare only the mean, 29 verses 84,
the second stock is better. Suppose if you compare the standard deviation, 4.6 and 10. The lower
the standard deviation better it is. So the stock 1 is better. Now there is a contradiction, with
respect to mean option B is better with represent standard deviation option A is better.

Now there is a contradiction to need to have the trade off. In this situation we have to go for this
coefficient of variation, coefficient of variation = σ /µ. For example, for this case; = (4.6 / 29)x
100 = 15.86, for second case = σ2 /µ2 = (10 /84 )x 100 = 11.90. Lower the coefficient of
variation, but have the option is. So, if the variance is smaller, to be able to choose that group, or
that stock.
(Refer Slide Time 07:18)

112
Now we will see how to find the variance and standard deviation of the grouped data. Already
we have seen the standard deviation of the raw data also named as ungrouped data. We are
seeing the standard deviation for population, standard deviation for sample. Similarly, we will

find the variance for a grouped data. So the formula is:


Here, f is the frequency, M is the midpoint of that interval, mu is the mean of the interval, N is
some of all frequencies.

For the sample variants, look at the formula here it is also n-1,

(Refer Slide Time 08:03)

113
So we will see this example now look at this, there with the grouped data is given; but the
problem is given in the form of table. So, see these are the interval. These are frequencies. So 20
to 30, there are six values are there. Between 30 and 40 there are 18 values is there, 40 and
50,11 is there 50 and 60, 11 is there. 60 and 70, 3 is there, and so on. So first 1 we are to find out
the midpoint of the interval between 20 and 30, the midpoint is 25, 35, Next interval 45, 55, 65,
75. Now, you multiply this f and the midpoint of the interval. So, 150,630, and so on.

We need this then you can find M- µ, M is 25- 43- 18, 35- 43-8 and so on. Then you

square it, then the squared value you multiply by f, you are getting this . When you

add it, that is going to be 7200, . There is nothing but 50. So 144 is the population
variants of this grouped data. If you want to know the standard deviation of this group of data,
just to take the square root of the variance, that is 12.

(Refer Slide Time 09:30)

114
The next measure is shape of the, we can say a set of data. That is shape or distribution, what
distribution follows. We can see the skewness, kurtosis, box and whisker plots there are the three
method. So we will see what is a skewness, skewness is the absence of symmetry. As I told you
it may be this is left skewed data. This is a right skewed data. This is symmetric data. So this
absence of symmetry, this and this can be done with helps skewness.

So the other one the application of skewness is to find out what is the nature of this shape,
whether it is skewed or symmetric. Next one is a kurtosis; it is the peakedness of a distribution.
There are three layers, there are Leptokurtic, Mesokurtic, Platykurtic. Leptokurtic means high
and thin, Mesokurtic is little flat in this way and Platykurtic very very very very flat this way flat
and spread out.

Then we can see, box and whisker plots. It is a graphical display of distribution. It reveals
skewness. The application of boxer whisker plot is to check whether the data, follow a symmetry
or what is the nature of the skewness of the distribution.
(Refer Slide Time 10:58)

115
See the skewness left one which is an orange color it is the negatively skewed. As I told you the
previous lecture skewness is how it is named is looking at the tail. the tail is on the left hand side
so it is a left skewed or negatively skewed. Come to the blue one, the extreme right. It is this tail
is on the right hand side. So it is a right skewed or positively skewed one, the middle one there is
no skewness, so it is symmetric.
(Refer Slide Time 11:29)

The skewness of a distribution is measured by comparing the relative position of the mean,
median, and mode. If the distribution is symmetrical, we can say mean equal to median equal to
mode. The distribution is skewed right, the median lies between mode and mean, and the mode

116
is less than mean. The distribution skewed right means this way. So this will be our mean, this
will be our median, this will be mode.

Look at this, Median lies between mode and mean. The mode is less than the mean because
mode is less than the mean. The same thing the distribution is skewed left. This is the case. So
the mean position of mean will be here. Median position of the mode, the median is lies between
mode and mean. And the mode is greater than mean. As I told you, whenever, if you want to
know the central tendency of your distribution you to check the nature of the distribution.

If it is skewed right or left, you should go for median, because the median always middle of the
distribution irrespective of your skewness.
(Refer Slide Time 12:58)

The same thing, what are explained the previous one. Mean, see negatively skewed the position
of mean is here, median, mode. Positively skewed to the position of means here, median is here,
mode is here. There is no skewness in middle one.
(Refer Slide Time 13:11)

117
How to find coefficient of Skewness? The summary measure for Skewness can be measured as:

If S is negative, it is negatively skewed. If S equal to 0. It is symmetric. If S is greater than 0, the


distribution is positively skewed.
(Refer Slide Time 13:35)

You will see an example. µ1 is 23, median1 is 26, σ1 is 12.3, and you apply this formula, = 3 x
(23- 26 )/ 12.3 we are getting negative, so it is a negatively skewed. Go to the middle one µ2
equal to 26, median2 equal to 26, so 26 - 26 = 0. So S2 equal to 0. For this distribution the

118
skewness is 0 or it is symmetric. The right one µ3 equal to 29, median is 26, σ3 is 12.3, and you
substitute here we are getting positive value for S3. So the skewness is positive.
(Refer Slide Time 14:20)

The kurtosis, as I told you, it is a peakedness of a distribution, when they say Leptokurtic,
leptokurtic, this one. So it is high and thin, if that means highly homogeneous distribution, the
things are very close. This is second one is the Mesokurtic, it is normal shape. The last one is
Platykurtic, flat and spread out.
(Refer Slide Time 14:48)

119
The next we will go to box and whisker plot. There are five positions in the box and whisker
plot. One is median Q2, first quartile Q1, third quarter Q3. The next word is the minimum value
in the data set, maximum value in the data set.
(Refer Slide Time 15:05)

We will see this one here, this one. So, this point is would minimum value in this box is a Q1 is
the quartile one, quartile two, quartile three maximum, why its called box and whisker plot. The
whisker is look like a whisker of a cat. So it is a box and whisker plot.
(Refer Slide Time 15:25)

You see how the skewness can be measured or identified with the help of box and whisker plot.
So by looking at the position of this middle this line, we can identify the distribution. What is

120
that, if it is on the right side of this box it is left skewed data. If it is a left side it is the right
skewed data. If it is exactly on the middle, it is symmetric which follow normal distribution. So
far, we have given some kind of theories about this various central tendencies and dispersion.

Now, I'm going to switch over to Python. So whatever we have done. The theory portion
whatever we are taught here, so I am going to use Python. I am going to explain how to use
Python to get central tendencies, skewness, box and whisker plot and various dispersion
techniques with the help of Python. So we will go to the Python mode.
(Video Starts: 16:26)
Okay, now we will come to the Python environment, the first, as I told you we are to import
pandas as pd. We can do this pd is it is for only our convenient fantasies and library. Then they
are to the import Numpy numerical Python, as the np. Okay, so the first one is to import the
required libraries. The second one is going to import the data set. So the data set, already I know
the part of the data set is the otherwise the name of the data set is IBM underscore 313
marks.xlsx.

So, I am going to save the object called Table. Table equal to pd., this is the command. This
read_excel is the command for the reading the Excel file. The path is this, ‘IBM-313’ otherwise;
simply you can type it there. Now print table, let us see what is the data. See, look at this, serial
number is there, MTE that is a midterm examination marks, mini project, total, end term
examination marks and total marks.

Okay, this total is out of 100, this total is out of 50. You see it is starting from 0,1,2,3. Okay, now
I want to find out the mean of the total that is in the end term examinations. So, x = table, the
object either be a square bracket. There you have to write the column name. So that means I am
going to take only the column name, total, and I go to store that value into the variable x. Now
the x is nothing but by the last column, if you want to mean of that one. So, np.mean.

Otherwise, if you want to know np. If you press tab you will get various options to that np. tab.
See here, there are so many options there in that have there are maximum, minimum, the mean,
median, you need not remember also you can check it one by one. So now we will go to the

121
np.mean. Then, np.mean, then we will call that variable x executed, shift enter. So we are getting
this value 46.90 is the average marks.

There is a lot of the median. So np.median, median is 45. We will go for mode, the mode. You
have to import scipy, from scipy import stats. Stat is another library function. So start stats.mode
called the variable x. We will see what is the mode? Mode is the number of frequencies.
Suppose, there are five students see got the same marks 30 bars, the mode will be 30. Okay. So,
okay, we will come back to later.

So, next we will go to percentile. In percentile suppose I have taken the array, that we are
introducing another one np.array, a equal to np.array just we have taken an array 1, 2, 3, 4, 5.
Suppose, go to say p equal to np.percentile of that array a, 50. What do you want to know? I
want to know 50th percentile, 50th percentile means what value in this array will be the 50th
percentile, and execute this print P. So, 3 is the 50th personality but the median.

This number is very small number one for illustration purpose. So you can have a large number
then you can run it. Then now, we will go to another command in Python is for loop. For loop is
suppose I have taken a variable k saved three variables, one is Ram, Seeing the characters in the
code 65, 2.5. Suppose if I print k, what will happen. You see that it is printing Ram, 65, 2.5. But
there is a requirement that I have to print one by one.

First I have to print Ram, and then I have to print 65, and then print 2.5. Now here at a time I am
getting all the answer but I want to print one by one. So for that purpose, see that for i in k. This
is the syntax, there should be a colon, that is one print i. So what will happen first in k this is
array. So first, for i in k will take the value Ram, second i will take the value of 65. Third, i will
take the value of 2.5.

Now if we execute this print i, see that one by one. So this is the one by one, I am getting this
output. So this is the example of for loop. So for i in k. the k is in which variable. So the i value
will change. if you want to print i. So the first it print Ram and then 65 and 2.5, because why I

122
am showing that we are going to use this for loop incoming examples. So I want to give an idea
about how to use for loop in Python.

Now we will go to the range. So, far i in range. Is it that 10, 20, 2 this is the rage function. The
rage first one is the starting value. The second one is the ending value, the 2 is increment; print i.
if we print that you see that, now what is happening. 10, 12, 14, 16, 18 incremented by 2, ending
with excluding 20. 1, 2,3,4,5 increment is by 2. Now suppose in the print, I want to print, now it
is printed one by one.

But I want to print in i with the comma, so 10, 12, 14 that purpose the same comment, i use end
equal to it should be separated by a comma in colon. So if I run this. What is happening see 10,
12,14,16,18 so this one end equal to in ‘,’. That is what how it is giving the output in horizontal
way. Now we will go to the next option functions in Python. Suppose the functions are very
useful applications in Python many time.

There are some built in function is there. For example; print is the built in function, maximum is
the built in function, and minimum is built in function. You can create your own functions, and
then you can call that function wherever it is required. Suppose def,that is the syntax def greet
open parenthesis, end with the colon, print Hi, print good evening. So, this is the way of defining
your function.

Okay, then. After defining the function, you have to call that function, suppose I call the greet. I
will execute this word what will happening? So, this function is getting executed. So again Hi,
and good evening. So another example or function suppose I want to add two numbers, so def
the function name is add in parenthesis, p,q. It can be anything colon, the colon is important.
Otherwise, it will show syntax error.

So c equal to p+q, so print c. suppose this is my function, suppose this is, I want to call this
function, add 6,4 what answer I am getting. Suppose other number suppose, add 10,4. So I am
getting 14. We have seen how to create a function now finding the minimum, maximum value in

123
the data set. Suppose I created a new array. Data equal to 1, 3,4,463,2,3,6, just i take randomly.
Suppose I want to see the minimum value in this array and maximum value in this array.

So what is happening. So minimum value is 1, the maximum value is 463. So for that the
comment is min and max. Now, this minimum and maximum value, I can create own function.
Then I can call that function. Because every time I need not type minimum, min data, max data.
Because already that was built in function, we can create our own function. So the same data I
have taken, 1, 3, 4, and 463.

I am defining function min underscore; underscore MAX data, so min underscore value equal to
minimum data, maximum underscore value equal to maximum data. Now returns because I want
to get the output. This is indentation is more important. Suppose, there is, minimum_value
should be the same, same indentation, generally we can give Tab. Tab means we can save for
space work. So return underscore, minimum_value, maximum_value will run it let us see what is
happening.

So I am getting this because I called this function again how? Min_and_max( data). Because my
function name is Min_and_max. So 1, 463. So this functions application is very much useful in
Python because when you are making a large program, every time some routine aspects you need
not do it, yourself every time. So you can call that function whenever it is required. It will save a
lot of your time and energy.

So now, suppose I want to know the range of the data range is nothing but maximum value and
minimum value. For that, I go to define a function def, that function name is rangef. That is the
rangef. rangef I given you can give any name for the data. So, I am finding
minimum _value = min(data)
maximum_value = max(data)
return (maximum_value - minimum_value)

So if i call that function rangef data, what will happenning I getting 462, nothing but = 463-1.
Now we will go to quartile. Quartile, we have seen already. It is a Q1, Q2, Q3. Q1is the 25 th

124
persentile, there is an inbuilt function in NumPy. So when you say I am creating an array, a equal
to np.array, array1, 2,3,4,5. So Q1 equal np.percentile. This np.percentile will give you the
percentile.

Suppose if you want to know 25th percentile I can get what value in this array is the 25th
percentile. So if I execute this. So, that means, the value in two is 25th percentile. So the same
thing np.percentile(a,50), if I put. I am going to call it Q2. So 3 is, otherwise median you see
look at this because it is odd number, the middle value three obviously it is the 50 th percentile. I
will go for third one, Q3 = 4. That is our 75th percentile.

Next, spredness is measured in terms of Inter quartile range. As we know already, Q3-Q1, so that
is nothing but IQ. So if we see IQ, there is nothing but Q3-Q1, 2 is inter quartile range. Now, we
will go for how to find out the variance. Suppose, there are two way two variants one variances
for fine variants of mean, another one is the variance of the population. Even you put np.
NumPy, var is for the population variance x. So what is the x, x is the total.

So that column, the total column we have saved in the name of object called x or variable called
x. So we will see variance is 262.781. Suppose, there is a another function will be import to
library statistics. So statistics.pstdev, that is population standard deviation x. so I can get the
population standard deviation, that is 16.2105. It is the standard deviation. So if you want to
know the standard deviation of the sample. So, statistics.stdev.

So only thing is, if you want to know the population standard deviation, you should write p there,
np.std otherwise by default, in statistics library, you are getting only the sample standard
deviation. This is standard deviation for the population. This is standard deviation for the sample.
Next round for the skewness for that you are to import from scipy there is a library called dot
stats import skew, skewness of x. So skewness is positive value so it is the right skewed data.

Next we will go to Box and whisker plot, because for drawing plot you have to use Matplotlib.
Import pyplot as plt. So plt.boxplot, there is an inbuild function as x comma symbol is star (*)
plt.show execute this. So we are getting box and whisker plot. You see that box and whisker plot

125
rather than some star symbol that implies that that data is are outlier. Outlier means which goes
beyond maximum value beyond minimum value.

The position of this middle line will help you to identify the nature of the distribution. If it is in
left side, it is right skewed data. See, look at this, because it should positively skewed data and
now it is little left side. So the data is, data is right skewed data. So with that we are stopping the
central tendency. So what we have seen so far we have seen various central tendencies and
different way of measuring the dispersions,
(Video Ends: 31:40)
Whatever you will learn theory part that we run in Python, we got the answer. There are so many
sources are available in internet to know the different course, different videos find out you can
also refer that for this class. Thank you.

126
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee

Lecture – 06
Introduction to Probability - I

Good morning students. Today we are going to next lecture number 6 introduction to probability.
The concept of probability is fundamental in any field whether you call it a statistics or analytics
are your data science everywhere because if you look at some of the book titles of the statistics
or analytics it will come with probability and statistics because the concept of probability and
statistics is cannot separate it because always will go together because, the concept of statistics
since we are taking sampling, we are predicting about the population.

So, whatever we say with the help of sampling we have to attach always some probability
because it cannot be 100% assured that whatever you say with the help of sample will be exactly;
you cannot predicted so when there is a prediction comes, then you have to attached probability
to that. Today, you will see that it is an introduction to probability I am not going to teach in
detail about that one.
(Refer Slide Time: 01:21) correct the picture size

What are the ideas which are important for us the only that I am going to teach. So the lecture
objective is to comprehend the different way of assigning probability understand and apply in

127
marginal union joint and conditional probabilities and solving problem using laws of probability
including law of addition, multiplication and conditional probability and using very important
theorem that is the Bayes rule that to revise the probability at the end these are the my lecture
objectives.
(Refer Slide Time: 01:51)

So we will go the definition of the probability, the probabilities the numerical measure of
likelihood that an event will occur, the probability of any event must be been between 0 and 1,
inclusively 0 to 1 for any event A, the sum of the probability of all mutually exclusive
collectively exhaustive event is 1 latter will explain what would be mutually exclusive
collectively exhaustive events. So always the probability of summation of probability will be
equal to 1, for example, probability A plus probability B and probability C equal to 1, here A, B
and C are mutually exclusive and collective events.
(Refer Slide Time: 02:32)

128
So this is the range of probability you see that, if it is an impossible event, the probability value
is 0. If it is a certain event, the probability is 1 if it is there a 50% chance, the probability is point
5. So, the point here is it is the probability lies between 0 to 1.
(Refer Slide Time: 02:50)

The method of assigning probability there are 3 methods, one is the classical method of assigning
probability rules and laws, the relativity frequency of occurrence that is cumulative to historical
data and subject to probability that is a personal intuition or reasoning by using these 3 methods,
let us see how to find out the probability.
(Refer Slide Time: 03:11)

129
First of all is a classical probability the number of outcomes leading to an event divided by the
total number of outcomes possible is a classical probability. Each outcome is equally likely there
is an equal chance of getting different outcome. So, it is determined a priori that is before
performing the experiment we know what are the outcome is going to come suppose we toss a
coin there are 2 possibility head or tail in advance we know what are the possible outcomes.

It is applicable to games of chance the object to is everyone correctly using the method assign
and identical probability because what is happening here, using classical probability that
everyone will get the same answer for a problem because we know in advance what are the
possible outcomes.
(Refer Slide Time: 04:00)

130
So, mathematically the P(E) = ne / N
where the N is the total number of outcomes ne is the number of outcomes in E.
(Refer Slide Time: 04:15)

The relative frequency probability, it is based on the historical data because the another name for
relatively frequencies of probability, it is computed after performing the experiment, number of
items an event occurred divided by number of trials and it is nothing but frequency divided by
sum of frequency, here also everyone correctly using this method assign as identical probability
because everything is already known to you.
(Refer Slide Time: 04:42)

131
So, the same formula probability E is n of e divided by N where n e is number of outcomes N is
total number of trials.
(Refer Slide Time: 04:50)

Then subject to probability, it comes from a person's intuitions or reasoning. Subjective means
different individuals may correctly assign different numerical probabilities to the same event, it
is the degree of belief sometimes subjective probability is useful for example, if you introduce a
new product, suppose if you want to know the probability of success of the new product so we
can ask an expert that what is the probability of success. Even if it is a new movie or new project
what is the probability of success?

132
It is based on the intuition are based on the experience of the person he can give some probability
of success or failure for example sites select decision for sporting events for example, in cricket
what is the how much possibility of one team to win these are intuitive probability.
(Refer Slide Time: 05:42)

In this course that there are certain terminology with respect to probability you have to know it is
very fundamental, even though you might have studied or even with previous classes it is just to
recollect it what is the experiment we will see what is the experiment event, elementary event,
sample space, union and intersections mutually exclusive events, independent events,
collectively exhaustive events, complimentary events, these are some terms we will revise this.
(Refer Slide Time: 06:07)

133
One is we say what is experiment trial elementary event and event experiment, a process that
produces outcomes is experiment so there are more than one possible outcome is there only one
outcome per trial, what is a trial one repetition of process is a trial what is the elementary event?
Even that cannot be decomposed or broken down into other events that is elementary events and
what is the event and outcome of an experiment may be an elementary event maybe aggregate of
elementary event usually represented by uppercase letter for example A, E that is notation for
event.
(Refer Slide Time: 06:48)

We look at this example. If you look at the table there are some towns population is given there
are 4 families their family A, B, C, D we asked the 2 questions, children’s in household whether
do you have children are not see family yes. Then we asked a number of automobiles how many
number of automobiles you have 3. B they have children, they have 2 automobiles for the help of
this table will try to understand what is experiment for example, randomly select without
replacement 2 families from the residents of the town.

For example, randomly we can select so elementary event for example, the sample includes
family A and C randomly you have to selected. So, event each family in the sample has children
in the household, the sample families own a total of 4 automobiles to these are particular events
for example for the event each family in the sample has children in the household for example A
is one event D is another event for example, the sample families own a total number of 4

134
automobiles for example, A and C they have 4 automobiles B and D they have 4 automobiles A
and D they have more than 4 automobiles.
(Refer Slide Time: 08:07)

This is the example of event then what is the sample space, the set of all elementary events for an
experiment is called a sample space. Suppose if you roll a die, there are you can get 1 2 3 4 5 6
these are the sample space there are different methods for describing the sample space one is
listing tree diagram, set builder notation and Venn diagram, you will see what is that.
(Refer Slide Time: 08:30)

See this listing experiment randomly select without replacement 2 families from the residents of
the town, so each order the pair in the sample spaces elementary event for example D, C. So

135
what are the different possibility look at this table A B, A C, A D, B A, B C, B D, C A, C B, C D
so, these are the listing the sample space what is that we have to select 2 families from the
residents.
(Refer Slide Time: 08:58)

So, here we will do without replacement, without replacement means suppose A it owned by
again A, if it is B it owned by again B, once A is taken we are not selecting another A, so
without replacement the same thing the another way to express the sample spaces with the help
of tree diagram, it is a tree diagram is very useful and easy to understand. For example A B C D
there are 4 families we can have combination A B we can combination A C, A D, B A, B C, B
D, C A, C B,C D, D A, D B, D C. So, this is the easy which is the different sample space because
tree diagram is easy to understand.
(Refer Slide Time: 09:55)

136
Now the set notation for random sample of 2 families so S = {(X, Y), X is the family selected on
the first draw, and Y is the family selected on the second draw}. It is the concise description of
larger sample spaces in mathematics they use this kind of notations.
(Refer Slide Time: 10:15)

You see the sample space can be shown in terms of Venn diagram, so this is a list of sample
space see that this is a different dot express different sample space
(Refer Slide Time: 10:26)

137
Then we will go to the another concept union of sets, the union of 2 sets contains an instance of
each element of the 2 sets for example X is 1,4,7,9 one set, Y is another set 2,3,4,5,6 So, if you
want to know X union Y just we have to combine 1,2,3,4,5,6,7,9 similarly we look at the Venn
diagram X is 1 Y is 1 if you want to know union combining both events and other examples say
C IBM, DEC, Apple that is the C set there is the another set F Apple, Grape, Lime suppose we
want to know union of set C and F so we are to take IBM, DEC, Apple, Apple is coming in both
sets we are taking only one grape lime, this is the union.
(Refer Slide Time: 10:15)

We go for intersection suppose for X = 1, 4, 7, 9 Y is 2, 3, 4, 5, 6 if you want to know common,


intersect is 2 sets contain only those elements common so here the forest common in X and Y so

138
X intersection Y is 4, for example, C and F, in C we have IBM, DEC, APPLE F is APPLE,
GRAPE, LIME then C intersection F that is a common thing between set C and F is Apple, so
this one, see this portion says our intersection.
(Refer Slide Time: 11:54)

Then we will go for mutually exclusive events even with the no common outcomes is called
mutually exclusive events occurrence of one event precludes the occurrence of other event for
example C IBM, DEC, Apple F is Grape, Lime. So C intersection there is no common thing so
that is why it is null set similarly X is 1,7,9 Y is 2,3,4,5,6 , X intersection Y there is no common
set look at the Venn diagram there is no common thing so X intersection Y 0 these 2 sets are not
over lapping.

So it is called mutual exclusive events. another example for this when we toss a coin there is 2
possibility to get the outcome one may be head or tail it cannot have both that is why both events
are mutually exclusive event.
(Refer Slide Time: 12:46)

139
Then independent events, so occurrence of one event does not affect the occurrence or non
occurrence of other event is called independent event, the conditional probability of X given Y is
equal to the marginal probability of X the conditional probability of Y given X is equal to the
marginal probability the one way we will do a small problem on this one way to test the
independent event is suppose P(X/Y) = P(X) and P(Y/X) = P(Y), then even X and Y are called
independent events will go in detail after some time but with the help of an example.
(Refer Slide Time: 13:28)

Collectively exhaustive event it contains all elementary events for an experiment suppose E1 E2
E3 sample space with 3 collectively exhaustive event suppose you roll your die, all possible
outcome 1, 2, 3, 4, 5, 6 that is collectively exhaustive events.

140
(Refer Slide Time: 13:52)

(Refer Slide Time: 14:11)

Then complementary events and elementary events not in the A dash or is it is complimentary
event you see that the P (A) is there which is not there that is A dash that is called
complimentary, so P( A’) = 1 – P(A) then counting the possibilities because in probabilities
many time different combinations may come these rules may be very useful for counting
different possibilities one rule is mn rule second one is sampling from a population with
replacements, second one is sampling from a population without replacement.
(Refer Slide Time: 14:28)

141
Will go for the mn Rule if an operation can be done m ways and the second operation can be
done n ways, then there are mn ways for the 2 operation to occur in order. The rule is easily can
be extended to k stages, with a number of ways equal to if there are k stages n1, n2, n3 there
some simply we have to multiply for example toss 2 coins the total number of sample event is 2
x 2 =4 because in the first coin you make a 2 possibilities second coin you make it another 2
possibilities so the total is 4 possibilities.
(Refer Slide Time: 15:07)

Suppose you see that another example of sampling from a population with replacement. One
example is a tray contains 1000 individual tax returns if 3 returns are randomly selected with
replacement from the tray, how many possible samples are there? So every time you are going

142
for a 3 trial; trial 1 trial 2 trial 3, in each trial there are 1000 possibilities because you can choose
one from 1000, so, firstly trial 1000 second trial 1000 third trial 1000 when you multiply this is
1000 million possibilities are there with replacement.
(Refer Slide Time: 15:45)

In case if you go without replacement, the same thing because without replacement what will
happen the sample size will decrease. A tray contains 1000 individual tax returns in 3 returns are
randomly selected without replacement from the tray. How many possible samples are there So,
that is a N Cn = N!/(n!(N-n))!
= 1000!/(3!(1000-3)! = 166,167,000
you see the previously with replacement and going to previous light with replacement it is a 1000
million now, it is only 166 million because we are going for without replacement.
(Refer Slide Time: 16:29)

143
There are different types of probability say we can call it as a marginal probability union
probability joint probability and conditional probability. Then what is the rotation model
probabilities simple one probability P of X, so, how it is expressed in terms of Venn diagram, see
this one, so marginal probability the union probability is the X union Y, the probability of X or Y
counting’s, the joint probability or common probability, the probability of X and Y occurring
together the middle portions.

Then conditional probability, the probability of X occurring given that Y has occurred here there
are 2 events, the probability of the outcome of X is depending upon the outcome of Y. So we
have to read the probability of X given that Y has occurred. So this is the Venn diagram notation
for expressing the conditional probability
(Refer Slide Time: 17:23)

144
Then we will go for general law of addition so, P(XUY) = P(X) + P(Y) – P(X∩ Y)
(Refer Slide Time: 17:35)

Will take a small example, from that example, will understand the concept of probabilities a
company is going for improving the productivity of the particular unit. They are coming with a
new design one design is layout design for example, layout design, one design will reduce the
noise, that is one option that is second design that will give you more storage space, so we are
going to ask from the employees what kind out of these 2 designs which design will improve the
productivity.
(Refer Slide Time: 18:13)

145
You see the problem, a company conducted a survey for the American Society of interior design
in which workers were asked which changes in the office design would increase productivity,
there are 2 design is there one is the one design will reduce the noise another design will improve
the storage space. The responders were allowed to answer more than one type of design changes.

So this table shows the outcome so 70% as the people have responded that reducing noise would
increase the productivity 67 percentage of the respondents responded that more storage space
would increase the productivity there is the 2 design So, we are asking there from the respond
which design will improve the productivity.
(Refer Slide Time: 19:04)

146
Suppose, if one of the survey respondents were randomly selected and asked what office design
change would increase workers productivity, otherwise, what is the probability that this person
would select reducing noise or it designed which is helpful for providing more storage space out
of this that reduce, out of 2 options.
(Refer Slide Time: 19:29)

So, let N represent is the event reducing noise that means you are choosing that design S
represents the event more storage space, yes, that is the another options. The probability of
person responding to N or S can be symbolized statistically as a union probability by using law
of addition that is a P(NUS).
(Refer Slide Time: 19:56)

147
So, see that P of N union S is because we know they asked for the our the formula P(NUS) =
P(N) + P(S) – P(N∩ S). So, the P of N is 70% P of S is 67% those who have told S for both the
designs is 0.56. So, when you substitute these values in the formula, so, we are getting 0.81 that
is 81% of the people have told both the designs will increase the productivity
(Refer Slide Time: 20:33)

What you have solved with the help of Venn diagram in previous in the previous slide can be
solved with the help of contingency table. This contingency table is so helpful just we have to
make a table. For example, in rows I have taken noise reduction, the noise reduction say 70%
people have told yes, so, remaining 30 might have told no. 70 30 in the column I have taken
increasing storage space design in the 67% total they have told that increasing storage space
would increase the productivity.

So remaining 33 might have told no. So, the 0.56 is intersection people have told both yes for
both the design that is for noise reduction and increasing storage space. So, once if you know this
0.56 the remaining things you can simply you can subtract it 0.70 - 0.56 = 0.14 that is those who
have told no to storage space and yes to noise reduction then if you subtract from 0.67 – 0.56,
will get 0.11, 0.33 - 0.14 we will get 0.19 so from this table we can read a lot of information’s.
(Refer Slide Time: 21:49)

148
Whatever it has no this is for example, event A A1 A2 event B1 B2 this in the rows, you see
that? It is in the columns whatever cell inside the cell this portion is called the joint probabilities
P (A1 ∩ B1) whatever it is the extreme side of the table see that this called marginal
probabilities, this is a notation that traditionally they follow whatever the inside the cell it is a
combination of both event that is the: it is the joint probabilities, whatever extremely and it is a
marginal probability extreme side of the table.
(Refer Slide Time: 22:26)

The same thing, suppose if you want to know the same answer with the help of this contingency
table, we want to know how much percentage of the people who have agreed for both the design
that is N and S. So, P ( N )+ P (S) - P (N ∩S). So this value directly we can read it from the table.

149
So P (N) is 0.7 0 + P (S) is 0.67 - P (N intersection S) is 0.56, so when you do that we getting
0.81.
(Refer Slide Time: 22:57)

Then we will go for conditional probability the probability of N is 0.7 0 those who are good both
the design engineers is .56 suppose if you want to know P(S \ N) = P (N intersection S) /P ( N) so
this value I will explain this conditional probability in detail later, but now you take this one, so,
P of N intersection S from the previous table you can find out the 0.56. You can look at the Venn
diagram also, P of N is 0.7 and you substitute getting 0.8
(Refer Slide Time: 23:37)

150
The same office design problem you see that there is a another conditional probability P (N /S )
that means, those who are told No to noise reduction, but they are agreed for storage design. So,

this is the conditional probability. So, we have to multiply is this point because

N∩S = 0.11 divided by P (S ) = 0.67 that = 0.164.


(Refer Slide Time: 24:14)

We will take another small problem will explain the concept of probabilities with the help of the
problem. A company data revealed that 155 employees worked on 4 types of positions the table
is shown in the next slide raw value of matrix also called the contingency table with the
frequency counts of each category and sub totals and totals containing breakdown of these
employees by type of position and by sex.
(Refer Slide Time: 24:46)

151
You see that look at this table in rows, the type of position they hold in their organization,
whether they are working in managerial position, professional or technical or clerical in the
column sex whether male or female, you see the intersection of managerial and male 8 that
represents both that is more managerial working in a managerial position the same time they are
male. So, that is our join values here the only count join counts the extreme right or the bottom
of the table we are given the total counts.
(Refer Slide Time: 25:23)

Now, if an employee of the company is selected randomly, what is the probability that the
employee is female or professional worker, so, what we have to do? So, we are going to find out
P ( F )+ P (P) - P (F ∩ P). So, P(F) is when you go to the previous one P (F) is so, when you

152
55/155 =0.335, P(P) is when 44/155 = 0.824, minus P (F ∩ P). that is F intersection P =13 /155
you will get 0.084
= 0.55.
There is another problem.
(Refer Slide Time: 26:15)

Shown here are the raw value matrix and corresponding probability matrix of the result of a
national survey of 200 executives who were asked to identify geographic location of their
company and their company's industry type. The executives were only allowed to select one
location and one industry type, because it is not possible same person working different location
different industries, because one person can work only one type of industry will conclude that in
this session.

We have seen different types of probability how to assign probability and different counting
rules say mn rules with replacement, without replacement. And different terms, which you are
frequently we are going to use in this course that is the event, join probability, marginal
probability and so on, then you have taken one sample problem with the help of sample problem,
we have seen how to find out union of 2 events that is a joint probability P(A U B), then
intersection P (A ∩ B).

153
Then how to find out the marginal probability, then how to find out the conditional probability,
but that will close and continue with the next lecture. Thank you very much

154
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Management Studies
Indian Institute of Technology – Roorkee

Lecture – 07
Introduction to Probability - II

Dear Students, we will continue that the concept of probability in lecture number 7 will take one
example then we will try to understand the concept of marginal probability, joint probability, and
conditional probability in this problem. The problem is a company data reveal that 155
employees worked in a 4 types of positions. Shown here again is the raw values matrix also
called the contingency table with the frequency count for each category.

And subtotals and totals containing a breakdown of these employees by type of position and the
sex. Look at this contingency table in the row it is given what kind of position they are holding
whether managerial professional technical clerical,
(Refer Slide Time: 01:20)

In column what is their sex, suppose when an employee of the company is selected randomly,
what is the probability that the employee is female, in the contingency table row represents the
type of position, column represents sex, the type of position they may hold whether they can
have the they can work as a managerial position, professional position, technical clerical, we

155
asked their sex also in the datasets, the question is if an employee of the company is selected
randomly.

What is the probability that the employee is female or professional worker that is what is the P(F
U P)? F is the female, P is the professional. So, as per the law of addition of probability
P(F U P) = P ( F) + P ( P) - P (F ∩ P),
P( F) we can find out and going to the previous slide.
(Refer Slide Time: 02:22)

So, the P (F) is there are 55 females total there are 155 So, when = 55/155, you will get 0.355
then P(P), going to previous slide P(P), is probability of professionals, there are 44
professionals, 44 /155 that will give you 0.284 then, minus P ( F ∩ P), that is for female related
same time their working in a professional type of position that is 13. So, 13 /155 will give you
0.084
P(F U P) = 0.555.
(Refer Slide Time: 03:10)

156
Shown here are the raw value matrix and corresponding probability matrix of the result of a
national survey of 200 executives who are asked to identify the geographical location of their
company and their company's industry type. So, there asking 2 question what is their company's
geographic location and what kind of industry they are working. The executives were only
allowed to select one location and one industry because they can work only one industry in one
location.
(Refer Slide Time: 03:45)

This table shows 0 there is a industry type maybe finance manufacturing communications,
calling it ABC. The geographic location may be Northeast, Southeast, Midwest and West, So for
example, in the finance A, there are 56 people are working manufacturing B there are 70 people

157
are working, in the communications 74 in the Northeast location, there are 82 people, in
Southeast location 34 people and Midwest location 42 and West 42.
(Refer Slide Time: 04:21)

The question is, what is the probability that the respondent is from the Midwest? Directly we can
read this answer from the table, second question is what is the probability that the respondent is
from the communications industry or from the northeast? So here the addition of the 2
probability that is a P(C) and P(D). The third question is what is the probability that the
respondent is from the Southeast or from the finance industry?
(Refer Slide Time: 04:55)

158
So from the given table, we find out the probability that means we have divided each element in
the cell divided by the gross total. After dividing that we got 0.12, 0.05, 0.04, 0.04 Now this is a
matrix conditional probability. From this table, we can pick up whatever answer we wanted to
get for answering that question
(Refer Slide Time: 05:20)

For example. I am going back, suppose what is the probability that the respondent is from the
Midwest? So I am going to next slide Midwest is F. So, it is a 0.21 going back, what is the
probability that the respondent is from the communication industry or Northeast? So you have to
see P (C U D) = = P ( C) + P ( D) - P (C ∩ D) you can find out then what is the probability that
respondent is from southeast or the same thing? P (E union A) = P ( E) + P ( A) - P (E ∩ A)
These values you can directly pick up from the previous table from, this table now we will go for
mutually exclusive event suppose okay suppose if you want to find out P (T union C) here T is
the those who are technical professional P of T union C is P (T )+ P ( C) because generally it will
be minus P (T intersection C), but that is not possible, because a person cannot work in 2
industry at a time.

So, it is mutually exclusive event in the mutually exclusive event in the intersection will become
0 that is a P (T union C) = P( T) + P (C) that another term that is minus P (T ∩ C) will become 0.
So, P(T) is a 69 /155, + P ( C) is 31/155 when you simplify it is 0.645.This is example of
mutually exclusive event.

159
(Refer Slide Time: 07:00)

There is another P(P U C) = P (P) + P (C) there would not be any intersections even that formula
which I told the 2 slides before also that the intersection component will become 0 because the
person cannot work in 2 industries, there is an example of mutually exclusive event.
(Refer Slide Time: 07:18)

In the law of multiplication P ( X ∩ Y) = P (X) x P( Y\X) = P ( Y) x P (X \ Y) is a law of


multiplication. What will happen if event X and Y are independent event? Simply we can
multiply P ( X) into P ( Y), that you will see later.
(Refer Slide Time: 07:42)

160
You will see another problem. A company has a 140 employees of which 30 are supervisors 80
of the employees are married, and 20% of the married employees are supervisors. If a company
employee is randomly selected, what is the probability that the employee is married and is a
supervisor? Wherever this kind of problem comes, if you are able to construct the contingency
table whatever question being asked you can pick up from there directly. So, from the given data.
(Refer Slide Time: 08:15)

First we will construct a contingency table in the contingency table you see there are 140
employees out of which 30 are supervisors out of 140, 80 people are married and for example,
those who are married at the same time supervisors, what do you do that is you have to multiply
that how will you that if you multiply 80 into 0.2 divided by 140 you will get this answer.

161
(Refer Slide Time: 08:43)

So, for example, you see that how we are getting that we are the previously they will get 0.1143
we will see how we are getting so, we know the probability of married people 80 /140. This is
given those who are supervisors at the same time married. 0.2 that is a 20% is given. If you want
to know those who are married at the same time they are supervisors. So, we have to use
conditional probability P ( M ) x P (S \ M). So, P ( S\M) is known, 0.20 is given. So P(M∩S) =
0.1143.
(Refer Slide Time: 09:32)

You see that once you know that one cell in the contingency table filling the remaining cell is so
easy. And whatever value you want to pick up we can pick up for example, I have filled the first

162
0.1143 I know what is the row total and column total from that I can subtract it I can get the
remaining rows that is a application of contingency table. Suppose P (S) = 1 – P(Sbar), that we
know that P (S bar), that is 0.7857.

I am saying this one, this location, this location 0.7857 if you want to know P (M bar intersection
S bar) that means, those who are not M those are not married. At the same time, who are not
supervisors, so, that is nothing but this location 0.326 those are not married the same time they
are not supervisor this locations, P( M intersection S bar) those are married, but not the
supervisors.

So, that is a P(M)- P (M∩S). So, this value is 0.5714 minus this one. So, we will get this point,
nothing but in the contingency table if you know one cell the remaining this can be found out
(Refer Slide Time: 10:54)

The special law multiplication for independent events general law is if P (X intersection Y)= P
(X) x P( Y \ X ) = P ( Y) x P ( X \ Y). special law, that is if X and Y are independent. So, P (X )
= P (X \ Y), because, when the event X and Y are independent, the outcome of X is not
depending on outcome of Y. So, P (X\ Y) will become P (X) itself. So, similarly, P (Y) and P(X)
if not independent P( Y \ X) will become P(Y) itself and you substitute the there. So, P(X ∩ Y)
will be P( X) into P ( Y), only when the both even are independent.
(Refer Slide Time: 11:45)

163
This also law of conditional probability, this also we have seen previously also the conditional
probability of X given Y is joint probability of X and Y divided by the marginal probability of Y
So joint properties intersection the P (X intersection Y) / P (Y). So this can be just by readjusting
(P(Y / X) x P (X))/ P ( Y).
(Refer Slide Time: 12:14)

Little detailed explanation on conditional probability, A conditional probability is the probability


of one event, given that another event has occurred suppose if I say P(A\ B). So, first you have to
find the intersection of P ( A∩ B) then divide by P ( B). So the conditional probability of A given
that B has occurred. This is an explanation for this. Suppose, if you want to know P of B given A
has occurred, so P ( A∩ B) /P ( A)

164
where P of A and B equal to join probability of A and B. So P (A) is marginal property of event
A, P ( B) is marginal property event B
(Refer Slide Time: 12:54)

We will take an example how to find out the conditional probability of the cars on a used car lot,
70% have air conditioning Air Conditioning and 40% have CD player. 20% of the cars have
both. So, what is the probability that a car has a CD player, given that it has AC that means, we
want to find out P of CD given that AC is there.
(Refer Slide Time: 13:19)

As I told you just to you draw the contingency table because all the values are given? So, what is
the value I am going back see for example 70% of the cars having AC So, this value I am going

165
back 40% of the cars having CD player So, this value and you subtract minus 1 will get that one
another information is given 20% of the car have both like by see that 0.2 this value. So, once
you know these values other cells can be find out.

So, if you want to know P (CD\AC) as per the definition, P of CD and AC divided by P (AC) so,
this is a 0.2 this is by P (AC) is by 0.7. So, this is 0.2857. So, given the AC we only consider the
top row 70% of the cars of these and 20% CD player, so 20% of % is 28.57% okay there so, we
are getting the conditional probability.
(Refer Slide Time: 14:25)

So, this conditional probability can be explained with the help of a tree diagram, because the tree
diagram is easy to visualize. So, having AC, having not AC, having CD, having not CD having
CD having not CD. So, 0.7 0.2 0.5 0.2 0.1 so, if you want to know having CD, so you have to
divide 0.2 divided by 0.7. For example, if you want to know this this arc, so, this is a 0.5 divided
by 0.7 and so on because a tree diagram is very easy to understand.
(Refer Slide Time: 15:12)

166
Then we will see the definition of independent event, if an X and Y are independent events, the
occurrence of Y does not affect the probability of X occurring, so, X and Y are not connected.
Similarly, if X and Y are independent events, the occurrence of Y does not affect the probability
of X occurring, you see that P of if X and Y are independent P (X \ Y) = P(X), P(Y\ X) = P(Y)
this we have seen the previous also.
(Refer Slide Time: 15:44)

This is another example 2 events are independent This is the condition P(A\ B) = P( A). So, this
condition is for testing independent, even A and B are independent. When the probability of one
event is not affected by other event.
(Refer Slide Time: 16:05)

167
So, we will take one example will check the practical application of this concept of independent
events. This also this data previously given. So, we have asked what kind of industry are
working, whether you will finance manufacturing communications that we asked to the
geographical locations. Now, you see that the question is tested the matrix for the 200 executive
respondents to determine whether the industry type is independent of geographical location that
means, we were to find out is there any dependency between the geographical location and what
kind of industry.

For example, in India, if you look at there, most of you know, software companies are in south.
So, is there any connection between the geographical location and kind of industry which are
located. We will take this Example finance and the best region, so, when you go this,
(Refer Slide Time: 17:05)

168
If you want to know P(A\G) = P(A∩G)/ (G), P(A∩G) we can directly read from the table 0.07
this one this value then P ( G ) 0.21 directly we can read from the table. So, what you do that
value is 0.33, but P(A) when you look at P(A), so, P(A ) is 0.28. So, now what is happening, the
P(A\G) is not equal to P(A). If it is equal, both are independent, since it is not equal, there is a
kind of dependency between the geographical location and the type of industry which are located
there. So, this is a one way to test the independency.
(Refer Slide Time: 18:06)

For example, you take for another example, because A given D, this is another example. So,
P(A∩ D) is 8 here the actual count is given. Any way you can you can do both the way also
P(A∩ D) is 8 divided by P( D) , P (D) is 34. So, we are getting this value, but you see the P (A)

169
is 20 divided by 85, 20 divided by 85. So, both P (A \ D) and P(A) are same. So, these are
independent events. Then example, if at the same P (A / D) equal to P(A) both events are
independent this is the way to test the independent.
(Refer Slide Time: 18:55)

Next we are going to an important application that is a Bayes ruler Bayes theorem, it is used to
revise the probabilities, it has lot of applications in higher level of probability theory. And
extension to the conditional law of probabilities enables revision of original probability with the
new information’s. So, P ( X \ Y) equal to P( Y \ X) x P ( X ) divided by the summation of this
one I will tell you the net the next slides
(Refer Slide Time: 19:29)

170
For example, supposed see that P (X\Y) So, this can be written as P (X∩Y)/P(Y) this can be
written as P of x intersection y divided by P of x you look at this here also P of x intersection y
this also be x intersection y. So, this can be written as P ( x\y) multiplied by P ( y) equal to P ( y\
x) multiplied by P( x), you see that if I look at this suppose, I know P (x\ y).

Suppose, if i want to know the reverse of this that is P of y by x, you see that I know P of x by y,
I am getting reverse of that that is P of y by x. So, from this you can write it P of y by x is
nothing but P of x by y multiplied by P of y divided by P of x. Here the P of x is only because
here only 2 outcome there are if there are more outcomes here the sigma of P(x) will come, the
sigma of P(x) is nothing but different combination of joint probabilities that we will see with the
help of an example
(Refer Slide Time: 21:12)

This is a very typical example machine A and B, and C all produce the same 2 parts X and Y. Of
all of the parts produced, machine A produces 60% machine B produces 30%, and machine C
produces 10%. In addition 40% of the parts made by machine A are part X 50% of the parts
made by machine B are part X 70% of the parts made by machine C is part X. A party produced
by this company is randomly sampled and determined to be an X part with the knowledge that it
is an X part. revise the probabilities that the part came from machine A, B, and C First, we will
solve this with the help of a tabular format.
(Refer Slide Time: 22:01)

171
For example, there are 3 mission is there mission A and BC that 60% of that part was produced
by machine A, 30% was produced by machine B, 10% is by C previously we have seen how that
formula for conditional probability has come now, I will tell you an application of Bayes
theorem.
(Refer Slide Time: 22:29)

Suppose there are 2 say there are 2 supplier, supplier A supplier B, I know that, say the 40% of
the product supplied by supplier A, remaining 60% supplied by supplier B from my past
experience. I know that the 2% out of 40% is 2% will be defective product which are supplied by
supplier A from supplier B, I know from my past experience he used to supply 3% of defective
products out of 60.

172
By using their products that I have assembled a new machine now the machine is not working,
the machine is not working, but I want to know what is the probability that product was supplied
by supplier A, If the machine is not working, what is the probability that the product was
supplied by supplier B? So this is the application of your Bayes theorem, we will see with the
help of an example.
(Refer Slide Time: 23:36)

Different options there. The problem is a particular type of printer ribbon is produced by only 2
companies that company names is are Alamo Ribbon Company and South Jersey Products.
Suppose Alamo produces 60% of the ribbons and South Jersey produces 35% of the ribbons
from over experience. Look at this 8% of the ribbon produced by Alamo or defective and 12% of
the South Jersey Ribbons are defective from our past experience, A customer purchases a new
ribbon.

What is the probability that Alamo produced the ribbon? Otherwise, what is the probability that
South Jersey produced the ribbon, like in the previous example, the machine is not working,
what is the probability that product was supplied by supplier A what is the probability that
product was supplied by supplier B the same example.
(Refer Slide Time: 24:39)

173
Now, first you will find out the marginal probability and conditional probability. So, P of Alamo
that is 65% of the product was supplied by Alamo South Jersey 35% from the past experience I
know the defective parts which was supplied by Alamo supplier 0.08. Similarly, the defective
products which was supplied by South Jersey is 0.12, you see that this some would not be 100,
but this some will be 100 because this is a total they supply in 8% is in the 65 out of 65, 8% is
defective products are supplied by Alamo person.

Now, as for the formula now, we look at this V now it is reverse Now, we know it is defective.
Then we want to find out what is the probability that was supplied by Alamo. So, we look at this
P ( D) given by Alamo multiple by P of Alamo, look at this this component, this is the sum of all
possibilities P (D) given a Alamo multiplied by P (P Alamo )+ P ( D given by South Jersey)
multiplied by P (South jersey) So, this 0.08 is given 0.65 is given.

So, this was combination of 0.08 and 0.65 this and this was combination of 0.35 and 0.12. So,
but because we have to add these two. So, when you divide by this is 0.553 that is, if the product
the ribbon is defective, then there is a 50% chance it was supplied by supplier alone. Similarly,
the product is defective what is the probability that it was supplied by South Jersey, the same
thing, 0.12 to multiplied by 0.35 because P of D Alamo is given then 0.08 This is the all
combination this denominator same, so, 0.447 that is there is of 44.7% chance that defective
product was surprised by South Jersey.

174
(Refer Slide Time: 27:01)

If you look at in the in the tabular form it is very easy. So, this is the first. So, this is the event
Alamo South Jersey this fellow supplying 65% this is 35%, this is the conditional probability, we
know that this supplier Alamo will supply 8% of defective products, this fellow will supply 12%
of 2 products. So, first we have to find out the joint property 0.0 to 0.08 this one we have to add
it. Then this joint property has to be divided by this 0.04. So, that will give you use see that, here
we know the details of P of D given the; we are finding the reverse of that P of E given by D that
was the advantage of this byes theorem. Now, it is 0.094, this was 0.447.
(Refer Slide Time: 27:54)

175
This can be shown in the pictorial form. There is a Alamo South Jersey defective not defective.
Defective 12% remaining this percentage when we multiply by 0.02 when you multiple this
0.042 when you added, we are getting 0.094. In this lecture, I have explained the example of
mutually exclusive events then I have explained the multiplication, then explain the independent
events then I have explained the concept of Bayes theorem, then I have explained with the help
of a problem, the application of Bayes theorem with that will conclude this lecture. Thank you
very much.

176
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee

Lecture – 08
Probability Distributions

Good morning students we are entering to the 8th lecture on this course that is a data
analytics with the Python. Today the topic is probability distributions. So what we are going
to cover today is very interesting topic. We are going to see the some empirical distribution
and its properties. Then discrete distribution in the discrete distribution we are going to see
Binomial, Poisson, Hyper geometric distributions. The continuous distribution we are going
to see the uniform, exponential, normal distribution.
(Refer Slide Time: 01:00)

First up all what is distribution? What is the purpose of studying the distribution? The
distributions describe the shape of a batch of numbers that is the meaning of distribution.
Suppose the different set of numbers there, you want to show what shape it follows whether it
is a bell shaped, we can call it is a normal distribution. If it is forming a rectangular shape, we
can call it as a uniform distribution like this that describes the shape of a batch of numbers.

The characteristics of your distribution can sometimes be defined as a small number of


numerical descriptors called parameters. So each distributions characteristic is expressed with
the help of its parameters. Parameter is nothing but for example normal distribution it has 2

177
parameter one is mean and variance with the help of that you can draw the distribution that is
a parameter.
(Refer Slide Time: 01:53)

Why distribution? Can serve as a basis for standardized the comparison of empirical
distributions because if you want compare with phenomena with the very standard
distributions we can come to know that what distribution it follows then it will help you to
estimate the confidence intervals for inferential statistics that will see what is the meaning of
conference interval incoming classes then form a basis for more advanced statistical methods.

For example, fit between observed distribution and certain theoretical distribution is an
assumption of many statistical procedures. Suppose why we have to study the distributions,
suppose we are doing your simulation for example the arrival pattern follow Poisson
distributions. Suppose certain data you collected if you prove that it is arrival follow Poisson
distribution already there is a mean and variance and other population parameters already
defined it.

If you are a natural phenomena, you are able to compare with standard distributions that are
well defined distribution parameter is there that parameter you can use as it that is a purpose
of studying the distribution.
(Refer Slide Time: 03:01)

178
Then we will go for what is the random variable we want to construct a distribution, it is the
relation between X and corresponding probability X, p of x. So here the X is nothing but
random variable. A variable which contains the outcome of chance experiment is the random
variable is the kind of you are quantifying the outcome suppose we task of the coin for X = 1
is getting head, 0 getting tails. So 1 is nothing but your random variable.

So the X, the X is the random variable, that can take the value of 1 and 0, the X value is 1 it is
the head. If X value is 0 it is a tail. Variable that can take on different values in the population
according to some random mechanism. So the value of 1 and 0, it follows certain mechanism.
Random variable can be a discrete it may be distinct values, countable. For example year is a
discrete random variable. For example Mass it is a continuous random variable.
(Refer Slide Time: 04:02)

179
Then probability distributions, the probability distribution function or probability density
function PDF of the random variable X means the values taken by the random variable and
their associated probabilities if you make a relation between X and corresponding
probabilities p of x or f of x, that if you plot that point that will form your distributions, so
PDF of your discrete random variable also known as PMF probability mass function.

Example let the random variable X be the number of heads obtained in you 2 tosses of your
coin. There are 2 possibilities when you toss 2 times 2 tosses, first toss you may get head,
second toss you may get head then head tail, tail head, tail tail, so these are the sample space.
(Refer Slide Time: 04:53)

Probability density functions of your discrete random variable. Suppose we are tossing coin 2
times the probability of the 0 head is 1 by 4, the probability of getting one head is 1 by 2, the
probability of getting 2 heads 1 by 4 some should be 1. See in the in the X axis, the random
variable is taken 0 in Y axis corresponding probabilities marked. So, in X axis random
variable and Y axis corresponding probability this is called the distributions.
(Refer Slide Time: 05:23)

180
Now, probability distribution for a random variable X will do a small numerical problem, a
probability distribution for discrete random variable X is given. So, X is given corresponding
probability distribution is given. So, this is an empirical distribution. Suppose, if you want to
know, what is the probability of X < = 0. So, what you have to do? Wherever random
variable X is 0 and less than equal to 0 you would add it.

For example 0.20 + 0.17 + 0.15 + 0.13 we will get 0.65. Suppose, if you want to know the
probability for the random variable - 3 to 1 - < = X < = 1. So, it would add - 3 to 1. 0.15 +
0.17 + 0.20 + 0.15 and you add it will get 0.67.
(Refer Slide Time: 06:12)

How to plot a discrete distribution? So number of crisis for example is taken the probability
of happening that crises is also given for example, the probability of getting 0 crises is 0.37

181
for one crises 0.31 and so on. So, in X axis you mark the random variable in Y axis you plot
the probability. When 0.37 it is a 1 this 1 this is that these are discrete these points, these
points cannot be connected in x axis, this random variable has to come into the x axis, this
probability has to go to Y axis.

For example, 0.37, this one here what will happen you cannot connect this line because it is a
discrete, because you may have an x = 1 when x = 2, when x = 1.5 there is no value, if it is a
discrete distribution, you cannot connect these points that is why it is called the discrete
distributions.
(Refer Slide Time: 07:17)

The requirement for the discrete probability density function, so probabilities are between 0
and 1 inclusively. Total of all probabilities equal to 1 and some are probability we have seen
that already just 1
(Refer Slide Time: 07:30)

182
Next term will see cumulative distribution function. The cumulative distribution function of a
random variable X defined as F of X is the graph associating all possible values are in the
range of possible values with the P of X <= x. Cumulative probability distribution function is
just adding the probabilities. The CDF always lies between 0 to 1 that is 0 <= F of x should
be <= 1 F CDF cumulative density function.
(Refer Slide Time: 08:04)

Then there is a very important property is the expected value of X. X be a discrete random
variable with the set of possible values of D and pmf is P(x). The expected value or mean
value of X is denoted as, generally expect of x or µx = Ʃ x. p(x). So, Ʃ x. p(x) is your
expected value of X.
(Refer Slide Time: 08:32)

183
What is the meaning of this mean and variance of your discrete random variable is look at
this there are picture (a) and picture (b) the left side, the mean is same for both the
distribution, but look at the variance. The left side figure it shows that lot of variances that
items figure it is less variances, the probability distribution can be viewed as, are viewed as
you are loading with the mean equal to the balance point. So mean is nothing but it is like
kind of a balance point for which the distribution lies. Part (a) and part (b) illustrate equal
means, but part (a) illustrates a larger variance.

(Refer Slide Time: 09:12)

See the second case mean and variance of discrete random variable, the probability
distribution illustrated in parts a and part b differs even though they have equal means and
equal variances the shape of the distribution is different.

184
(Refer Slide Time: 09:25)

Now, we will see how to find out an expected value use the data below to find out the
expected number of credit cards that a customer to retail outlet will process. So X is a random
variable. There is a how many number of credit cards customers having the P (x) equal to X
corresponding probability. So Zero P (x) is .08. That means probability of a person having 0
credit card is 8 % probabilities a person to have for example, 6 credit cards is 1 %.

So how to find out the expected value multiplied by x and corresponding probability to
submit. So, 0 (0.08) + 1 (0.28) + 2 (2.38), and so on, + 6 (.01) = 1.97. You can make them 2.
That means the customers, they can have an average of 2 credit cards that was any customer
if you take randomly average that customer can have 2 credit cards. Here an example of
meaning of this what is mean.
(Refer Slide Time: 10:31)

185
Now we will see how to find out the variance and standard deviation of an empirical
distribution. Previously we have seen µx = Ʃ x. p(x), I will see how to find out the variance of
an empirical distribution. Let X how the pmf of p of x and the expected values mu. We know
already the mean of an empirical distribution. Now we have to find out the variants of the
empirical distribution, then the variance of X denoted as V(X) or σ x 2 or σ2.

The variance of X = Ʃ (x - µ)2. p(x), variances can be denotes E( X - µ)2 . The standard
deviation is the square root of this.
(Refer Slide Time: 11:14)

We will see you an example, a quiz scores for a particular student are given below 20, 25 and
so on find the variance and standard deviation. So, before knowing the standard deviation,
first you have to find out the mean because the mean is required. So, the mean if you add and

186
divided by corresponding elements, number of elements you will get 21. For example first we
will construct a frequency distribution, you see 12 is repeated by 1 time, 18 is repeated by 2
times, 20 is repeated by 4 times, 25. For example 25 is repeated by 3 times.

Then we have to find the probability, the probability is nothing but the relative frequency as I
told you one definition of probability is relative frequency. So what is the cumulative
frequency here first you have to find a total frequency 1 + 2 =3, 3 + 4 =7, 8, 10, 13 there is a
cumulative frequency. So, the probability here we are obtaining by using the concept of
relative frequency. So the relative frequencies we are adding all the frequency that is a total.
So 1 divided by corresponding sum of all frequencies 2 divided by some of frequencies. now
the mu we can find out mu in another way also, we know that already we are done with this
relative frequencies ƩF x M/ ƩF , how to find out the mean, sigma of expected value X. p
(x), 12 x 0.08 + 18 x 0.15, + 20 x .31 + 22 x .08,
Ʃ x. p(x) = 21. One way you can add all the values you can divided by number of elements.

Otherwise from this empirical distribution, what x is given x is 12, 18, 12, probabilities given.
So, if you want to know the mean x into p of x, now we are going to find out the variance.
(Refer Slide Tim: 13:40)

So p 1 variance = .08 .X1, Where X1= (12 - µ)2 +, p 2 , that is , .15 . X2 That is, (18 - µ)2 and
so on when you add it you will get the variance and the mu take square root of will get this.
You see that and going back .08 (12 – 21)2 + .15(18 – 21)2 + .31 (20 – 21)2, when you
simplify the variances 13.25 standard deviation is 3.64. So, what do I have done seen this
problem, the data is given first you were constructed here empirical distributions, then we

187
will use the formula of needed variants to find out the mean and variance.

(Refer Slide Time: 14:13)

Another shortcut formula to find the variances is nothing but we the E (x - µ)2 for example
already we seen E (x - µ)2, when you square it and simplify it, you will get this formula. So
V(X) =[ Ʃx2 p(x)] – µ2
= E (X2)- [E(X)]2
just you expand it will get this answer.
(Refer Slide Time: 14:48)

So let us find out the meaning of your discrete distribution, the formula for finding the mean
mu equal to expected value of X that is X .p (x). So X is given p(x) is given, multiply X and

188
p(X) after doing that, when you sum the sum is 1. So the mean of this empirical distribution
is 1.
(Refer Slide Time: 15:07)

That is find out the variance and standard deviation of this empirical distribution that is a
discrete distribution. So σ2, we know (X – u)2 . p (x). So X is given p(x) is given first to find
out (X – u) then (X – u)2, then multiply (X – u)2 by P( x) and sum it we are getting 1.2. So the
variance is 1.2 you take square root the standard deviation is 1.10.
(Refer Slide Time: 15.37)

Suppose then another distribution say X is given p of x is given X into p of X. So when you
plot it the mean you see that the mean or mean not be exactly 1 or 2 or 3 mean value may in
between 1 and 2. So mean value need not be discrete only the random variable is discrete
here.

189
(Refer Slide Time: 16:00)

Some of the very important properties of expected values suppose the expected value of a
constant is constant only. When you want to multiply 2 random variable E ( X + Y). we can
write E(X) + E (Y). E (X \ Y) is not division it is a conditional, it is kind of a conditional
probability. So E (X \ Y) need not be will not be equal to E (X) divided by E (Y) and same
thing E (XY) is not equal to E (X) multiply E (Y) unless they are independent.

If they are independent, you can write E (XY) equals to E (X) and E (Y) otherwise we cannot
written. So, if a random variable come along with the constant that constant can be removed
out of this expected value. For example E (a X) the a can be brought left side. So, a E (X),
here a is the constant. So, easy that it is in format E (a x + b) that can be brought a left side.
So, a E(x), then when your expect b value constant this constant itself. It will become a E (x
+ b) where a and b or constant.
(Refer Slide Time: 17:11)

190
Then properties of variances; variances of a constant is 0. If X and Y are 2 independent
random variable, then Var( X + Y) = var(X) + var(Y). Var( X – Y) = var( X) + var(Y), it
should be very carefully here, support there are 2 groups there are group 1 and group 2. If you
want to know the difference in the variance, you too add their variances b a constant, then
variances of b + X because variances of b will become 0 it will become only variances of X.

If a is constant than variances of a X is because variances Ax the square term and you bring
left side of the bracket would write a square and variances of X. There are proof is therefore
this if a and b are constant than variances of a X + b equal to a square variances of X, the
variance of B will become 0. The answer is a square variances of X. If X and Y are 2 random
variable and a and b are constant, then var(a X + b Y) = a2 var( X) + b2 var(Y).
(Refer Slide Time: 18:24)

191
Then covariance for 2 discrete random variable X and Y, E (X) = µx and E (Y) = µy, then
covariance between X and Y is the defined as covariance of X Y equal to can be written as σ
x y = E ( X - µx) and E (Y - µy ) and simplify it will get E (XY) - µx . µy , that is a covariance.

(Refer Slide Time: 18:54)

In general, the covariance between 2 random variable can be positive or negative. If random
variables move in the same direction, then the covariance will be positive. If they move in the
opposite direction, the covariance will be negative. Properties of covariance, if X and Y are
independent random variables their covariance is 0. Since E ( X Y) = E(X), E(Y) is
independent covariance there would not be any covariance. Covariance (X X) is variance of
X. Similarly, covariance of YY is simply variance of Y.
(Refer Slide Time: 19:31)

192
Then correlation coefficient, the covariance tells the sign but not the magnitude about how
strongly the variables are positively or negatively related. The correlation coefficient provides
such measures of how strongly the variables are related to each other. The variance is only
giving the direction not the magnitude, but the correlation it is giving the magnitude for 2
random variables X and Y, E (X) = µx and E (Y)= µy. The correlation coefficient is defined as
covariance of X Y divided by σx and σ y.
(Refer Slide Time: 20:06)

Dear students now are going to some special distributions will study some special distribution
in a discrete category and continuous category. The discrete will study about the binomial
distribution and Poisson distribution and Hyper geometric distribution. Continuous category
which will study we are going to study uniform exponential and normal. In this class I will
explain the theory and corresponding its parameters outer end of the class will use Python to
find out various parameters various mean and variance of your distributions and
corresponding probabilities in the practical class.
(Refer Slide Time: 20:46)

193
First one is the binomial distribution. Let us consider an example to explain the concept of
binomial distribution. Let us consider the purchase decision of the next 3 customers who
enter in store there are 3 customers going enter the store and the basis of past experience, the
store manager estimates that the probability that any one customer will make your purchase is
0.30. What is the probability that 2 of the next 3 customers will make a purchase?
(Refer Slide Time: 21:18)

Now look at this the tree diagram the first customer there is a 2 possibility, S is the purchase
F is no purchase, X is the number of customers making purchase. So we will see that is the
end here. Now, what is happening the first customer he can purchase or not purchased second
customer different possibilities, third customer different possibilities. Now we look at the
experimental outcome, this possibility, look at this possibility, success success success.

194
Look at this possibility success success failures look at this personal success failure Success,
then success failure failure, failure success success, failure success failure, failure failure
success, failure failure failure. So, we have written all possibilities. Now, the question is out
of 3 customers what is the probability that 2 customers will make a purchase? What is the
meaning SSS all 3 customer have purchased.

So value of x equal to 3 random variable second case to customer how purchased third
customer did not buy. So, here x is 2 because here the X is taken the number of customers
making purchase the first possibility x = 3, the second possibility is 2, the third possibility 2,
the fourth possibility is 1,.. 2, 1, 1, 0. Now the question is, what is the probability that 2 out of
3 customers will make a purchase, See that? There is a possibility.
(Refer Slide Time: 23:00)

The first customer: it is the possibility, the SSF, SFS FSS what is the probability of success is
p,p and 1 - p we know p is .3. So 0.32x0.7 = .063. For second category also we are getting p,
first success p failure 1 - p again success is p. So, p2 multiplied by (1 – p). so.3 square .7
equal to .063 then third possibility failure, 1 - p success success p p ,So, p square 1 – p = .
063, .063, .063.

Now, here we actually do here the possibilities 3 C 2 because the question is asked out of 3
customers, what is the probability that 2 customer will buy? So it is a 3 C 2, the value of 3 C 2
is 3. That is why 1, 2, there are 3 possibilities when you go back when you go back how
many 3 is there 1 2 3 possibilities there is the meaning of value 3 C 2.
(Refer Slide Time: 24:20)

195
Now we will find out the probability if x = 0 that mean nobody is buying.
(Refer Slide Time: 24:33)

Students we have studied so far variance, covariance, correlation coefficient, just how to
make it in relation. For example, we know the variance formula, variance equal to Ʃ(x - x
bar) 2. So variances for 1 variable, suppose if you want to know 2 variable if you want to
know for 2 variable, this variance will be called this covariance.

So covariance is Ʃ(x - x bar) .(Y - Y bar) variances divided by n - 1. So, Ʃ(x - x bar). (Y - y
bar), here also n - 1. Variance, covariance, next one correlation coefficient is covariance x, y
divided by standard deviation x, standard deviation y. Now, you see that this is a variances,
this is a covariance, correlation coefficient. So, for correlation coefficient when you divide
covariance to a corresponding standard deviation you will get correlation coefficient.

196
Next we will say ‘m’ that is called slope of the regression, slope of a regression equation. So,
there is nothing but you were covariance(x, y) divided by variances of x, the first one is
variance, covariance, correlation coefficient and slope of regression equation, you see that all
are having some relationship for the variance is only for one variable, what is the meaning of
variance?

How each value is away from its mean that deviation square of the deviation, then some of
the deviation, then the mean value of that some of the deviation it will give you the variance
for covariance there are 2 variables there how each variable is moving away from its own
mean. So, Ʃ( x - x bar ) (y - y bar) divided by n - 1. If you want to know correlation
coefficient, that covariance is divided by its corresponding standard deviation look at
correlation coefficient.

If you want to know slope and regression equation, if you divide covariance of x, y divided
by corresponding variances of x you will get m that is nothing but slope of the regression
equation. Dear students so for we have seen what is the need for studying the distribution?,
then we have seen how to construct a discrete probability distribution after constructing how
to find out the mean and variance of a discrete distribution then we have seen the properties
of expected value.

Next we have seen the properties of the variance. Then we have seen how this mean,
variances, covariance are interrelated. The next class we will continue some discrete
distributions and some continuous distribution in detail. Thank you

197
Data Analytics with Python
Prof. Ramesh Anbanadham
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee

Lecture - 09
Probability Distribution – II

Dear students in this lecture, we are going to see some of the discrete distributions and
continuous distributions, the discrete we are going to study the binomial poisson hypergeometric
in continuous that where we going to study uniform exponential and normal, first you will see,
what is the application of this Binomial Distribution?
(Refer Slide Time: 00:48)

Otherwise what is the need of this binomial distribution. We will take one practical example that
I will solve manually with the help of our concept of probability then I will tell you how the
concept of binomial distribution will help us to solve this problem quick way, but just consider
the purchase decision of the next 3 customers who enter your store and the basis of past
experience.

The store manager estimates the probability that any one customer will make your purchase is
0.30. Now, what is the probability that 2 of the next 3 customers will make a purchase?
(Refer Slide Time: 01:29)

198
That problem is drawn in the form of a tree diagram the first customer, second customer, third
customer when the first customer comes, there are 2 possibility he can go for purchase or not
purchase, purchase mention as say S success failure. The second customer also may go for
purchase, not purchase third customer purchase, not purchase. Look at the first possibility the
First customer is purchasing, so, success, success, success that is S, S, S.

If you take X look at the bottom of this slide, if X equals the number of customers making a
purchase, so, in this category, all 3 customers all 3 are going to buy look at the second possibility
success, success failure. So, in this out of 3 customers 2 customers S S going to purchase like
this we have displayed all possibilities. Now, the problem is X what is asked is 2 out of 3
customers what they have purchased.

So, number of customers making a purchases how many possibilities are there 2 customers 1, 2,
3 how we got this 3 is nothing but your 3 C 2 the 3 C 2 is out of 3, what is the probability? What is
the chance that 2 customers will buy there will be a possibility of 3 C 2, so the 3 C 2 is 3.
(Refer Slide Time: 02:57)

199
Now, we will see the first customer all possibilities purchase purchase no purchase success
success, so, success we know, we are calling it is a p failure is 1 - p. So, it is a p.p into 1 – p, p is
(0.30)2 multiply by 0.70, it is a 0.63. For second chance also first customer may buy second
customer did not buy third customer also buy, so, it is success failure success. So, p 1 - p into p
again p2. (1 – p) = 0.063, third choice, no purchase purchase purchase FSS, so 1 - p, p,p so it is a
p2 .(1 – p) = 0.63.
(Refer Slide Time: 03:48)

If I plot this one see x possibility, x possibility this possibility is see this possibility, when x = 0,
that means, what is the charge that all 3 customers will not buy? That is this case 0.7, 0.7, 0.7. If
I say x = 1, what is the probability that 1 customer will buy, if x = 2? What is the probability 2

200
customers will come, 2 customers will buy with x = 3? What is the probability that 3 customers
will buy? So, we can do with the concept of probability manually like this.

So, the question is asked is this one? What is the probability that 2 customers will buy out of 3.
So, this is your probability distributions. So, this we can do manually but it will take a lot of
time. So, here with the help of your binomial distributions, you can get right this all possibilities
going back this way nC x So,
P(X=2) = nC x .p x .q n - x, where n = 3 where x = 2 p is the probability success q = probability of
failure.

So, this nC x will give you different combinations of that event to happen. So, this p x , see that
always here p2 q n - 2
. So, this will give you the probability. So, that is the purpose of this
binomial distributions.
(Refer Slide Time: 05:35)

What is the property of this binomial distribution experiment involves in identical trials, each
trial has exactly 2 possible outcomes success, failures. Each trial is independent of the previous
trial for example, the previous case customer one may buy or may not buy. That is not depending
upon the for example for the second customer is intension of purchase is not based on the
previous customer.

201
So, p is the probability of success on any one trial, the q is probability of failure on any one trial.
So, p and q are constant throughout the experiment, this is important assumption. So, x is the
number of success in the end trials, there the previous case we are seeing x = 2.
(Refer Slide Time: 06:23)

So the probability function as I told you can written as


n
C x = (n!/(X!(n-X)! ). p x .q n - x
the mean of this distribution is µ = n.p directly we are going to answered. So, you can find out I
think but using the concept of expected value, when you multiply this P (X) with the x, σ x . P(x),
when you simplify will get this formula. Similarly, variance directly we are going to use only the
result variance = nPq
standard deviation = root of nPq.
(Refer Slide Time: 07:03)

202
Sometime there is ready made table is there to find out binomial probability values for example,
n = 10 and n is number of trials, x = 3 , p = 0.04 , n = 10 , x = 3 that value is this one 0.150. So,
the tables are provided you do not use your calculator it may take more time.
(Refer Slide Time: 07:34)

Now we will find out mean and variance suppose, for the next month the clothing store forecast
1000 customers will enter the store the previous problem their 1000 customers. So, what is the
expected number of customers who will make the purchase out of 1000. So, the answer is µ = n p
so n this 1000 probability of success point 3, so it is equal to 300. So, that is out of 1000
customers, there is a chance that 300 customers will make the purchase.

203
For the next 1000 customers entering the store the variance and standard deviation of the number
of customers who will make the purchase can you written as npq. So, n is given p is given 1 - p
70 210 the standard deviation is 14.49.
(Refer Slide Time: 08:24)

Then will go to next distribution Poisson distribution describes discrete occurrences over
continuous or interval. Generally Poisson distribution is for rare events, it is an discrete
distribution. X can have only few discrete values. Yes, it describes the rare events, each
occurrence is independent of any other occurrences. The number of occurrences in each interval
can vary from 0 to infinity. The expected number of occurrences must hold constant throughout
the experiment. These are the assumption of Poisson distribution.
(Refer Slide Time: 09:09)

204
Some of the examples are application of Poisson distribution. So, arrival at the queuing system
follow Poisson distribution, any airport people may arrive, airplane may arrive automobile may
arrive, baggage may arrive that follow Poisson distribution. The banks people automobiles loan
applications this arrival pattern follow Poisson distribution in computer file services, read and
write operations will follow Poisson distribution

The defects in manufacturing goods can be considered an example of Poisson distribution. For
example, number of defects per 1000 feet of extruded copper wire see that the n is very large.
The probability of success is very low. Another example of Poisson distribution is number of
blemishes per square foot of painted surface blemishes kind of a defect in paint number of errors
per typed page, these are the example of your Poisson distributions.
(Refer Slide Time: 10:15)

205
So, the probability function for Poisson distribution = (λX e -λ )/X!
Where X can take only discrete value, the lambda is mean that is a long run average the value of
e is 2.71. The mean of this Poisson distribution is lambda variance of the Poisson distribution
lambda standard deviation of Poisson distribution is root of lambda. So, this distribution called
the univariate distribution.

Because it has unique parameter that means, it is a special property of your Poisson distribution
where mean and variance is same. So, that is mean and variances. Same that is lambda.
(Refer Slide Time: 10:55)

206
And another caution while using Poisson distribution is the unit of this lambda and X should be
same, for example, lambda equal to 3.2 customers for 4 minutes probability of around 10
customers in 8 minutes. So, you have to adjust the lambda, how will you adjust, you multiply by
2, so it will be 6.4 into 8. Now, the unit of X and lambda is same, then you can use your
probability function, P (X) = (λX e -λ )/X! and substitute X = 10.

We are getting 0.025. See the second one, lambda = 3.2 customers by 4 minutes, X = 6
customers for 8 minutes. So, you have to multiply lambda. So, it will be 6.4 in 8 then P of X will
get the answer. So, what is the point here is the unit of lambda and X should be same.
(Refer Slide Time: 11:49)

Here also there is a probability Poisson probability table is available. So, when µ = 10 X = 5, you
see in the column shows mu. So, µ = 10 here, X = 5 is this value. So, this value is 0.378. So, you
cannot, you need not to use your calculator directly you can read it from the table. But we are
going to use these things in your practical class we are going to find out probabilities with the
help of Python.
(Refer Slide Time: 12:23)

207
We will go for hypergeometric distribution, see the binomial distribution is applicable, when
selecting from finite population with replacement or for an infinite population without
replacement. So, whenever the concept of without replacement comes, then we have to think of
using hypergeometric distribution. The hypergeometric distribution is applicable when selecting
from your finite population without replacement.
(Refer Slide Time: 12:52)

The properties of hypergeometric distribution, so, sampling without replacement from your finite
population then you should go for hypergeometric distribution. The number of objects in the
population is denoted as N, each trial has exactly 2 possible outcomes, success or failure. Similar
to binomial distribution trials are not independent, this is one different property, when compared

208
to binomial. In the binomial distributions, since we are going with replacement, the trials are
independent; here we are going without replacement.

So, the trials are not independent, it is dependent. Then, that means, the P will not be fixed, the
probability of P will not be fixed, every time we will get different answer for the P. X is the
number of success in the n trials. The binomial is acceptable approximation N/10 ≥ n, otherwise,
it is not.
(Refer Slide Time: 13:53)

That we will see the probability function with discrete values use probability mass function, if it
is a continuous will use PDF probability density function. So, the probability mass function is
P(x) = (ACx) (N - AC n – x) / NCn. So, here see the capital letters represent for the population. So N
is for population size, A is the number of success in the population, small letter, n for the sample
size, the x is number of success in the sample.

So, P(x) = (ACx ) (N – A C n – x )/ NCn. The mean value = A.n / N. The variance and of
hypergeometric distribution is A (N – A) n (N – n) / N2 (N – 1), root of variance is the formula
for standard deviation.
(Refer Slide Time: 14:57)

209
Will see one problem here, different computers are checked from 10 in the department, 4 out of
10 computers have illegal software loaded. What is the probability that 2 of the 3 selected
computers will have illegal software loaded. By looking at the problem we are feeling that it is a
finite population. Whenever there is a finite population, we should think of hypergeometric
distribution.

So, N is given that N is your population size, A is 4, because we know 4 is out of 10, 4
computers are illegal software. Because we know the how much illegal software is installed in
the population A = 4. Then, what is the probability that x is 2 of the 3 selected the computers will
have illegal software loaded. So, x = 2, here n is 3 there are 2 things is there, one is for the
population and population N = 10, A = 4.

For the sample n = 3, then what is the probability of that 2 out of 3 selected computers will have
illegal software loaded. So now substituted, everything is given, so we will get 0.3. So what is
the meaning of this 0.3, the probability that 2 of the 3 selected computers will have illegal
software loaded is 30%.
(Refer Slide Time: 16:38)

210
Then we will go to the continuous probability distributions. A continuous random variable is a
variable that can assume any value in the continuum. That is, can assume an uncountable number
of values. Thickness of an item, time required to complete a task, temperature of a solution,
height, these are example of continuous random variable. This can potentially take any value
depending only on the ability to measure precisely and accurately.

First we will see uniform distribution. The uniform distribution is the probability distribution that
has equal probabilities for all possible outcomes of a random variable. That is the probability of
random numbers also. When we say random numbers, the probability of choosing any number is
same, because of its shape, it is also called a rectangular distribution.
(Refer Slide Time: 17:31)

211
Look at this, Uniform Distribution. The uniform distribution is defendant interval, a ≤ x ≤ b. So
the probability function is 1/ (b – a), where ‘b’ is upper limit, ‘a’ is the lower limit. In other
interval, the value of f (x) is 0, the area = 1.
(Refer Slide Time: 17:51)

The mean of a uniform distribution is µ= (a + b)/2,


Standard deviation = (b – a) / √12. These very standard result, the derivation of this one, you can
refer some textbook.
(Refer Slide Time: 18:07)

212
Now, we will see some problem using uniform distribution, suppose uniform probability
distribution over the range is defined this way. 2 ≤ x ≤ 6. So if you want to know f(x) = 1/ (b – a)
, where b is 6, a is 2, so, (1/ (6 - 2 )) = 0.25. What is the meaning of this 0.25 means, in between
2 and 6, you can select any random variable, maybe 3, you may select 3 or 4 or 5, the probability
of choosing 3 or 4 5 is, 0.25. So, the mean = (a + b)/ 2 ,
where a is 2 b is 6
= (8/ 2) = 4. Similarly,
standard deviation = (b – a) / √12, that is 1.17.
(Refer Slide Time: 18:56)

213
We will see another example, suppose a random variable is defined between 41 to 47. First, we
have to find out probability density function 1 divided by b - a, so 47 - 41, so that is 1 / 6. So 1 /
6 the height of this rectangle of distribution is 1 / x this value is your 1 / 6. The lower limit 41
upper limit 47.
(Refer Slide Time: 19:21)

So, the mean of the distribution is (41 + 47) divided by 2


= 88 divided by 2 = 44
standard deviation (b – a) / √12. So,( 47 – 41) /√12
= (47-41)/3.464
=1.732.
(Refer Slide Time: 19:37)

214
Now, see the uniform distribution is defined in this interval 41 to 47 right. Suppose, we want to
know the probability between 42 and 45, is it 42 and 45. So, that had not been done by 45 x, that
is:
x2 - x1 /( b – a)
x2 is 45, x1 is 42 So,
45 - 42 = 3 , 3divided 6 = 1 / 2. So, this area is 0.5.
(Refer Slide Time: 20:04)

You will see another example of this uniform distribution. Consider a random variable x
representing the flight time of an airplane travelling from Delhi to Mumbai. Suppose the flight

215
time can any value in the interval between 122 to 140 minutes. Because of the random variable x
can assume any value in the interval, x is a continuous rather than a discrete random variable.
(Refer Slide Time: 20:30)

Let us assume that sufficient actual flight data are available to conclude that the probability of
your flight time within any one minute interval is the same as the probability of a flight at time
within any other one minute interval, contained in the large interval from 120 to 140. So, that is
the properties of your uniform distribution. That means, in any one, any small interval you
should have the same value.

If any one interval, if any between this interval the probability will say next, another one minute
interval, the probability same. With every one minute interval being equally likely the random
variable x is set to have a uniform probability distribution. So, the upper limit is 140, the lower
limit is 120, 1 /(140 – 120) .
(Refer Slide Time: 21:20)

216
So, the height of this rectangle distribution is 1 divided by 20, it is starting a value is 120, b value
is 140. Suppose, you are asked to find out, probability of your flight time between 120 and 130
minutes. So, somebody is asking what is the probability of that, flight arriving between 120 and
130 minutes. So, what you have to do, this is (130 – 120) divided by (140 – 120). When you
simplify you will get 0.50. So the probability of that flight arrive between 120, 130 minutes is
50%.
(Refer Slide Time: 22:05)

Next we will go to the exponential probability distribution. The exponential probability


distribution is useful in describing the time it takes to complete a task. The exponential random
variable can be used to describe the time between vehicle arrivals at a toll booth, time required to

217
complete a questionnaire, the distance between major defects in a highway. For example,
whenever the time between arrival and time required to complete the questionnaire, whenever
the word the between comes. Then that is the then it is appropriate to use exponential
distribution.
(Refer Slide Time: 22:43)

The density function for a exponential probability distribution is f (x) = (1/ µ). e (-x/µ).
Here, µ is the mean.
(Refer Slide Time: 22:52)

We will see what is the how to construct the exponential distribution. Suppose that x represents
the loading time for a truck at a loading dock and follow such a distribution. If the mean or

218
average loading time is 15 minutes, µ = 15, then appropriate probability density function is f of x
= 1 divided by 15 e to the power – x divided by 15.
(Refer Slide Time: 23:13)

So, what is the meaning of this exponential distribution is supposed, if the loading time, the
probability of loading time less than 6 is this area. The probability of loading time between 6 and
18 is this one this time, this is a probability, this shadowed portion represents the probability of
that loading time is between 6 and 18.
(Refer Slide Time: 22:39)

Because in many application of exponential distribution will use the cumulative probability
density function. So, for an exponential distribution, the formula for finding the cumulative

219
probability density function = 1 - e (- x 0 / µ) where x0 is some specific value of x. it is nothing but
if you integrate that to distribution, if you integrate the distribution between the intervals when
you simplify it, you will get this answer.
(Refer Slide Time: 24:08)

We will see one example of this exponential probability distribution. The time between arrivals
of cars at a petrol pump follows an exponential probability distribution with the mean time
between arrivals of 3 minutes. See that it is mean time between arrivals. So, the mean is 3 here.
The petrol pump owner would like to know, the probability that time between 2 successive
arrivals will be 2 minutes or less. So x value is less than or equal to 2.
(Refer Slide Time: 24:43)

220
So if you want to know P of x less than equal 2, we have to substitute into the, that distribution.
That is the cumulative to distribution function, 1 - x value, so it is 0.4866. It is a 0.4866 so this
one. So this shaded one. Suppose if you want to find out, the time between 2 successive arrivals
of vehicle is less than 7 minutes, probability will increase. So, when x increases, but there are
more chances that the time between 2 success arrival is this much. So what I am saying this is a
way to interpret the exponential distribution.
(Refer Slide Time: 25:23)

Now, there is a very important relationship between poisson and exponential distribution. See,
the Poisson distribution provides an appropriate description of number of occurrences per
interval. So, one interval is there. In that interval, how many occurrences happened, that is the
Poisson distribution. In the exponential distribution provides an appropriate description of length
of the interval between the occurrences.

Suppose the same interval between this occurrence to this occurrences, so this phenomena is
explained with the help of between 2 occurrences, this phenomena is explained with the help of
exponential distribution. Number of occurrences in that interval it is explained with the help of
Poisson distribution, that is a relation between poisson and exponential distribution.
(Refer Slide Time: 26:14)

221
So, mean of your poisson and mean of your exponential distribution, what is the relationship.
Because the average number of arrival is 10 cars per hour, the average. So, if it is a 10 in that
interval, there are 10 cars are arriving. So, this mean is taken for the Poisson distribution. But 1 /
10 is taken as the mean for the exponential distribution. That means, time between arrivals. So,
that is a link between µ generally we write µ mean for Poisson distribution, 1 / µ mean for
exponential distribution that is the relation between poisson and exponential distribution.
(Refer Slide Time: 27:05)

Next, we are entering into the very important distribution that is a normal distribution, normal
distribution is we can say a father of all the distributions. Because suppose some phenomena is
happening if you are not aware, you can assume that it follows normal distribution. Normal

222
distribution is following Bell Shaped, it is symmetrical, mean, median and modes are equal. The
location of the normal distribution is characterized by µ, the spread is characterised by σ. The
random variable has an infinite theoretical range that is minus infinity to plus infinity.
(Refer Slide Time: 27:50)

(- 1 / 2)(( x - µ)/σ)2
The density function of a normal distribution is f (X) = (1/ √( 2 .ᴨ .σ) ). e , e is
mathematical constant value is 2.71, ᴨ is the mathematic constant we know that 3.14, µ is the
population mean, σ is the population standard deviation, X is any value of the continuous
variable.
(Refer Slide Time: 28:18)

223
The shape of the normal distribution is by varying the parameter of µ and σ we obtained different
normal distributions. Dear students, what we have seen so far is we have seen some of the
discrete and continuous distributions in the discrete distribution we have talked about the
binomial, Poisson distribution. In the continuous distribution we have seen exponential and
uniform distributions. The next class very important distribution that is normal distribution, that
will cover in the next class. Thank you very much.

224
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology Roorkee

Lecture – 10
Probability Distributions - III

Welcome back students now we are going to discuss another important continuous distribution
and that is normal distribution, normal distribution can be called as mother of all distribution
because, if in any phenomena if you are not aware about the nature of the distributions, you can
assume that it follow normal distribution, most of the statistical test or whatever analytical tools
which are going to use in this course.

Good to have some assumptions that did follow normal distributions knowing the properties and
behavior and assumptions about the normal distribution is very important for this course. Some
of the properties: normal distribution is Bell shaped curved.
(Refer Slide Time: 01:16)

Right we curve form a bell shaped curve. Iit is symmetrical, you can fold it so after folding both
the side are same, another important property mean, median and modes are equal the location is
characterized by its mean µ the spread is characterized by standard deviation. The random
variable has an infinite theoretical rates that is a minus infinity to plus infinity.
(Refer Slide Time: 01:48)

225
(- 1 / 2)(( x - µ)/σ)2
The formula for normal probability density function f (X) = (1/ √( 2 .ᴨ .σ) ). e ,
where e is the mathematical constant, the value is 2.71828 pi is the mathematical constant the
value is 3.14 mu is the population mean sigma is the population standard deviation, X is any
value of the continuous variable.
(Refer Slide Time: 02:18)

The shape of the normal distribution will change based on its spread by varying parameters µ
and σ being obtain different normal distributions. For example this one where the sigma is very
low, in this case sigma is little normal, in this is sigma is very big.
(Refer Slide Time: 02:39)

226
Changing mu shift to the distribution left or right if you increase the value of mu, it can go right
side or left side. Changing sigma, the standard deviation increases or decreases the spread
generally, when you decrease the sigma the spread will decrease when you increase the sigma
the spread will increase.
(Refer Slide Time: 03:01)

There is another normal distribution standardized normal distribution, any normal distribution
with the mean and standard deviation combination can be transformed into standardized normal
distribution. One thing what you had to do, we need to transform X unit into Z units, Z is nothing
but the conversion method is (X – mu) / sigma, the standardized normal distribution as means 0
and the variance or standard deviation is 1.

227
(Refer Slide Time: 03:36)

The translation from X to the standardized normal that is the Z distribution by subtracting the
mean of the X and dividing by standard deviation. So, that conversion from it is normal
distribution to standardized normal distribution is done with the help of this Z transformation
Where Z = (X – µ) / σ
X is a random variable mu (µ) is the mean of the population. Sigma(σ) is the standard deviation
of the population.
(Refer Slide Time: 04:05)

The formula for the standardized normal probability density function, if you substitute Z equal to
X – mu / sigma in our previous equation and it become if

228
f (Z) = (1/ √( 2 .ᴨ ) ). e (- Z2)/2
Where pi(ᴨ) is the mathematical constant z is any value of this standardized normal distribution.
(Refer Slide Time: 04:26)

Standardized normal distribution the shape how they look like also known as Z distribution mean
is 0 standard deviation is 1, the value above the mean how positives Z value, values below the
mean will have negatives Z value.
(Refer Slide Time: 04:44)

Let us see how to do that conversion from normal distribution to standardized normal
distribution. If X is distributed normally with the mean of 100 and standard deviation of 50 the Z
value of X is 200 then corresponding Z value is X - mu / sigma, X is 200, - mu 100, divided by

229
sigma 50, equal to 2.0. This says that X =200 is two standard deviation above the mean of 100,
that is 2 increments of 50 units, the Z value nothing but how many times of it is standard
deviation that is nothing but your Z, here 2 increments of 50 that is why the Z value is 2.
(Refer Slide Time: 05:41)

Look at the conversion now this will be so convenient for you the red one, where the mean = 0
the X = 200, we have asked to find out when X = 200 what is the corresponding Z value? The
red will shows in the simple normal distribution, the black one shows the standardized normal
distribution, you see that the mean of the distribution is under mu standardized scale it becomes
0, when X = 200 in a normal distribution in a standardized normal distribution.

The X and corresponding Z value is 2 , where the mean mu equal to 0 sigma equal 1. Note that
the distribution is the same only the scale is has changed. We can express the problem in original
units are standardized units but there is an advantage. Why we have to convert into standardized
normal distribution sometime you may be required to find out the area of a distributions. Because
if you are not standardizing you cannot use that your Z table, Z statistical table. Every time to
know the area you have to integrate.

That is a very compression process that is why every normal distribution is converted to
standardized normal distribution for the convenient of looking at the Z value directly from the
table that will simplify our task.

230
(Refer Slide Time: 07:13)

The probability is measured by area under the curve in a continuous distribution, the probability
you know that it is measured area under the curve suppose, always it has to be expressed
between A and B. If you want to know the probability exactly at A are exactly B that will not
form the area. So, the probability is 0. So in the context of continuous distribution, the meaning
of probabilities area under the curve, but if it is a discrete probability distributions, the
probability can be red directly by looking at the X and corresponding P of X.
(Refer Slide Time: 07:53)

231
Total area under the curve is 1 and the curve is symmetric, so half is the above the mean half is
below. So P (-minus infinity ≤ X ≤ mu) is 0.5 similarly, P(mu ≤ X ≤ plus infinity) is 0.5, so the
total area is 1.
(Refer Slide Time: 08:16)

Suppose, if you want to know the area Z less than 2.00, see this was when the Z is lesser, this
area is 0.9772.
(Refer Slide Time: 08:30)

One way you can read it directly from the Z table suppose in the rows, the Z value is given the
column the decimal of it is given. Suppose if you want to know Z = 2.00 you have to look at in
row 2.00, the corresponding area is this one. See, the rows shows the value of Z to the first

232
decimal point. The column gives the value of Z to the second decimal point. The value within the
table gives the probability from Z minus infinity up to desired Z value.

When we look at the table, statistical table especially Z table it should be very careful whether
the area is given from minus infinity, there are 2 possibilities sometime the area may be given
minus infinity to plus X value, sometime area may be given only the positive value, this side
value 0, positively values of Z is given. If you want to know if you want to read the negative
value of Z because it is symmetric, so you all can read just only the positive value, then we can
take that value to the negative side.
(Refer Slide Time: 09:53)

So, finding normal probability procedure we will see one problem to find P( a < X < b) when X
is distributed normally, the first one is draw the normal curve of the problem in terms of X,
whenever you are going to find out area, it is always good to draw the distribution draw the
normal distribution then you can intuitively you can read from the picture, so the next step is
translate X value to Z values then use standardized normal table where you can get the area.
(Refer Slide Time: 10:34)

233
Let X represents the time it takes to download an image file from the internet. Suppose X is a
normal with mean 8 and standard deviation 5. If we want to know what is the probability of X
less than 8.6 that means what is the probability of downloading time is below 8.6 So first you
have to mark the mean then you are to find out this X values 8.6, so since it is asked less than
8.6, the left side area, so the first steps is 8.6 has to be converted into, you can integrated by
using normal distributions, you can substitute to minus infinity to 8.6 mean you can integrated,
we will get the area there is no problem, but it is very time consuming process.
(Refer Slide Time: 11:26)

So, one easy way is you have to convert that normal distribution into standard normal
distribution, that means the X value has to be converted into Z scale they can read they can use

234
the table to find out the area for the corresponding Z value. Suppose, X is the normal with mean
8 or standard deviation 5. X less than 8.6 use the Z = (X – mu)/sigma, formula to get Z value
when X = 8.6 . So we got 0.12, so now when Z value 0.12 you can read this value directly from
the normal table to know the probability.
(Refer Slide Time: 12:06)

Is it that Z value 0.12 so you can Z value 0.12 so this area is 0.5748.
(Refer Slide Time: 12:15)

Finding normal probability suppose X is greater than 8.6, so now we have to look at the area of
the right side so, P of X greater than 8.6 is equal to that we have to convert it to Z scale after
getting since it is greater than since the area is 1.

235
1 – P( Z less than 0.12), will give the blue side area. So, one when Z = 0.12 corresponding areas
0.54 so, this side area is after subtracted from one will getting, we are getting 0.4522.
(Refer Slide Time: 12:56)

Suppose X is a normal with mean 8 standard deviation 5 so find P(8 less than X less than 8.6)
now, the 2 value of X is given both of values has to convert when X = 8 we are getting Z value 0,
when X = 8.6 we are getting Z value 0.12, so now we have to know the area of Z = 0, to Z =
0.12. So that means 0 to 0.12.
(Refer Slide Time: 13:28)

236
One way from the table is, first you find the area up to minus infinity to Z value 0.12. So, we are
getting 0.5478 then subtract when Z = 0 left side area we know it is a .5. So, the remaining is
0.0478.
(Refer Slide Time: 13:50)

Now, just the reverse of that the probability is given you have to find out the X value, let X
represents the time It takes to download an image file from the internet suppose X is normal with
mean 8 standard deviation 5 find X such that 20% of the download times are less than X, there
are 2 points here, one is less than X another one is 20%. So, on the left hand side when area
equal to 0.2 what is the corresponding X value so, for that first you got to find out Z value, from
the Z you have to find out the X value.
(Refer Slide Time: 14:35)

237
Now look at the table. So, when area equal to 0.2 corresponding Z value is (-0.84), this is the
value of Z.
(Refer Slide Time: 14:45)

So, we know that Z value is (-0.84), here this formula has come from this simple formula = (X –
mu) / sigma. Now, we know the value of Z from this you have to find out value of X. And one
more thing the when you are finding the value of Z, you should be very careful what kind of
normal distribution you are using to find out the value of Z if normal distribution is like this that
is area is given from 0 to positive Z right. So if you are measuring area on the left hand side, so
will get them Z value but have to attach negative side to that. So we should be careful.

238
So mu = 8.0,
X = 8.0 + (-0.84)5, we are getting 3.80. So 20% of the download times from the distribution with
the mean 8 and standard deviation 5 are less than 3.8 seconds.
(Refer Slide Time: 15:47)

Another important thing gives us is normality because the normality assumption is very
important for other type of inferential statistics. I will tell you why it is important because we
will be studying a concept called Central Limit Theorem, where when you do the sampling of the
sampling that will follow normal distributions. So, lot of many analytical tools many statistical
tools follow the assumption that data should follow normal distributions, that is why as soon as
you collect the data.

The first step is cleaning the data, when the cleaning in that process is you have to verify whether
the data follow normal distribution or not, otherwise, you may not otherwise you will you may
end up choosing wrong statistical techniques or analytical techniques.
(Refer Slide Time: 16:35)

239
It is important to evaluate how well the data Z is approximated by a normal distribution.
Normally distributed data should approximate theoretical normal distribution, like the normal
distribution is bell shaped where the mean is equal to the median. The empirical rule applies to
the normal distribution. The interquartile range of a normal distribution is 1.33 standard
deviations; these are the way to test the normality.
(Refer Slide Time: 17:04)

Another way to assess the normality is construct the charts or graph. Now, you can look at the
shape of the distribution, for small or moderate sized data set, do stem and leaf display and box
and whisker plot and check whether it is look symmetrical or not. As I told you in the beginning

240
of the lectures, if you look at the stem and leaf plot, you should follow this kind of shape then we
can say it follow normal distributions, In the box and whisker plot.

For example, box and whisker plot is like this, right, the middle line that is median line should be
middle of the box then only we can say the data Z follow normal distribution for a large data set.
That is the histogram or polygon appears bell shaped, you can draw a histogram and also you can
verify whether it follows normal distribution. Other way you can compute descriptive summary
measures, whether you can check mean median mode.

How the similar value is the interquartile range approximately 1.33 sigma is the range is
approximately 6 sigma, these are the some descriptive measures to check whether the data follow
normal distribution or not then you can find the skewness, when the skewness is 0 then we can
say this data follow normal distribution.
(Refer Slide Time: 18:31)

Some more example, to check the normality observed the distribution of the data set these are the
conditions do approximately 2/3 of the observations live within the ±1 sigma. Then we can see it
follow normal distribution do approximately 80% of the observations live within ±1.28 standard
deviations are do approximately 95% of the observation live within the mean or ±2 standard
deviations, this is the Z table.
(Refer Slide Time: 19:02)

241
You see the previously the Z table is starting from 0 is not starting from minus infinity. So, this
is second decimal, suppose Z is 0, the probability is 0, here it is given, what is given only one
side known. The area is given only this one. So if you are finding you have to add 0.5 suppose if
you want 0.0 you have to add 0.5 to get the, Z table. Another important .which I am planning to,
willing to share with you.
(Refer Slide Time: 19:39)

See this Z = 0, Z = 1 see, this is 0.3413, right between 0 and one. Suppose if we want to know
minus infinity to 1, you have to add 0.5 with that. Plus 0.5. So, we will get the value, another one
when you see, when you look at the normal distribution. I will come back to that.
(Refer Slide Time: 20:04)

242
If X is normally distributed with the mu = 485 and sigma = 105. So, the 485 to 600 when X is
485, you have to convert that scale to 0 when X = 600 corresponding Z values 1.1 so, Z is 0 to
1.1, 0.3643 is the area and the curve. Dear students we have seen So, far the properties of normal
distribution then we have seen standardized normal distribution a normal distribution how these
are interrelated and we have seen how to find out the area with the help of table one property.
(Refer Slide Time: 20:51)

You can look at the normal distribution the normal distribution shape is like this. So, you look at
this, it would not touch this is x axis y is your probability effects. It will not touch the x axis you
may how this doubt why it is not touching this distribution normal distribution why it is not
touching axis? Because suppose, if I am plotting age of the students in the class follow normal

243
distribution, see the average age is say 19. There is a possibility somebody, suppose I am closing
this way, some bodies age may be say around 30, somebody age may be around say, around 10.

So, since this normal distribution was drawn with the help of sample, I was not exactly knowing
that this kind of rare value of X whether it is X = 30 hours X = 10. So, why I am not closing?
Why this normal distribution not touching X axis, because we were given provision for the rare
events that means X is maybe very high value X may be very low value, but I am not sure about
that. That is why the normal distribution did not touch with the X axis. The another doubt you
may know when you look at the Z table.

When you look at the Z table, the value of Z most of the time I go back. It will show you, see this
the value Z is 3.5. The question may come why the value of Z is maximum 4 or 5 in the
statistical table, you remember the beginning of the class, I was saying from the mean if you
travel on either side with one sigma distance you can capture 68%, if you travel 2 sigma distance
from the mean, that is this distance, 2 sigma distance and minus 2 sigma distances.

You can capture 95% of the area of the normal distribution. If you travel 3 sigma distances this
extreme distance, I can use some other colour, please bear with me. If you travel 3 sigma
distance, this portion, if you travel 3 sigma distance, you can cover 99.7% of the data. Okay. So,
why the value of Z is not beyond 3, the possibility of the Z value is beyond 3 is only 0.3% the
same time the probability of x value to become extremely high or extremely low is only 0.3%.

What is the meaning of that only 0.3% chance that the value of that will be more than 3, that is
why all statistical tables given only 3.5 or 4 not beyond that. The another reason and also why we
are not closing with the x axis the probability of that extreme events to happen is only 0.3%
provided if it is following normal distribution. Now, I will summarize the students that so far via
we have seen different type of probability distributions.

The previous class we have seen some of the continuous distribution in this class we have seen
an important normal distribution that is a normal distribution. We have learned properties,
normal distribution and a standard normal distribution, how to convert a normal distribution to

244
standard normal distribution, how to refer Z table to find out the area that you have seen. The
next class with the help of Python will use how to find out the area under the curve or how to
find out the mean standard deviation of different distributions. Thank you very much.

245
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee

Lecture – 11
Python Demo for Distribution

Dear students in the last lecture we have seen probability distributions in this lecture with the
help of Python we solve some problems from probability distributions.
(Refer Slide Time: 00:44)

The problem is taken from a book written by Ken Black the title of the book is applied statistics
so we will see the problems now. Now I am importing scipy importing numpy as np from scipy
dot stats import binom, binom is for doing binomial operations and one more thing you can
import picture also for example this picture that is that empirical distribution picture I have taken
from this source that is exclamation symbol square bracket that link and that link should not be
in square bracket then you will when you do that link we can get that picture directly I am
executing this.

Yes now we will see the problem a survey found that 65% of all financial consumers were very
satisfied with their primary financial institution. Suppose that 25 financial consumers are
sampled and if the survey result still hold the true today what is the probability that exactly 90
are very satisfied with their primary finance institutions. By looking at the problem we have to

246
see what kind of distribution we are going to use here there are 2 possibilities satisfied or not
satisfied.

So, there are 2 possibilities there so then we can go for binomial distributions print binom dot
pmf, probability mass function k equal to 19 that is we say in the probability distribution current
context, n equal to number of sample p is probability of success. Now we can see that the answer
is 0.09 so there is 0.09 the probability that exactly nine are satisfied with a primary financial
institutions.
(Refer Slide Time: 02:43)

We go to the next problem this Book this problem also taken from that book, according to US
Census Bureau approximately 6 percentage of all workers in Jackson Mississippi are
unemployed in conducting a random telephone survey in Jackson what is the probability of
getting 2 or fewer unemployed workers in a sample of 20. Here we want to know 2 are less so
say the probability of 0 plus probability of 1 plus probability of 2.

So we were to do the cumulative density function. So, here for doing that one you have to enter
type binom.cdf(2, 20, 0.6) the 2 represents less than or equal to 2, 20 represents the sample, the p
represents the probability. So, when you run this you are getting community probability of
88.5%.

247
We will take another problem solve the binomial probability n equal to 20 sample size is 20 p
equal to 0.4 x equal to 10 so binom dot pmf you will get the answer for 0.117.

We well go to the next distribution Poisson distribution. So, in the Poisson distribution for doing
Poisson distribution you have to import the library Poisson distribution from scipy dot stats
import Poisson first we will find out Poisson probability mass function so poisson.pmf (3, 2). 3
represents the x and 2 represent the mean. We will see another problem, suppose bank customers
arrives randomly on any weekday afternoon at an average of 3.2 customers every 4 minutes,
what is the probability of exactly 5 customers arriving in a 4 minute interval on a weekday
afternoon.

By looking at the problem you say that we know that the arrival pattern follow Poisson
distribution and you have to be very careful on the unit of mean and the unit of x both are in 4
minutes then no problems they simply can poisson.pmf(5, 3.5) is your x value Poisson dot pmf
so 3.2 is the arrival rate, so 5,3 that is the 11.39%. You will see one more problem, bank
customers arrive randomly on weekday afternoon at an average of 3.2 customers every 4 minutes
what is the probability of having more than 7 customers in you 4 minute interval on a week day
afternoon.

So, here we have to find out the probability of x greater than 7 so what we will do first we will
find up to 7 with the help of this world that we will save any object called prob, equal to poisson
dot cdf 7 and lambda 3.2 so when you subtract 1 minus of this 1 then we will get probability of
more than that, yes so I am finding up to 7, when you substrate 1 minus that up to 7 we will get
to more than 7. we will see another problem on Poisson.

On a bank has an average random arrival rate of 3.2 customers every 4 minutes what is the
probability of getting exactly 10 customers during 8 minutes interval. now it should be very
careful here the unit of x and unit of lambda are different, because it's a 4 minutes it is 8 minutes
so you have to convert into same units so multiply by 32 by 2 you will get 6.4, so lambda equal
to 10 so Poisson dot pmf of 10, 6.4 will give you the answer for 0.0527.
(Refer Slide Time: 06:47)

248
We will go to uniform distribution next suppose the amount of time it takes you see that the
amount of time it takes to assembly a plastic module ranges from 27 to 39 seconds and the
assembly time are uniformly distributed describe the distribution what is the probability that a
given assembly will take between 30 to 35. first we will develop that array, so here u equal to np
dot arrange, there are 2 functions one is range another one is arrange, arange means its array,
array function if you type if you type simply range that is a list.

So 27 is the starting value say ’40-1’ because it is n – 1, not 40, 39 will be the last value in the
array, the increment is by 1, we got this one, now this is our uniform distribution, now from
scipy dot start import uniform so we will find out the mean of this distribution so for that purpose
uniform dot mean loc is the starting point 27, scale is how much plus 12 so it is a 27 + 12 is 39.
So, that is a syntax for, so the mean is 33 otherwise in the uniform distribution finding the mean
is not a very complicated formula simply you have to find out the (a + b) / 2.

Then we will do the cumulative distribution cdf, cumulative function, so uniform dot cdf np dot a
arrange 30 what the question was asked is 30 to 35. So, np dot array 30 because it is a 35 you
have to go out to 36 the increment is 1, starting point is 27 this scale is 12. so this will give you
the probability between 30 to 35 so, when you run this so the probability of 30 is 0.25, 31 is 0.33,
32 is 0.41 and so on. Suppose we want between 30 and 35 for 30 the probability is 0.25 for 35 it

249
is 0.66, so if you substrate 0.66 - 0.25 you will get the; and so far that probability that the given
assembly will take between 30 to 35 seconds.
(Refer Slide Time: 09:33)

You will see one more problem according to the National Association of Insurance
Commissioners the average annual cost of automobile insurance in the United States in a recent
year was 691 dollar. Suppose the automobile insurance cost are uniformly distributed in the
United States with an average of, from $200 to 1182 dollar what is the standard deviation of this
uniform distribution. So, we have to find out standard deviation of this distribution. Before that
will check the mean the mean is given 691 dollar.

So, we will verify this uniform dot mean loc starting point is 200 the difference is the scale is
982 that is 1182 minus sorry $200, this is 1 so it is the extra 691 dollar if you want to know the
standard deviation of the uniform distribution because this formula is different it is not simple
standard deviation so uniform dot std loc is a 200, scale is 982 you will get the standard
deviation of 283.47.

Next we will move to the normal distribution here also I have inserted a picture of normal
distribution you see that, the picture is taken from this link actual exclamation square break here
that link okay when you execute this you will get here picture of probability distribution, that
picture. First we will have to import a library norm that is imported from scipy so from scipy dot

250
stats import norm that is the value, mean, standard deviation: 68, 65.5, 2.5 suppose, if x equal to
68 the mean of that normal distribution is 65.5 standard deviation is 2.5 what is the probability?
So, we will run that, first you have to run this also, yes the probability is 0.8413.

If you want to x less than that value suppose, if you want to know cumulative distribution of x
greater than value you have to substrate from 1. Suppose if we want to move the value 68 and
above. So, already known we know up to 68 this much value so, the remaining area is because
we know the area of the normal distribution is 1. So, 1 minus remaining that value will give you
the right side area. Suppose if you want to know the value between x1 and x2.

For example value 1 less than or equal to x less than or equal to value 2 so it is a very simple
printout norm dot cdf you will find out the upper limit and the lower limit. Simply you type the
lower limit because the value ms already I have declared. Now suppose the between 68 and 63, x
values 60 and 63 if we want to know the area that it plays a very simple reason to receive a lot of
our time. Suppose what is the probability of obtaining a score greater than 700 on your GMAT
test that has mean 494 and standard deviation of 100 assume GMAT score are normally
distributed.

There is another example what is the probability of x greater than 700 when mean equal to 494
and standard deviation is 100. So, because we want to know x greater than equal to 700 so we
have to find out x equal to 700 then subtract from 1 so, print 1 minus norm dot cdf 700 the 494,
100 will give you the answer for. what is the probability that randomly drawing his score the 550
or less so we have to need x value less than or equal to 550 so 515, 494, 100 will be the answer.

What is the probability of randomly obtaining a score between 300 to 600 the GMAT
examination actually this problem is taken from statistics for management Levin and Reuben.
Now you see that the upper limit is 600 lower limit is 300 between 600 and 300 what is the
probability so print norm.cdf 600, 494, 100) minus this was upper limit, minus norm.cdf( 300
,494, 100) is the lower limit.

251
What is the probability of getting a score between 350 and 450 on the same GMAT exam, 450
350 there is another example, similar to previous one into this one. Now we are going to do the
reverse of that. Now if they are so far be able to find the cdf cumulative probability. Now
suppose the area is given if the area is given we want to know the x value, if it is a standard
normal distribution we want to know the z value because the default function is the standard
normal distribution where the mean equal to 0 standard deviation 1.

So area under 0.95 the corresponding z value is 1.645 this value you can read it from the table
the same way which I have explained in the in my theory lecture. Suppose if you want to know
most importantly here norm dot ppf that is a probability function. So, now I am norm ppf 1 -
0.672 will give you the left side area. So, we will see what is the corresponding let us say it is z
value is, yeah here we are going in the left hand side so the z value is minus 0.459.
(Refer Slide Time: 15:31)

Now we will see an example of hyper geometric distribution. the example says suppose 18 major
computer companies operate in the United States and that 12 are located in California's Silicon
Valley. If 3 computer companies are selected randomly from their entire list what is the
probability that one or more of the selected companies are located in the Silicon Valley. What
things you have to notice here is 1 or more. So, for that means we have to see what is the
probability of getting 1 or more selected companies.

252
So from scipy dot stats import hypergeom p value equal to hypergeom dot sf, sf means survival
function. So, here if it is one or more means that 1 - 1 so 0. 0, 18 represents the population size 3
means we are 3 we are choosing that is the number of sample 3, 12 means the number of success
in the population that is a capital A, the same notation what you are used in our theory. So, here
the p value is 0.9754.
(Refer Slide Time: 16:45)

We will see another example a western city has 18 police officers eligible for promotion 11 of 18
are Hispanic, suppose only 5 of the police officers are chosen for promotion if the officer chosen
for promotion had been selected by chance alone what is the probability that one or fewer of the
5 promoted officers would have been his Hispanic. So, what we need to know here 1 or fewer, so
here we have to find out the cumulative probability. So, the formula for finding the cumulative
probability for a hyper geometric function is the p value.

I am going to save in the name of p value equal to hypergeom dot cdf 1, so 18 represents the
population size 5 represents, because choosing 5, 11 represents the number of success in the
population. So, when you run this getting 0.04738.
(Refer Slide Time: 17:43)

253
Now we will go for next example on exponential distribution we will take a sample problem. A
manufacturing firm has involved in statistical quality control for several years. As part of the
production process parts are randomly selected and tested from the records of these tests it has
been established that the defective part occur in a pattern that is a Poisson distributed on the
average of 1.38 defects every 20 minutes during production run. Use the information to
determine the probability of less than 15 minutes will elapse between any 2 defects.

Here how to look at the 2 things in this problem one is the mean of your Poisson distributions
given mu and second thing is the between any 2 defects. Now when as I told you in theory itself
whenever the between 2 things you have to go for exponential distribution. Now first we do find
the mean of your exponential distribution. So, the mean of your exponential distribution is 1 by
mean of the Poisson distribution. So, here is Poisson distribution mean is 1.38 so the lambda we
can call it as mu 1 that is mu1 is for the mean of here exponential distribution mu1 equal to 1
divided by 1.38.

So that value is this much suppose, what was asked probability that is less than 15 minutes from
scipy we have to import exponential function. So, we have to find out the cumulative probability
further to expon dot cdf so the 0.75 represents because we got 0.75 from 15 divided by by 20
because that is the mean was in the Poisson distribution mean was for 20 minutes. Now the
problem for the exponential distribution is asked for 15 minutes.

254
So, we are dividing 15 by 20 so that the units are matching. So, we need to find out the
cumulative function of exponential distribution. The lower limit of that x is 0 the upper limit is
0.75 so exponent dot cdf upper limit 0.75, lower limit and the lambda value so you will get the
0.644. Students, is so far we have seen we have seen binomial distribution, how to use Python.
Then you have seen Poisson distribution, we have seen uniform distribution.

We have seen normal distribution and exponential distribution and hypergeometric distribution
also. So, we will continue in the next class with a new topic that is on sampling and sampling
distribution, thank you.

255
Data Analytics with Python
Prof. Ramesh Anbanandam
Department of Computer Science and Engineering
Indian Institute of Technology – Roorkee

Lecture – 12
Sampling and Sampling Distribution

Dear students we are going to the next lecture that is sampling and sampling distributions. The
objective here is; objective of this lecture is describing a simple random sample and why
sampling is important.
(Refer Slide Time: 00:40)

Explain the difference between descriptive and inferential statistics and defining the concept of
sampling distribution. Determining the mean and standard deviation of the sampling distribution
of the sample mean that very important theorem that we are going to see in this class, the central
limit theorem and its importance. and determining the mean and standard deviation of the
sampling distribution of the sample proportions, then at the end we will see the sampling
distribution of sample variances.
(Refer Slide Time: 01:17)

256
The whole statistics can be classified into 2 categories one is the descriptive statistics another
one is the inferential statistics. The descriptive statistics is only for collecting and presenting
describing the data as it is it is very low-level statistics. Whereas the inferential statistics drawing
conclusions are making decisions concerning a population based on sample data, in the
inferential statistics with the help of sample data we are going to infer something about the
population. So, when you say population you should know what is the population what is the
sample?
(Refer Slide Time: 01:58)

Population is the set of all items are individual of interest for example all likely voters in the next
election, all parts produced today, all sales received for November. The sample is the subset of

257
the population like 1000 voters selected at random for interview, a few parts selected for
destructive testing, random received selected for audit. This is an example of sample.
(Refer Slide Time: 02:33)

When you look at the left hand side there is a bigger circle that is the population from there some
numbers are bigger the collection of that picked of the values is called a sample.
(Refer Slide Time: 02:35)

The question may come why we out to sample it is less time-consuming than a census, less
costly to administer then your census. It is possible to obtain statistical result of your sufficient
the high precision based on the samples. Because of the research process sometimes destructive
the sample can save the product. If accessing the population is impossible sampling is the only

258
option. Sometimes you have to go for census also we are in census we will examine each and
every item in the population.
(Refer Slide Time: 03:17)

Suppose if we need to have higher accuracy and you are not comfortable with the sample data
then used to go for census. The reasons for taking a census because census eliminates the
possibility that random sample is not representative of the population, many time there is a
chance that the sample which you have taken may not represent the population. Otherwise the
person authorizing the study is uncomfortable with the sample information then you go for
census.
(Refer Slide Time: 03:40)

259
We will see what is sampling? Sampling is generally selecting some items from the population
that is a sampling. So, there are that can be classified into two way one is random sampling
another one is a non random sampling in the random sampling. The concept of randomness is
taken care non random sampling the randomness is not there. Sometimes we may go for non
random sampling even though it is not so comfortable that is not good for doing many statistical
analysis, sometimes we have to go for non random sampling.

But in the random sampling the outcome or the generalization which you provide with the help
of random samplings are highly robust. So, we will go for what is the random sampling? Every
unit of the population has the same probability of being included in the sample that is the concept
of your randomness. A chance mechanism is used to selection of the process because the chance
of mechanism is we can use a random table to choose someone, you can use your calculator you
can choose someone, choose someone randomly that eliminates the bias in the selection process
also known as the probability sampling.

They will go for non random sampling every unit of the population does not have the same
probability of being included in the sample. It is open the, you know selection bias, there is a
possibility selection bias not appropriate data collection methods for most statistical methods. So,
it is not good method for doing some statistical analysis also known as non-probability sampling.
(Refer Slide Time: 05:18)

260
Random sampling techniques there are 4 way we can say of selecting random one is the simple
random sample second one is a stratified random sample with the proportion disproportionate
third one is a systematic random sample fourth one is cluster or area sampling. Simple random
samples every object in the population has an equal chance of being selected, objects are selected
independently.
(Refer Slide Time: 05:48)

Samples can be obtained from your table of random numbers or computer random number
generators. A simple random sample is the ideal against which the sample methods are compared
this is a best method.
(Refer Slide Time: 05:59)

261
Suppose we will see there are 20 states have ever taken suppose I want to choose some states
randomly for some studies. Suppose first task is I have given some number 2-digit number 01, 02
for example up to this one, it is only for illustrate the purpose it is not 20 the number of states are
more.
(Refer Slide Time: 06:22)

Next I am using the random table to choose the States randomly. For example you can start from
you can see this is a random table you say see the table you can follow any 2 digit 99, 43, 78, 79,
61, because the random table can be read at any direction. so suppose if I am reading left to right
99, 43, 78, 76, 61, 45 and so on so 53 next is 16 so 16 is, I have to choose the serial number 16
and corresponding states I am going back so the 16 is Tamilnadu. So, one state is chosen the next
random number is 18 so the 18 is Kerala next state is chosen.

Next 50 there is no number 65 there is no number 60 there is no number but 01 number is a 01 is


Andhra Pradesh then 27, 27 is not there 68 not there's 36 not there 76 not there 68 not there 82 is
not but 08 is there 08 it is Haryana. So, like this, this is the way to use a random table to choose
from the population. Here the population is the number of states suppose I want to choose some
states randomly for my study so I can use the, this random number table.

Suppose so the capital N is a 20 n is 4 so capital N represents the population n represents the


sample size.

262
(Refer Slide Time: 08:08)

Then we will go for stratified sampling so the population is divided into non-overlapping
subpopulations called strata. Random sample is selected from each stratum potential for reducing
sampling error. We can go for proportionate the percentage of these samples taken from each
stratum is proportion to the percentage that each stratum is within the population. We can go for
disproportionate also the proportion of strata within the sample are different than the proportion
of the strata within the population.
(Refer Slide Time: 08:41)

For example stratified random sample population of FM radio listeners so what I have done the
whole population is divided into 3 stratum one is 20 to 30, 30 to 40, 40 to 50 you see that each

263
stratum are homogenous within between the stratum there may be a difference maybe there is a
different variance but the same stratum will have is homogeneous the similar kind of behavior
are dataset it will be there. Why it is reducing the sampling error that you if you choose 20 to 30
if you choose something from this strata so all we will we have the similar characteristics.

If you choose number some numbers 40 to 50 these sample will have similar characteristics. See
that between the stratum it is a heterogeneous within the strata it is homogeneous.
(Refer Slide Time: 09:39)

Then next method is the systematic sampling it is convenient and relatively easy to administer
the population elements are ordered in sequence. The first sample element is selected randomly
from the first K population element. Thereafter the sample elements are selected at a constant
interval k from the ordered sequence of frame. What is the k is, k is the population size divided
by sample size. The k represents the size of selection interval we will see an example.
(Refer Slide Time: 10:12)

264
Suppose the purchase order is from the previous fiscal year serialized one to 10,000 so capital N
is 10,000 a sample of 50 n equal to 50 purchases orders need to be selected for an audit so here K
is 10,000 divided by 50 that is a 200, K is the interval so the first sample element randomly
selected from the first 200 purchases assuming that we have chosen 45th the purchase order from
the 45th you have to add 200, so 45th plus 200, 245, 245 +, 445+ 645, and so on.
(Refer Slide Time: 10:12)

Then we will go for the cluster sampling here the population is divided into non-overlapping
clusters or areas. Each cluster is miniature of the population the subset of cluster is selected
randomly from the sample if the number of elements in the subset of cluster is larger than the
desired value of n these clusters may be subdivided into form a new set of clusters and subjected

265
to a random selection process. Because each cluster will behave like your population now you
may ask the difference between stratified sampling and cluster sampling.

In stratified sampling the things are homogeneous in each stratum the items within the in
Stratham of homogenous but in cluster sampling it is highly heterogeneous and each cluster will
act like your population. For example say, apparel cluster Ludhiana, apparel cluster Tirupur or
these are the example of clusters because each cluster will have similar characteristics but will
have different variants.
(Refer Slide Time: 12:04)

So, we will go for advantages of cluster sampling it is more convenient for geographically
dispersed a population, reduced travel cost to contact the sample elements, simplify the
administration of the survey because the cluster itself will act as a population. Unavailability of
sampling frame prohibits using other random sampling methods because there is no other
method we can go for a cluster sampling. The disadvantage is statistically less efficient when the
cluster elements are similar.

Because that cannot be generalized cost and problem of static analysis are greater than simple
random sampling.
(Refer Slide Time: 12:40)

266
The next kind of sampling technique is non-random sampling the first one is the convenience is
sampling because based on the convenience of the researcher the sample is selected. Next one is
the judgement sampling sample elements are selected by the judgement of the researcher for
example suppose you administering a questionnaire suppose that questionnaire can be
understood only by a manager then you have to look for only the managers. So, the researcher is
judging that who should fill this questionnaire so judgment sampling.

Then quota sampling sample elements are selected until quota controls are stratified. Suppose
say some, Uttarakhand there are some districts and each distinct I have to collect some sample so
I may have some quota for example in Haridwar district how much sample has to be collected
some other district how many sample has to be collected. So, there is a quota sampling. Snowball
sampling is a very familiar that survey objects are selected based on the referral from other
survey respondents.

Suppose you may approach one respondent out ever the survey is over you can ask him to refer
his friends, so that is a snowball sampling. It is a very common method in the research.
(Refer Slide Time: 13:56)

267
Then there are some errors when we go, when we go for sampling. Data from non-random
samples are not appropriate for analysis of inferential statistical methods that was there a very
important drawback because you cannot generalize because there is no randomness. Sampling
error occurs when the sample is not the representative of the population, if the sample is not
representing the population then whatever analysis you do that will become futile.

So, non sampling error suppose if you go for apart from this sampling procedure sometime there
may be missing data, there may be problem and recording, there may be problem with the data
entry, there may be analysis error. Sometime the poorly consumed concepts, unclear definition
and defective questionnaires that also lead to error. Sometime response error occurs when the
people may not understood what is the questionnaire.

Suppose there is option that not know, will not say. Sometimes the respondent may over state
their answers, these are the possible error when you go for sampling. There is one more error,
type 1 and type 2 error that we will see in the coming classes.
(Refer Slide Time: 15:19)

268
So, now is to go to the sampling distribution of mean here Expo represents the mean so the
proper analysis and interpretation of your sample statistic require knowledge of its distribution
that is a sampling distribution. For example we start from population say population is mu select
a random sample from the sample you select the sample statistic, statistic it is not statistics, yes
there is no s, so whatever things would you say about the sample it is called a statistic, T statistic
F statistic X-bar these are; since we you calculated from the sample we are calling it to statistic.

With the help of sample mean you can calculate or estimate the operation mean this is the
process of we were inferential statistics. So, what is happening something we are going to
assume about the population once we assume that population that is generally called hypothesis
then we will take a sample randomly we will do some sample statistic with the help of sample
statistic we can estimate the population mean or we can estimate the population variance. In this
contest currently we are estimating the population mean.
(Refer Slide Time: 16:37)

269
This picture shows the inferential statistics there is,m see there are bigger circle that is the
population. So, the population parameter is unknown but can be estimated from the sample
evidence see the red one shows that the sample statistic. So, what is the inferential statistics is
making statements about a population by examining sample result that is the inferential statistic.
(Refer Slide Time: 17:04)

See another example of inferential statistics drawing conclusions or making decision concerning
a population based on these sample results. You see there are different red color is there. So,
these 1 2 3 4 5 6 7 these are the sample the whole things in the population, the inferential
statistics is used for estimation estimating the population mean weight using the sample mean
weight. For example if you want to know the weight of the population that can be estimated with

270
the help of weight of the sample mean then this inferential statistics are another application was
for hypothesis testing.

We can use sample evidence to test the claim that the population mean weight is for example 120
pounds are not. We will go in detail about the statistics in coming lectures.
(Refer Slide Time: 17:57)

Now we are entering into the sampling distribution sampling distribution is a distribution of all
of the possible values of your statistic for a given size sample selected from the population.
(Refer Slide Time: 18:14)

271
So, what will happen we can say type of sampling distributions we can do the sampling
distribution for the sample mean. We can do the sampling distribution for the sample proportion.
We can do the sampling distribution for sample variance.
(Refer Slide Time: 18:30)

First we will see the sampling distribution of sample mean.


(Refer Slide Time: 18:34)

Suppose assume that there is a population there are 4 people in a population that is age random
variable is x is age of individuals. So, the value of x may be 18, 20, 22, 24 it is the population.
(Refer Slide Time: 18:54)

272
First you will find out the population mean population mean is Sigma of capital Xi divided by N
generally whenever you see a capital alphabet that is for the population. The smaller one is for
the variance. So, 18, 20, 22, 24 divided by 4 is 21 similarly the population variance is 2.236.
What is happening there are 4 element is there so the probability of getting each element that is
choosing 18, 20 it is 1 by 4 so 0.25 + 0.25 and 0.25 it this follow uniform distribution.

Suppose if we choose only one sample when you plot it the chances for selecting each person
from the population is 0.25.
(Refer Slide Time: 19:49)

273
Suppose if you consider all possible sample of size n, size n here means we are going to select 2
people with the replacement there is a possibility first observation may be 18 20 22 24 second
observation may be 18 20 22 24 so possibility is 18 18, 18 20, 18 22, 18 24, 20 18, 20 20, 20 22
and so on. So, there are 16 possible samples here we are doing sampling with replacement that is
why it is coming 20 20, 22 22, 24 24.

If we find the mean of this, so right side picture shows the mean of that 18 18 is 20, 18 20, 19
when you plot this me what is happening that mean of this sample is following normal
distribution. |Previously when you take only one sample when you plot it we are getting uniform
distribution. When you increase the sample size 1 to 2 what is happening you are getting here
normal distribution it is no longer uniform.
(Refer Slide Time: 21:00)

Now summary measure of this sampling distribution where we selected to with replacement you
see that and going back there are 16 elements 4 x 4, 4 times 4 =16 element. So, the mean
expected value of x bar is 18 19 21 up to 24 out of 16, mu equal to 21. Then the standard
deviation of this sampling distribution is Ʃ(X- µ)2/√N, so the formula for standard deviation is
first to find the variance, mu is 21, so (18 – 21)2 + (90 – 21)2 + up to + (24 – 21)2 = it is 1.58.

Please look at and going back look at the population mean. The population mean is 21 and
population standard deviation is to 2.236, when we select 2 with replacement mean of the

274
sampling distribution is 21 but the standard deviation of the sampling distribution is 1.58 when
you go for selecting 2 samples with replacement.
(Refer Slide Time: 22:16)

Next what we have to do we are going to select the 4 at a time we are going to construct the same
table which have constructed previously. After constructing when you find the mean it will be 21
so we have found these summary measures for the sampling distribution where the mean of the
sampling distribution is 21 and the standard deviation of sampling distribution is 1.58, so when
we compare population data versus sample.

For population there are 4 element in the population in the sample there are 2 element. The mean
of the population is 21 the mean of the sampling distribution is also 21 but the standard deviation
of the population is 2.236 but the standard deviation of sample distribution is 1.58.
(Refer Slide Time: 23:08)

275
You will go for another example that there is a population which follow an exponential
distribution. Now from this exponential distribution we are going to select 2 at a time with
replacement. When you select two at a time then if I find the mean then if you construct
frequency distribution then if I plot that frequency distribution when n equal to 2 we are getting
this kind of distribution you see that the parent distribution is exponential when the sample size
is 2 if I plot the mean of the sample mean that is following this kind of similar to uniform
distribution.
(Refer Slide Time: 23:50)

If I increase the sample size to 5 what is happening it is changed. so when n equal to 30 it is


looking like here normal distribution. So, what is happening whatever may be the nature of the

276
population if you select any sample from the population then if you plot that the sample mean
that will follow normal distribution. So, for example another example you take the population
follow a uniform distribution.
(Refer Slide Time: 24:21)

You select 2 at a time and plot the sample mean that follow this kind of distribution increase
sample size to 5 it is approaching normal distribution. When n equal to 30 it is looking like a
normal distribution initially it was the uniform distribution when the sample size is increasing
then it is following it is behaving like a normal distribution.
(Refer Slide Time: 24:43)

277
So, expected value of sample mean let X1