Data Mining Notes
Data Mining Notes
Lecture - 01
Introduction
Welcome to the course Business Analytics and Data Mining Modeling Using R. This is
the very first lecture and we are going to cover the introduction part of it. First we need
to understand why we need to study this course. If you follow the industry news and
related reports you would see that there is his requirement of data scientist, data
engineers, business analyst and other positions other relevant positions which require
expertise in domains like business analytics, data mining, R and other related areas.
So, we are going to cover some things in this course. So, let us start. So, what is business
analytics?
The primary purpose of business analytics is to assist and aid in and drive in decision
making activities of a business organization. Now, if you look at the Gartner’s definition
they have defined business analytics as comprised of solutions used to build analysis,
models and simulations to create scenarios to understand realties and predict future sales.
1
The domains which are actually part of business analytics are data mining predictive
analytics, applied analytics, statistics and generally the solution is delivered in a
application format which is suitable for business user. Now, if we look at analytics as
such it can be classified as 3 categories descriptive analytics, predictive analytics and
prescriptive analytics.
Now, first let us understand the descriptive analytics. Descriptive analytics mainly
revolves around gathering, organizing, tabulating, presenting and depicting data and
describing the characteristics of what you are studying. This is about a descriptive
analytics is mainly about what is happening. So, we have to we try to answer the
question what is happening in a particular context looking at the data.
Now, this is also called reporting and managerial lingo reason being because we
generally look at the sales number in cost numbers, revenues number etcetera are
different ratios. So, we try to understand how our business is performing, how the
company organization is performing or how the industry or country economy is growing
in a overall sense. So, therefore, what is happening is covered in descriptive analytics.
This is also first phase of analytics, this is where you actually start, you try to gather the
sense of what is happening and then you look for other things other categories of
analytics. Now, this descriptive analytics can be useful in the sense to inform why the
results are happening, but you know they do not inform why the results are happening or
2
what can happen in future. To answer these questions we need next phase of analytics
which is predictive.
So, predictive analysis can be defined as you know, predictive analytics can be used to
predict the past two predict the future. Now, the main idea is to identify association
among different variables and predict the likelihood of a phenomena occurring on the
basis of those relationships.
Now, at this point we need to understand the concept of correlation versus causation.
Now, correlation is something you know if something is happening if x is happening, if x
is correlated with y that is sufficient for us to you know predict something.
But if we are looking for what we should be doing about it if we know that what is going
to happen in the future which is going to be you know part of predictive analysis then
what we are going to do about it is becomes part of prescriptive analytics. Now,
prescriptive analytics is where the cause and effect relationship that comes into the fact
and the main idea about main idea of the prescriptive analytics is to suggest a course of
action. So, generally you know prescriptive analytics is about recommending decisions
and which entailes generally in mathematical and computational model you do lots of
simulations optimizations to find out what can be done about this future scenario of the
business or relevant topic.
3
So, prescriptive analytics is also defined as final phase of analytics. Now, methods from
disciplines like statistics, forecasting, data mining, experimental design they are used in
business analytics. Now, next part that brings us to the next part data mining the core of
this particular course is data mining. So, let us understand what is data mining. Brief
definition of data mining could be extracting useful information from large data sets.
Now, if you look at the how Gartner is defining data mining. So, they define data mining
as the process of discovering meaningful correlations, patterns and trends by shifting
through large amount of data stored in repositories.
4
(Refer Slide Time: 06:11)
If you look at some of the examples given here first one is related to the medical field.
So, data mining can actually help us in predicting the response of a drug or a medical
treatment on the patient suffering from a serious disease or illness. Another example
could be in the security domain where data mining can help us predicting whether an
intercepted communication is about a potential terror attack. Now, another application of
data mining could be in computer or network security field where it can help us
predicting whether a packet of network data can pose a cyber security threat.
Now, our main interest is in the business domain. So, let us look at some of the examples
where data mining can help in business context. So, common business questions where
data mining can help could be like this which customers are most likely to respond to the
marketing or promotional offer.
5
(Refer Slide Time: 07:12)
Another example could be which customers are most likely to default on loan. So,
banking and financial institutions they might be worried about some of their customers
defaulting on loan. So, they would like to identify such customers and then take
appropriate actions.
Now, another question could be which customers are most likely to likely to subscribe to
a magazine. So, for a magazine if you know if they are running advertising and you
know marketing promotional offer it would be important for them to understand the
customer which are likely to subscribe to the kind of content that they are publishing or
selling through their magazine. So, all these are some of the, some of the flavour of you
know kind of questions where data mining can actually help us.
6
(Refer Slide Time: 08:05)
Now, let us look at the origins of data mining. So, data mining genesis it is mainly an
interdisciplinary field of computer science and it originates from the fields of machine
learning and statistics. Now, some researchers have also define data mining as a statistics
at scale and speed. Some people also define it has a statistics at scale speed and
simplicity main reason being that in data mining we generally do not use the concepts of
confidence and logic of inference rather than rather we rely on partitioning and using
different samples to test about models. So, that makes the whole process simpler. So, that
is why this simplicity comes from.
7
Now, let us compare the classical statistical setting and data mining paradigm. Now,
classical statistical setting is mainly about you know data scarcity and computational
difficulty.
So, generally in a statistical setting you are dealing with a statistical question where you
are looking for primary data and that is of course, costly to collect. So, there is always
you are going to face this situation where data is not enough or it is very difficult to get
the data. And if we look at the times manage statistical it is an old discipline old
discipline, so the time and statistic statistical studies you know there used a face there
was a time and they used to face lot of computational difficulty where most of the
mathematical computation they have to perform manually. So, that was the time and the
statistical setting originated. So, they generally face is classical statistical setting is
generally they face the data problem, the data scarcity problem and the computational
problem. When will look at the data mining paradigm it mainly its relatively a newer
area and this as mainly you know developed or evolved because of the availability of
large data set and ever improving computing powers. So, therefore, in data mining
paradigm we are not faced with the problem of data sets or the computational problems.
Now, let us look at the another points you know in same sample in classical statistical
setting same sample you used to compute an estimate and to check its reliability, while if
you look at the data mining paradigm because we do not have any problem in terms of
data points or data sets therefore, we can fit a fit a model with one sample and then
evaluate the performance of the model with another sample. So, that is one another
different. Now, third one is the logic of inference. Now, we need confidence intervals and
hypothesis test to actually because we are using the same sample to compute the estimate
and then check its reliability therefore, we need to eliminate this option where a pattern
or a relationship can developed can we see in because of a chance. So, chance variable
has to be eliminated. So, therefore, logic of inference it is important to the statistical
setting and much stricter conditions are actually placed in a statistical modeling.
When we look at the data mining modeling we do not face with the problems of
inference and related problems reason being that different samples are being used. So,
therefore, reliability or robustness of the model is automatically taken care of because the
model is built on one sample and it is evaluated on the different sample different
partition. We look at the machine learning techniques such as trees neural networks they
8
are also less structured, but more computationally intensive in compare comparison to
statistical techniques. So, therefore, they might require more running time more compute
more computation time and they are less structured while we look at the statistical
technique regression and logistic regression discriminant analysis they are highly
structured techniques.
Now, another important aspect of emergence of this field business analytics data mining
big data and related fields is rapid growth of data.
Our economy is has also been growing there has been growth in the internet
infrastructure that has also let too many people using internet and digital technologies.
So, that has also lead to growth of data. Automatic data capture mechanisms for example,
bar code, POS devices, click-stream data, GPS data, that is also has led to rapid growth
of data. Now, operational databases the ones we talked about whenever you visit a retail
stores or whatever you items that you want to buy all those data’s they are can be
considered you know as transactions between the business and the customers all those
9
days you know transactions or recorded in the operational database. Now, for the data
analytics purposes or business analytics purposes these operational database they have to
now, brought into a data warehouse because you cannot actually do any kind of meaning
full analysis on operational data bases. So, therefore, the data has to be brought into data
warehouse and data marts, from there the data can then be sampled out for the analysis.
Now, another reason for rapid growth of data is constant declining cost of data storage
and improving processing capabilities so that has also you know even smaller
organization can nowadays invest in related IT infrastructure and have the analytical
capabilities and developed analytical capabilities to improve their business. Now, core of
this force core of this cores focuses on predictive analysis mainly we focus on three task
prediction classification and association rules.
Prediction is mainly when we are trying to predict value of a variable will discuss in
detail different terminology and concepts later in this lecture. Classification is a task
where in you are we are trying to predict we are trying to predict type of a particular
variable. Association and rules is where we are trying to find out the association between
different items in a transactions.
Now, another important thing related to data mining is data mining and data mining
process as such is we generally try different several methods for a particular goal and
then a useful method is finally, selected and then used in the production systems.
10
(Refer Slide Time: 15:46)
Now, how do we define usefulness of a method? Now, the methods that we select to
perform a particular class of course, they have to be relevant with respect to goal of the
analysis. underlying assumptions the matter method they also have to be there, there
should also meet the requirement of the goal and the problem. Size of the data set
different methods and algorithms they are going to impose their own restrictions on
number of variables and the number of records that are going to be used for analysis. So,
size of the data set is also going to determine.
Types of pattern in the data set. So, different methods or algorithm they are suitable for
suitable for finding out or understanding different types pattern. So, therefore, type of
pattern in a data set is also going to determine the usefulness of a method with respect to
a particular goal.
11
(Refer Slide Time: 17:24)
Now, let us understand the data set format that is typically used in data mining and
business analytics. Now, generally the data that we use is in tabular or matrix format
variables are generally or in columns and observations are in rows.
Now, another important thing is each row represents a household this is unit of analysis
for example, in this case the sedan car data set case unit of analysis house hold. So, all
the data, all the variables that are there in the data set there about this particular
household, so household being the unit of analysis.
Now, the statistical software data mining software that we are going to use in this courses
R and R studio. So, R what is R? R is a programming language and software
environment for a statistical computing and graphics, widely used by statistician across
the world and also by data miners. So, there are many packages available for different
kinds of functionalities, related to statistical techniques and also related to data mining
techniques.
R studio is another you know this is again most commonly used integrated development
environment for R. So, it might be difficult for some users to directly start using with R
because of the interface they might not be very much comfortable with the interface of
R, R studio bridges this gap and provides a much better interface to perform your data
mining, modeling or statistical modeling using R.
12
(Refer Slide Time: 19:04)
So, let us look at this example a sedan car. So, this is R studio environment. So, here you
have these four parts here first part this part is about the R script, the R script the code is
actually written here and this part is actually to run this commands that are given in this
script. Then in this part you have in the environment where all your you know data and
variables they would be loaded up if that if a particular data set is loaded into R studio it
would be shown here, if a particular variable is loaded into R studio it would actually be
shown here. In this part you have plots help page for a detailed before.
13
(Refer Slide Time: 20:36)
So, this is the data that we were talking about. You can see that this is in the tabular
format or matrix format.
So, you have these three variables. So, annual income and household area these are your
predictor variables. Will discuss the terminology later in the lecture, and this is your
outcome variable ownership. So, we want to predict based on these two variables we
want to predict the class of a sedan car you know means class of a you know household
whether they own a sedan car or not. If we looking at the whole data set we have 20
14
observations and each observation is about a household, it depicts their annual income in
rupees lakh and the household area in 100 a square feet.
Now, let us go back. Now, if we run this command head df we are going to get first 6
observation from the data set. So, if we you do not want to go and actually have a look at
the full data set which might be a large file.
So, you might look at you might run this command head, the head df and then you can
look at the first 6 observation of the data set.
15
Summary, this command is going to give you this these basic statistics about the data set
variable for example, annual income we can see min max mean and other statistics
similarly for household area and ownerships. So, you can see from the ownership data
this is a categorical data we will discuss later in this lecture what it is a categorical
variable. We can see 10 observation belong to the non-own category and intern
observation belong to the owner class. We can also look at the other things.
Now, let us plot a graph between you know household area and annual income. Annual
income being on the x axis and household being in the on the y axis, so this is the plot,
you can look at if you look at this plot some of the observation which are belonging to
the a non-owner category non owner class they are mainly on this part and the
observation belonging to owner they are mainly on this half. So, our goal is to classify
the ownerships of a sedan car.
Now, as we talked about that different methods could be tried out in a data mining
modeling and then the best one is generally selected. Now, one method in this case could
be a set of horizontal and vertical lines. So, we can look at this data and we can create
you know set of horizontal and vertical lines this could be a hypothetical method one
which could, then be used to classify which than could be used to classify this
observations. Another method could be a single diagonal line. So, we could draw a single
16
diagonal line somewhere here and then it could also be used to classify these
observations.
Now, if you look at this data if you are able to draw a line somewhere here most of the
observations belong to owner call owner you know owner class they will be on the upper
rectangular region and the most of the observation belonging to the non owner class they
would be on the lower rectangular region. So, let us do that.
So, if we look at these two points. So, these two points are actually 7.6 and 6. So, these
two points have been look at if a line can be a horizontal line can be drawn here then two
partitions two rectangular regions can actually be created.
If you similarly keep on drawing the horizontal and vertical lines to keep on separating
these observations you can see here this particular you know only 3 observation
belonging to non owner or there others are owner observations in this lower rectangular
region most of the observation are belonging to the non owner and only 3 belongs to the
category.
17
(Refer Slide Time: 25:48)
Generally we can keep on creating similar lines to further classify. So, now finally, will
end up with a particular graphics where each rectangular region is homogeneous; that
means, it contains the observation belonging to only one class either owner or non
owner. So, this could be these set up of horizontal and vertical lines could be a method,
could be a model to classify these observations.
Similarly as we talked about the method two could be about finding a single diagonal
line to separate these observations. Now, if we look at the again if we look the plot again,
the line a diagonal line could be somewhere here it can go from somewhere here and
then we will have homogenous partitions homogeneous rectangle regions. So, again a
similar process can be adopted to find out that particular line you would see a line being
drawn there we can extend this line and create partitions. This could be a single, this
could be another method two , to actually classify the observation.
18
(Refer Slide Time: 27:09)
Now, as we you know see as we discussed these two methods, method one set of
horizontal and vertical lines, and method two a single diagonal line. So, now we need to
find out which is the most useful method which is the best method that can be done using
different of assessment matrix that we are going to cover in later lectures.
Now, let us look at the key terms related to this course. As we go through other lectures
we will come across many more terms and then we will discuss them where we need to
first we need to discuss the key terms at this point.
19
So, first one is algorithm. So, we have been using this, this particular term quite often.
So, algorithm can be defined as a specific sequence of actions or set of rules that has to
be followed to perform task. Algorithms they are gently used to implement data mining
techniques like trees, neural networks, for example, neural network we use back
propagation algorithm that is part of neural network. So, there are a number of algorithm
that are actually required to implement this techniques.
Next term is model. So, what we mean by model? So, here be by modeling we mean data
mining model. Now, how we can define a model in a data mining context? A data mining
model is an application of data mining technique on data set. So, when we apply some of
these techniques like trees and neural networks which we are going to cover and later
lectures on a data set then we get a model.
Our next term is, our next term is variable. A variable can be defined as the
operationalize way of representing a characteristic of an object, event or phenomenon.
Now a variable can take different values in different situation. Now, there are generally
two types of you know a variable that we going to deal with here is one is input variable
it is also sometimes called as independent variable, feature, field, attribute or predictor.
Essentially input variable is an input to the model.
20
(Refer Slide Time: 29:37)
The other type of variable is output variable which is generally, other names for this
variable are outcome variable, dependent variable, target variable or response. So, output
variable is an output to the, output of the model.
Another term that we come is across record observation, case or row. So, as we talked
about the tabular data set of metric data set each row represented a record each row
represented and observation or case. Now, how do we define it? So, observation is the
unit of analysis on which the variable measurements that are in the column take can such
as a customer, a household, an organization or an industry. For example, in our sedan car
example case the unit of analysis was household and we had the variables like annual
income and household area which actually measure something related to household.
Similarly customer organizational or industry you could also have these as a unit of
analysis and the variables that are measured on these.
21
(Refer Slide Time: 30:53)
Now, let us the talk about the variables in detail. Not only two types of variables are
used. So, other types that we talked about input variable and the output variable or
outcome variable they are in the modeling sense. But here we are talking about more in
the data sense. So, two types of variables are used categorical and continuous. Now,
categorical variables can we further classified as nominal and ordinal, and continuous
can further we classified into two categories interval and ratio variables. So, let us
understand these variables.
22
Now, before we go into the details of what these 4 types of variable mean why we need
to understand the type of variables in a data set. As we discussed before that we are go,
in a data mining process in a data mining modeling we generally use a many methods
and then the select the most useful one or the best one. So, therefore, it is important for
us to identify an appropriate statistical or data mining technique and the understanding of
variable type is part of this process.
Now, proper interpretation of the data analysis results, that also depends on the kind of
technique that you are using and the kind of data that was actually analyzed. Now,
another important thing is that data of these variable types or either quantitative or
qualitative in nature. For example, quantitative measure quantitative data numeric values
and our expressed in a number, qualitative data they measure types and our expressed as
label or a numeric code.
Now, if we look at these 4 types nominal ordinal interval and ratio a structure of these
variable types in gears from nominal to ratio in a hierarchical fashion. So, nominal is a
least structured available type followed by ordinal followed by interval and then ratio is
the most a structured variable type. Now, let us understand nominal variables.
Now, nominal values indicate distinct types for example, gender. So, gender values could
be male or female. So, they indicate distinct types. Similarly nationality again get the
values could be Indian, Pakistani, Bangladeshi, Sri Lankan and so all these values
23
indicate they are distinct type. Similarly the religion I could be another nominal variable,
so the values will again indicate distinct types for example, Hindu or Muslim or
Christian. Similarly, pin code could be another example for a nominal variable where
each pin code actually indicate a distance distinct location. Employee ID could be an
nominal variable because each employee ID would actually indicate a different person.
Now, the two operations equal to and is not equal to or supported because these are
distinct types you cannot say male is greater than female or so, greater multiplication
greater than less than multiplication division all these observations are not supported
only equal to or is not equal to is supported.
Let us look at the ordnance. So, values indicate a natural order of sequence. Natural
order or sequence for example, academic grades. So, you have A grades like A B C D E
F. So, they all these grades label they indicate you know A order, but essentially they are
distinct from each other.
Similarly, likert scales, likert scale quality of a food item they are also examples of
ordinal variables. In likert scale you come across whenever you are trying to fill a survey
questionnaire, trying to reach to the respondents and trying to get the responses you
always get the responses from a strongly disagree to a strongly agree or strongly agree to
strongly disagree. So, all those points are actually part of likert scale and can be are
actually can be defined as ordinal variable.
24
Similarly quality of a food item it could be better, it could be average, it could be you
know poor so that can also be an ordinary variable. Now, 4 additional operations are
supported because the values indicate a natural order. So, therefore, these less than or
less than or equal to greater than or greater than or equal to operations are also supported.
If we look at the continuous variable types first one interval variable, now here apart
from the apart from what we discussed a nominal ordinal difference between two values
is also meaningful. Now, another important thing about interval variable is values may be
in reference to a somewhat arbitrary zero point. So, for example, Celsius temperature,
Fahrenheit temperature. So, Celsius temperature is generally used in India where we you
know when we talk about the temperature is 35 degree Celsius 40 degree Celsius and
those numbers. Now, when we talk about 0 degree Celsius it is not actually the absolute
0 point it is rather a can we called an arbitrary zero point. Similarly location specific
wherever for example, distance from landmarks, a geographical coordinates they all are
examples of interval variables.
Now, next, the operations that are supported for interval variables are two additional
operations apart from what we discussed for nominal and ordinal to additional operations
are supported plus and minus addition and subtraction. Now, the last variable type is
ratio variable. Now, in ratio variables ration of two values is also meaning full. So,
values and also values are in reference to an absolute 0 point example Kelvin
25
temperature. So, in Kelvin temperature you have 0 degree Kelvin. So, when we say 0
degree Kelvin we are actually means is actually is actually mean 0 degree, 0 temperature,
that is 0 in absolute in real sense. When we talked about the Celsius temperature or
Fahrenheit temperature 0 degree centigrade or 0 degree Fahrenheit they do not actually
mean a absolute 0 all right, real 0 in the absolute sense.
A is length of weight, height, income and we can also the examples of (Refer Time:
37:26) variables; apart from the operation that we discussed for nominal ordinal interval
two additional operations division and multiplication or supported for these two
variables.
Now, another important thing we lead to variable type is convergent from one variable
type to other. Now, high structure variable type can always be converted because we
discussed that these variable types they have you know hierarchy of a structure. So,
therefore, you know high structure variable can always be converted into a low structure
variable type, but we cannot convert a lower structure variable into a high structure
variable. For example, a ratio variable is can we converted into an ordinal variable age
group.
So, age group could be like you know you know adult, young, old, middle age, but age
could also be you know a specific age could be 20, 21, 25, 40 and 45. So, this is actually
being a ratio variable. Now, based on the these actual numbers you can actually convert
26
this variable into a ordinal variable where you say that less than 20 is young then 20 to
40 its adult then later on middle age and then old age. So, that kind of grouping can
actually be done. So, therefore, a high structure variable cannot always be converted into
a lowest structure variable type.
Now, let us discuss the road map of this particular course. So, module first which we
started in this lecture is a general overview of data mining and its components which
covers the introductory part and the data mining process. Then the module second it is
about data preparation and exploration here we talk about the different you know a steps
that are required to prepare the data, to explore the data, visualize the data and other
techniques like dimension reduction etcetera.
Now, third module is about performance matrix and assessment where will try to
understand the matrix that are actually used for task like do assess the performance of
different models for task like classification and prediction. Now, the next module, fourth
module is about supervised learning methods there will we are going to cover data
mining and statistical techniques like regression and logistic regression, neural networks,
trees. So, those techniques and many others are going to be covered in this particular
module. Module fifth, is about unsupervised learning methods they are we are going to
mainly cover clustering and association rules mining. Then next module is a time series
forecasting there we are going to cover the time series handling regression based
27
forecasting and a smoothing methods. Finally, in the last module we will have some final
discussion and concluding remarks.
Now, apart from this road map will also have two supplementary lectures, one is on
introduction to R, then other one on basic statistical methods. So, it is highly
recommended that you go through these two lectures before proceeding further for next
lecture.
28
Thank you.
29
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture - 02
Data Mining Process
Welcome to the lecture 2 of course Business Analytics and Data Mining Modeling Using
R. In this particular course, in this particular lecture we are going to cover the data
mining process.
So, there are different phases in a typical data mining effort. So, first phase, phase is
about first phase is named as discovery. So, important activities that are part of this phase
are framing business problem. So, first we need to understand the needs of the
organization and the issues they are you know facing and the kind of resources that are
available the kind of team is available, the kind of data repository other resources are
available, and the you know try to understand the main business problem and then try to
identify the analytics component, so analytics challenge or analytics components,
component part of that particular business problem. Then we start understanding we start
to develop our understanding about the relevant phenomena the relevant concepts
constructs variables and we try to develop or formulate our initial hypothesis that are
going to be later on converted into a data mining problem.
30
So, this is in the discovery in the first phase these are some of the activities that we
generally have to do. The next stages, the next phase is a data preparation in this phase
we have already understood the problem especially the analytics problem that we have to
deal with therefore, we also based on our initial hypothesis formulation we will also have
you know understanding about the kind of variables that would be required to perform
the analysis therefore, we can also we can always look for the relevant data from internal
and external resources and then we can compile the data set and that can be used for the
analytics.
So, another activities that can actually be performing this stage also a data consistency
checks. For example data can come from variety of sources therefore, we need to check
the definition of fields whether they are consistent or not we need to we also need to look
for units of measurement we also need to check for you know data format whether that is
consistent or not. For example, if you have a variable gender and your data set and in one
particular source is it is recorded as male m a l e and in other male and female full form
m a l e and female and in another from another source it is recorded as capital M or
capital F. So, therefore, you have to make sure that data is consistent and when the whole
data is compiled or got together and, these consistency have to be checked.
Then time periods. So, data could be belonging to a particular you know you know
number of years so that we also need to check because the analysis or results could
actually be limited by the time period as well. Sample, now you do not always need all
the records that we prepare in our data set we normally generally we take a sample of it
because smaller sample of the you know that is generally good enough to build accurate
models.
Now a next phase in a typical data mining effort is about data exploration and
conditioning.
31
(Refer Slide Time: 04:05)
So, in this phase we generally do activities like missing data handling we also check for
range reasonability whether the range of different variables they are in the expected, they
are as per the expectation or not. We also look for the outliers they can come due to some
human error or measurement error etcetera. Graphical or visual analysis is another
activity. So, we generally you know plot a number of a number of graphics for example,
histogram, scatter plot, that we are going to discuss in later coming lecture other
activities that we can do is transformation of variables, creation of new variables and
normalization.
So, why we require, why we might need to do some of these activities would be more
clear on as we go further in coming lectures. Transformation creation of new variable for
example, sometimes you know we might require that you know sales figures has to be
you know in more than 10 million or less than 10 million kind of format then we might
have to transform our variable. Creation of new variables, if we do some of these
transformation we will end up with some of the new variables sometimes we might use
both the forms in our analysis. Normalization sometimes a scale could pose a problem in
a particular data mining algorithm or statistical technique, therefore, normalization might
be required. So, these are some of the activities that we have to do.
32
Training is generally the a large partition and all the models all the elevation they are
generally applied on the training data set and then fine tuning or selection of a you know
particular algorithm or particular technique happens on the validation data set test data
set is again reevaluation of that final method.
So, in this particular phase you know we need to determine the task that we have to
perform whether it is a prediction task or a classification task. We also need to select the
appropriate method for that particular task if it could be regression neural network, it
could be clustering, discriminant analysis and many other methods that we are going to
discuss incoming lectures.
Now next phase is a model building, building different you know candidate models or
using or selected techniques in the previous steps and their variants. So, that is run using
training data then only we define and select the final model using the validation data then
evaluation component that happens the final model that is done on test data. So, in this
phase mainly it try out different candidate models we assess their performance we fine
tune them and then we finally, select a particular model.
33
(Refer Slide Time: 07:24)
Now, the next phase is results interpretation. So, once model evaluation you know
happens using performance matrix we can go on and interpret the results. So, one is the
exploratory part then another one could be the prediction part. So, all those interpretation
can actually be done in this particular phase, and they then the model deployment once
we are satisfied with the model that we have we can run a pilot project to. So, that we are
able to integrate and then the model on operational systems. So, that is the last phase.
Now, this is the typical data mining process that one has to follow. Similar data mining
methodologies were developed by SAS and IBM modeler that was previously known as
SPSS clementine. So, these are the commercial software for a statistical modeling and
data mining modeling and they also follow a similar kind of data mining mythology. So,
for SAS methodology is call SEMAA, S E M A A and the IBM modelers methodology is
called crisp DM.
34
(Refer Slide Time: 08:29)
Now, at this point we need to understand the classification, important classification and
related to data mining techniques. So, data mining techniques can typically be divided
into supervised learning methods and unsupervised learning methods.
Now, what is supervised learning? So, in supervised learning algorithms are used to learn
the function f that can map input variables that is generally denoted by X into output
variables that is generally denoted by Y. So, algorithms they are used to learn a mapping
function, a function which can actually map input variables X into output variables Y as
you know you can also write this as Y as a function of X. Now, the main idea behind this
is we want to approximate of mapping function f such that new data on input variables
can actually be used to predict the output variables Y with minimum possible error. So,
that is the main idea. So, the whole model development that we do is and this you know
learning of this function is actually perform. So, that on new data we are able to predict
the outcome variables with minimum possible error.
35
(Refer Slide Time: 09:48)
Now, supervised learning problems can be further grouped into prediction and
classification problems that we have discussed before. Now, next is unsupervised
learning. So, in unsupervised learning algorithms are used to learn the underlying
structure or patterns hidden in the data. If you look at the examples again unsupervised
learning problems can be grouped into clustering and association rule learning problems.
Now, there are some other key concepts that can actually be we need to discuss before
we move further.
36
So, some of these concepts are related to sampling. So, we do not need to go into detail
of you know all the sampling related concepts, we will need what is mainly applicable
what is mainly relevant to a data mining process. First we need to understand the target
population. So, how we define a target population? Target population is the subset of the
population under study for example, in our sedan car example in the previous lecture it is
the household that was the target population. So, we wanted to study the household and
whether the own a sedan car or not. Now, results whenever we study a particular target
population it is generally understood that results are going to be generalized to the same
target population. Now, when we do an analysis we do not gather all the data of all the
data coming from the target population we take a sample of it, reason are as we have
discussed or indicated before cost related problems we cannot actually go about
collecting data from the whole population because it is going to be a very costly process.
So, therefore, the purpose of you know business analytics data mining modeling and
other related discipline is to reduce this cost and still have some useful insights or solved
some of the analytics problem. So, that is why we require sample. So, we generally take
a subset of a target population and then analyze the data that we have in that sample. So,
generally our data mining scope generally we might be mainly limited to simple random
sampling which is a sampling method where in each of observation has an equal chance
of being selected.
37
Now, what is random sampling? So, an sampling method where in each observation does
not necessarily have an equal chance of being selected. Generally simple random
sampling is used where each observation has a equal chance of being selected. The
problem with simple random sampling could be that sometimes the sampled population
could not be there represented of, could not be the proper you know representation of the
target population because of this you know equal probability of you know equal
probability of each observation being selected this could be one problem.
38
Now, whenever we do sampling it is going to result in less number of observation than
the total number of observation that are present in the data set. Now, there are some
other, there are some other issues that could again further bring down the number of
observation or variables in your sample. For example, data mining algorithm, so different
data mining algorithm they could have varying limitation on number of observation on
variables. They might not be able to handle you know more than a certain number of
observation or more than a certain number of variables so that could be one limitation
that could again limit your sample size.
Now, limitation can also be your due to computing power and storage capacity. So, the
available computing power and storage that you have for your analysis purpose, can also
either lower the speed or limit the number of observation the or number of variables that
can be handled. Similarly elimination can also be due to a statistical a software,
nowadays you already understand that different software they have different versions.
So, generally if you are using a free version of you know commercial software that can
actually limit the number of variables or number of observations that can actually be
studied. So, those limitations can further bring down your sample size.
Now, while we are discussing this about you know limitation related to number of
observations we need to understand how many observations are actually required to will
accurate models because the whole idea is to build accurate models build good models.
So, that we have good enough results which could be used on production systems later
on. Especially when we understand that the, you know we cannot get the data from the
you know target population we have to take a sample. So, cost is an important factor. So,
therefore, it is always you know better for us to understand the number of observation
number of you know sample size that would be sufficient for us to build accurate model
robust models.
39
(Refer Slide Time: 16:32)
Now, there are many other concepts related to you know data mining process that we
need to understand at this point, sampling size we will come back again. Now, next
concept is rare event. Now, typically when you are dealing with a particular data set now
if it is you know classification problem and you have a outcome variable where you have
you know for example, ownership that we talked about whether a household owns a car
or not. So, owner you know owner or non owner. So, that kind of scenario is there.
So, typically you know you know the split between observation belonging to owner class
and observation belonging to the non owner class there would 60 40 or around 50 50 or
you know that kind of ratio. But it might so happen that in a you know in a if it is rare
event that out of let us say 1000 households only 10 or 20 household actually own a
sedan car, then in that case it will make it a rare event, that ownership of a sedan car
could actually be a rare event in a particular target population. So, in that case how do we
do our modeling.
Another example would be low response rate in advertising by traditional mail or email.
So, you might be having different promotional or marketing advertising offers through
coming to you through traditional mails post and emails as well. Now not everyone is
going to respond to this offers. So, again this can also be a rare event. So, how do we do
our modeling? What are the issues that we face in this kind of situation? So, if you have
a particular class which is which has very few observation belonging to you know very
40
few observation actually belong to this class. So, any kind of modeling that you are
actually going to you know do using that particular data set might not give you an you
know good enough model right, it might become very difficult for you to get a model
build a model that will satisfy your main analytics goal.
For example, if you have 100 observation in your data set and 95 belong to the non
owner class and 5 belong to the owner class and main objective of your business problem
is to identify people who are owners. So, therefore, if you even if you do not build a
model a still you classify every household as a non owner still you will get 95 percent
accuracy of your model. So, in this case modeling for the success cases is becomes an
issue. So, how do we solve these problem we do over sampling. So, we do over sampling
of success cases so that means, we include more we duplicate many of the data points
which are belong to the success case or we change the ratio that means we take the
success cases data points observation and we also take the observation which belong to
the for example, in we take observation belonging to the owner class and we take the
observation belonging to the non owner class in a 50 50 kind of ratio or similar ratio.
This kind of problem again arises mainly in classification task now in other related
concept to this problem is cost of this classification. So, when we say that success class
is more important for us we are dealing with asymmetric cost here and because
identifying success class is more important for us because it is more important for us to
understand which customer is going to respond to our email you know promotional offer
and marketing offer. So, therefore, we dealing essentially dealing with asymmetric cost.
Now, generally if we are not able to identify success cases generally the cost of failing to
identify success cases or is going to be more than cost of detailed review of all cases
reason being, if you are able to identify a particular customer that he or she is going to
respond to your offer then probably they are profit that you can earn by selling the
product or services to that customer would actually be more than cost of detailed review
of cases.
So, therefore, generally the benefit of identifying success cases is higher than that and
that is why modeling is modeling becomes premium. Now, other important aspect now
other important aspect of rare event is when you know success case is important for you
now prediction of success cases is always going to come at a cost of misclassifying a
41
failure case as a success case. Because if you out of 100 observation you have only 5
success cases you would like to identity all those 5 so that you are able to make a profit
from your offerings. So, therefore, eventually the model that you built you might end up
with you know identifying many failure cases also as a success cases. So, therefore, this
is going to be more than usual so, is usually if you have you know 50 50 case of success
case and failure success cases and failure cases your accuracy or your accuracy and error
can be on the error can be on the lower side, but in this case error would actually
increase, but the purpose is to identify success cases to in increase the profit.
Now, another important aspect of data mining process is dummy coding for categorical
variables.
Now, there could be some statistical software which cannot use categorical variables
expected in the label format. If you have format for example, if you have a variable like
you know gender where you have male or female or M or F has been those are the labels
that are they are in the data sets many a statistical software might not be able to use this
particular variable the directly. So, therefore, dummy binary you know dummy coding
might be required for these variables.
When we say dummy coding we actually create dummy binary variables. So, these are
actually having 0s and 1s, 0 indicating the absence of a particular type and 1 indicating
presence of a particular types. For example, if activity a status of individuals we have
42
data on activity status of individuals and has 4 mutually exclusive and jointly exhaustive
classes for as a student unemployed, employed and retired then in that case we can create
different dummy variable wherever when a particular observation able to belongs to if its
activity status belongs to a student, it builts have the value as 1 if it does not then it will
have the value as 0, similarly for other observations and for other classes.
Now, if these types classes are usually exhaustive then we do not need to create all you
know four dummy variables. So, there are 4 types we do not need to create 4 dummy
variables because then being jointly exhaustive have been no 3 of them then the fourth
one is already known. So, therefore, we need to create only three dummy variables.
43
Now, another problem with you know more number of variable is that it is going to
increase your sample size requirements because you are always looking for a reliable,
you always want to estimate, you always want to compute a reliable estimate and if you
have more number of variables then your sample size requirement will increase to have
that higher reliability.
Now, another related problem is over fitting. So, what is over fitting? So, over fitting
generally arises when you know a model is built using a complex function that fits the
data perfectly. So, if your model is fitting the data perfectly then probably it might be
over fitting the data. So, what actually happens is you in the over fitting you in the fitting
the noise your model end up fitting the noise and explaining the chance variations. So,
you would ideally you would actually look to avoid explaining chance variations because
you are looking to understand the relationship which can then be used to predict the
future values all right. So, therefore, over fitting is something that is not desirable. Over
fitting can also arise due to more number of iterations, if we you do more number of
iteration that can result in excessive learning of data. So, that can also lead to over fitting.
If you have more number of variables in model some of those you know variables might
have a spurious relationship with your outcome variables all right. So, that can also lead
to over fitting.
44
Now, the next concept is related to sample size. So, how many observation should
actually be good enough for us to build, for you to build an accurate mode? So, domain
knowledge is the you know is important here to understand a sample choice because as
you do more and more modeling, more and more analytics you would be you know able
to understand different phenomenas the construct concepts variables you will have a
better hunch on or better rule of thumbs to understand how many observation will
actually be we will to build a model for a particular analytics problems. So, domain
knowledge is always going to be the crucial part.
We also have rule of thumbs for example, if you have p number of predictors 10 into p.
So, for 10 observation for predictor can actually be a good enough rule of thumb to
determine the sample size. Similarly for classification task also many researchers are
suggested some rule of thumb for example, 6 into m into p observation, where m is the
number of classes in the outcome variable and p is the number of predictors. So, that can
actually help you you know determine a sample size.
Now, what is an outlier? So, outlier can briefly be defined as a distance distant data point
now important thing for us to understand is whether this distant data point is valid point
or erroneous point because if a particular data point is distance from the majority of the
values or very distance from the mean more than 3 standard deviation away from the
45
mean then it could be you know it could be due to human error or measurement error. So,
we need to find out whether a particular data point is because of the human error or
measurement error for example, at you know a temperature a room value of you know
100 or 150 for a room temperature or temperature in a city you know could be you know
human error or measurement error. Sometimes there would be you know errors due to
decimal points and related you know typing error and all that. So, we need to identify
whether outlier whether it is a valid point or erroneous value.
So, how do we do that? So, generally you can do some manual inspection you can sort
your values and find out if anything looks out of place you can also do I mean you can
also look at the minimum and maximum value and form their you can try and identify
whether they are outliers whether they are errors clustering can also help you. So, you
can do clustering you could be able to see and whether a particular point is outline or not.
Domain knowledge, domain knowledge also going to help about this.
Now, next related concept is missing values you might come across a data set which is
very important for your particular you know your particular business analytics problems,
but few records you know few records have missing values. So, if those records are few
in numbers then probably you can remove those records and then go ahead with your
analysis, but if the number of records are more than probably that can end up eliminating
most of your observations. So, therefore, you need to handle those missing values.
Imputation is one way, so you impute those missing values with the average with the
average value of that particular variable right so that could be one solution. You can also
if that is also not, if that is also not desirable then another option is if you have many
missing values then you can identify the variables where the missing values are there and
if those variables are not very important for your analysis then probably you can think
about you know dropping them. If those variables are important for your analysis then
probably you can a look for proxy variable which is having less number of missing
variables and replace that particular variable with their appropriate proxy variables.
46
(Refer Slide Time: 31:15)
So, in any scenarios normalization is the desired thing and the data mining process. So,
there are two popular ways of normalization one is standardization using z-score where
you subtract these values these value you know subtracted by mean and divided by
standard deviation. Then there is another min max normalization where you subtract
each value by the minimum value and then divide by the difference of max and min.
47
(Refer Slide Time: 32:35)
Thank you.
48
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture - 03
Introduction to R
So, we are going to cover basic introductions; we are going to cover basics of R.
So, before we start let us understand the installation steps, this is a specifically
for windows pc or laptops that is the expectation that many of the students they
would be having windows pc or laptop and these instructions are for the same.
So, first you need to install R, so the a link is given here and depending on the
your operating system whether it is 32 bit or 64 bit you can download the
appropriate file; installation file and the after once you are done with installing
R, then you can go ahead and install this R studio desktop version, this is the
GUI for R. So, the link is, download link is already given here, again depending
on settings all configuration of your desktop or laptop, you can download the
49
appropriate installation file 32 bit or 64 bit and then go ahead with your
installation.
So, these are some of the packages that I have mentioned in this so installed R
packages is 1 particular function that is used for installing R packages in your
system. So, some of the packages that we are going to use in this particular
course I have mentioned then here. So, you can see installed R packages and c is
the combined function, which can combined all the strings or names of the
packages and in 1 go all these packages would be installed, you would also see I
have assigned dependencies the another argument as true.
So, if there are any dependencies for these packages, they would also be
installed. So, once you are done with your installation of R and R studio and
50
java as well then you can go ahead and use this particular function, this
particular code to install these packages.
Now, let us understand about R’s GUI. So, R’s geographical user interface, it is
mainly command line interface, it is quite similar to the bash shell that we have
in Linux or interactive version of this is scripting language python. So, many
people use python as well for data mining modeling and a statistical modeling.
So, the interactive version of that particular language is also quite similar, is also
based on command line interface. Now, R studio is one of the popular GUI for R
and it has been used for this particular course, most of the R scripts have been
written in this using this particular GUI R studio.
51
(Refer Slide Time: 04:07)
Now, as we have seen before R studio has a 4 main window sections we can see
it again. This particular section is the top left section and generally we write and
save our R code here in this particular section, you can see the instruction for
installing packages is written in this section and has been saved as a file called
installation steps dot R. So, most of the R scripts would have this extension
name R dot R, so this particular file is saved as installation steps dot R.
Now, bottom left section that is for actually education of R code and to display
output; to generate output is also called a console sections. So, this can be called
a script section and this is called console sections here we actually see
installation of execution of the R code and the related output.
Then, 3rd window section is top right section which is actually can be called
data section or environment section, in this particular section we manage our
data sets and variables. As we will see later in this particular lecture that many
variables that we initialize are assign some values to those variables or data sets
they would be a visual here, once they are loaded into R studio or R
environment.
Now, the 4th window section is bottom right section, where we actually display
plots and seek help for R functions, it can be called plot and help section. So, in
this, particular plots right now the plot sub section is active and therefore, any
52
plots that we generate would be displayed here. Now, you also have help section,
so if you are looking for a you know some help, related to some particular
function in R you can type that particular name of the that particular function
and the help is would actually be displayed there. So, you would see if we type
help here, then in the documentation how the help function can actually be used
and the details related to arguments and further details, examples, everything
else and notes, references, examples, everything you can see over there.
53
(Refer Slide Time: 06:59)
Now, data set import, this is in so far our course for this course data sets are
mainly available in excel files. So, we would be importing data set from excel
files or we would be creating in our studio itself. So, any hypothetical data set
we would be immediately creating in R studio. So, let us with this information
let us start our R basics or open R studio.
One of the important function that we require before we start is, understanding
different packages and load them into the R environment for example, because
54
as we discussed the data sets would generally be imported from excel files, in
this course see I have first line of this particular R script is about this library
xlsx, so this is the package which can actually be used to import data set from a
excel file.
You would see, here in the console section that this particular package has been
loaded and the required package is also I have also been loaded. Now, if we are
looking to you would like to import an excel file a data set. So, for example, the
previous data set that we have used in the lecture 1 sedan car data set that we can
actually import. You would see in this particular environment section or data
section you would see a particular file has been imported.
If you click on this particular link, a new window in the script section would be
open; file section would be open and you can see the data set there. So, we have
3 variables in this particular data set annual income, household area and
ownership and if you scroll down the data set then you can see 20 observation,
the same is displayed in the data section as well, we have 20 observation and 3
variables.
Now, this is a 1 way of importing data from an excel file, in this case I have used
this particular function filed or choose and this can actually allow this can
actually allow us to browse for our file in our windows directories. Another way
55
to import an excel file into R environment could be using this function read dot
xlsx and then giving the whole full complete path to the particular file.
One important difference that we can note here is, in R views forward slash for
mentioning the complete path instead of the backward slash as is used in
windows systems. So, do not forget to change from back slash to forward slash.
Now, another way to import the data set is you can set your working directory,
so this is the command. First let us see, how the, a data set can be imported
using the full path name.
So, let us see how the excel file data set from excel file can be imported using
the full path name of the file. So, you can see the same data it is actually the
same data set, but second import you can see here df 1, 20 observation and 3
variables.
Another way of importing data set could be you can set your working directory
using this command set wd and then, again you can give the full path name of
your working directory, just execute this particular command and then, now
because the your working directory has been changed where your excel file is
located, then you can simply put simply write the name of this particular excel
file and again import the data set, that says you this code and again you would
see another file has been imported it is again same data set, so you can see.
56
3rd instance of the same data set being imported same 20 observation and 3
variables. Now, once you have imported your data set, you can use the data set
to the start your modeling to execute different steps related to the particular
modeling. So, for example, because this is an introduction to R, we are covering
R's basics. So, therefore, we will try to understand some of the basic building
block structures that are used in R.
So, first let us to understand the numeric character and logical data type, how are
they are used in R? How we can create them and then access them? So, how a
numerical variable is created in R? You just need to type this i, i could be
numerical available for your code and then you can assign the value and execute.
So, you would see in the data section and i variable has been created and it
contains it has this value 1. We want to create a character variable, again you can
type the name of your variable for example, in this case we have country and
you can assign the value, in this case we have initialized this particular variable
with this India as our country.
So, again execute in this code you would see in the data section country has been
created and it has value India, similarly a logical variable can be created. So, for
example, in this case we have this logical variable named as flag and the value is
given as true, it could be true or false these are the 2 options for logical
variables. So, once we execute this code, you would again see that flag is created
in this data section and the value is true.
Now, how do we find out whether the characteristics of these variables? So,
these 2 functions could be useful class, the class function can be used to find the
abstract class of any particular variable and the type of function can be used to
find out a storage type of any particular variable for example, the variable i that
we have just created, we can find out the abstract class of this particular variable
which is numeric and then type of, of this particular variable which is double.
So, this particular variable is a numeric and it is being stored as a double in the
memory, similarly we can check for country. So, you can see country is
character the class of this variable country is character and a storage type is
character as well. So, similarly we can check for flag.
57
(Refer Slide Time: 15:54)
So, class of this variable flag logical variable flag is again logical and this
storage type is logical. Now, whether a particular variable is an integer or some
other variable type; some other data type that also can be checked into that also
can be examined in R and coercion from 1 variable type to another variable type
can also be done.
For example, if you want to find out whether the variable we have, the whether
the variable i is an integer or not. So, we can use this function is dot integer. So,
this code if we exude this particular code we will get an answer as false because
the i was created as a numeric variable and not as an integer variable.
So, therefore, that can be checked using the function is dot integer. Now, it is
possible for us to quotes this particular variable; numeric variable into an integer
variable, how that can be done? Let us create another variable j and we assign
the value 1.5 to this particular variable you would see that in the data section j is
created j can be seen here with value 1.5. Now, let us check whether this
particular variable is integer or not because we have created it as a numeric
variable it should come as false, so which is the case here.
Now, let us quotes this variable into an integer data type. So, that can be done
using is dot integer command, is dot integer function. So, is dot integer function
58
we can pass on the same variable as argument and we can again store it store the
return value in j itself and then we can display the value of j.
Let us execute this line, you will see 1. So, 1.5, value of 1.5 which was stored as
j which was created as a numeric variable, has now been changed to 1 because
the variable has been coerced into integer data type from numeric data type;
now, if we again do a check whether this particular variable j is integer, now we
will get an answer as true.
Now, another important function is length. So, length function can actually help
us find about the length of a particular variable. So, for example, length of i is 1
length of country and flag these 3 variables length of all these variables 1. Now,
next to discussion point is about vectors.
So, vectors are 1 of the basic building blocks for data in R, simple R variables
that for example, I can be flag if we just created they are actually vectors, now
vectors can take values from the same class. So, let us check whether the vector,
whether the variables that we just created i country and flag, whether they are
vector or not. So, again we can use the function like is dot vector, which is going
to tell us whether these 3 variable belong R vector or not.
So, you would see all these, all 3 variables are vector. Now, creation and
manipulation of vectors; this, so there is this function combined, which is
combined function c or column operator would also be is used. So, c function
and column operator can actually be used to create and manipulate vectors. For
example, if we want to create a vector of character vector of these 3 values.
Three values cricket, badminton and football.
59
(Refer Slide Time: 19:56)
So, this vector v can be created we can use the c function and we can pass on
these 3 strings as arguments, greater strings as argument, we will have this
vector created as v vector has been created and it has 3 values cricket badminton
and football. Now, we want to access individual values then we can do so using
the brackets. So, v1, v2, v3 will can be used to access first value and second
value and third value respectively.
You can see v1 cricket is v value in the output. Now, column operator can also
be used to create a vector for example, v1 1 to 5 so, this particular vector is
going to be created with having values 1, 2, 3, 4 and 5, as you can see here. So,
in column operator you mentioned the starting and ending values and the middle
values are actually filled up by the column operator.
We can sum these values using the sum function, you can see sum v1 and we get
this particular output 15, similarly multiplication with a constant can be done,
we want to access a particular value in this vector and that can also be done
using brackets. So, we do 3, so we wanted to access the 3rd value. So, that we
can do using v2 3 in brackets and we can see output as 6.
If we want to add 2 vectors that can also be done v1 and v2, but here it is
important that both should be having the same number of values, similarly if we
want to find out the values in a particular vector which are greater than 8 for
60
example, in this case there are many values which are greater than 8. So, we
want to just find out those values.
So, we can execute this particular line v3 greater than i 8 and we will see the
answer is false for first 2 values because they are less than 8 and the true for rest
of the 3 values which are actually greater than 8. Similarly, if we want to access
the values which are greater than 8 we can do in this form also v3 and within
brackets we can again use v3 greater than 8.
So, it will return the index for all the values which are greater than 8 and then
those values would actually be accessed using the brackets. Similarly, if we want
to access values which are greater than 8, but less than 5 again a similar thing
can be done, you can see the value is 3, 9, 12 and 15 there which are less than 5
or greater than 8.
Now, sometimes we might be required to initialize a vector and then populate it,
till now what we have been doing is at the same line at the same instant we were
creating variables and vectors and immediately we were initializing them as
well, we were populating them as well. So, sometimes we might want to do this
process in 2 steps first initialize and then populate.
61
So, for that we can use a vector function, so vector function can actually be used
to initialize vector, by a number in this case a vector function is being used to
create a vector of length 4. Now, if you want to, so by default if you use the
vector function by default a logical variable, the logical vector is created, if you
want to a create a numeric vector then we can do so, we have to mention the
mode of mode as numeric in the vector function and that can be done.
If you want to reassigns 1 of the values for example, 3rd the value, that can also
be done you can see 1.4 they are just which varies assigned. Similarly, if we
want to create an integer vector that can also be done, we have to mention and
the mode argument it as mode argument has to be assigned as a integer, you can
see the this particular vector v6 we had created with length 0. So, we will check
for it is length again you get you get the right answer as 0, here length function
can actually be used.
Now, whatever we have done so far it might look like that vectors are 1
dimensional arrays, right till now whatever examples that we have gone through
they give this impression, but if you really look in R whether they are 1
dimensional arrays or not, but in R they are they are actually defined as
dimensionless, that can be checked using these 2 function if we check for the
length of this particular vector v5 we will get answer as 4, but if we check for
62
the dimension using the dim function dim function we would find as null, so
which is undefined.
So, vectors are not actually 1 dimensional edits they are actually dimensionless.
That brings us to the next discussion point that is arrays and matrices. So, R also
has these building blocks arrays and matrices. Now, we have array function
which can actually be used to restructure a vector into an array, so how that can
be done we can see this through this example.
Now, array function the initialization 0 is given and we are creating an array of
these dimensions 4 states, 4 quarters and 3 years. So, this particular array would
be created with these dimensions; with these 3 dimensions I am having. So, this
we can see here, now we I assigned one of the value as 1000. So, that can also
we can check here.
You would see 3 dimensional array; so this is our 1st year, this particular matrix
looking structure in this array is for the first year, then this is for the second year,
then for third year. So, we had created this array for 3 years, so we can see 3
matrix structures back to back matrix lecture have been created. Now, that brings
us to the matrix so a 2 D array is a matrix. So, array can be thought of as a series
of matrix and a 2 D array being a matrix. So, we can use matrix function to
initialize an array, you can see the matrix initialization with the value of 0 for all
63
the elements and you can define a number of row as 3 and number of column as
3 in this case so and row and column argument can be used to define.
Now, with a different initialization we can create this program matrix M1 you
can see here, now a matrix multiplication can be used using this operator
percentage as trick and then followed by percentage this. So, this particular
operator can be used for matrix multiplication in R. So, this is the result of
matrix multiplication M1 multiplied by with M1 itself.
If we want to find out the inverse matrix though so we have the matrix do not
inverse function for that, but to be able to access this function, to be able to use
this function we first need to load this particular library matrix calc, so we will
just do that. So, we can just load this particular library matrix calc, once it is
loaded then we can access this function matrix or inverse and for any matrix we
will get the inverse of it.
You can see the inverse matrix here, if we want to transpose a matrix then there
is this function t transpose function that can actually be used to get a transpose
of a matrix. Now, the next discussion point is about data frames, now data
frames provide a structure for storing a variables of different data types. So, till
now we whatever we discuss vectors, arrays and matrices we were using the
64
same data type for a storing values of same data type. Data frames can be used
for storing variables of different data types.
Now, they provide flexibility to handle many data types preferred input, now
they have also because of that they have also become the preferred input format
for modeling in R, they can also be understood as a list of variables of the same
length for example, the data set that we earlier imported is this sedan car data set
we can see 3 variables of same length annual income, household area and
ownership and you can see 2 of the variables annual income and variable
household area are of the same type, they are numeric, but the ownership is
different.
So, we can check whether a particular variable is data frame or not, using this
program; using this particular function is dot data dot frame. So, we will just
check this for df which is true because the in the start of this particular script the
files that we imported they were actually in the data frame; they were actually
stored as data frame. Now, another important aspect related to data frame is the
dollar notation. So, using dollar notation you can access any of the variables
which are there in stored in a data frame for example, annual income. We can
access using this particular line df dollar annual income, see you will get the
values that are stored in this particular variable.
65
We can also check the length of individual variables in stored in a data frame
using the same notation and passing it as argument to the a length function, you
can see a still the answer is 20. Similarly, we can also check, so all these
variables as we discussed before the variables are actually a vectors in R. So, we
can check for the same is dot vector for annual income and household area and
ownership and we will see all these variables are actually vectors.
So, data frames can also be considered as having vectors of same length. Now, if
we look at 1 particular variable ownership it is actually a categorical variable
which is actually called a factor in R. So, if we want to check whether this
particular variable is a factor or categorical variable we can do so by using this
particular function is dot factor. So, you will see the answer true.
66
So, this kind of a structure of these variables can actually be displayed using str
and the structure command. Now, next important operator is sub setting
operator, so actually the brackets can also be used to subset a data frame for
example, if you want to access just the 3rd column of a data frame, so instead of
using the dollar notation we can actually use this particular subset operator that
is actually brackets. So, first 1 is for rows and the second is for columns. So, as I
have mentioned 3 year, so we will be accessing 3rd column.
You can see 3rd column the ownership related values are there. Now, if you
want to access this first 5 rows, you can mention that the same here 1 to 5 using
the column operator and for the column the space that nothing is mentioned
there, so all the columns would be displayed, would see. Another, if you want to
access 2 columns in 1 go we can do so by using combined function in the
brackets, if you want to; if we do not remember the index whether it was a first
column or third column we can actually use the combined function mention and
mention the actual name of those variables or vectors and then again we can
access the same data you can see here same data has been accessed.
67
(Refer Slide Time: 35:37)
Now, if you want to create; if you want to display retrieve few records using
these you know these greater than using these operators for example, annual
income having greater than 8 lpa, so that can be retrieved using this particular
command. So, all the rows where the annual income is greater than 8 all such
columns would be displayed, you can see here all the values are greater than 8,
8.1, 8.2 up to 10.8.
Now, we want to check for the class of df or type of df, that can also be done.
You can see class of df is mention as data frame and type is a list. So, data
frames are basically a list, are actually list. So, what are lists? So, list are a
collection of objects of various types including other lists. So, list function can
be used to again create and a great list, double bracket notation also will come
across the double bracket notation, we will see what is that.
So, a list can be generated for example, this lists. So, different objects can
actually be used as obviously, we have created these objects. So, i, j, v, m and
this is another addition that we are using for this list initialization and creation.
So, this using a list function we can create lists. So, you can see here, list has
been created. So, list can store objects of various type and they can be of
different length.
68
So, 1 big difference is the objects and list they can be of different types various
types and they can be of different length as well, while in data frame the
variables I tribes could be different, but they have to be of same length. For
example, you would not see angle the see double bracket notation and see first
you know a list element is this Roorkee which being slide and second list
element was actually i and values 1 the 3rd list element was actually j value was
again 1 and the 4th element which actually this particular character vector
cricket, badminton and football and then 5th element is actually this matrix.
So, if we check for class and length of a particular element of a list we will see
this we will get this. So, we want to actually check for a particular vector, then
we will have to use double brackets now you would see that l and for within
double brackets you will see a character vector, which was actually just we saw
and the output this one. So, this is the character vector and length of that
character vector is also displayed as 3. So, a structure command can also be used
with lists as well as data frame, which are also lists.
69
(Refer Slide Time: 39:14)
So, you can see 5 elements of list and the a structure of all those elements is
displayed for example, Roorkee this was character, then you have numeric
vector and integer vector and then the and the character vector followed by a
numeric. Now, our next discussion point is factors, so factors are as we
discussed before they are categorical variable, they are called factors in R. So,
they can be ordered or unordered as we discussed in before in previous lecture
that factors categorical variable could either be nominal or ordinal. So, here in
this case categorical variables are called factors and ordered or unordered they
are called ordered variable can are generally called ordered and nominal variable
are generally called unordered.
So, we have ownership variable already in our data frame, let us check whether;
let us check the type of class of that particular variable and find out whether it is
a ordered or unordered variable. So, as you can see the class of this particular
variable is factor which is categorical and then let us check whether it is order or
not. So, you will get answer as false because there was no specified order there.
Now, there is another this function head which can be used to actually display
first 6 values and if it is a factor variable then levels would also be displayed.
So, let us run this and you would see that levels non over an owner for a factor
variables are displayed. So, all the first 6 values in case of a integer or numeric
70
variable we would have just seen the values, in case of a factor variable R knows
that it is a factor variable, so in the output they also display the levels that are
used in the and this particular variable; factor variable.
Now, how can we go about; how can we actually a go about creating factor
variables for example, we have 1 particular variable in our data set that is annual
income if we want to a create groups off those and those households and we
want to create a factor categorical variable, where it says some houses would
belonging to lower middle class, some would be belonging to the middle class
and some would be belonging to the upper middle, how that can be done?
So, let us first create a vector of called income groups and mode is going to be
the character because we are going to be storing these strings, we did last lower
middle class and upper middle class and the length would be same as the annual
income. So, let us create this vector and then any record which is having annual
income less than 6 can we called lower middle class.
So, let us assign this value, any household which is having income greater than
or equal to 6 and less than 9 can be called middle class so let us do that and then
any household having income greater than 9 can be called upper middle class,
now let us check the values you would see all the values have been assigned
with lower middle class or middle class or the upper middle class.
71
(Refer Slide Time: 42:40)
Now, how do we create a factor of it? Factor variable from this now let us create
another variable income class which is now going to be a factor variable, now
we are using the income group the character vector that we had just created and
levels so because the factor variable are going to have levels. So, therefore, we
need to assign some levels.
So, in this case we already know lower middle, middle and upper middle and
you can see here ordered is true. So, we are ordering these variables, so lower
middle, then middle and then upper middle, so this is the order. So, we are
giving the names of the levels and also we have paired this ordered or women
dash true, now this particular factor ordered factor would actually be created.
Now, we can combine this variable in our data frame df. So, c bind command
can actually be used to combine variables column bias, so we can use this c bind
command. So, income class, but now we actually be combined in the data frame,
we can check the same using the structure command, str command you can see
income glass ordered factor with 3 levels.
72
(Refer Slide Time: 43:56)
We can again run the head command to find out a clear visibility of what are
these levels and their order you can see now the values are mentioned and levels
in the level you can see lower middle is less than middle and which is again less
than upper middle, so you can see an order in these variables. So, this particular
factory you has been created as a ordered variable.
Now, the another discussion point is on a contingency table. So, in our in this
particular course we would be for especially for classification tasks, we would
be using contingency tables to understand the results of a particular
classification related technique or algorithm. So, first in we need to understand
these tables, so table is the command which can actually be used to create. So, it
is generally used contingency table generally is used to store counts across the
factors. So, let us see through an example, so we have df, we have ownership
variable and income class that we have just created.
Now, let us see how many owner and what are their classes? What are their
numbers across different classes and for how many non owners? What are their
numbers across different classes? So, let us create a contingency table using
table command. So, we just need to mention these 2 factor variables and it will
be done.
73
(Refer Slide Time: 45:22)
You would see that first row is about non owner, second row is about owner and
you will see 3 columns as lower middle, middle and upper middle; you would
see a non owner you know 6 lower middle class non owners are there and then 4
middle class owners are there and the 0 upper middle class non owners are there.
Now, if we look at the owner row you would see that 0 lower middle class
owners are there, 8 middle class owners are there and 2 upper middle class
owners are there. So, this kind of count we can always get using table command.
So, this particular command is going to be useful for us when we do our
classification task. Now, we want to find out the class and type off and
dimension of this particular table. So, we can execute these lines you can see
table and integer and the dimensions are also mentioned 2 and 3. So, this ends
the introduction of R. So, in the next class, next supplementary lecture we are
going to cover the basic statistical technique.
Thank you.
74
Business Analytics and Data Mining Modeling Using R
Prof. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture - 04
Basic Statistics Part-1
Welcome to the course on Business Analytics and Data Mining Modeling using R. This
is our supplementary lecture number two on basic statistics using R. So, Let us start. So,
as we have discussed about three types of analytics, first one being descriptive, then
predictive and then prescriptive. So, we are going to cover the descriptive part, and we
are going to learn some of the basic statistics using R.
75
(Refer Slide Time: 00:54)
So, Let us open RStudio. So, as we have done in the previous lecture, we are first we are
required to load this particular library. Why we need this, because we want to load the
data set from we want to import the data set from an excel file. Data set that we want to
import is the same one the Sedan car. So, Let us execute this line. You can see in the data
section this data set has been imported you can see 20 observation and 3 variables.
Again Let us have a relook at first six rows of this particular data set, you can see annual
income, household area and ownership, the same variables are there. Now, let us start our
76
descriptive. Now, one of the first function that is popular and used quite often is
summary function. Summary function in R can help you in getting the idea about the
data magnitude and the range of data. Now, it also provides several descriptive statistics
like mean, median and counts. So, we will see in the output. So, Let us execute this
summary df. So, in df, we have three variables - annual income, household area and
ownership.
We look at the output first start with let us start with annual income. You can see the
values range from minimum value of 4.3 to maximum value of 10.8; mean line some
were between some were at 6.8, and median line somewhere at 6.5. You can also see
other things like first quartile and third quartile this is at 5.75 and 8.15. So, this quartiles
also give you a you know idea about where the majority of the values are lined.
Now, let us look at the second numerical variable that is household area. So, here also
you can see most of the values all the values are going to lie between minimum value of
14 and maximum value of 24. Now, majority of the values are going to lie between first
quartile that is 17 and the third quartile 21, and mean lying at 18.8, and the median at
18.75. Now, these statistics are mainly for numerical variable. Now, for the categorical
variable or the factor variable that we have is ownership. Now, there only the counts are
displayed, some of the statistics related to numerical variable they are not applicable.
77
Now, Let us move on to other basic statistical methods, first one is correlations, how do
we compute the correlations between two variables. So, again correlation is applicable
between two numerical variable. So, we want to find out how a particular variable is
correlated with another variable, so that can be done using the this functions cor
function. So, we can pass on these two arguments annual income and household area and
we can find out the correlation between these two variables. So, the correlation value
comes out to be 0.33 for annual income and household area. So, correlation generally
gives you the idea about the relationship between the variables. So, the correlation value
lies between minus 1 and 1.
Now, in this case it is plus 0.33 the value, which are closer to 1 or minus 1 signify high
level of high degree of correlation and values closer to 0 signify or indicate a low level of
correlation between variables. More discussion on correlation we will do in coming
lectures. Now, next important-statistics is covariance. Now, covariance we have cov
function in R that is available to us. So, again we can pass onto numerical variables in
this case our example is about annual income and household area. Let us execute this
line. Now, you can see the covariance as being computed between these two variables.
Now, another covariance is again the spread of values. So, how must common spread
must between these two variables is there, the overlap region that is between the these
two variables can actually be indicated by covariance values.
78
Now, another simple statistics that we can compute using simple R functions, so mean.
So, mean was something that was part of sumv sumarry function as well, but if we are
interested in just computing mean of a particular variable that can also be done using this
mean command. So, let us execute this for annual income. You can see the value. You
can values same as what is displayed in summary function. Now, similarly median also
can be computed. We have this function median in R that can be used to compute the
value.
Now, if you are interested in few more statistics, for example, inter quartile range. So,
inter quartile range is the difference between first and third quartiles. As we discussed in
the summary function, we get the statistics related to first quartile and third quartile, this
is another way to get the same information. So, let us execute this line iqr function and
the annual income past in as an argument and you will get the value.
Now, if you are again in we are just interested in minimum and maximum values, so we
have a direct function call range which can be and we can find out the minimum and
maximum value. So, we do not need to depend on summary function, and we can use
this standalone function which provide the specific estimate. Now, standard deviation
there is this function sd is available in R. So, we can always compute standard deviation
for any variables for annual income it counts out to be 1.7.
79
Similarly, if you want to compute variance of a particular variable, variance meaning the
spread of values for that particular variable that can be computed using var function in R.
So, you can see. So, summary function as such it covers some important some key
statistics, in one go you can compute for all the variables in your data frame in your data
set or you can if you are interested in one of those statistics, you can use this direct
function and compute the same.
Now, there are some important function that which are available in R which we might be
required to use sometime, sometimes to transform a particular variable, sometimes do
some specific task which is repetitive in nature. So, there is an already function that is
available in R. So, we can use them. So, one such function is apply. So, in coming
lectures, we will keep learning about many useful functions from R. So, apply function
can be used. If you want to apply a function to a several variables in a data frame, this
particular function can be useful. For example, we want to apply if you want to compute
a standard deviation for all the numerical variables in a data frame that can be done in
one go using this particular function.
So, first argument is again you need to pass on the data frame and the variables on which
you want to apply. So, variables as generally you know recorded in columns, margin
indicates the same thing. So, margin value up to two means that the function is to be
applied column wise. Now, third argument is function f u n - fun. So, in this case, in this
80
example we want to compute standard deviation values. So, sd is the that function that
we have you know seen before. So, this can be passed on and we can apply. So, if you
execute this line for these two variables, one variable annual income and household area,
you can see the standard deviation value have been computed.
Sometimes you might have to write your own functions which are generally called user
defined function. Some you know pre developed predefined function might not be
available in R, and you might be required to write your own functions. So, here is an
example. So, this is a very simple example, just to give you an idea about how you can
write your own function and use them in your modeling R data preparation and
transformation all those steps. So, this function is about providing the difference between
max and minimum values for all the variables in a data frame. So, first you need to come
up with the name of your function.
So, for example, because we want to compute the difference between max and min
values for all the variables, so our name is mm max for min and the difference. So, mm
diff is the name that I have given. And then you have to use function to define it. And
then you have to mention the argument that would be allowed to pass when this case it is
data frame. And then within this particular function, I have used this built in function
apply this is again it again takes the first argument as data frame and this margin for
column. And within this function again defining in a in a way I am again defining
another function. So, this is user defined function and within apply again I am writing
one more user defined functions. So, function x and max and min x. So, let us execute
this code. So, that this function becomes available for us to use in future.
Now you would see in data section a functions section has being created and you would
see mmdiff as the function name now this can always be called any number of time for
your coding. So, mmdiff let us call this function mmdiff and we have passed these
argument data frame and first and second column. So, let us execute this particular code
you would see that difference between max and minimum values for these two variables
annual income and household area has being computed and you can see here. If you want
to verify whether your user defined function that the function written by you is working
fine or not, you can do so.
81
(Refer Slide Time: 12:39)
So, let us run a summary command and Let us see whether our user defined function has
provided the correct output or not. So, you can see in the summary, you can see the
difference between max and min value for annual income, this is 10.8 minus 4.3. So, you
can see that it is 6.5 this is correct. Now, next one for household area its max value is 24
and min value is 14, difference being 10. So, you can see that. So, your user defined
function is giving correct output.
Now, let us move onto our next part that is about initial data exploration. So, whatever
basic statistics that we have just discussed sometimes I mean we might require to
understand a bit more for example, whether we can understand if there is any potential
linear relationship between variable, whether we can understand the distribution of data.
So, for that some level of visualization is required. So, now we are going to discuss some
techniques related to visualization. So, these are some of the things that can be that
should be done before starting the formal analysis or formal modeling.
So, one of the most important visual analysis can be done using scatterplot. So, for this, I
am going to generate this hypothetical data. So, again this function R norm this can be
used to generate randomly generate a data which follows normal distribution. So, R norm
and the first argument that I am passing here is 100 that means, I want to generate 100
observation or 100 values. So, Let us generate values for R norm, R x. You would see in
the data section x has been created, and you would see that this numeric vector have
82
being 100 values and the values have being generated randomly and they are also
following normal distribution.
Now, we can generate another variable y. So, let us generate it like x plus R norm. Again
in this case we are giving a mean specifying mean as zero standard deviation as 0.6. Let
us execute this line and generate y. You can see y another vector has being created having
the same number of observation 100 and the values. Now, if we want we can combine
the these two variables and create a data frame, so that can be done using this hash dot
data frame command. So, these two variables, they will be quartz and data frame would
be created. So, let us execute this line. Now, let us see what the data looks like, so first
six observation you would see that x and y you can see the these data points have been
randomly generated. Let us look at the summary of this data frame. So, this is available
for us.
Now, scatter plot. So, the plot command is a generic command that is available in R and
can be used to generate many kinds of plots. So, in this particular case, we are trying to
generate a scatter plot. So, in the plot command, we need to specify first argument should
be about the variable which is going to be plotted on x-axis and then the second
argument is about the variable which is going to be plotted on y-axis. And then some
other las is another argument that is available mainly for the visual appeal, you can seek
help to get more information on las. Then the third important argument is about main
83
which gives you the title of the plot. So, in this case, we have given the title of the plot as
scatterplot of x and y.
Now it is important for you to level the x axis and y axis, because sometimes we are
going to use data frame and the dollar notation, and then that can be taken as the default
name default names for your x-axis and y-axis. So, in this case we have given the name
of x-axis x and name of y-axis as y. Now, you can also specify limits for your x-axis and
y-axis because sometimes your plot area might be smaller, and the it might not look good
because a small portion of your plot is displaying data right. So, therefore, if you are able
to restrict your limits then the plot area would you know your data points would cover
majority of the plot area.
So, as you can see in the summary command that that we have just run the mean and
max value let us look at the mean and max value, the mean value is minus 2.13 for x, and
the max value is 2.77 for x. So, we can say that all the values will are going to lie you
know lie within the range of minus 3 to 3, therefore we have given x limit as minus 3 to
3. Similarly, for y as well you can see that minimum value is minus 2.79 and the max
value is 3.37, now all these value can lie within this range minus 4 to 4, and therefore we
are given y limit as minus 4 to 4. So, let us plot this. Let us execute this code, and you
would see a plot has being created.
84
(Refer Slide Time: 18:32)
And you would see all the values and you can also form this plot, all the data points you
can see, a line can be drawn from this point to this point, and it would closely fit the data.
So, there seems to be linear relationship between x and y. So, why this kind of
relationship is visible in this case, this is mainly because the way we have generated x
and y. If you look at the way we have generated x and y, you can see x was randomly
generated and then y was x plus some addition of randomly generated numbers. So, from
there this linear relationship is coming.
Now, let us start our discussion on a hypothesis testing. So, hypothesis testing what
hypothesis testing is about. So, this is one of the very common statistical technique that
is used. So, generally whenever we are trying to formulate whenever we are trying to
formulate a business problem, one part of it is going to be data mining related, analytics
related or statistics related. So, therefore, mainly in when we talk about statistical
modeling generally the first step is formulation of hypothesis. So, in case also we are
going to learn this particular technique. So, generally it is about hypothesis testing is
about comparing populations. For example, comparing performance of a students in
exams for a two different class sections. So, we want to understand how class A students
have performed in their exams, and whether there is significantly different from class B
or is it exactly same. So, these kind of comparison could actually be performed using
hypothesis testing.
85
So, essentially what we are doing is we are testing the difference of means from two data
samples. So, one could be class A and class B and we can compare the means for these
two data samples. And we can statistically we can find out whether there is difference in
performance or not. So, common technique is that we use is to access the difference or
significance and significance of the same. So, idea as we discussed to generally,
formulate an assertion and be then test it using data.
Now, what are some of the common assumption in hypothesis testing. So, generally we
start with that there is no difference between two samples. For example, in example that
we just discussed we can assume that performance of students belonging to class A and
performance of students belonging to class B is similar. So, there is no difference. So,
that is the starting point for us in hypothesis testing. So, this starting point is generally
referred as null hypothesis or it is denoted as H o. So, generally null hypothesis this is
that there is no difference between two samples.
The alternative hypothesis, if we have some region to believe that the performance of
class A is superior to class B or otherwise performance of class B student is superior to
the same of class A, then we can say that using alternate hypothesis, which can be
denoted using H a. So, in this, we state that there is difference between two samples.
Now, we are interested in knowing few more examples of hypothesis, and how we can
formulate our null hypothesis, and the alternate hypothesis. So, here is one more
86
example. So, this one we already discussed students from class A and B that same
performance in the examination being null hypothesis; and the students from class A
perform better than students from class B is the alternate hypothesis.
Some more examples given in this particular slide, for example, new data mining model
whether new data mining model does not predict better than the existing model. So, this
could be null hypothesis. Alternative hypothesis could be new data mining model
predicts better than existing model. So, what is going to happen after we do this
hypothesis testing, so either testing results will lead to rejection of null hypothesis in
favour of the alternative or acceptance of the null hypothesis.
87
(Refer Slide Time: 23:32)
Let us look at another example. For example, this one is more related to regression
analysis then we will discuss regression analysis in coming lectures, then this would
seem more important information to you. So, this is important null hypothesis in a
regression analysis case regression coefficient is zero, i.e., variable has no impact on
outcome. The alternative hypothesis could be regression coefficient is nonzero that
means, variable as an impact on outcome. So, the these are some of the examples for
hypothesis formulation, and how we can state our null hypothesis and alternate
hypothesis.
So, because of this whenever any population, whenever we have gather more than 30
observation, the distribution of the data, it starts following normal distribution. So,
88
therefore, normal distribution is a commonly occurring property, there is generally we
find in different samples and populations. And therefore, it is easier for us to use this
particular characteristic of distribution and then normal distribution and then use it for
our hypothesis testing.
So, generally as we discussed hypothesis testing is about difference of you know means,
so we generally look for look to test difference of means. So, the idea is drawing
inferences on two populations, for example, if the there are two population one is P 1 the
other one is P 2. So, how we can draw inferences from these populations, so that is the
main idea. So, generally this is done by comparing means. So, for example, for
population one and population two mean population is mu 1 and mu 2.
So, therefore, we can state our null hypothesis as mu 1 being equal to mu 2, so that is
null hypothesis that means both populations are same they have same mean, therefore
they are same. And that the second being that mean mu 1 not equal to mu 2 that means,
there is difference between these two population means, therefore between these two
population. So, how do we do it because it is generally a difficult to get information
about whole population. So, generally we take samples, generally we draw random
samples from these populations.
So, our basic approach is to draw random samples randomly generated samples from
these population and then compare observed sample means. So, we got now mu 1 and
89
mu 2 they are unknown, so we take sample from the population and then we will
compute these observed sample means which can be denoted as x 1 bar and x 2 bar; x 1
bar for the population P 1, and x 2 bar for the population P 2.
And then we can go about doing some hypothesis test. So, two popular hypothesis test
are student’s t-test and Welch t-test. So, we will go one by one. So, Let us first discuss
the student’s t-test. So, some of the basic assumptions that are related to student’s t-test is
about that two populations. So, two population distribution P 1 and P 2. So, we assume
that they have equal variance. So, we do not know the variances of these two population,
but we assume that them to be equal. So, only then student’s t-test can actually be
performed. Now, let us say we have two samples from these two samples of n 1 and n 2
observation respectively from these two populations P 1 and P 2 and they have being
randomly and independently drawn from these two population. So, these are some of the
assumptions related to student’s t-test.
Now, another assumption is that is mainly about how the t-statistics is actually computed.
So, if we assume that P 1 and P 2 are normally distributed this is generally the case
because of the central limit theorem. So, if P 1 and P 2 are normally distributed with
same mean and variance, then t-statistics follow the t-distribution in this case with n 1
plus n 2 minus 2 degrees of freedom, and this is how it can be computed. So, t-statistics
can be computed as x 1 bar minus x 2 bar and divided by the pooled sample variance,
90
pool sample standard deviation and then multiplied by this particular factor square root
of √1/ n1 + 1/ n2.
Now, pooled sample you know variance can be defined in this fashion. So, you can see s
1 square the sample variation from population one sample drawn from population one,
and s 2 square is the variance for sample 2 drawn from population 2. And you can see a
kind of rated average has been taken to compute pooled sample variance. So, this is the
statistics that is computed and this is under the assumption that null hypothesis is true.
So, we need to understand that this has to be correct that P 1 P 2 normally distributed
with same mean and variance. And we are assuming that null hypothesis is true and then
we can go ahead and compute this particular t-statistics.
So, as we said S p is pooled standard deviation, and S 1 and S 2 are sample standard
deviation. And the S p square being the pooled standard variation, and S 1 square and S 2
square being the sample variance. Now, another point is regarding the shape of t-
distribution. Now, shape of t-distribution is generally similar to normal distribution and it
becomes more so when the degree of freedom reach 30 or more. As we have more and
more observation into our sample then t-distribution becomes more like normal
distribution. So, normal distribution and t-distribution they are also called bell curved
because their shape looks like a bell.
91
Let us try to understand this particular t-statistics. Let us go back t, it is defined as x 1 bar
minus x 2 bar. Now, x 1 bar and x 2 bar are sample x 1 bar and x 2 bar are observed
sample means. So, if observed t values, so if x 1 bar and x 2 bar are quite close to each
other if the observed sample means are quite close to other, the observed t values also
going to be closer to 0. But it is going to be closer to 0 then the then the sample results
are exactly equal to null hypothesis, therefore null hypothesis is good in that case we
accepted. So, if sample observed means x 1 bar and x 2 bar they are close to each other
or t is close to 0 then the null hypothesis is generally going to be accepted.
Now, let us understand the next point. Now, observed t value, if observed t value is far
enough from 0, so it is far enough from 0, and t-distribution is indicating a low enough
probability for the same then it will lead to a rejection of null hypothesis. So, if the value
t that means, one of the sample one of the observed sample mean is much greater than
the other one therefore a leading to a higher value of t, and the probability is also low on
the lower side and that can actually lead to rejection of null hypothesis. Now, t value
falling in the corresponding areas in the normal curve it should be so this also means it
should be less than 5 percent of the time. So, in that case, this particular null hypothesis
would be rejected. So, we will stop here, and we will continue our discussion on basic
statistics using R in the next part.
Thank you.
92
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology Roorkee
Lecture - 05
Basic Statistics- Part II
Welcome to the course business analytics and data mining modelling using R. This is a
4th supplementary lecture on basic statistics. So, in the previous lecture we stopped at;
we stopped our discussion about student’s t test, so let us pick up from there. So, as we
discussed in the previous lecture the most common type of testing that we actual do
hypothesis testing that we actually do is about difference of means.
So, let us take this example if P1 and P2 are normally distributed with same mean and
variance, then t-statistic follow a t-distribution with n1 plus n2 minus 2 degrees of
freedom. So, we talked about this particular formula of t-statistic in the previous lecture
as well. So, you can see t can be computed as difference between x1 bar and x2 bar
divided by pooled sample variance and then multiplied by this factor in the square root;
1/n1 + 1/ n2. How the pooled sample variance is computed? Can also you can also see
here, it is kind of a weighted average of sample variances from our population 1 and
population 2, you can see S p square is (n1-1) is (n1-1) S12 + (n2 –1) S22/ n1+ n2- 2.
93
Now, as far as t-distribution is concerned; shape of t-distribution is concerned it is quite
similar to normal distribution and as the number of observation; degrees of freedom
these 30 or more, it closely resembles to normal distribution.
So, both t-distribution and normal distribution they are bell shaped curves. Now, some
specific discussion points on the t-statistic formula; you can see the numerator of t-
statistic is actually the difference of sample means. So, from there you can understand for
our null hypothesis and alternate hypothesis for their testing, if the observed t value is
actually is comes out be 0, then that would actually indicate that sample results exactly
equal to null hypothesis, that means, the means are equal.
Similarly, if the observed t value, that is far enough from 0 and t distribution indicates a
low enough probability let us say less than 0.05, then in that case our null hypothesis H0
would actually be rejected. So, to get a, better understanding let us look at the normal
curve.
94
(Refer Slide Time: 03:18)
So, let us say, so for a value, a t value which is far enough from 0 and the probability of
that particular t value falling within this curve is quite low, let us say 0.05 or less, then in
that case null hypothesis would be rejected because our t value is falling in these 2, one
of these 2 areas. So, our t value is not falling within this main area and probability of
falling in this particular area is low which is 0.05, that means, there are more chances
that particular sample results; that particular sample is falling in these 2 smaller regions.
So, therefore, in this case normal in this case, a null hypothesis would actually be
rejected. The same discussion can also help us in understanding the confidence interval
which we will cover later on in this lecture, this is about you might have come across the
words like 95 percent confidence interval, 90 percent confidence interval, 99 percent
confidence interval. So, these words terms you might come across, so in this case though
we have taken this we talked about a small probability and the t-statistics falling in this
particular region, this is corresponding to 95 percent confidence interval. So, there more
discussion on confidence interval we will do later on this, later during this lecture.
So, another way to understand this is the t value falling in the corresponding areas in the
curve is less than 5 percent of the time, so that is another way of understanding this. So,
the low probability that we talked about 0.05, it is generally denoted using alpha and this
also known as significance level of the test.
95
Now, how do we find out whether a null hypothesis is going to be rejected or not, so we
generally compute this t asterisk value, which is determined in such a way, that
probability of magnitude of observed t value being greater than this t asterisk value is
actually alpha. So, in such a fashion t asterisk is determined for different values of
observed t's and once that t asterisk value is determined, we generally compare it using
the observed t value and if the magnitude of the observed t value is greater than t
asterisk, then a null hypothesis is actually rejected.
You can see t asterisk is determined such that P and probability of absolute t greater than
or equal to t asterisk is alpha which is 0.05, for an example and third point is about that
null hypothesis is rejected, if observed value of t is such that absolute value of t is greater
than or equal to t asterisk.
Now, significance level of statistical test is the probability of rejecting the null
hypothesis for example, in this particular case we assume that alpha is 0.05. So, if null
hypothesis is true and alpha is 0.05, then the observed magnitude of t would actually
exceed t asterisk 5 percent of the time.
Now, another term that you might have come across is called p-value; p-value is sum of
probability of t being less than or equal to minus absolute of observed t value and
probability of t observed t value being greater than or equal to magnitude of observed t
value. So, summation of these 2 terms; summation of these 2 numbers is actually going
96
to be p-value. Now, let us open R studio and let us go through 1 example, which is
related to student’s t test.
So, let us first create this hypothetical data, so we have these 2 variables x1 and y1. So,
we are going to use R norm command that we discussed in the previous lecture. So, R
norm again we want 20 observation with mean, mean 50 and standard deviation being 5
and y1 this is corresponding to the second population we have, we want 30 observations
here and the mean being 60 and a standard deviation value being again 5.
So, you can see that because of the assumptions that are related to student’s t-test, you
can see we have kept the standard deviation value as same while creating these 2
populations or samples. So, let us compute this, so x1 you would see that x1 has being
created here 20 observations and these values are randomly generated and following
normal distribution. Now, second sample we can let us execute y1 and will get the 2nd
sample you can see in the data section, in the environment section you have y1 30
observation in this here and again the values being generated and again following a
normal distribution.
Now, let us come to students t-test. Now, we have this t dot test of function that is
available in R and it could be used to run our students t-test. Now, in this case the t test
function we pass on x1, that is the first sample and the y1 that is being the second sample
and you would see there is another argument called variance var dot equals, this is
97
related to variance, where as we understand that in this students t-test variance of 2
population are supposed to be equal. So, in this case var dot equal is assigned as true.
So, once you write this particular code we can execute this, so let us do t-test here.
You would see in the result, it is 2 sample t-test because we are trying to compare the
means of 2 samples x1 and y1. So, it is 2 sample t-test and you can see the data mention
as x1 and y1, you will also get a t statistic t as minus 7.1424, degree of freedom is also
mentioned as 48 you can see, that number of observation in x1 sample were 20 and
number of observation in y1 sample were 30. So, therefore, addition being 50 and then
we subtracting 2 for parameters for mean. So, therefore, it comes out to be 48.
Now, you can also see a p-value has also being computed and how this computation is
actually done that we have discussed in the slide. Now, you would also see that alternate
hypothesis 2 difference in the result is given there the 2 difference in means is not equal
to 0. So, alternate hypothesis is true and the null hypothesis has being rejected in this
particular case, you would also see that 95 percent confidence interval has also being
mentioned there being between minus 14 to minus 8.289, will have a discussion on
confidence interval as well, later in this particular lecture.
Now, you would see mean of x and mean of y those values also being given there. Now,
if you want to compare our results of students t-test with t-value which is related which
98
is t-value corresponding to 0.05 significance level, especially for 2 sided hypothesis test
or sample hypothesis test, so we can do this using this particular qt function. So, qt
function can give us this value for example, in this case the significance level is 0.05. So,
this is going to be divided by 2, being because the normal distribution being symmetric.
So, we need to divide because there are going to be 2 regions and this particular value
has to be divided for equally for each region and you would also the degree of freedom
as 48; 20 plus 30 minus 2 and the lower tail is false. So, once you we can find out the t-
value for 0.05, significance level.
So, let us execute this code and you would find that t value for this given significance
level of 0.05 and given degrees of freedom 48 in this case, t-value comes out to be 2.01.
Now, if you compare this with the observed t-value that we just saw that is minus 7.1424
it is quite less. So, therefore, the null hypothesis actually rejected.
Let us go back to our discussion. Now, another test that can be performed while you
know while hypothesis testing related to difference of means is Welch's t-test. So, when
do we use Welch's t-test when the population variance are not equal. So, assumption of
equal population variance and that is not reasonable and that cannot met and then
probably we can use Welch's t-test and do our hypothesis testing. So, again formula for t-
statistics for Welch's t-test is given here, so t w is again a difference between 2 samples
99
x1 bar and x2 bar and divided by, in this case you would see that sample variance S1 2 /n1
+ S22 square sample variance for 2nd sample S22/ n2 and square root has been taken.
So, this is the formula for computing t-statistics. Now, as far as the interpretation is
concerned as we did, as we discussed for students t-test again here also the sample means
are in the numerator, we have sample means difference of sample means. So, again the
same points are applicable, if the value numerator is 0 then probably the null hypothesis
is true and if the there is numerator is far from 0 and for low probability like 0.05 is there
then probably null hypothesis is good actually be rejected. So, from that sense
interpretation of results are going to be same. So, only 1 important difference being that
population variance that cannot be assumed as equal.
Now, another assumption that was applicable in student’s t-test that random samples
would be drawn from normal distributed population that is still applicable. So, again this
t-statistics; Welch's t-statistics also follows t-distribution which is as we discussed is very
similar to normal distribution and becomes almost normal distribution and the degrees of
freedom reach 30 or more.
So, let us do a small example for Welch's t-test, so let us open R studio. So, again we can
use the same data in this for to perform test related to Welch's t-test as well. You can see
the t, t dot as the same function can again be used to perform this particular test. Now,
the only difference being that variance dot equal now in this case would be assigned as
100
false. So, you can see 3rd argument var dot equal is false and the 2 samples x1 and y1,
we are passing on the same samples and doing the test on the same samples.
So, let us execute this particular line, again you can see here that now the name has
changed to Welch's 2 sample t-test; 2 sample being x1 and y1. So, that you can see, you
can also see the t-statistics that is Welch's t-statistics that comes out to be minus 6.6412
and degree of freedom comes out to be 30.926, degree of freedom computation in
Welch's t-test is slightly different from students t-test.
So, we would not go into detail of that, p-value again the interpretation and meaning
remains same. So, here also you will get a p-value. Now, which again this p-value is also
less than that low probability value that we talked about less than 0.05, alternate
hypothesis 2 difference in means is not equal to 0, again in this case also the null
hypothesis is rejected, you can also see in the results 95 percent confidence interval,
values are mentioned there, so minus 15 and minus 7.99, more discussion on confidence
interval will do in a while in this lecture.
Now, means of these 2 samples; sample estimates is also sample estimates are also given,
mean of x and mean of y are also given. Now, if we go back to the earlier computation
that we performed about that t-value that is corresponding to 0.05 significance level and
the degrees of freedom that computation we can again do and will find out that observed
t-value is less than the corresponding t-value where 0.05 significance level and given
degrees of freedom in this case, that computation can be done and we will find out that
the this particular t is statistics minus 6.6412 is less than that, therefore, null hypothesis
has to be rejected.
So, let us go back and so next discussion point is on confidence interval. So, confidence
interval actually provides an interval estimate of a population parameter using sample
data. So, till now what we are looking for actually the point estimate, but using
confidence interval we can also provide an interval estimate of a population parameter.
Now, this confidence interval in a way also tells about the uncertainty that is associated
with the point estimate. So, point estimate might not be accurate and the confidence
interval in a way is explaining this the uncertainty.
Now, another way of understanding confidence interval is how close x bar is to mu. So,
that is another way of understanding confidence interval because our x bar is actually
101
computed based on the sample randomly drawn from the normal distributed population.
Now, confidence interval will give us some sense of, will minimize some uncertainty
about the sample estimate that we have and it will tell us and range where this particular
where we can with some confidence we would be able to say that population parameter
is going to lie in that particular range.
Now, for example, in 95 percent confidence interval estimate for a population mean
straddles the true unknown mean 95 percent of the time. Therefore, what we actually
mean is that if we are computing an interval estimate based on 95 percent confidence
level, then the population mean is 95 percent of the time, the population mean is going to
lie in that range. The same thing can be expressed in this form that mu is going to belong
to this particular range x bar plus minus twice of sigma divided by square root of n.
So, this particular range, this, using this particular formula range can be computed and
the particular population mean will actually straddled by this particular range 95 percent
of the time or any other confidence interval, if it is 99 percent confidence interval we are
talking about then 99 percent of the time it will straddle that range, similarly for 90
percent. Now, at this point we can discuss 2 important resource related to errors type 1
and type 2 errors. So, is this particular classification table that is displayed in this
particular slide, you can see when type 1 error and type 2 error can actually occur. So, for
example, if null hypothesis we look at type 1 error if null hypothesis is rejected, while
102
null hypothesis is being true, so that is called as type 1 error, while the null hypothesis
actually true, but it has being rejected using our statistical test or hypothesis test.
The type 2 error occurs when the null hypothesis is false, but using our hypothesis test or
statistical test we actually accept null hypothesis. So, these are the 2 situation when type
1 error and type 2 error actually happen, the other 2 are the correct outcome when our
hypothesis test accept null hypothesis and is also true or our hypothesis test rejects null
hypothesis and is also is false. Now, how do we overcome the problems related to type 1
and type 2 errors. So, for type 1 error we can look at the significance level that is denoted
by alpha.
So, we can manage this particular error using appropriate significance level. So, we
reduce the alpha, then there is less chance of doing type 1 error. So, therefore, many
times you would see many researcher would prefer 99 percent confidence interval over
95 percent that because they do not want to commit type 1 error. So, therefore, they
reduce the alpha depending on their acceptance level. In some research stream or
research domain even 90 percent confidence interval is accepted, but in that case there is
a risk of committing a type 1 error.
Now, type 2 error; it is generally denoted using beta, this can be generally managed using
appropriate sample size. So, if you keep on increasing your sample size and it there is
some sort of saturation that is reached and then in that case less chance of committing
103
type 2 error. Now, another point related to hypothesis testing is power of a test. So, what
is a power of a test? So, power of a test is about correctly rejecting null hypothesis.
So, if you go back to the table that we had. So, if you look at null hypothesis is rejected
the second row in that case the only problem when the null hypothesis should actually
we rejected and it is not is actually the because of the type 2 error. So, you would see that
power of a test is actually computed using 1 minus beta because whenever there is type 2
error, then that reduces the power of a test in terms of correctly rejecting null hypothesis.
Therefore 1 minus beta is called the power of a test.
So, this particular power of a test is also used to determine the sample size because as we
talked about 1 way to manage or handle type 2 error is the selecting appropriate sample
size. So, we want to again compute or find out what would be the appropriate sample
size, then power of a test could be a good indicator.
Now, next important statistical test is ANOVA. So, till now what we have been talking
about is was mainly about 2 populations. So, what happens about hypothesis testing if
you are dealing with more than 2 population, so in that case ANOVA is used. So, used for
more than 2 populations or groups instead of performing multiple t-test. So, if we have
more than two population, 1 alternative is we perform multiple t-test pairwise t-test for
different grouping. So, that is 1 solution, but this can be cumbersome and the
interpretation could be cognitively very difficult for us and the probability of committing
104
type 1 error would actually also increase because when you are trying to do multiple t-
test. So, for every pair you have to do interpretation and then it will be influenced by
some other t-test and it will become very difficult for you to cognitively interpret the
results and the, reduce the manage the probability of committing type 1 error.
So, therefore, ANOVA is preferred in case more than 2 populations are involved. So,
another important point related to ANOVA is this is sort of generalization of hypothesis
testing and that used for the difference of 2 group means. So, hypothesis testing that is
used for difference of 2 group’s means it is ANOVA is in a way generalization of that
process. So, if we were to perform multiple t-test for n group, then we have to actually
do n times n minus 1 divided by 2 test to actually make any conclusion.
Now, the typical null hypothesis and alternative hypothesis in ANOVAs case is that in
ANOVA we assume we assume that in null hypothesis all the population means are
equal, which is quite similar to what we do in difference of means, alternate hypothesis
at least 1 pair of the population means is not equal.
So, again assumption is quite similar each population is normally distributed with same
variance and the testing is mainly about whether different population; different
population clusters whether they are more tightly grouped or spread across the
populations, so this is what we are trying to find out.
105
(Refer Slide Time: 28:18)
Now, there are 2 important statistics that we compute in ANOVA process, one is between
groups, mean sum of squares that is SB squares. So, this is an estimate of between
groups variance, this is the formula SB square can be computed 1 divided by k minus 1,
when k being the number of groups and then summation over 1 to k and multiplied by n
i, n i is the number of observation in ith group and then the difference between mean of
ith group and the mean of all the groups and square of this particular value. So, this is the
formula for SB.
So, this is mainly for between group variance, then another estimate that is required
related to within group mean sum of a square, it is called within group mean sum of a
square and this is an estimate of within group variance. So, we are trying to find out the
homogeneity within a group and it is heterogeneity with respect to other groups. So,
within group variance is computed in this fashion SW it is called sw, SW 2 is computed in
this fashion 1/n-k and then summation over different add k groups 1 to k and the for all
the observation from 1 to n i, this is for ith group. So, ni (x ij - xi bar)2.
106
(Refer Slide Time: 29:32)
So, once these 2 computations are done between group mean sum of squares and within
group mean sum of square has being computed, then we compare these 2 statistics if SB
square is greater than SW square, then we can say that some of the population means are
different. So, therefore, null hypothesis would actually be rejected in this case.
So, this is actually done using F-test statistics. So, generally we can use this particular
formula F SB square divided by SW square and then the F-test statistics is actually used
to find out whether the null hypothesis is accepted or rejected. So, let us go through a
small example for ANOVA. So, let us open R studio. So, again in this case we have
created a hypothetical data.
107
(Refer Slide Time: 30:39)
So, we are talking about ads. So, these are 3 options AD1, AD2 or NOAD at all and the
purchase that can be associated with these ads. So, we are again for these 3 types of
situation AD1 and AD2 and NOAD, we are trying to create 3 samples randomly again
we are using rnorm function, you can see here. So, let us execute first create this
particular variable ads, you will see a character vector of ads has being created. So, the
sample size is 100.
So, this you can see, then purchase we can compute we want to compute you know, we
want to generate 3 samples you can see 100 first sample, which is corresponding to ad 1
it is about 100 observation and mean is there 500 standard deviation is there, then for
AD2 again 100 observation mean is 600 and standard deviation being 80, you can see
that standard deviation is same because that is part of the assumption that variance
should be equal, then the 3rd is NOAD case, there also we have 100 observation mean
being 200 and standard deviation again is same as previous 2 samples. So, let us execute
this particular code, so you would see purchase has being created. Now, we can create a
data frame of these 2 variables ads and purchase, so let us do this. So, this is how our
first 6 observation of data looks like, so Ads. So, NOAD, NOAD, so these are some of
the records and the corresponding purchase value it is also given.
108
(Refer Slide Time: 32:14)
Now, if you are interested in summary you can see AD1, AD2 and NOAD, how they
have they are distributed 27, 32 and 41, these are the split for 100 observation. Similarly,
if you are interested in statistics related to AD2, specially the purchase part you can see
that mean purchase is 493 and the min and max value are also there, similarly we can do
the same exercise for AD2, we can find out the statistics related to purchase with respect
to AD2, similarly for NOAD situation we can see.
109
Now, we have this aov, ANOVA aov to perform ANOVA test. So, in this case you can see
first argument is actually a formula. So, in this case purchase is being tested with this
respect to ads and data is again df 2, we have data frame 2, we have just created. So, let
us run this particular thing; now, let us look at the results.
You can see in this case F value is there, probability value is there, in this case you would
see that because F value is greater than 1, you can see that null hypothesis is rejected,
you would also see other numbers here sum of a square and mean squares, so those
numbers are here. So, this is how, we can actually perform ANOVA test. So, with
ANOVA, we are able to cover the basic statistics using R for this particular course.
Thank you.
110
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture – 06
Partitioning
Welcome to the course Business Analytics and Data Mining Modelling Using R. So, we
are into second specific subject data mining process. So, last time we stopped at stopped
our discussion on partitioning. So, let us pick up from there. So, in the data mining
process another specific point is partitioning.
So, as we discussed in this statistical modelling we generally use the same sample to
build the model and then check and then perform and check it is validity again. In the
data mining process we generally do partitioning wherein we split the data set into 2 or 3
partitions, 2 or 3 or even more partitions and then one of them the largest partition is
generally used for model building and then other partitions are either used for fine tuning
the models fine tuning the selected model or for model evaluation.
Now, another important point that we need to understand is that several candidate models
how do we select the our best model. So, it could be due to a 2 main reasons. So, first
one is the acceptable region the region that we want genuine superiority of the final
model over other candidate model is. So, it might so happen that the a final model the
111
selected model is a giving superior performance genuinely in comparison to other
candidate models.
The second is the problematic second reason is the problematic part that we want to
minimise or remove chance occurrence leading to better a match between final model
and data. So, it might. So, happen that you have 3 or 4 candidate models m 1 m 2 m 3 m
4 and it might due to a some chance occurrence that model number 3 that is m 3 is better
matching with the a data and therefore, giving the superior performance. So, therefore,
we need to minimise this particular situation we need to manage this particular situation.
So, partitioning is a one way to do that mainly data driven techniques they lack in
structure. So, they do not impose any specific structure in on data during their modelling.
So, therefore, they might end up a producing this later situation chance occurrence
because they are data driven. So, their main focus is on data and that might lead to over
fitting.
Now, as we said that partition of data set into 2 or 3 parts can actually solve this
particular problem. So, a typically 3 partitions are created they are called training set or
second one is validation set and the third one being test set.
So, again these partitions are created following a predetermined proportions. So,
typically the partitions are created following 60 20 20 rules; that means, 60 a percentage
112
of the points 16 60 percentage of the point observation they going to training set and 20
percent go into a validation set and the remaining 20 percent go into test set. So, though
the that is the typical proportion that is used.
So, but you can anyway change this so, but it has to be predetermined and this
predetermined proportion is then used to create partitions; however, the records are
randomly assigned to different partitions. So, the proportion is predetermined, but the
records are randomly assigned to different partitions. sometimes the situation might
require that the records are assigned based on using some relevant variable. So, in those
cases the variable decides which record will go into which particular partition.
Now, let us discuss the role of each of these partitions. So, first one being training
partitions. So, usually this is the a largest partition and this is I used to the same partition
the same sample exactly used to build the candidate models. So, different models that
you can think of to a tackle your classification prediction task can be used then the
second partition is the validation partition. So, in this particular partition is actually used
to evaluate candidate models or sometimes we also use this particular partition to fine
tune and improve our model.
In those situations when we use validation partition to fine tune or improve our model
validation part partition also becomes part of model building
113
So, therefore, it might create a bias in the model evaluation if the this particular partition
is used for the a evaluation purposes therefore, in those cases test partition becomes
mandatory to evaluate the final model and that is the role of test partition to evaluate the
final model. Now at this point we need to discuss different types of data sets.
So, till now the partition partitioning related discussion that we just did it is mainly
applicable to cross sectional data now what different type’s data sets are generally used
in the statistical modelling or data mining modelling. So, let us discuss so first one be
cross sectional data.
So, cross sectional data are observations on variables related to many subjects. So,
variables could be relate to related to individuals they could be related to firms industries
or countries regions and there are many variables and they are they are could be a many
subjects. Now they are observed at same point of time so it is snapshot kind of a kind of
a snapshot is taken.
114
(Refer Slide Time: 06:22)
So, let us assume our data set to be a cylindrical pipe. So, our variables observations on
variables are related to many subjects they are taken at a cross section. So, let us see this
is the point. So, all the observations on different variables v 1 v 2 v 3 v 4 for different
subjects they are taken at same point.
So, this is called cross sectional data. Now generally when we are doing when we do a
cross sectional analysis generally unit of analysis is also is specified though the variables
might be on different subjects individual firms, but there has to be unit of analysis
because the different observation that we are going to recall they are going to represent a
distinct subject. For example, if we in our sample if we have. So, let say we have this
sample and we have different variables in the column side and in this side we have
different observations. So, each observation each observation represents a distinct subject
for example, if the unit of analysis is individual. So, therefore, each observation will
represent an individual.
If unit of analysis is firm then each observation will represent a firm even though the
variables v 1 and v 2 v 3 v 4 could be on you know different subjects now the main idea
when we do cross sectional analysis to compare differences among the subjects. So,
whenever you need of analysis is individual we are trying to compare some differences
that are arising out of differences among those individuals or if our unit of analysis is
115
firm then we are trying to steady some you know and compare differences data their
among firms.
Now, second type of data is time series data, now in time series data observations on a
variable. So, related to one subject. So, in the in time series data we do not deal with
many subject there is just a one subject and observation on the on a variable related to
that subjects are actually taken. Now observation they are the this particular variable is
observed over a successive equally spaced points in time. So, each observation
represents a distinct time period. So, in time series we have the same subject let us say
this is the variable related to the subject one and at equally spaced times the observation
would actually be made.
So, observation or on the a same subject one subject and observed over a successively
equally spaced points and time each observation representing a distinct time period. So,
now, again here also that this unit of time could be different it could be day’s weeks or
years or month. So, based on that equally spaced point in time the observations are
recorded.
Now, the main idea in time series analysis is to examine changes in subject over time.
So, in the this subject changes are examined over time. Now another type of data set that
you may come across is the panel data sometime it is also called longitudinal data. So,
panel data any way takes different features of cross sectional data and time series data.
So, observations on variables related to same subjects over a successive equally spaced a
point’s in time are taken.
Now, the main idea in panel a panel data analysis is to compare a differences among the
subjects and to examine changes in the subjects over time. So, in a way panel data can
also be understood as cross sections with time order. So, we go back to our this
cylindrical tube for a data set that is a this is another cross section this is another cross
section.
So, all these cross sections with different equally spaced successive equally spaced time
can actually be used for panel data analysis. So, you have many variables here could be v
1 2 v 4 in all cross sections and they are they have a definite time order.
116
So, we are trying to study the same subjects. So, therefore, same variables and we are
taking different cross sections. Now another type of data set is called pooled cross
sectional data. So, observation on a variables related to subjects at different time periods.
So, you take observations, but these observations on subject, but these subjects need not
to be same. So, they could be different, but the observation are made at different time
periods. So, what is the main idea? So, main idea is to examine the impact of on subject
due to you know environmental changes caused by some falsely intervention or some
event.
So, for example, population census that is one example of pooled cross sectional data
were in India generally the population census happens in in a in 10 years. So, let say
2001 population census and 2011 census and the census is done the subjects might
change, but the senses is happening at different time periods and the we are looking at
due to passage of time populations we are looking at different characteristic related to
population different features or variables on population.
So, again pooled cross sectional data can be understood as independent cross sections
from different time periods. So, let say this is for pooled cross section data. So, this could
be 2001 related data on subjects and this could be 2011 data on subjects. So, this subject
need not be same and to independent cross section in the panel data the cross sections
had a time order in the pooled cross sectional data they do not have a time order. So,
these cross sections are independent let us discuss our next phase of data mining process
that is model building.
117
(Refer Slide Time: 13:47)
So, we will go through this particular phase using an example with linear regression. So,
let us open R studio.
In the previous lecture we talked about over fitting. So, let us revisit the same concept
through an example. So, this is again an hypothetical data this is about predicting future
sales using a spending on marketing promotions. So, I have created I am going to create
some hypothetical data. So, the you can see this data frame this code is about creating
this data frame promotions is 1 variable you can see different numbers are there.
118
(Refer Slide Time: 14:44)
So, these numbers are suppose are in rupees crores and then you have a sales. So, these
numbers are again in crores. So, we are going to create this particular hypothetical data.
So, let us create it can see promotion and sales number. So, we have same in
observations on these 2 variables let us look at the summary.
So, these are the some of the statistics on these 2 variables promotions and sales. So, let
us plot. So, you can see I am going to plot promotions on x axis and sales on y axis and
you can see that x axis label and y axis label is given there and then limit on x axis and y
119
axis. So, I we discussed in the previous lecture limit can be given their using the results
from summary command you can see the most of the value some minimum in promotion
is 2 and maximum is 9. So, therefore, most of the value will lie in the range of 0 10s that
is why this assessment as 0 to 10 similarly for y limit.
So, we can plot this this particular in the plot this particular symbol that you might see
this was actually supposed to this was supposed to be a symbol for rupees may be in this
system that particular symbol is not supported. So, therefore, it is coming as garbage in
this case, but we look at the data. So, this is sales verses promotions you can see.
So, looking at this data we can try to affect different models which can actually help us
understand the relationship between sales and promotions, but because we are doing a
business analytics course. So, our idea is not just to understand the relationship, but to
understand relationship and use them in a fashion. So, that we can improve our
prediction we can do some predictions.
120
(Refer Slide Time: 16:41)
So, therefore, if we try to fit some complex model let say let us try to fit a cubic curve
you can see that in the plot and this is the cubic curve that we have tried to fit here. So,
we if you look at this curve so supposing this being a complex function which we are
trying to fit over data and this is leading to perfect 100 percent fitting 100 percent match.
So, therefore, this is in a way causing over fitting of in the model you can see this hard to
imagine, when you move when you know increase your promotions spending on
promotions from 4 crores to 5 crores and the sales is actually dropping. So, that kind of
relationship is difficult to imagine with this is just an example, how a complex model if it
is fitted on data it can lead to over fitting a better model could be this particular line. You
would see most of the observation they all closed to this line and this line could be the
better model for this observations and this sample.
Now, let us go through our model building example. So, we are going to use this
particular hypothetical data said used car data set it is actually based on the many post
related to a used car sales that are made online, but mainly it is a hypothetical data. So,
let us load this particular file this particular library.
121
(Refer Slide Time: 18:45)
And this used car used car excel file you would see that this 1 d f has been created with
79 observations and right now it is showing as 12 variables.
But it is actually some 3 type deleted columns in excel that are that have also been
picked up by R as variables even though there is no data those columns have been
deleted. So, for that we have this particular code which will actually help us remove
these deleted columns from excel.
So, that is why we have this particular line. So, it will actually remove those 2 variables
you can see now and the environment section we have 79 observation with 10 variables
those 2 variables our actually deleted columns in excel file.
122
(Refer Slide Time: 19:41)
Let us look at the variable names. So, these are the variables we have brand of the car,
model of the car and we have manufacturing year of the car then we have fuel type
whether it is petrol diesel or c n g then we have S R price which is actually showroom
price for the car then we have kilometres accumulated. So, this is the related to another
related to car then the price the offer price for the used car, then the we have another
variable on transmission whether it is automatic or manual then we have another variable
on owners. So, the number of owners who have actually had the ownership of the car and
it is life time and then air bag number of air bags that are there in the car.
So, you would see that some of the some many of these variables are directly relevant in
a sense to predict the to predict the offered price of a used car. So, what we are trying to
the task that we are trying to perform here is prediction of offered price for used cars
using these variables you would see accumulated kilometres and the age of the car which
can be competed using the manufacturing year the transmission time number of owners.
So, these are some of the variables which could be relevant in terms of making our
prediction task related to offering offered price.
123
(Refer Slide Time: 21:25)
So, let us look at the first 9 records you can see here. So, different you know different
used cars you know from different models Hondai, Mahindra, Maruthi Suzuki honda. So,
these all are the then the model name model names are also available then the
manufacturing here then the fuel type is there we go back to our excel data set.
And let us look at some of the dummy coding that has been done here.
124
(Refer Slide Time: 22:00)
You would see that fuel type it has been indicated as petrol for petrol it is 0 then for
diesel it is 1 and for c n g it is 2 then you have, another per transmission 0 means manual
and 1 means automatic let us go back.
So, again for outlier during our discussion on outliers we talked about that sorting can
also help’s in terms of outlier ejection if there is some value related to some variable
some measurement which is which looks out of place which does not seem to be real
then that can that can be found out using sorting. So, let us do that. So, we have picked
up these 3 important variables kilometres showroom price and the manufacturing year.
So, in this case everything looks, but sometimes some value might look like out of place
it could be due to typing error or something else.
125
(Refer Slide Time: 23:16)
So, that can be easily identified and that value can then be handled then as we discussed
age could be age could be an another important variable in terms of predicting the value
of a used car.
So, let us compute this particular variable. So, you can see the current year if it is the
current year is let say 2017 then it be subtracted by manufacturing year.
So, we could be able to a compute an age that is add it to the distinct data frame and
since we are interested in since we are not interested in first 3 variables the brand name
126
the model name and the manufacturing year any more. Therefore, we can compute
another data frame by the moving these 3 columns. So, these are the variables and of
interest fuel type S R price showroom price kilometres price transmissions and
transmission and owner’s, air bag and age.
Now, another important concept related to modelling is model building is seeding. So,
generally many analysts prefer to do seeding actually help helps provide some you know
flexibility in randomisation. When if you know if you want to use these same partitions
in your second or third run seeding would actually help us duplicate the same random
partitions. So, in R we have this function set dot seed which can be set up. So, you can
further up in this case set dot seed we have given 1 2 3 4 5.
So, this this could be any number that you like and a seed would be created. So, next
time when you want to create the same partitions this seed can actually helped you
duplicate the same. So, therefore, let us execute this.
Now, let us move to partitioning. So, here you would see that you are using sample
function that is available in r. So, this sample function can actually be used to randomly
draw different observations and create an index which can be used to create an different
partition for example, in this particular case we want to create just 2 partitions of 50
percent each. So, we just want to create 2 partition training and test.
127
So, in this case 50 50 percent you would see in the first argument we have given the
range of values that we want in our sample. So, this is one form the sample size that is
can be computed using n row command and passing the data frame as argument in R and
you can see the second argument is 0.5 multiplied by again the size; that means, we want
50 percent of the observations in one sample and then same in the another second sample
and you would see the replace third argument it is assigned as false.
So, therefore, this is without replacement. So, this sampling is without replacing because
we are we want to create a you know we want to do partitioning. So, therefore, we do not
want the same observation to again appear in validation partition or test partition. So, we
want some of the observation randomly picked and being assigned to training partition
and the other observation that randomly picked and being assigned to other partitions
depending on the number of partitions.
So, let us execute this particular command. So, this will actually create a few are into the
data section.
You would see the you would see that an integer Vector has being created. So, these
values are actually index 2 different observations. So, now, this index is indices can
actually be used to assign different observation to different partition. So, for example,
data frame one we can use this particular index to assign these observation you would
see that the integer vector part ideas it actually has 39 observations. So, the data training
128
partition would actually end up with 39 observations you would see the d f 1 train has
been created 39 observation 8 variables. Now the remaining remaining observation will
actually go to the test partition you would see forty observations have had been assigned
to test partition.
Now, once these partitions have been created we can do our modelling. So, now, l m is
the function that is used for linear regression in R. So, first argument is actually the
formula. So, in this case formula and how the formula is actually written here is actually
the price you have to pick your outcome variable or your output variable the dependent
variable which you want to predict. So, in this it is the price that is the offered price of
the car and then we used these 2 this is the way we write the formula we used till date
and then dot means all other variables that are present in the data frame they would be
picked up as the input variables or the independent variables and they would be they
would be used as predictors for model building.
Now I would see that generally a dollar notation is used with the data frame, but in this
case because this is the way l m is implemented you can mention the name of the data
frame in another argument and the name of variables in the formula. So, it will be taken
other things will be taken care of within the l m function. So, let us execute this line and
build our linear regression model.
You would see again in the valued section and environment section and again values
subtraction have been created you would see a mod has been created list of a 12.
129
(Refer Slide Time: 29:23)
Let us look at the summary in the summary you will get the results of you regression
analysis, you can see the formula you can see the residuals is some statistics related to
residuals the mean and max value median value and another things.
Now, let us focus on these coefficients parts you would see different predictors fuel type
S R price K M transmission their estimates are given and you would also see that P
values are given you would also see in the result phase that significance quotes are also
given.
130
For example 0 for 100 percent kind of significance more than 99.9 percent significance 3
stars are used in 99.9 percent confidence interval 2 stars are used for then for 99 1 star
and then for 95 dot is used. So, these are the notations in this case we c 3 less than say
having 2 star and 1 star. So, would see the constant term and the fuel type and the
showroom price you would see a dot in the age as well so these 4 variables. So, if we
look at the main variables excluding and constant term fuel type S R price and the age of
the variable, they are the main variable which are helping us in determining the offered
price.
Other statistics related to this regression modelling are given for example, multiple R
square and R square is given this seems to be 60 close to 61 percent, which is good
enough. Now you would also see from the F statistics that this particular model is
significant. So, therefore, we can go ahead and interpret the results.
131
(Refer Slide Time: 31:28)
So, now let us look at how this model is going to perform over other partitions. So, let us
first see. So, this mod fitted values this there are some values that are written from the l
m function. So, one of them is being fitted values. So, we can compute the residuals
using these fitted values more discussion on regression analysis we will do in a later
lecture in this in this particular example we are just going through the data mining
modelling process. So, let us compute the residuals let us look at the actual value
predicted value and the error part.
132
So, these numbers you can see now what was the actual value and what was the
predicted value and what was the error these difference of these 2.
Now, similar thing can be done on the test partition those. So, we have predict function
that can actually help us in scoring the test partition. So, in the predictive function first
we need to pass on the model that is mod in this case and then the test partition is the
second argument which has to be passed on. So, that we can find scoring do the scoring
of test partition. So, let us execute this line.
We would get the numbers let us again compute the residuals for test partition let us look
at the numbers again the similar kind of output. Now we have another library that we
need to load to see some of the metrics for evaluating the performance of the model. So,
r miner is 1 particular r miner is 1 particular library that we need to load. So, let us
install. So, we need to load this particular library r miner to be able to use some of the
metrics.
133
(Refer Slide Time: 34:43)
So, let us load this particular library r miner. So, that we are able to a able to compute
some of the metrics. So, let us load this particular library r miner to be able to use some
of the metrics for our model evaluation.
So, let us load this particular library R miner. So, that we have access to some of the
metrics for our performance evaluation. So, let us load this.
134
(Refer Slide Time: 36:16)
Now, this is the function m metrics and we have first argument that we are passing is the
price that is the actual value and then the a mod train the residual value and then we are
going to compute these 2 these 3 metrics SSE RMSE and ME more discussion on
metrics.
We will do in a later lecture. So, let us first compute this. So, we have this m metrics
function in r miner that can be used to compute the metrics SSE RMSE and M E more
discussion on this metrics we will do in a later lecture first argument is the price the
actual value is the second argument is the fitted values. So, let us execute this particular
code. So, these are the numbers we can see SSE RMSE and ME values there similarly
we can compute for the a test partition again you would see the numbers there.
So, this is how the numbers of training partition and test partition then they can be
compared and the performance of the model can then be assessed how well it is doing.
So, we will do more discussion on this when we come to our regression analysis lecture.
Thank you.
135
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture – 07
Visualization Techniques- Part I
Welcome to the course business analytics and data mining modelling using R. So, we
completed our first module that was about general overview of data mining. Now, we are
moving into our second module that is about data exploration and conditioning data
preparation so those topics. Now, first lecture is going to be on visualization techniques,
so let say start. Now, you might have a come across this particular popular proverb that a
picture is worth a 1000 words.
Now, this is also a well known fact that visual processing of human a brain is much
higher than numerical or mathematical processing. So, that is the main underlying
importance of visualization techniques in any modelling process including a data mining
modelling statistical modelling. So, therefore, if we as humans or as domain knowledge
expert or as analyst or data scientist if we get to see, have you look at the data or to
understand some graphs, to see some graphs, to see some plots.
136
So, we are able to exploit our domain knowledge, our expertise in a much improved
fashion. So, that been the bases. So, we are going to start our discussion on visualization
techniques.
So, generally in visualization techniques they have primary role in the data mining
process during this data exploration and conditioning phase, different phases we have
already talked about in the previous lecture; data mining process specifically. Now, this
primary role of visualization technique in this phase, data exploration and conditioning
phase can be summed up with these points. So, we try to understand the structure of the
data that is available. So, that is a 1 goal that is done using visualization techniques.
The second 1 being the identifying gaps or awareness values, so sometimes there might
be few rows or could be duplicate few rows could be you know some of the values might
be missing or some of the values might look out of place or they might look awareness.
So, therefore, those gaps, some shelves might not have any values at all. So, identifying
those gaps would also be part of visualization techniques.
Identifying outlier, so some of the values would be far away from the mean or median
values, where the majority of the major chunk of the values are align. So, some of some
of the values are might be far away. So, identifying those values whether they are valid
point or whether they are awareness values that also need to be determined, so that we
can move ahead for further analysis.
Now, finding patterns, as we said that visual processing of human brain is much better,
so therefore, if we get to see the data, the plots, graph. So, we can easily find some we,
can easily see some patterns which can in turn be helpful in terms of identifying
appropriate data mining techniques or statistical techniques and then use them for our
modelling process. So, these are some of the roles, where visualization techniques can be
useful.
137
Now, building on those points, few other things could be missing values that we already
talked about, identifying duplicate row and columns. So, that is also important
sometimes some rows and columns could be duplicate. So, we would like to avoid that,
because many statistical techniques might have this you know assumption that cases
should be independent, in that case you know duplicate rows could be a problems
similarly duplicate columns would also be problem in some statistical modelling
techniques, where multicollinearity could be an issue.
So, therefore, so these, is specific things, these are specific term that I just talked about,
will discuss them in more detail when we you know come up some statistical technique
like regression and logistic regression. So, another important role played by visualization
technique is about variable selection, transformation derivation. So, sometimes when we
apply some of the visualization techniques on datasets, we are able to identify some of
the variable which could be useful for the data mining task, some of the variable which
could actually be transformed to suit our goal in a much in improved a fashion, much
better fashion, derivation will also get some ideas, some directions about new variable
derivations.
So, all these kind of things are possible through visualization techniques. Some examples
are given here, for example, appropriate bin sizes for converting continuous variable into
categorical variable that is something, when we look at the data, when we look at some
of the graphs that we are going to cover in this lecture. So, we will get some idea what
138
should be the bin sizes for a continuous variable for it to be converted into a categorical
variable.
Combining categories, sometimes you know some categorical variable might be having
many categories which might not be which all of them might not be useful for the
specific task at end. So, sometimes it might be required or it might be mandatory by the
data at some of the groups can be combined, so some of the categories could be reduced
and you should be able to keep only the meaningful groups, meaningful categories for
our appropriate task mainly classification. Another important role could be usefulness of
variables and metrics. So, while we are exploring the data using visualization techniques,
we will also be able to understand, which variables are important and which metrics are
going to be used for performance evaluation etcetera.
Now, this particular phase data exploration and conditioning phase this is considered to
be a required frame e ester before formal analysis and we say formal analysis we actually
mean the data mining techniques and statistical technique like regression tress artificial
neural network discriminate analysis. So, before we go ahead with those formal analysis
of these techniques, this a particular step is mandatory; kind of mandatory where we
apply you know some of the visualization techniques on data and do some preliminary
processing, preliminary analysis.
Now, visual analysis let us understand visual analysis, role of visual analysis a bit more.
So, it could be considered a free form data exploration. So, when we talk about
regression analysis that is very structured kind of analysis that we do, but when we talk
about visual analysis there is mainly we are exploring so and that to in a free form. We
try many plots, many graphs that we are going to cover later in the lecture and which try
to learn something about the data, which is going to help us in our further analysis.
139
(Refer Slide Time: 07:23)
Now, as mentioned in the second point, that main idea is to support the data mining goal
and subsequent formal analysis that is going to take place. Now, techniques in visual
analysis they range from basic plots, that will cover you know line graphs, bar plots, a
scatter plots. So these are, be, will cover from these some of these basic plots to
interactive visualizations. So, interactive visualization they will a cover the multivariate
nature of the data sets, will later discuss that it is generally the kind of modelling that is
required is generally multivariate in nature.
So, therefore, some of the advance plots or interactive visualization can be really helpful
for formal analysis. Now, the usage of visualization techniques that also depends on the
kind of pass that we have, so some of the visualization techniques, some of the charts
and plots would be more suitable for classification, some others would be more suitable
for prediction, some others would be more suitable for clustering. So, therefore, it is the
data mining task that will also drive, the way, the kind of visualization techniques that we
are going to apply.
Now, different data mining techniques is also, such as CART and HAC that is
hierarchical agglomerative clustering. So, CART is classification and regression tree
modelling. So, some of these data mining techniques also have their own specific, I
know visualization techniques, their own charts and graphs. So, that is also important to
understand here. That we are not going to apply, everything that we learnt on you know
140
every technique that we are going to follow in subsequent formula analysis, but it is
going to be task specific as well, classification, protection or clustering and also it
sometimes it is going to be specific to the a particular technique.
Now, let us start our discussion with the basic charts. So, as I said 3 important charts or
graphs we are going to discuss, first 1 being line charts graphs, second 1 bar charts, third
1 being scatter plot. So, let us have basic discussion on charts.
141
(Refer Slide Time: 10:17)
So, generally these basics are display 1 or 2 variables at a time. So, generally they are
going to be 2 dimensional graphics and then there are going to be you know 2 variables
we are going to pick and 1 is going to be 1 axis and other 1 is on y axis. So, generally 1
or 2 variables at a time and the a main idea being, to understand the structure of the data,
variable types and missing values in the data set. So, generally these are the points where
these basic charts are going to be useful.
So, the basic charts for supervised learning methods, generally main focus is on outcome
variable. So, that is typically plotted on a y axis for and supervised learning, so basic
chart you can also be used for unsupervised learning methods as well, so will see that
through examples using r, so let us a move to our next discussion on line charts.
142
(Refer Slide Time: 11:38)
So, line charts the main used mainly to display time series data. So, we try to a see the
overall level and the changes that happen in the data overtime. So, let us understand, let
us learn line charts through an example, let us open r studio. So, let us load this, a
particular library xlsx. So, this is the data set that we are going to use bicycle leadership,
let us understand this particular data set.
So, if you look at the actual data on this excel file, you would see that the data starts from
January 2004 and it goes up to March a 2017 having a 159 data points and the second
143
column, the second variable is on riders which is actually the number of individuals
riding bicycles. So, this is a mainly to reflect the a bicycle leadership in the I I T Roorkee
campus, but this being mainly hypothetical data.
So, this data have created; hypothetical data have created for this demonstration purpose.
So, a let us load this particular, let us import this particular data set has we have been
doing in previous lectures. You can see in the environment section, the data set has been
imported, you would see that there are 159 observation and 2 variables if you want to see
the data here in the r environment or studio environment you can see here, month year is
the first variable and then the riders, so the data because, this being time series data, so
data is mainly on the riders, so it displays the number of riders in a particular month.
Now, this second line, the second line of code is actually about, if there are any as we
talked about in the previous lecture if there are any deleted columns in excel files they
would actually be picked up in the r environment. So, therefore, we want to get rid of
those columns. So, this particular line this particular apply function is going to help in
that. Now, let us look at the, let us have a look at the first 6 observation you can see for
different months, Jan Feb March and a different months, we can see the number of riders
that are there.
So, before generating a line graph, we need to create a time series vector here. So, t s is
the command, if you understood in understanding more about a particular function in r
144
you can do so using help section t s, you can see t s is the function mainly for time series
objects and you would get a detailed uses a t s different arguments, you can get detailed
help here in the help section.
Now, let us go back to our code. So, here you can see that in the t s function the first
argument is actually the data. So, we are passing on this argument data frame and is
specifically this riders variable, so that is the first argument. So, we want to create a time
series object out of number of riders for month for every month. So, you can see the
starting of a date is from 2004 and 1 is for first month of the year January and then
ending of this time series data is 2017 and then March that is third month and the
frequency is mentioned 12, that is be mainly because the it is the monthly.
So, therefore, in a year we have 12 data points. So, therefore, frequency has been
mentioned as 12. So, let us create this time series vector, this has been created, you can
see the same in our environment the t s v has been created and time series is a number of
values 1 2, 159. Now, if you want to plot this particular time series, so that is going to be
our, a line graph. So, you can see again plot is the command that we used previously as
well. So, in this case we have just 1 variable t s v and the this is going to be this
particular rider riders is going to be on y axis as you can see that y label has been given
and the x axis is mainly going to be used for time scale.
So, in this particular code, the time is scale would v determined by the function plot by
default, the default settings could be applied for time scale and you would see then
another argument l a s is there, that is for styling for axis labels. So, how the axis labels
are going to be displayed? More information on this particular argument, you can find in
help. You can type plot in the help section and you will get more information on different
arguments, some of the arguments would be available in the par command that is for
parameters; different parameters for graphical settings. So, you would see l a s
somewhere there, you can see here. So, in the parameter p a r, par command, so you can
see different styling of axis label that is displayed over there. So, you can see I have
mentioned l a s is equal 2, so that means, I want my axis to be always perpendicular to
the axis.
So, the labelling of points would is always going to be perpendicular to the axis. So, let
us see how it is going to be displayed.
145
(Refer Slide Time: 18:30)
So, this is the plot, so you would see in the x axis years are being displayed and you
would see different tick marks for different years have been picked up by the a plotting
function, that is plot here, this case and the riders was defined by us, so that was the part
of the time series vector that we created and this particular. Now, you would also see that
labels for these axis are perpendicular to the axis itself, so the way they are here.
Now, you can also look at the plot as well, would see this is a line graph, so all the points
for different points in time, riderships are there and these points have been connected and
that represent a line graph. So, this can help us understand the main label of the data. So,
you can understand the level of values that are being taken here. So, a 1 particular line
which can pass through these points, somewhere in between can be considered as the
main level of this particular graphic and then you can also see the changes over time. So,
it is look this particular graph looks polynomials, so the changes looks a polynomial in
nature.
146
(Refer Slide Time: 19:57)
Now, you want to improve this particular graph, so that is also possible. So, will try to,
so if you want to improve this particular plot. So, first thing that you need to do is to
create a sequence that can be used in creating our labels and tick marks in the axis is
specifically the x axis that is for time. So, what I am trying to do here is I mean creating
a sequence of, I am creating this particular vector using this sequence function.
So, sequence function will create values in this case a date sequence starting from 2004
up to 2017 and the difference, so these are going to be equally space points. So,
difference being 2 years, so let us create it, then another function that could be used for
formatting of a vector, so format is the function and for example, the sequence that I
have just created in at 1 in this particular variable, so we can format this. So, format
command would actually allow you the particular information that you want to retain in
your specific format, in your customize format. So, in this case I am using percentage b
and percentage y, which are mainly for month and year.
So, I am living out the date, day information and I am keeping the month and year
information using this particular format function. So, let us execute this particular code
and you would see label size being created. Now, in the environment section you would
also see that labels a Jan 2004 and then Jan 2006, Jan 2008, so these kind of labels have
been created. Now, if we want only the year part, then again we can use the format
147
function and we can extract only the year related information from this at 1 vector, so
that is also possible, so let us do that.
Now, another important aspect of creating plots in r is, margin, margin the plots. So,
there is this command par, p a r par parameter for parameter that we can actually use. So,
there is another, so there is, this a variable that is available in a par command that is for
margin m a r. So, this will actually tell us the different margin in a 4 sides of the plots.
So, these 4 slide 1 with these 4 slides, 1 being the bottom, then you have left side, then
you have top side and then the right side. So, all these 4 side what is going to be the
margin that we can actually defined using this particular function, par.
So, let us execute this line. So, you would see the default setting for margin 5.1. So, these
numbers actually represent the number of lines, so 5.1 number of lines, 4.1, 2.1, this is
the margin that is by default that is there, when we create a plot. So, we want to change
this particular because we want to change the axis and you would see a lot of spaces
being taken the way we are a labelling the axis, which being the perpendicular to the
axis. So, lot of space is required in this case, so therefore, lot of margin is required.
So, we need to change the margin. So, therefore, you would see that, the first margin is
actually for the bottom side. So, we see it is the highest 1 8. So, we want more margin
here and then we want 4 in the left side, then we want 4 in the in the top side and just 2
on the right side and then there is another point 1 addition to this. So, once this particular
margin has been created, now we can go ahead and start getting our new plot. This new
plot is on the same a time series vector that we have created, but the graphic is going to
look slightly different much better.
So, in this case we want to create a new plot and we want to create a new axis. So, you
would have to use these parameters in the plot command x a x t and y a x t. So, in this
case I have assigned n value, that means the x axis and y axis, would not be plotted in the
graph. So, they would disappear and labels also have kept them as null. So, there are not
going to be any labels just the graph, the line graph is going to be displayed without any
x axis or y axis or any labels for those axis, you would see that a box and a graph within
that particular box is displayed over here.
Now, we will have to create this axis. So, axis is the function that is available in r, which
could be used to create this function. So, x is 1, 1 means 1 is for x axis and then the next
148
line is for axis 2 the 2 is for y axis. So these, this axis function can actually be used to
create x axis and y axis. Now, you would see that at this is the tick mark location we are
using at 2, so at 2 is going to be used. Labels you would see, labels 1 that, so these labels
are, so at these tick marks at these points, the labels that have that are there in labels 1
they are going to be displayed and styling of these label is going to be again l a s is 2. So,
therefore, it is going to be perpendicular to the axis.
So, let us do this and you would see an x axis has been created which is slightly different
from the previous plot. In the previous plot, only the year was displayed, now you can
see on the month and year is has been displayed Jan 2004. So, why we are doing this,
main reason being the way we have, the kind of data that we have in the data, it is
monthly data. So, we have ridership data month wise. So, therefore, it is more
appropriate for us to create a, generate a plot which also shows a month not just the year.
So, month and year that information is being defected in the plot now, now similarly we
can create the y axis.
So, we particularly we did not want much changes in a y axis. So, therefore, it has been
displayed as is. Now, if you want to a label the axis, so this is another command m text
that can actually be used to label the axis, x axis and y axis. So, again in m text also first
argument is going to be you have to select the axis. So, side is equal to 1 mean, being the
x axis if you want to a change y axis and it has to be side h 2. Then you can mention text
argument is there, where you can mention the level of the axis and then a line argument
is there which will actually tell you that, how many number of lines the labelling would
be created after how many number of lines below the axis or from the axis. So, let us
execute this particular code and you would see that month year have has been created,
similarly for y axis you would see riders has been created.
149
(Refer Slide Time: 28:00)
In the zoomed version of this graph, you can see the same month, year and rider. So, this
is much better plot then the previous 1, where we just had the year information on the
time scale, now we have month and year information on the time scale.
Now, next function that we can learn is about graphics dot off. So, this particular
function if you call this function, so all the plots would actually disappear from your r
studio environment and there is an another command def dot off, if you run that
particular command then only the current plot the default plot that is being displayed
only that put, we deleted or erased. You can achieve the same effect using these 2 points,
these 2 tabs here in the plot section; you can see these 2 tabs, so that can be achieved.
So, we want to get rid of all these plots, so let us run this, you would see everything
would actually disappear. Now, you can again check the margin, once this all the devices
are closed, the par setting, this par function and many settings related to this particular
function would actually be reset. So, this we can check for the margin because we
change the margin you can see, that the default numbers are again having set for
margins.
Now, this was our discussion on line charts and line graphs, more discussion on a line
graphs and how we can use it further would be covered in time series forecasting time
series, where we are going to learn more about line chart and how it could actually be
used before the formal analysis and time series forecasting can actually happen, so will
150
learn more in those lectures. So, let us move to our a next basic chart that is bar chart. So,
the main youthfulness of this particular bar chart is for comparing group using a single
statistic, so will see, how that is done. Now, generally in x axis is used for categorical
variable. So, generally x axis is going to be reserved for categorical variable and we try
to understand more on that particular variable using y axis. So, let us do this through an
example.
So, to understand bar chart and other charts we are going to use this particular data set
that we are already familiar with, used cars data set. So, let us load this particular file, so
you can see this particular data set has been loaded here, we have 79 observation and 11
variables. So, let us run this a command, so that we are able to get rid of the deleted
columns, so there were no deleted columns.
So, let us look at the data as well, so let us go back to the original excel file. So, this data
set, we already familiar, but let us again have a look.
151
(Refer Slide Time: 31:21)
So, that data set is about used cars, so we have this information, through we have
information on our used car like brand, model, manufacturing year, fuel type, showroom
price, kilometres accumulated, the offered price and the transmission whether it is
manual or automatic, then the number of owners and then the number of air bags and
then this another variable c underscore price, which has been created manually.
152
So, all this particular variable is, if is 0, if offered sales price is less than 4 lakhs and 1
otherwise, so these are the variables and this particular data sets. So, from that data set
we can actually see that there is the 1 variable there was the manufacturing year.
So, from the manufacturing year and if the current year is 2017, we can actually calculate
the age of the vehicle. So, that we can do using this particular code, we can subtract
manufacturing year from 2017 and we can create age and once this is done we can use c
mind command to add this particular column in the data frame.
Now, in the data set, that we have already seen let us have a look again. We might not be
interested, you would see that age has been added there, you can see age there and you
would also see that some of the variables would like brand, model and manufacturing
year they might not be required now, so will get rid of them.
153
(Refer Slide Time: 33:04)
Now, these are the variables that we are interested in. So, let us stop here and will
continue our discussion in the next part.
Thank you.
154
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture – 08
Visualization Techniques- Part II
Welcome to the course Business Analytics & Data Mining Modelling Using R. So,
previous lecture we talked about visualization techniques. So, will continue our
discussion from the same point where we left in this in that lecture. So, we were
discussing bar plots and we were in this code we had imported this particular a data set
used car. So, this has already remaining imported in data frame one.
And we were also able to create this particular new variable age. So, this has been
created using the a manufacturing year, wherever that was available in the data set then it
was appended to the data frame. And then we had eliminated some of the variables
which were not useful for our purpose.
So, after that we were left with this particular data frame. So, now, we have 9 variables
you can see fuel; fuel type, then showroom price kilometres and price transmission,
owners, air bags, see price and age.
155
So, then another command another useful command, another useful function that is
available in r is S t r. So, we have discussed this a particular function before as well.
So, these actually help us understanding the structure of the data set different variables.
So, you can see fuel price this is factor variable and with 3 labels being 3 label being
CNG diesel and petrol.
Another important point that I would like to highlight here is the factor variable or
categorical variable that are created in r. The labels would be in the alphabetical order.
So, therefore, you would see that CNG has been displayed first. So, this has an impact on
many functions that are available in r, wherein the CNG is taken as the a default
category.
When we do we will start one of the formal analysis let say regression, then will come
across some of these important peculiarities in r and different r function specifically with
respect to factors.
Now, other variables you can see that showroom price, kilometre price, transmission
owners, airbag all have been displayed as a numerical variable, but if you really look at
the variable transmission and C_ Price.
156
So, they are actually categorical in nature because transmission can have only 2 values
that is 0 for 0 and 1 for automatic and manual. So, this is important. So, therefore, we
need to convert this numeric variable into a factor variable.
Similarly a C price that has been created by us manually as the as discussed in the
previous lecture, were 1 was assigned for price amore than 4 lakhs equal to or more than
4 lakhs and for the cars having price as less than 4 lakh well assigned 0.
So, therefore, only 2 values are pass will 0 and 1. So, this variable also being categorical
or factor variable therefore, we need to convert C-Price variable as well into factor
variable. So, let us do that do this. So, as we have talked in a supplementary lectures hash
dot factor is the command that can be used to course a numeric variable into factor
variable or categorical variable. So, let us execute this particular line.
So, first transmission so, we run this line and then let us also run, C-Price.
So, these 2 variables have been converted into factor variable. Let us look at the structure
command as structure function again. Now you would see in the transmission you would
see factor with 2 labels 0 and 1. They have been created. So, the variable converted into a
factor variable, you would also see that C-Price has also been converted into a factor
variable with 2 levels 0 and 1.
157
So, with this most of the variables in our imported data set used cars, most of the
variables they are they are being you know presented in they are they are being used
stored in their suitable a variable type.
So, now let us look at the summary results. So, summary results you can find out that
that the categorical variable for example, fuel type you can see counts have been
displayed for different categories for example, there are 3 only 3 records having CNG as
fuel type 52 records or observation having diesel as fuel price and 24 records having
petrol as the fuel type.
Similarly you can look at the transmission a 63 records having a transmission 0 that is
manual and 16 records having transmission as automatic that is 1 represented by 1.
158
(Refer Slide Time: 05:29)
Similarly, C-Price variable that also now being a factor variable or categorical variable 0
0 there are 48 cars, which are actually having a price value less than 4 lakhs rupees. And
31 car used cars are there, which are having price value more than equal to or more than
4 lakh rupees 4 lakhs rupees.
So, other variables are numerical in nature therefore, many descriptive statistics have
been displayed for example, mean median a max that we already understand. So, let us
move to some of the basic plots.
159
So, 2 basic plots that we want to cover that bar chart and this scatter plot. So, let us first
start with scatter plot. So, let us go back to our slides and let us understand some of the
key things about scatter plots.
So, generally scatter plots they are mainly useful for prediction task. So, the focus when
we say that prediction task and how is scatter plot could be useful with respect to
prediction task, focusing on finding meaningful relationship between numerical
variables.
160
So, a scatter plot is mainly for numerical variable both the axis they are used for
numerical variable both x axis and y axis, and for prediction task focus being identifying
some meaningful relationship by from the plot. Now for unsupervised learning task is as
clustering focus is on finding information overlap.
So, why this is useful that when we coming to our unsupervised learning lectures and
start clustering our discussion on clustering will learn more about this, but focus being
finding information overlap between different variables for unsupervised learning task
and scatter plot can be useful in that.
Now both the axis as said are used for numeric variables in the bar charts x axis is used
for categorical variable.
So, this is reserved for categorical variable and different groups because this being a
categorical variable the variable on x axis is being categorical variable we can create
different groups. So, there are going to be different categories and they can be different
groups can be there and statistics can be displayed on y axis and that can help us
understanding the differences between groups and compare them.
So, let us go back to our studio and let us start with scatter plot.
161
(Refer Slide Time: 08:10)
So, first plot that we are going to generate is between kilometre K M and price. So,
before plotting because we need to specify limits on x axis and y axis so, that our plot
looks much clear and we get more clear picture on the data. So, let us first understand the
range of this variable. So, this range values of these kilometre and price is going to help
us in the remaining the limits.
So, you can see in the plot function you would see that for kilometre the range is
between 19 and 167 and you can see the x limit that I have specified there is 18 and 180.
162
So, all the values are going to lie within this particular range similarly on the y axis the y
limit that I have specified is 1 and 75 and you can look at the a price values. So, they are
1.15 and 72. So, therefore, these values are also going to be lie with within this range.
So, x axis is a kilometre is on the x axis and price is on the y axis you would had
discussed before price is the outcome variable of interest in this particular data set and
therefore, price has is being displayed on y axis. Label for x axis and y axis have been
given appropriately using x lab and y lab arguments. So, we can run this particular code
and you would see graph has been depicted.
Now, if we zoom into this particular graph you would see that there is 1 extreme outlier
there. This particular where you kilometres accumulated they are far less they are far less
than 25, 000 kilometres, but you would see that price that is offered it is a much higher.
So, it is more than it is more than 70 in this case. So, the price is much much higher more
than 70 lakhs that is.
So, but if you look at the other values of the majority of values are lying within 0 to 20
lakhs range, but this is the only out layer. So, from this we can understand that a most of
the used cars that are they are in the data set they are in a smaller range.
Therefore, it would not be appropriate for us to study this stream outlier along with the
these points. So, you know we have to restrict our analysis to this range as well to be it is
163
it would be. So, we can eliminate this particular point and focus on this major chunk of
points mainly lying between 0 to 20 in terms of price.
You would also see that in this plot some of the some of the points which all lying closer
to x axis, but for you know far away from the 0 value and the majority of the values. So,
they have more kilometres accumulated, but price offered for them is also in the same
range between 0 to 20.
So, let us go back and here you would see first we are trying to identify we are trying to
identify that particular outlier point, you would see I have given this price value greater
than 70, you would see in the graph that this looks more than 70 price looks more than
70 lakhs. So, therefore, a let us run this particular code.
So, you would see this is point number 23. So, this is observation number 23 having fuel
type a diesel and showroom price 116 lakhs. And then you would see that showroom
price is 116 lakhs and the offered price is 72. So, these are high numbers in comparison
to the majority of other observation.
So, we can get rid of this particular observation because this seems to be this can be
considered a very distinct group. So, let us take a backup of previous data frame and now
eliminate this particular point. Now this is how we can eliminate this point in the data
frame we can use these brackets and the point number that index, we can specify as
minus 23 minus will give this instruction that this point is to be removed from the data,
frame and the data frame would be again stored in the same.
So, let us execute this or this particular data frame is gone. Now let us again let us again
have a look at the range values max and min values for both these variables kilometre
and price and let us again re plot the graphic.
So, you would see that kilometres there is not much change price then here you would
see some change for example, earlier values was were ranging from 1.15 to 72.
164
(Refer Slide Time: 13:26)
Now, this is ranging from 1.15 to 13.5 5. So, even less than 15.
So, the price is now less than 15 lakhs, now for a kilometres also now this new range is
27.5 to 167 earlier range was 19. So, you would see that kilometre range has also you
know increased specifically for the minimum value. Now let us plot this now the x limit
and y limit values have been appropriately changed modified would see 18 180 and 1
and 15 now instead of 75 earlier case, on it is plot this now this is the new plot that we
have.
165
Now, you would see that the graphic is covering most of the points in a clear fashion. So,
most of the points are covered in the graphic now. If we try to understand some of the
relationship between these 2 numerical variable kilometre and price you would see not
much change, you we can have a constant line that is going somewhere at a price value
of 4 lakhs and it can be a constant line.
So, it seems from this particular data points if we fit this particular these data points into
a you know linear model then you would see that kilometres is not a important factor. So,
the price is being offered irrespective of the kilometre. So, that is the kind of sense that
we get from the data. So, this data can be restart by a horizontal line. So, therefore,
kilometre is not is not such a crucial variable in our analysis specifically focusing on
prediction task related to price.
So, these are some of the insights that we can get from these basic plots. For example,
relationship between price and kilometres we can see that kilometres KM might not be
such a useful indicator for offered price, respectfully this is what we get from this is,
what we gather from this particular data set.
Now let us move to next basic chart that is bar chart. So, would see the bar chart we want
to plot between price and transmission. The transmission is the factor variable that we
have already created and we can compute average price for different groups. So, we can
get 2 groups based on transmission value. So, 1 is a 0 that is manual and the another
transmission value could be 1 that is for automatic.
So, these 2 groups manual and automatic and average price value for these 2 groups can
we can compute using this particular line up code, you can see I am trying to compute
mean from the values. So, which is another function which can return more information
on which you can find from the help section, but to give you a sense of this particular
function, which can find out the indices of the observations were transmission value is
transmission value is equal to 0.
So, those indices would be return and then the those particular observation would be
retrieved or selected or subset or subset would be created for us to pass to the mean
function and to calculate mean of that. And mean is again dollar notation indicating that
mean is to be computed only for one variable that is price.
166
Similarly, for another group the same thing can be performed. So, let us compute average
price you would see that average price a numerical variable has been created and there
are just 2 values, these values corresponding to 2 different groups group, 0 that is go
manual and group one that is for automatic car.
So, these 2 mean values have been computed. Now another variable that we want to
create before generating bar plot Trans is the name of the variable. So, this is just for the
labelling purpose. So, the transmission the labels that we are going to use in our plot so 0
and 1, that are the names of the label for 2 different groups.
We could have we could have used a you know manual and automatic as well here. So,
those could have also been the label names for our bar plot let us go with 0 and 1 right
now. Now, because the this particular bar plot is between average price the variable we
have just created and then a trans. So, let us look at the range; range is between 3.7 4 to
5.4 8. Now if you look at the plot bar plot function that I have written here the y limit is
between 0 and 6. So, these in this particular range would be covered in this.
So, average price is the first variable. So, this will go into the y axis and the names dot r
here this will go into the x axis and the labels for x axis that we have just created using
transmission, x lab name is transmission and y lab y axis label is average price. So, let us
execute this line.
167
(Refer Slide Time: 18:48)
You can see this for group 0 that is the cars used cars you know manual with manual
transmission, you can see their average price is somewhere between 3.5 and 4. And for
group 1 that is the used cars with automatic transmission, their average price is ranging
somewhere between 5 and 5.5.
So, this kind of information, so, it seems that the cars automatic cars with auto automatic
transmission they seem to be carrying more value. Now let us create another bar plot. So,
168
this time this time we are going to use only the 1 variable. So, the this previous plot that
we created. So, we had 1 numerical variable on y axis and 1 categorical variable on x
axis.
Now let us just focus on 1 variable and that has to be categorical. So, it is going to be on
x axis again. So, this variable is again transmission. So, what we are trying to find out is
the number of cars, which are manual and the number of cars, which are automatic the
percentage. So, percentage of all records percentage of all records that we want to find
out which are manual and which are how many are manual and how many are automatic.
So, this is the code that we can that we can actually use to compute this. So, can see
length. So, I am trying to find out the length of a vector which is going to be determined
by this which command. So, which command is going to return the indices for
transmission were it is 0.
So, all those indices they would be counted using the length function and will get the
number of records. And you would see this has been divided by divided by the all the all
records in the vector transmission, that can be computed using again using the length
function and passing on the argument transmission. So, this would be the ratio and then
multiplied by 100. So, this will create a percentage number percentage value similarly
for group 1 that is for automatic cars we can do this. So, let us execute this line.
169
You can see variable pAll has been created and there are 2 values 80.8. So, 80.8 percent
is of records they actually belong to group 0 that is manual and 19.2 records they belong
to group 1 that is for automatic cars.
So, let us generate a bar plot you can see pAll is the variable and the arguments Trans
and limit. Now in this case is 0 100 because we are using percentage. So, that is the
range standard range. So, let us create this plot.
170
So, can see x axis transmission 2 groups 0 and 1 and in on, y axis percentage of all
records and you can see 0 this is close to 80 and you can see 1, group 1, which is closer
to 20. So, this kind of a using 1 categorical variable that is mainly a transmission we can
also create these kind of plots.
So, this again in a way help us in understanding the structure of the data from this we can
understand that most majority of the cars, more than majority mark around 80 percent
most of the cars. They are manual and only 20 percent of the cars if smaller numbers of
cars smaller percentage of cars are actually automatic.
So, therefore, this gives as a idea about the this particular structure for example, if it was
less than 5 percent that then in that case it could have it could be you know defined as a
rare category rare class you know the automatic.
So, therefore, this might have affected our formal analysis. So, these are kind of graphics
an actually help us understanding some of the insights about the data and help us later on
in formal analysis.
Now, let us go back to our slides let us discuss a next set of plots. So, next set of plots are
actually distribution plots.
So, mainly 2, 2 distribution plots that we are going to cover in in this lecture in this
course a mainly 2 being 1 first 1 is histogram the second 1 being box plot. So, as the
171
name says these are distribution plots and help us understand the distribution of data. So,
because this is distribution so, generally they are applicable to numerical variable.
So, distribution can help us for example, understanding whether the data is following
normal distribution or not. So, therefore, if it is not following normal distribution what
can be done?
So, these some of these plots are going to help us in the fashion. So, sometimes we might
be required to transform those variables. So, that we are able to achieve the normal plot
normal distribution. Sometimes if we want to convert a numeric variable into a
categorical variable so, histogram and box plot the entire distribution that they display
that can help us in terms of binning of those variables; how the bins are groups are to be
created. So, those insights again help us in creating new variables.
So, as shown in this slide histogram and box plot they are about distribution of a
numerical variable, we get directions for new variable derivation as we discussed and we
also get directions for binning of a numerical variable. Now useful in supervised learning
is specifically in predict prediction task, because they are mainly applicable for
numerical variable and therefore, prediction task doing the important type for this kind of
plots where we can actually get some help for example, variable transformation in in
case of a skewed distribution.
So, will learn more about the skew in coming lectures as well and in this lecture as well
so, there could be a right skewed distribution or left skewed distribution. So, what are the
transformation that can be done to achieve to actually reduce some of this skewness of
the of the of the plot of the data. So, and so that can actually be find out that can be
actually be shown in use in histogram and box plot.
Selection of appropriate data mining method for example, if the a data set is not able to
follow or not able to meet some of the assumptions in a statistical technique. Then
172
probably, we cannot apply them and probably, we have to go with some of the data
remain technique some of the data mining technique. Because they are some of these
youngsters are relaxed and those techniques can always be applied.
So, selection of appropriate method or technique can also be done using these plots. Now
further discussion on box plot. So, box plot they display entire distribution.
So, till now for example, whatever bar charts that we plotted they were focused on you
know 1 or 2 groups or categories that were there in the categorical variable and the
numerical values for the same were you know reflected in the y axis. So, that kind of
information that that we could get from the bar plot, but in the box plot we get the whole
entire distribution the whole range of values are covered. So, therefore, we can have a
better look in the on the whole data the full data.
Now, there is another thing that can be done side by side box plots. So, we can create
side by side box plot that can again help us in comparing and understanding the
difference between groups, something that we did using bar plots that can also be done
using box plot and in a more in a much better fashion.
So, this could be useful in classification task where we can understand the importance on
numerical predictors. So, in classification task we are using some numerical predictors.
So, this side by side box plot can actually help us in finding out how these numerical
173
variables, numerical predictors, can be best utilised and their importance as well. Another
usage of box plot could be in the you know time series kind of analysis where we can
have series of box plot and we can look at changes in distribution over time.
So, that can also be done. So, let us open r studio and will go through an example go
through examples for box plot and histogram. So, let us also cover histogram as well
before we going to before will go through examples together for both of these kind of
plots.
So, histograms; histograms, they generally display a frequencies covering all the values.
So, in the bar plots only few values are actually covered. So, in this case histograms we
cover all the values and vertical bars are used more we will learn through r studio.
So, again we are going to use the same data used cars data. So, you can see that hist is
the function that can be used to plot a histogram, you can see because we are interested
in this particular variable price that is our outcome variable. So, let us see the range. So,
range is same as we saw here 1.1 5 to 13.5 5.
174
(Refer Slide Time: 29:27)
So, let us so you can see that limit minus 5 to 20 has been used, why this slightly wider
limit will see after the plot is created and y limit this is actually the frequency for
different bins. So, will see once the once the plot is created. So, let us execute this line
you can see the plot.
That the for better visibility of this histogram we have given this from minus 5 and this
particular is also and on this x axis on this extreme a right extreme also some more range
175
has been given. So, that we are able to visualize the whole histogram in 1 go in a much
better fashions therefore, that is why this wider x limits were given and you can see the
frequencies for different bins over there. So, this particular distribution because this
histogram covers all the values of a numerical variable, we can get a sense whether this
whether a particular distribution is following is normal distribution or not.
So, in this case in this case we can see that this seems to be a right skewed distribution
right there is a slightly longest tale on the right side. So, this particular particular
distribution does not seem to follow normal distribution. Therefore, it is going to be
slightly difficult to apply some of this statistical technique for example, is linear
regression etcetera. So, therefore, we would be required to do bit transformation to make
it slightly more of a normal distribution.
Now, a let us move to the box plot. So, in again in the box for the box plot we are
interested in these 2 variables price and transmission. So, let us look at the range which
is again is going to be the same and box plot is the function that is used. So, in this case
this is price verses transmission transmission is going to be on x axis and price is going
to be on y axis. So, different groups in different groups different categories in
transmission would be displayed on x axis and for each of those groups price distribution
would be displayed a price values will be displayed on the y axis, limit for y axis as you
can see 0 to 15 and the labelling is also you can understand.
176
So, this is the box plot let us have a look.
So, in the box plot you would see that 75 percent almost 50 percent of the values they are
in the they are generally remain in the box. And the this black this this line this particular
this particular line this is actually the median value and this is the starting point of the
values and then you have in the box and you have first quartile and the this 1 is third
quartile.
So, all the values in the box are between first quartile and third quartile. So, therefore,
covering 50 percent of the values median is displayed. So, majority of the values are in
this range between these 2 limits and some of the values that are displayed using a
squares they are they can be called outlier.
Similarly, this is. So, this was for group 0 and for group 1 also again the same thing this
this this line this particular line is median, then you have first quartile and third quartile
and creating the box and other things remain same this being this value being the outlier.
Now you can see between group 0 and group 1 the median value you can see this is
much lower from group 1 median price value is much lower. So, the whole the whole
box plot for group 1 is higher than much higher than the box plot for group 0.
So, there can be the clear separation between these 2 groups can be seen over there, if we
want to look at the a mean value for a those 2 groups that can also be done. So, we will
177
have to compute the means for those 2 groups. So, this can be done using the by
command that is available in r. So, first is the you know first argument is the variable for
which we have to generate the mean and the second argument is the categorical variable
the now groups for which, we have to create a the mean and the mean function the
because. So, let us execute this code.
So, means would be created and once means have been created we can plot them using
the points command. So, points is the command which can again be used to plot the
points on a particular graph. So, in this case p c h is the plotting character. So, in this case
plotting character as understand as defined by value 3 is going to be displayed in the plot
which is nothing, but plus sign.
So, let us execute this line more information on points command you can find out from
the help section. So, you can see plus sign visible there let us look at the this plot, you
can see plus sign exactly lying on the median value very close exactly matching or very
close to median value for group 0. And you can see for group 1 plus sign is much higher
than the median value. So, the skewness that we saw earlier may be that is coming
because of the group 1 that is automatic cars.
So, let us stop here and we will create will continue from here. In the next lecture will
create few more box plot and will go into. So, basic charts and these distribution plots
they are mainly 2D graphics, we will go in more into you know multivariate or
multidimensional graphics in the next lecture.
Thank you.
178
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture – 09
Visualization Techniques- Part III Heatmaps
Welcome to the course business analytics and data mining modelling using R this is our
5th lecture and we were covering restarted visualization techniques in the previous
lecture. So, let us start, we stopped at you know R studio where we were doing some of
the examples. So, let us go back and complete some of them and then we will come back
and start our discussion on our next particular plot that is on that is heat maps. So, let us
go back.
So, again we will have to do some of the loadings and importing data set we will have to
reload the library and everything. So, let us load this particular library xlsx. So, once it is
loaded.
179
(Refer Slide Time: 01:07)
So, mainly in this particular lecture we would be using used cars xlsx file. So, let us
import that particular data set see here. So, let us import it.
So, we will rerun the same lines that we did in the last in the previous lecture. So, we can
see that there are 79 observation 11 variables in the environment section and then let us
re create this age variable has discussed in the previous lecture let us append it to the data
frame and let us subset the data frame right and let us also convert these variables. Now
you might remember in the last session we had eliminated one of the observation which
180
was which was actually outlier. So, let us perform the same operation again. So, this was
the observation let us take back up and then again eliminate the observation.
Now, we want to have a look at the data set this is the data set you can see now d f1 78
observation and 9 variable.
So, let us go back to the point where we stopped in the previous lecture. So, we were
going through some of the examples of box plots. So, I think what I remember is we
181
completed a one box plot. So, let us discuss another one, this particular box plot is
between kilometre and the categorical price. So, let us look at the range of kilometre
because we would have to specify that in the y limit because this particular variable is
going to be on the y axis. So, let us do that you can see the range and you can see the y
limit that we have specified in this particular line is actually covering this range for
kilometre. So, now, let us create the box plot.
You can see this like the last like the last session last lecture this is the box plot that we
have. So, other understanding of the box plot remains same like the last session if you
want to display mean as well and that can also be done, but for that we will have to
compute the mean first. So, this is the code that we discussed in the previous section as
well. So, means are computed you can see means variable have been a created has been
created means one variable has been created and 2 values are there.
Now, let us plot these 2 points and you can see the plot now here also if you want to
compare how the kilometre variable km variable is actually distributed for 2 groups 0
and 1 group 0 and group 1 you can see. So, in comparison to our the previous example
that we saw there is the both these boxes are closer to each other, but group 0 is slightly
on the lower side the distribution is slightly on the lower side and there is some
difference between box 0 and box 1. So, this kind of box plots as we discussed can also
help us in understand the difference between groups and we can also help you know
182
decide whether we need to include any interaction variable because of you know if we
see a significant difference in the distribution of data in 2 groups. So, we will have more
discussion on interaction and on other related concepts in coming lectures.
So, let us plot another one. So, this one is between age and the categorical price that we
have these 2 variables. So, let us look at the range because again this is going to be
plotted on the y axis we can see the range is 2 to 10.
So, you can see the limit is also specified appropriately and other things remaining
similar. So, can see these 2 plots this one plot and we can also calculate means and plot
these 2 means for these 2 box plots let us look at the graphic.
Now, in this case this is in this case you would see the boxes are you know very you
know in the same range those these 2 boxes they are in the same range, but you would
see the median is coinciding with the first quartile in the box 0 and the box 1 it is
separated you can also see the which are means are also looking at the same value. So,
therefore, very close very little difference between these 2 distribution for these 2 groups
and 0 with respect to age.
Now let us do another example this is bit being showroom price and categorical price.
So, let us go through this range you can see appropriately specified in this particular line
box plot code.
183
(Refer Slide Time: 07:02)
And we can plot the box plots and then means let us plot them now let us look at this
particular graph this looks much interesting you can see these 2 boxes a much bigger
difference you can see for group 0 you would see that price is showroom price was you
know showroom price distribution is on the a lower side and for group 1 the showroom
price is on the higher side. So, that is nothing unusual this is actually because of the way
categorical price has been created the showroom prices actually following that you know
indicating the same difference at because both are related to pricing of the cars. So,
therefore, this difference this separation is very clearly visible or depicted in the box plot
because both these variables are related to price.
Now, let us come back to another plot let us come back to our slides.
184
(Refer Slide Time: 08:09)
So, heat maps is the our next discussion, heat maps again they are another you know they
can be combined with the basic plots and distribution plots they display numeric
variables using some graphics based on 2 D tables will see how that is possible. Then we
can also use some of the colour schemes that could be use to indicate values. So,
different colours and different shades of colours could actually be used to indicate a
different range of value let say if a value lying between 0 and 0.1.
So, one particular shade could be used if the value is lying between 0.1 to 0.2 a darker
shade could be used if the value is lying between 0.3 and 0.4 a little bit more darker
colour could be used. So, therefore, in that in that fashion a colour ordering you know we
can use the colour scheme and that can the intensity of the shade that can help us indicate
whether the value is on the higher side all or on the lower side. So, 2 D tables any kind of
data that we can have that we can actually have in 2 D you know table format a table
format. So, therefore, that can be displayed using heat maps and the colour coding can
actually help us in understanding the data and some relevant insights developing some
relevant insight.
Now, as we talked about in the previous lecture as well that our human brains are capable
of doing you know much higher degree of visual processing. So, therefore, heat maps
can really be helpful especially when we are dealing with large amount of data. So, we
have large number of value values it might be difficult for us to find out different things
185
different insights about the data therefore, heat map this colour things can actually help
us in building our visual perception, now those visual perception can be carry forward
for subsequent analysis and used for then later on used for formal analysis.
Now as you can see in the slide second point about heat maps is useful to visualize
correlation and missing value. So, as we talked about because different colour shades are
going to be used therefore, in the correlation metrics if there is a higher value if there is a
high degree of correlation between 2 particular numerical variables. So, that can be
shown with the darker shade and if there is low degree of correlation, low value of
correlation coefficient then a lighter state could actually be selected. So, therefore, the
different shades in density of these colour shades that can all actually help us in finding
out which variables are highly correlated and which variables are have or having low
correlation values right.
Similarly, missing values can also be spotted. So, if we have the data generally as we
talked about in the starting lectures that generally data is displayed in metrics format or
in the tabular format. So, therefore, that data can actually be displayed and if there are
any missing values. So, they can be represented using you know whiter white colour and
the cells where the values are there can be represented as the darker colour or the black
colour. So, it would be easier to specify missing value we can also. So, heat maps can
also be used to help us understand the missing ness in a particular data set if there are too
many missing values right that can also be spotted if there are duplicate rows and
columns probably because of the colour shades if the colour shade is a very similar for 2
particular rows or 2 particular columns or multiple rows or multiple columns. So, we can
again do a manual check to find out whether the values are whether it is a duplicate row
or column. So, heat maps can help us finding these problems.
So, let us go back to R studio, first will cover the correlation matrix. So, heat map can be
used for you know creating a correlation table heat map. So, first we need to compute
correlation. So, in this case you would see that in the data frame that we have data set
that we have let us have a relook.
186
(Refer Slide Time: 13:08)
So, you can see that column 1 5 and 8, 1 and then 5 and then 8 having left out in the
correlation function reason being obvious that these are factor variable or categorical
variables. So, that the correlation values it requires numerical variables. So, let us
compute the correlation values among the remaining numerical variables this.
You can see a correlation matrix has been displayed there this metrics is symmetrical. So,
upper half is symmetrical to the lower half you know this diagonal is there all the values
are 1. So, this value 1 is between the same variables. So, the variable is going to be
187
hundred percent correlated with itself therefore, these values are one other values are
showing the particular correlation coefficient.
Now we can have a different kind of a table for the same data.
This is the function same num. So, you can see different kind of depiction here. So, you
can see the variable names here in the rows and in the column also variable names are
there. So, the 1 representing the a 100 percent correlation and you would see the some
notations are given in the at the bottom of this particular output, see this single code is
used for values lying between 0 and 0.3 dot is being used for values are lying between
0.3 and 0.6 comma is being used for values lying between 0.6 and 0.8 plus is being used
for values lying between 0.8 and 0.9.
Hash trick is being used for values between 0.9 and 0.9 5 and the b is being used for
values between 0.9 5 and 1. So, you can see there is 1 we can see 1 comma here and then
several dots. So, this comma value might be somewhere between 0.6 and 0.8 and dots
could be somewhere between 0.3 and 0.6. So, this is quite similar to what we were
talking about the heat maps heat map will so the same thing using colours. So, in this
case this particular function sym num is displaying different different value using
different symbols. So, now, because the there is symmetry in the matrix.
188
(Refer Slide Time: 15:51)
So, let us get rid of the one of the triangle. So, in this case we just want to keep the lower
triangle. So, upper triangular values have been assigned as NA, now let us create the
correlation table heat map. So, this is the code you can see the first argument in this
particular function heat map is the metrics itself then there are some other arguments
symmetry is in this case is 2 colour we have specified as grey colour. So, we want to
have now we want to use the grey pallet we will understand more about colour schemes
in R later in this lecture so, this particular function grey dot colours can be used to create
a number of grey a number of grey shades.
Now, for example, we want to create 1000 shades starting from value ranging from 0.8
and ending at 0.2. So, the values can be start between 0 to 1 range, but we are is a
restricting ourselves to 0.8 to 0.2 scales we are not scaling the data that we have because
this is already correlation value. So, they are already standardised margins we have
specified.
189
(Refer Slide Time: 17:18)
So, let us execute this particular code you can see the output this is the graphic that we
have. So, in this graphic you can clearly see that a diagonal values they are in the darker
shade because there is the each variable is going to be 100 percent correlated to itself
therefore, these value are being repainted by the darker shade having perfect correlation
other values you would see slightly a lighter a shades of grey have been used, but the
intensity of the shade in indicate higher value and higher intensity indicates higher value
and lower indicates lower intensity of the shade in indicates lower value. So, these a
whitish kind of light grey kind of you know rectangles are squares shown here the
correlation values for the corresponding variables pair of variables are in the lower side
and the slightly darker for example, this particular one we mean price and SR price that
is this we can understand that showroom price is going to be highly correlated with the
price of the car.
So, therefore, the correlation is value correlation value is going to be on higher side and
similarly the colour is on the you know this is higher intensity in higher shade of grey
colour. So, this can actually help us visualizing in terms of finding out which particular
periods of variable are highly correlated. So, therefore, we can say price and SR price we
can also say SR price and airbag you can also say kilometres and age right similarly
price and airbag. So, these a particular set of variables they seem to be highly correlated
with SR price and price being very highly correlated.
190
(Refer Slide Time: 19:18)
Similarly the data metrics or missing value heat map can be depicted using heat map
function. So, what we are trying to do here is. So, generally missing value heat map is
actually if the value is present it is generally shown in the darker shade, if the value is
absent then it is shown in you know lighter shade. So, that generally you know the colour
that is used is black for the value being present and white for value being absent so, but
in this case we are not doing that because the data set that we have all the values are
present. So, instead of that what we are trying to do here is to just show you to give you
the feel of heat map in the data metrics case missing value heat map case for first 6
records and then later on for all the records we are depicting different shades of grey
colour for different actual values. So, depending on the value different colour shade
would actually be shown.
So, for first 6 records we are going to run this code you can see head is the function that
has been used for to actually subset the a data frame for first 6 records you can see grey
colours the scheme is this slightly different scale, now we want to standardise this scale
because column wise. So, column wise scaling is going to be done margins are also
mentioned there. So, let us run this code.
191
(Refer Slide Time: 20:46)
You can see for first 6 observations we can see for different columns different variables
and the values. So, for example, you can see the airbag it looks white and the colour is
predominantly white you can see most of the values in airbag they were actually 0. So,
therefore, this whiter shade of grey colour has been used similarly you can see this 5th
row this is mainly in the, you know many cells, many squares in this many cells in this
particular row they are in the darker shade. So, higher values are there in this particular
row.
Similarly, you can see that KM column this is slightly on the darker side so; that means,
higher values are there in the km column per KM variable. Similarly we can create the
heat map for all the rows for all the records.
192
(Refer Slide Time: 21:49)
So, this is the heat map you can see now this is the number in indexing for the rows 1 to
79 because we had 79 observations you can see that. So, depending on the values itself
the shade has been selected had it been for missing value actual missing value heat map.
So, we would had we would see either black or white, white in the places where the
value is absent and the black in the places where the value is present. So, we want to do a
similar thing for in the for our data set whole table is going to look black because there is
no missing value now that brings us to our next discussion.
So, let us come back to our slides, next discussion is on multidimensional visualization.
So, most of the most of the visualization techniques or plots that we talked about they
were mainly 2D 2 dimensional. Now, we can also have some features which can actually
add to the 2D plots that we have a gone through till now and in a way they would be
multidimensional because some of the features that are mentioned in this particular slide.
193
(Refer Slide Time: 23:21)
These features are going to give that multidimensional feel. So, our visual perception can
be multidimensional you know using these features on 2D plots. So, you can see multiple
panels if we can have multiple panels. So, if you just 1 scatter for example, if you use
just 1 scatter plot only 2 variables can be selected, only 2 variables can be visualized, but
if we have multiple panels right. So, we can have you know pair wise scatter plots for
many variables and at in one go we can look at different variables and the relationships
and the information overlap and many other things.
So, multiple panels can give us the multidimensional look using the 2D plots, similarly
colour. So, colour coding can be done for different groups of a categorical variable. So,
that will also give us the multidimensional and it will help us in building our visual
perception size and shape. So, a different size and shape for points that are being
depicted or shown forgettable graphics that is being generated different size and shapes
can be can actually be used and therefore, that can help us give that multidimensional
feel from the 2D plot animation can be done which can actually help us in visualizing a
changes over time some operations like aggregation of data rescaling and interactive you
know visualization that can also be done to have that multidimensional feel.
Now, when we create a real multidimensional visualization like 3D plots their visual
perception is not that much clear it is difficult for us to a learn something from 3D plots.
So, it is more easier for us to because of the way we having learning over the years our
194
learning with respect to 2D plots from 2D plots is much better than in higher dimensional
done from higher dimensional plot. Now the main idea is again for these features and the
operations that we talked about the main idea being help build visual perception that is
going to help using support in the subsequent analysis.
So, let us go back to our studio and will go through some of the plots. So, first one being
colour coded scatter plots. So, before we do some before we create some of the colour
coded scatter plots let us understand the coloured schemes in R in R. So, there is this
function pallet which can actually help us understand the default you know colour
scheme for R in R.
We run this particular function you can see different colours are depicted here black red
green 3 green 3 blue 7 all these colours are depicted. So, therefore, for any whenever in
any function we use the colour argument for example, in this plot this colour argument
has been used. So, these particular colours would be picked up in that order. So, for you
know first time black would be picked for different colouring red would be picked for
the third different separate colouring green 3 would be picked. So, this particular colour
scheme is going to be used to have different colours in your plots.
If you want to change this particular this particular colour scheme you can do that for
example, rainbow 6 this is one function that can actually change your pallet scheme. So,
195
you can pass on this argument in the function you can again check the values that is a
rerun pallet.
You can see now the, this colour scheme has changed to rainbow 6 red, yellow, green and
these colours. So, right now will stick to the default scheme, let us reset it now you will
see a default scheme is there you can recheck it you can see it is black, red, green 3. Now
let us create a colour coded scatter plot. So, this is this plot we are going to create
between the variable age and kilometres km. So, a colour is again colour is using. So, the
colour feature that we are using this is for the categorical price variable. So, for different
groups of this variable we have 2 groups in this variable in this categorical variable 0 and
1. So, for these those 2 different group different colours are going to be used, the points
that are going to be plotted between age and km. So, for different groups different
colours would be used.
So, let us a run these 2 lines. So, range is 2 10, appropriately specified in the plot
function let us run you can see the plot.
196
(Refer Slide Time: 28:30)
You can see 2 colours black and red as we have already seen that black and red are the
first and 2 second option. So, black has been used for the group 0 and the red has been
used for the group 1. So, you can see this particular plot. So, here we can have that 3
dimensional feel like we have 2 variables came in the y axis and age in the x axis and we
can see the relationship between km and age in this particular this particular scatter plot
as the age of a vehicle is more the number of kilometres accumulated or of course, going
to be on the higher side, but you can also see that the red points or on the higher sides
slightly on higher side there are few red points on the lower side. So, therefore, we can
understand that categorical price were the you know which were assigned as one which
means which were assigned as which were having value more than 4,00,000 equal to our
more than 4,00,000 they have accumulated a more kilometres. So, those cars are being
used more often. So, that 3 third dimension is being depicted using colour in this case.
Now, another kind of multidimensional visualization that we can create is multiple panel.
So, we can create multiple panels each separate panel for each group. So, let us go
through one example.
197
(Refer Slide Time: 30:19)
So, this particular example is being done using 3 variables. So, essentially we are trying
to create bar plot and this bar plot is between this bar plot is mainly between price and
age and then the different panels are going to be used depending on the transmission. So,
for different transmission different panels are going to be used one panel per
transmission 0 and one panel for another panel for transmission 1 and the main bar plot
is between price and age were age is being used on the x axis. So, therefore, it has to be
categorical.
So, therefore, we need to create a categorical we need to convert age into a categorical
variable. So, let us start with that. So, age group is the variable is the categorical variable
that we are going to create out of this age variable. So, you can see age dot factor and
you can see age is already a factor, but to be saved this has this particular function has
been used. So, we are extracting the labels with the function which can be used on a
categorical variable to extract the labels different labels that are there in a particular
variable. So, let us. So, age was a numerical variables, in this case s dot factor has been
used to convert into a factor variable then it will have labels and labels function can be
used to get to retrieve those labels.
198
(Refer Slide Time: 31:55)
So, let us do that, age groups has been created you can see different labels 2 3 4 5,
different cars with different ages. So, have now have been clubbed into different groups.
So, again this is for coming you know further computation that we need to create this
particular variable. So, we need to have you know we need to run a loop later you can
see for loop is there. So, for that we need this age groups to which can help us in running
through all the different age groups.
So, let us run then we are going to create a average price for each transmission group
transmission 0 and transmission 1. So, for that we have created these 2 variable average
price 1 and average price 2. So, let us initiate them once initialization is being run, for
each age group we are going to run this particular loop and we are going to create these 2
variables we are going to fill feed data into these 2 variable average price 1 and average
price 2 that is depending on that for transmission 0 and for you know all the groups for
the transmission 0 and all the groups and for each group we are going to create an
average price similarly for transmission 1 and for each age group we are going to
compute the average price, let us run this particular loop.
199
(Refer Slide Time: 33:21)
So, once this loop is done now there could be some groups where average price cannot
be computed because for you know that combination did not match for transmission 1
some of the age groups there were no data, no record similarly for different transmission
group transmission 0 there might be some age groups were there were no records. So, in
those cases this N A N would be automatically you know assigned in R. So, therefore, we
need to convert them to 0.
So, once this is done now we can now because we want to create a different 2 panels. So,
in this case par is the command that can be used I mean you can see mf row is the
argument. So, we want to create a 2 rows and 1 column right. So, we want to create 2 2
panels and the x axis is going to be the same. So, therefore, 1 column and there are going
to be 2 panels and on y axis so, 2 and see x is again 0.6 this is actually for the labelling
and this is actually for all the numbers that are going to be depicted. So, default is 1 and
0.6 is. So, we are scaling down the sizes of all the points all the numbers and text that is
there margin you already know this is outer margin this is also specified here. So, let us
run this command.
Let us have a look at the range because we are going to require that in bar plot. So, what
is the range is you can see from these 2 we can see that, between 0 to 9, if we have a 0 to
9 limit it would be covered.
200
(Refer Slide Time: 35:15)
So, let us plot this let us create a box legend trans 0 and then another the name of the y
axis.
And second plot walks name of the x axis and legend now you can see this plot has been
created. So, these are the 2 panels you can see the scale for x axis is same because the
same variable is being used on the x axis, but in the y axis the variable is same, but the
average price for different groups could be different. So, therefore,, but is still we have
used the same range. So, therefore, these 2 panels can actually be compared value by
201
value. So, therefore, you can see that most of the you know vehicles in different age
group right and having a transmission 0 they are around 4,00,000 average price for some
age group it is slightly slower as we move further this average price goes down till this
particular age group age group 8 and then again for age group 9 and 10 it is increasing
may be the cars over of the higher showroom price if you look at the transmission 1. So,
these are automatic cars. So, the average price for these cars is a slightly on the higher
side right more than 5 or closer to 6 right.
So, the automatic cars of course, they are going to be big costlier. So, therefore, used cars
also are also going to be on the higher side they are also going to be costly that is
reflected in this particular graphic and, but as the age increases you can see there is you
know slight you know decrease as the age of a car is increasing some of course, some
apprehension are there, but that is the general sense. So, will today we will stop here and
will continue our discussion on some more visualization techniques in the next section.
Thank you.
202
Business Analytics & Data Mining Modeling Using R
Dr. Gaurav Dixit
Department of Management Studies
Indian Institute of Technology, Roorkee
Lecture – 10
Visualization Techniques- Part IV
Multiple Panel Plotting
Welcome to the course Business Analytics and Data Mining Modelling Using R. So, in
the previous lecture, we were discussing visualization techniques and we were in
particular we were discussing a multiple panels. So, let us go back and restart our
discussion from the same points let us go back to our studio.
So, at the end of the lecture we were we were trying to cover separate panel for each
groups. So, that I think we were able to complete and now today in this lecture let us
move to scatter plot matrix. So, scatter plot matrix can be really useful in situations
where you have a many numerical variables and you are trying to understand in different
relationship between different pairs of a variables that could that is going to be useful for
prediction task and supervised learning supervised learning prediction and classification
task in supervised learning methods.
203
And in and in case of unsupervised learning methods it could be useful in understanding
the information overlapped between 2 variables. So, if we get to see in one go and the
relationship between different variables our visual perception can be much better
specially in some situations.
So, the data set that we are going to use is the same one the users curve data set. So,
again because we are starting a fresh so we need to import this particular data set again.
So, let us do this let us reload this library. So, let us import the this this particular data
set.
You can see the data set has been imported 79 observations of 11 variables that is visible
in the environment section.
Now, let us also compute these age variables some of the things that let us also take full
backup of this particular data frame that we are going to require later on the in this
lecture. And first 3 observation as we understood in the previous lectures might not be
were not important for some of the initial visualization techniques that we discussed once
this is done.
In the previous sessions, we also identified one observation that we wanted to get rid of.
204
(Refer Slide Time: 03:07)
So, a let us do the same today’s well backup in eliminating the observation, now we can
now we are ready to start with scatter plot matrix. So, let us go back to a scatter plot
matrix yes. So, as discussed is scatter plot matrix can be useful to understand the
relationships and information overlap.
So, now you can see we have a selected a 4 key a numerical variables continuous
variable for scatter plot matrix as we understood in the previous lecture, that the for
scatter plots both the a variables a variables that are going to be on x axis and y axis are
supposed to be numerical variable. So, therefore, you can see S R Price kilometre and
Price and age these 4 numerical variables can be used to create a to generate a scatter
plot matrix. So, let us execute this code.
You can see the graphic here has been converted created let us zoom in.
205
(Refer Slide Time: 04:21)
Now, here you can see the you can see in the diagonal in the in the in the diagonal
rectangles right in the diagonal boxes, you would see the minimum the variables that
have been used to create this particular scatter plot matrix S R Price K M Price and Age
you can the function that we have used to a generate this particular matrix is pairs. So, in
this pairs function we have to pass on this formula in the formula you have to mention
the name of the variables you know which are going to be used to create this scatter plot
and then the data.
So, let us go back. So, for example, if you are interest interested in this particular this
particular graphic then the y axis is going to be S R Price which is in the same row S R
Price and the x axis is going to be represented by K M which is in the same column. So,
x axis is K M and the y axis is S R Price. Similarly if you are interested in this particular
graph then we would see that age is age variable, which is in the same row is going to be
represented on the y axis and S R Price is going to represent the x axis while it is in the
same column. So, that is how you can understand which variables are there in x axis and
y axis.
Now, you can look at different plots and you can try to understand the relationship
between them for example, this particular plot you can see this is between S R Price and
price. So, you can see a linear kind of relationship is visible there see more majority of
the points if you pass a line through the majority of points it is going to be a linear line
206
all right it is going to be a linear line. So, therefore, the relationship now you can
understand from the variable variables itself that S R Price and Price both are based on
prices therefore, there is going there is supposed to be a linear relationship. So, that is
very visible in the data itself.
Now, if you if you are interested in kilometre and versus Price you can see this particular
plot. So, K M and Price you can see most of the points they are clubbed here in this
particular group. So, there does not seem to be much difference of you know you know
much difference of Price on kilometre, we can also look at this particular graph in this
case Price is on a y axis and K M is on x axis because Price is ever outcome variable of
interest this is this particular plot is of more interest to us. So, here you can see the you
know K M could be represented by this for particular data could be represented by
horizontal line that actually you know signifying that there is not much of influence of K
M on in determining Price similarly there are different plots and different kinds of
relationship can be seen over there.
Now, if we if for example, if we are interested in some few other plots for example, Price
and let say Price and age and this particular graph you would see that because the age is
being represented by few numbers only. So, therefore, you would see for particular is
cars of different prices are depicted here similarly also different. So, it is look like a bar
chart bar chart kind of plot.
So, for different age group for different age numbers cars of different prices are being
shown there by different point’s data points. So, this particular these this particular
scatter plot this kind of a scatter plot matrix can really be useful in terms of finding many
relationship understanding many relationship, it can help us in finding a new variables it
can also helps us in understanding the interaction terms if required it can also help us in
grouping some of the categories it can also help us in you know sub setting the model.
So, running a model on a on a subset of the full data set. So, those kind of things can
actually be identified using these plots.
Now, let us move to our a next point let us go back to the slide.
207
(Refer Slide Time: 09:14)
So, next we are going to discuss is discuss these operations aggregation rescaling and
interactivity. So, these operations can sometimes be really useful for the same things that
we have been talking about. So, let us start with rescaling. So, let us go back to r studio
again. So, rescaling can be really helpful we if there are crowding of crowding of points
near axis in a near axis whether it is x axis or y axis if there are many points which are
crowded near those axis near those axis.
So, therefore, we can do rescaling of x axis and y axis and get a better look of the data.
So, how let see through an example. So, in this so now we are going to create 4 plots 4
back to back plots. So, therefore, we are trying to we are trying to divide our plotting
area plotting region into 2 rows and 2 columns. So, 4 plots are going to be created and
we have appropriately we have changed other settings like margin outer margin and this
size of font and size of different text and numbers.
So, let us run this. So, first particular rescaling that first particular example is between a
for is scatter plot between kilometre and price. So, let us again have a look at the range.
208
(Refer Slide Time: 10:55)
You can as you already know that range can be use to a specify the x axis x limit and the
y limit in the plot function. So, you can see the appropriately the limits have been
specified.
Now, let us run this particular plot you would see because as I said we wanted to you
know we wanted to generate 4 plots in the same you know plot area.
209
So, therefore, you can see one fourth of the one fourth of the area has been taken by the
first plot. Now let us create the axis labels we can see Price verses kilometre and you can
also see the points. Now if we want to zoom into this particular plot though we have
talked about this plot many times that there is not much influence of K M on price, but if
you want to have a much closer look if you want to have much closer look then scaling
can be really useful.
How we will see. So, let us change this scale of x axis and y axis into log scale. So, when
we talk about this log scaling we are essentially changing the spacing of points on x axis
and y axis. So, points are not going to be equally spaced in x axis and y axis they are
going to follow the logarithmic log base is you know scaling and is spacing would be
accordingly changed. So, how that can be done so in the plot function in R there is this
argument log which can be actually which can be used to do the perform different kinds
of scaling for example, if you if you just want to change the scaling for x axis. So, we
can say log we can assign log as x and then if you if you just want to scale y axis then y
can be a log can be assigned the y axis if you want to change both the axis then that that
is the case that we are doing here right now. So, we have to say x y. So, this is between K
M and price.
Now, let us talk about the limits now limits as you as in the previous plot we had used 0
to 180 for x axis. Now in case of log in case of log you would see that this is 10 to 10 10
210
to 1000 reason being that is the reason being that 180 is more than 100 and the next you
know spacing point appropriately spacing point in a log is scale is going to be 1000. So,
it is going to be like 1 10 100 and 1000 or in the other direction 1.1 .01 in that sense.
So, therefore, we have to make sure that the all the values are within the range. So,
appropriate limit for the for the values to lie in the in the in the plot region this could be
appropriate 10 to 1000 for 0 and 180. Similarly 0 to 15 we can have 0.1 2 100. Let us
execute this particular line you can see now the visibility of all these points is much clear
and this is mainly because of the spacing change in a scale and they are and thereby
change in spacing of points in x axis and y axis.
Now, we are trying to recreate both the axis x axis and y axis and we are also trying to re
label these axis now let us zoom in.
211
(Refer Slide Time: 14:50)
So, now you can compare these 2 plots. So, here the that a that horizontal line that can
that can represent this particular these particular data points is not that much usually, you
know perceivable in in plot 1, but in plot 2 pretty much you can see that this seems to be
a that kind of relationship there is not much this is a horizontal line. So, therefore, there
is not a much influence of K M on price.
So, this is much better visible in this log scale you can see the points this 10 100 and a
1000 you can see the most of the points they were in this range right around 100 20
around from 25 to 100 20. You can see the points are still lying in the in the same range,
but because of the change in the in this is scale and therefore, is spacing of points the
visibility of the of these data points have changed and therefore, we can more easily
perceive the relationships.
So, let us go back and now we are going to create a box plot and try to understand how
the scaling rescaling can actually be really it can be helpful in case of a box plot. So, this
is this in this for this example we are using this data frame which we had taken backup.
So, in this in why we are taking this particular data frame we had eliminated the one
particular outlier point in the data frame 1 d f 1 now we wanted back so that the
importance of rescaling could be emphasized in much better manner.
212
(Refer Slide Time: 16:45)
So, range of this you can see this point is back 72 now we have to change our limits you
can see that. So, this particular box plot is between Price and transmission transmission
being on the x axis being the this being the categorical variable. So, let us plot this this is
the plot let us label the axis now let us zoom in let us have a look at the plot.
Now, you would see the plot is in this particular case because of this particular
observation you can see one observation lying here, because of this and this observation
213
in the automatic transmission category because of this observation most of the whole box
has crowded into x axis.
So, therefore, comparison of these 2 boxes is becoming a very very difficult. So,
therefore, rescaling can really be helpful in this particular case. So, what we are going to
do is we will use the log scale.
See the log y has been selected because now only the y is the a numerical variable. So,
therefore, only there rescaling is rescaling is required and y limit have been changed
appropriately. So, that it covers all the data points you can see 0 75 is very well very well
within this particular range and let us execute this code you would see a plot has been
created let us label the axis.
214
(Refer Slide Time: 18:20)
Now, you see the boxes are looking much better and the comparison could be easily
performed. Now the point which was outlier now you can see the spacing between this
particular point and the a box plot the main box plot has changed to a great extent and
this is the result of scaling now comparison could be easily done.
So, this could be the benefit of rescaling in some situations were crowding happens. So,
next discussion point is on aggregations attaching a curve and zooming in. So, we will go
through some of the examples and we will see how aggregations and curve how we can
append a curve or add a curve to the existing plot and how we can zoom in and how it
can actually help us in doing some of the visual analysis task.
215
(Refer Slide Time: 19:14)
So, let us. So, again we are going to a create 4 back to back plot. So, the similar
parameter the par function has been you know specified and called appropriately you can
see 2 and 2 2 rows 2 columns and a margins have also been specified. So, that we are
able to use the plotting region effectively. Now our first plot is now for the in this
particular case we are going to use the time series data that riders data that, we had
earlier used. So, let us import that particular data set bicycle ridership dot xlsx you can
this particular data set is the time series data the first variable being from month and year
the time scale related information and the number of riders for every month and this
covering years from 2004 to 2017.
Let us also create this time series vector that we are going to require later on let us also
create these variables at 1 labels at 2.
216
(Refer Slide Time: 20:52)
Now, let us go back to aggregations yes now once this we have created the time series
vector now let us plot. So, this