NPTEL Coursebook
NPTEL Coursebook
FOR ENGINEERS
2 Introduction to R 6
3 Introduction to R (Continued) 18
5 Data frames 39
Week 2
12 Linear Algebra for Data science 126
Week 3
19 Statistical Modelling 228
Week 4
23 Optimization for Data Science 289
Week 5
27 Multivariate Optimization With Equality Constraints 333
Week 6
31 Module : Predictive Modelling 386
Week 7
39 Cross Validation 498
41 Classification 524
Week 8
46 K - Nearest Neighbors (kNN) 584
Lecture – 01
Data science for engineers - Course philosophy and expectation
First off, I want to say, this is the first course on data analysis for beginners. So, this is
for people who want to learn data analytics who have not been practicing it for a long
time and so on. However, while we say this is a data analysis course for beginners, it
would still be a substantial amount of information substantial amount of mathematical
concepts and more conceptual ideas that we will have to teach.
So, while it is an introduction course it is still significant amount of effort and learning
that that we expect the participants to get out of this course when we talk about data
1
analytics, there are several algorithms that one could use for doing analytics. So, as part
of this course, we will try as much as possible whenever appropriate to explain all the
concepts in terms of the data science problems that one might use them to solve. So, in
that sense, you try to give you a framework to understand different data analysis
problems and algorithms and we will also as much as possible try and provide a
structured approach to convert high level data analytic problem statements into what we
call as well defined workflow for solutions. So, you take a problem statement and then
see how you can break it down into smaller components and solve using an appropriate
algorithm.
So, these are at a conceptual level what you would expect the participants to take out of
this course. For teaching data analytics or data science it is imperative that you do coding
in a particular language there are many possibilities here as far as this course is
concerned we are going to use R as a programming language. So, as part of this course R
will also be introduced and the emphasis here will be on the aspects of R that are more
critical for what you learn in this course. So, in other words commands that are required
for this course material will be dealt in sufficient detail.
So, that is as far as a programming language is concerned for learning data science. In
terms of the mathematics behind all of this we will describe important concepts in linear
algebra that we think or critical for good understanding of machine learning and data
science algorithms we will teach those and we will also teach statistics that are relevant
for data science. Other than this will also have modules on optimization ideas and
optimization that are directly relevant in machine learning algorithms, we will also
provide conceptual and descriptions that are easy to understand for selected machine
learning algorithms and whenever we teach a machine learning algorithm we will also
follow it up with another lecture where the practical implementation of an algorithm for
a problem statement is demonstrated and that demonstration would take place and we
will use R as the programming platform.
2
While we talk about what the objectives of this course are it is also a good idea to
understand what this course is not about. As I mentioned already if you are a very
advanced data analysis practitioner then there are other courses which are at more
advanced levels that are relevant, this course is at a basic level for someone to get into
this field of data science. We will be teaching a course on machine learning later which
might be more appropriate for people of this category. This course is also not about big
data per se and we are not going to cover big data concepts such as map reduce, hadoop
frameworks and so on.
This course is more about the mathematical side of the data analytics, so, we are going to
focus more on the algorithms and what are the fundamental ideas that underlie these
algorithms. While we will use R as a programming platform this is not an in depth R
programming course where we teach you very sophisticated programming techniques in
R the R programming platform will be used in as much as it is important for us to teach
the underlying data science algorithms.
Now, there are a wide variety of machine learning techniques there are a number of
techniques that could be used and in a nine week course we have to pick the techniques
that are most relevant, not only that since we think of this as a first course in data
science. We also have to spend enough time covering the fundamental topics of linear
algebra statistics and optimization from a data science perspective. So, that takes quite a
few weeks of lecture. So, we are going to pick a few machine techniques which we
believe are the most relevant for a beginner.
3
So, you understand the basic ideas in data science you get a fundamental grounding on
the math principles that you need to learn and then you put all of this together in some
machine learning technique. So, you understand some machine learning techniques
where all of these ideas are used and we have picked these techniques in such a way that
you can understand data science better and also use these in some problems that might be
of use or interest to you.
So, in terms of an idea of what outcomes we would expect when a participant finishes
this course there are many things that you can do, but these are some categories of skills
that that we would expect you to generate. So, you would expect you to be able to
describe data analysis problems in a structured framework, once you describe that would
expect you to identify some comprehensive solution strategies for the data analysis
problems, classify and recognize different types of data analysis problems and at least to
some level determine appropriate techniques.
Now, since we do not teach you wide variety of techniques, within the gamut of
techniques that you are taught you will be able to identify an appropriate technique that
you can use and in this course, we emphasize this important idea of assumption
validation. So, you make some assumption support the data that you are dealing with and
then those assumptions tell you what algorithms you should use and then once you run
the algorithm you get the results and see whether your assumptions are validated and so
4
on. So, you would be able to think about how you can correlate the results of whatever
you have done to the assumptions you made to solve the problem and then see whether
that makes sense whether the solution makes sense and so on.
So, that is where we talk about judging the appropriateness of the proposed solution
based on the observed results and ultimately, we would expect you to be able to generate
comprehensive reports on the problems that you solve and then be able to say why you
did, what you did, so, that is an important aspect of what we are trying to cover.
So, if you stick with us and get through all the eight weeks of this course and also
diligently work on the assignments that are provided at the end of every week then we
hope that you learn the fundamentals of data science, you get some fundamental
grounding on important ideas and the math that you need to learn to understand data
science and take this learning forward in terms of more complicated algorithms and more
complicated data science problems that you might want to solve in the future.
So, I hope all of you learn and enjoy from this course and we will see you as the course
progresses.
5
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 02
Introduction to R
Welcome to the course on data science for engineers. In this module, we are going to
introduce R as a programming language to perform data analysis. This lecture, we are
going to give a brief introduction about R and Studio.
In R studio, we are going to look how to set the working directory, how to create an R
file and save it, how to execute an R file and how to execute pieces of R code.
6
(Refer Slide Time: 00:41)
Let us first see what is R. R is an open source programming language that is widely used
as a statistical software and data analysis tool. R generally comes with the Command line
interface. R is available across widely used platforms, windows, line x and macOS Now,
let us see, what is R Studio.
7
both Open source and Commercial software. R Studio is also available as both Desktop
version and Server version. For this course, we are going to use Open Source Desktop
Edition so, that you can solve your assignments using this R Studio. R Studio is also
available for various platforms, such as windows, line x and macOS.
Now, let us see, how an R Studio looks, when you first run application. This is how an R
Studio Interface looks. When you first run the application, to the left, we see Console
panel, where you can type in the comments and see the results that are generated when
you type in the commands. To the top right, you have Environmental History pane. It
contains 2 types: the Environment type, where, it shows the variables that are generated
during the course of programming, in a workspace, which is temporary and in the
History tab, you will see all the commands that are used till now from the beginning of
usage of R Studio. The right bottom, you have another panel, which contains multiple
tab, such as files, plots, packages and help.
The Files tab shows the files and directories that are available in the default workspace of
R. The Plots tab shows the plots that are generated during the course of programming.
And the Packages tab helps you to look, what are the packages that are already installed
in the R Studio and it also gives an user interface, to install new packages. The Help tab
is a most important one, where you can get help from the R Documentation on the
functions that are in built in R. The final and last tab is the Viewer tab, which can be
8
used to see the local web content that is generated using R, are some other application.
For this course, you are not going to use this tab from much. So, we limit ourself not
discuss more about that, viewer tab. So, we have got an idea about how R Studio looks.
Let us see, how to set the working directory in R Studio.
The working directory in R Studio can be set in 2 ways. The first, way is to use the
console and using the command Set working directory. You can use this function Set
working directory and give the path of the directory which u want to be the working
directory for r studio, in the double codes.
R, to set the working directory from the GUI, you need to click on this 3 dots button.
When you click this, this will open up a file browser, which will help you to choose your
working directory. Once you choose your working directory, you need to use this setting
button in the more tab and click it and then you get a popup menu, where you need to
select Set as working directory. This will select the current directory, which you have
chosen using this file browser as your working directory.
9
(Refer Slide Time: 04:50)
Once you set the working directory, you are ready to program in R Studio.
Let us illustrate how to create an R file and write some code. To create an R file, there
are 2 ways: The first way is: you can click on the file tab, from there when you when you
click it will give a drop down menu, where you can select new file and then R script, so
that, you will get a new file open.
10
(Refer Slide Time: 05:18)
The other way is to use the + button, that is just below the file tab and you can choose R
script, from there, to open a new R script file.
Once you open an R script file, this is how an R Studio with the script file open looks
like. So, 3 panels console environmental history and files and plots panels are there. On
top of that, you have a new window, which is now being opened as a script file. Now you
are ready to write a script file or some program in R Studio.
11
(Refer Slide Time: 05:51)
So, let us illustrate this with a small example, where I am assigning a value of 11 to a, in
the first line of the code which I have written and you have b which is a times 10, that is
the second command, I am evaluating the value of a times 10 and assign the value to the
b and the third statement, which is print c of a, b concatenates this a and b and print the
result. So, this is how you write a script file in R. Once you write a script file, you have
to save this file before you execute it.
12
Let us see, how to save the R file. From the file menu, if you click the file tab, you can
either save the file, when you want to save the file, if you click the save button, it will
automatically save the file has untitled x. So, this x can be 1 or 2 depending upon how
many R scripts you have already opened, or it is a nice idea, to use the Save as button,
just below the Save one, so that, you can rename the script file according to your wish.
Let us suppose we have click the, Save as button.
This will pop out a window like this, where you can rename the script file as test R, are
the one which you are intended to. Once you rename, you can say save, that will save the
script file.
13
(Refer Slide Time: 07:31).
So now, we have seen how to open an R script and how to write some code in the R
script file. The next task is to execute the R file. There are several ways you can execute
the commands that are available in the R file. The first way is to use run command.
This run command, can be executed using the GUI, by pressing the run button there, or
you can use the Shortcut key, this is control + enter, what it does is, it will execute the
line in which the cursor is there. The other way is to run the R code ‘R’ using source R
source with echo. The difference between source and source with echo is the following:
The Source command executes the whole R file and only prints the output, which you
wanted to print. Whereas, source with echo prints the commands also, along with the
output you are printing.
14
(Refer Slide Time: 08:38)
So, this is an example, where I have executed the R file, using the source with echo, you
can see, in the console, that it printed a the command a = 11 and the command b = a time
10 and also the output print c of a, b with the values. So, a = 11 and b = 11 times 10, this
is 110. So, this is how, the output will be printed in console. So, that is the result.
Now, let us see how to execute the pieces of code in R. As you have seen earlier, you can
use run command, to run the single line, right. So now, let us try to assign value 14 for a
and then try to run it. So, how do you do this? Take your cursor to the line, which you
15
want to edit, replace that 11 by 14 and then use control enter or the run button. This will
execute only the line, where the cursor is placed.
In the Environment pane, you can see that, only value of a, has been changed and the b
value remains same. This is because, we have executed only the line 2 of the code, which
change the value of a, but we have not executed the code of line 3. So, the b value
reminds as is. Value of a, has changed, but not the value of b.
16
In summary, we can say that, Run can be used to execute the selected lines of R code.
Source and Source with echo can be used to run the whole file. The advantage of using
Run is, you can troubleshoot or debug the program when something is not behaving
according to your expectations. The disadvantages of using run command is, it populates
the console and make it messy unnecessarily.
In the next lecture, we are going to see how to add comments to the R file and how to
add comments to the single line and multiple lines etc.
Thank you.
17
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 03
Introduction to R
Welcome to the lecture 2 in the R model of the course Data science for Engineers. In the
previous lecture we have given a brief introduction about R and R studio and we have
seen how to create an R file write some codes in R file and how to execute an R file.
In this lecture we are going to show how to add comments to the R file, how to clear the
environment and how to save the workspace of R now let us first look at how to add
comments to the R file.
18
(Refer Slide Time: 00:51)
Before that let us ask this question: why do you add comments to your codes? Adding
comments improve the readability of your code for example, you can explain the purpose
of the code you are writing in the comments or you can explain what an algorithm is
doing to accomplish the purpose which you are attempting at. Writing comments also
help us to generate documentation which is external to the source code itself by
documentation generators.
19
Let us look how to add comments to a single line in R script first you can comment a
single line R by using hash key at the start of the comment if you see in this example I
have commented this first comment by a hash key which turns this command green and
if you notice these commands are describing what this program is doing, what it is doing
is it is taking a single number and then calculating a value which is 10 times of it.
So, you can see here I am defining a variable a = 10 which I am commenting it out as the
input number and now I am explaining this operation which is being happened here
which is b is calculated as 10 times a and if you would have remembered in the previous
lecture we have used this symbol for assigning a value to a variable you can also use = in
R studio that is been demonstrated here. Now you can see how commenting makes your
script file more readable.
Comments can also be used to make certain lines of code in at you can do that by
inserting a hash key at the beginning of the line like here you can see I want to comment
this line which says a = 14 if I wish to do so, I can comment it by keeping a hash key in
front of it. Now we will see how to add comments to multiple lines at once in R.
20
(Refer Slide Time: 03:10)
There are 2 ways first we used to select the multiple lines which you want to comment
using the cursor and then use the key combination control + shift + C to comment or
uncomment the selected lines.
the other way is to use the GUI, select the lines which you want to comment by using
cursor and in the code menu if you click on the code menu a pop up window pops out in
which we need to select comment or uncomment lines which appropriately comments or
uncomment the lines which you have selected. In some cases when you run the codes
using source and source with echo your console will become messy.
21
(Refer Slide Time: 03:57)
And it is needed to clear the console let us now look at how to clear the console. The
console can be cleared using the shortcut key control + L, let us look at an example, in
this code I have defined a and calculated b and printed a comma b, when I execute this
code using source with echo all the commands will get printed here. Now, let us say
suppose I want to clear this console what I have to do is I have to click here and I have to
enter the key combination control + L. Once I do this you can see that the console will
get cleared remember clearing console will not delete the variables that are there in the
workspace you can see that even though we have cleared the console in the workspace
we still have the variables that are created earlier.
22
(Refer Slide Time: 04:51)
Now, let us see how to clear the variables from the R environment you can clear the
variables on the R environment using rm command, when you want to clear a single
variable from the R environment you can use the rm function has shown here rm
followed by the variable you want to remove. If you want to delete all the variables that
are there in the environment what you can do is you can use the rm with an argument list
= ls followed by parenthesis or you can clear all the variables in the environment using
the GUI in the environment history pane you see this brush button, when you press the
brush button it will pop up
23
a window saying we want to clear all the objects that are available in environment if you
say yes it will clear all the variables.
Which is shown there and you can see the environment is empty now. Now, let us see
how to save.
The data from the workspace in R I have already mentioned that the information that is
saved in the environment of R is temporary and it is not retain when you close the R
24
session or restart the R Studio it is sometimes needed to save the data which is already
there in the current session.
The reason being you would have done certain operations to get the data to this form and
you do not want to repeat those actions and you need to start from the point where you
want to leave now, in that cases you need to save the data from the R environment when
you want to do that.
There are 2 ways the first one is the automatic option when you close the R Studio
application it will ask you look do you want to save the workspace image if you say yes
it will save all the variables that are there in the workspace, if you say do not save the R
Studio will exit and the workspace information not be saved.
25
(Refer Slide Time: 06:57)
You can also save the workspace information using manual method where you can save
the information to a file using the save command and the saved information can be
reloaded for the future sessions using the load command let us see how to do that in R.
Here is an example code the first line here shows how to save a variable that is there in
the workspace into a file name sess1 dot R data.
So, in the comments you can see that this is the command which you can use to save a
single variable a, if you are willing to say the full workspace you need to use this
command save list = ls with argument all dot names = true and you can give the filename
whatever you wish to and the shortcut key for this command which is given here is save
dot image which saves the data in the environment into dot R data file in the current
working directory. Once you do that you can load the workspace information at later
point of time whenever you want using this command load you can specify the file = the
file name which you save the data into.
So, in this lecture we have seen how to add comments to R file, how to clear the console
and how to clear the R objects that are there in the environment and also we have seen
how to save the variables that are available in the R environment for further use. In the
next lecture we are going to introduce you to the basic data types of R.
Thank you.
26
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 04
Variables and datatypes in R
Welcome, to the lecture – 3 of R module in the course Data Science for Engineers.
In this lecture we are going to see the rules for naming the variables in R and what are
the basic data types that are available in R and we are also going to see two basic R
objects; vectors and lists, in detail.
27
(Refer Slide Time: 00:36)
28
b2 = 7, assigns the value of 7 to the variable b2. This is a valid variable name because it
started with an alphabet and it has only alphanumeric characters.
Similarly, the second variable Manoj_GDPL = scientist this is also valid variable name
because it has a special character, but it is underscore which is allowed special character
for the variable names. Now, let us see some examples where the variable names are not
correct the variable 2b = 7, gives an error because that variable name has started with the
numeric character which is not following the rules for the names of the variables in R.
R also contains some predefined constants that are available such as pi, letters, the
lowercase a to z and letters in the uppercase which are uppercase letters from A to Z and
months in a year, you can have full month name by month dot name and you can have
abbreviated month names by typing month dot abbreviation.
29
(Refer Slide Time: 02:25)
Let us now look at the data types that are available in the R.
R has the following basic data types and this table shows the data type and the values
that each data type can take. So, R has logical data types which take either a value of true
or false, it supports integer data types which is the set of all integers and numeric which
is set of all real numbers. We can also define complex variables, R supports set of all the
complex numbers. Also, we can have a character data type where you have all the
30
alphabets and special characters which are under the window of basic data types of
characters. There are several task that can be done using data types.
The following table gives you the task action and the syntax for doing the task. For
example, the first task is to find the data type of the object. To find the data type object
you have to use type of function. The syntax for doing that is you need to pass the object
as an argument to the function type of to find the data type of an object
The second task is, to verify if object is of certain data type. To do that you need to
prefix is dot before the data type as a command. The syntax for that is, is dot, data type
of the object you have to verify. For example, if you have variable a, which is defined as
an integer and if you use this command is dot integer of a, it will show true originally.
The variable a is not defined as integer this will show false and the third task is
interesting task where you can change the or convert the data type of one object to
another to perform this action you have to use, as dot, before the data type as the
command; the syntax for doing that is as dot data type of the object which you want to
coerce. Note that all the coercions are not possible and it attempted will be returning a
null value.
There are sample codes for doing this as which are in the bottom. The first one is type of
1, so, 1 is a numeric variable. So, if you say, type of, you will get double which is
31
numerical variable and if you say, type of a string, that is printed 22-1-2001, it will give
value as character. Now, if you going to ask whether this; the character variable which
have created 21-11-2001 is this character type variable, you can use this command is dot
character the result is true because you have defined it as a character variable.
The next example, here what you are going to do here is we are coercing the character
variable which is defined earlier that is 22-11-2001 as date and then you are checking
whether that date is a character variable. The result is false because when you coerce this
character variable as a date it will be a numeric variable, when you want to ask whether
this variable is a character the result will be false.
You can also coerce numeric variable into complex variable by using as dot complex of,
so, for example, we have as dot complex of 2, will convert this numeric variable 2 into
the complex variable 2 + 0i. Now, let us try coercing a character into a numeric variable
using this command as dot numeric of which has given us not available or NA. This
means the coercion from the characters to numeric numerical variables is not possible.
We have several basic objects of R, in this the most important ones are; vectors, lists and
data frames. A vector is an ordered collection of same data types, list is ordered
collection of object themselves and data frame is a generic tabular object which is very
important and the most widely used objects of R programming language. We will see in
detail about each of this in the coming parts of the lecture and the other lectures also.
32
(Refer Slide Time: 07:22)
Vector is an ordered collection of basic data types of a given length. So, only key thing
here is all the elements of a vector must be of a same data type. If you want to see an
example the way you creating vector in R is using the concatenation command that is C.
So, now I am going to define a vector which is containing four numeric variables and I
am assigning it to a variable X. This is what the code here X = concatenation of these
numbers and then I am printing X. So, if you execute this piece of code, this is how the
33
output in the console looks. It creates a vector X with the variables 2.3, 4.5, 6.7, 8.9 and
prints them in the console.
List is a generic object consisting of ordered collection of objects. List can be a list of
vectors, list of matrices, list of characters and list of functions and so on. To illustrate
how a list looks, we take an example here. We want to build a list of employees with the
34
details for this I want the attributes such as ID, employee name and number of
employees. So, I am creating each vector for those attributes.
The first attributes is a numeric variable containing the employee IDs which is created
using the command here, which is a numeric vector and the second attribute is employee
name which is created using this line of code here, which is the character vector and the
third attribute is number of employees which is a single numeric variable.
Now, I can combine all these three different data types into a list containing the details of
employees which can be done using a list command. So, this command here creates m
dot list variable which is a list of the ID, emp dot name and num dot employees that are
defined above.
Once you create a list you can print the list and see how the output looks. So, when you
execute this course, you can see in the console the list is printed this is the first one IDs
1, 2, 3, 4; this is the second element of the list which are contain the names of employees
and the third element of the list which are saying how many number of employees are
available. So, we have created a list.
All the components of a list can be named and you can use that names to access the
components of the list. For example, this is the same list we have created you can use the
35
same ID, emp dot name and emp dot employee. Instead of directly creating a list you can
also give the names for this attributes as ID, names of employees and the total staff as
shown in the code here. Once you execute this code we can see now that list is created
and if you want to access this element of the list you can do that by using the dollar
command m dot list is the list and you want access the component with the name, names.
So, when you use this command and print the result you can see the names of the
employees that are printed.
You can also access the components of the list using indices. To access the top level
components of a list you have to use double slicing operator which is two square
brackets and if you want access the lower or inner level components of a list you have to
use another square bracket along with the double slicing operator. The course here
illustrates how to access the top level components; for example, I want access the IDs, I
can use print emp dot list and this is a double slicing operator which will give me the
first level which is ID.
The second component can also be similarly accessed that is the result is shown here and
if you want access, for example, the first sub element or the inner component of the
component ID you have to use emp dot list the double slicing operator and the first
element in the another square bracket.
36
Similarly, you can access the first employee name using double slicing operator to be
followed by the element one which prints the value man from the employee list.
A list can also be modified by access in the components and replacing them with the
ones which you want. For example, I want to change the total number of staff into 5, that
can be done easily by assigning a value 5 to the total staff and I want to add a new
employee name to the list the component of the list which has the employee names is 2
and I want to add this new name Nir as a new employee to that sub component.
So, I can directly assign this character variable Nir to the second component and fifth sub
component of the list. Now, we need to also increase the employee ID and you have to
give this employee and new ID which is 5, what we are doing now, in this command is
your accessing the fifth sub element of the level one component and then assigning data
value of 5.
Now, once you print the list you can see that the IDs, number of employees are 5 and
total staff is 5 and the name Nirav is getting added to the list.
37
(Refer Slide Time: 13:50)
Next, we will see how to concatenate the list. Two lists can be concatenated using the
concatenation function. The syntax for that is concatenation of list 1 and list 2. We have
a list which already contains three attributes, you want add another attribute which is
employee dot ages; for that I am creating a new list which contains the ages of the
employees five employees.
Now, I want to concatenate this new list that is m dot ages with the original list which is
emp dot list. So, when you want to concatenate these two lists you have to use this
concatenation operator, the original list and then the new list. So, this command
concatenates these two lists, you are now assigning it to the employee dot list. When you
print this new employee dot list you will see that you have added another attribute ages
to the original list.
To summarize, we have seen the data types that are supported an R and two objects of R
vectors and list in detail. In the next lecture, we are going to look at the important data
object of R which is data frames.
Thank you.
38
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 05
Data frames
Welcome to the lecture 4 in the module r of the course data science for engineers. In this
lecture we are going to introduce you to the data frame objects of R.
39
How to create data frames, how to access rows and columns of data frame, how to edit
data frames and how to add new rows and columns of the data frame. Let us first look at
what data frame is?
Data frame are generic data objects of R which you are used to store the tabular data.
Data frames are the most popular data objects in R programming because we are
comfortable in seeing the data in the tabular form. Data frames can also be taught as
mattresses where each column of a matrix can be of different data type. Let us see how
to create a data frame in R.
The code here shows how to create a data frame; the command here where the mouse is
pointed creates a vector which is a numeric vector, which is containing 1, 2 and 3. The
second command creates a character vector which contains 3 strings are Scilab and java;
and the third command here is creating another character vector which is having entries
for prototyping and first scale up.
The way you create the data frame is use the data dot frame command and then pass each
of the elements you have created as arguments to the function data dot frame. This
command will create a data frame d f when you print the data frame df, this is how the
output looks. So, we can see that the names of the variables you have created are taken as
columns and the entries in the each column are of the same data type; this is the
condition which you need to be satisfying while creating a data frame.
40
(Refer Slide Time: 02:27)
Data frames can also be created by importing the data from text file to the way you have
to do it is you have to use the function called read dot table and the syntax for that is you
assign the data which your reading to a new data frame you have name that data from
which you want to create and then read dot table takes in the argument the path of the
file.
Let say you have data in a sum text file where the data is separated by spaces, you have
to use this command read dot table and path = path of the file which from which you
want import the data. This path specification is the os dependent you have to take care of
whether you need to use backslash are slash operator depending upon your OS. A
separator can also be specified to distinguish between entries the default separated
between the entries of data is space, when you want to see the syntax for importing the
data and creating a data frame this is how it looks; new data frame is read dot table you
have to specify the path of the file and then you can also specify the separator that is
being used to separate the entries of data.
The separator can also be a comma or a tab etcetera. So, what we have seen here is you
can either create data frames on the go or you can use the data that is already existing in
some format and use that to create data frames. Now that we have created a data frame,
41
(Refer Slide Time: 04:13)
We need to see how we can access rows and columns of a data frame; the syntax for that
is the data frame and 2 arguments has to be passed, the first argument val1 refers to the
rows of a data frame and the second argument val2 refers to the columns of a data frame.
So, this val1 and val2 can be array of values such as one to 2 or c of one comma,
etcetera.
If you specify only val2 which is the syntax given here df of a l 2 this refers to the set of
columns, that you need to access from the data frame. In this code we can see that if you
want to access first and second row of the data frame that is created. You can do so by
accessing the rows we have to put 1 to 2 comma, nothing in the column specifies all the
columns has to be accessed.
You can see in the result from the data frame what you have created in the previous slide
you are able to access the first 2 rows. If you want to access the first 2 columns; instead
of rows what you need to do is you need to leave a space first and then comma, and you
need to specify the list of columns you need to access that is one to 2 that is shown here
in this command and we can see the result on the console output, you are able to access
the first 2 columns of the data from which you have created. If you want to access the
first and second columns using just the column names you can do.
So, by just specifying d f of 1 to 2, this is another way of accessing the columns of a data
frame..
42
(Refer Slide Time: 05:58)
Sometimes you will be interested in selecting the subset of the data frame; based on
certain conditions. So, the way you do is you should have the conditions based on which
you have to select the data frame and you should also have a data frame, once you have
you can use the command subset to get the subset of data frame.
Let us illustrate by an example. Now we are going to create a data frame by name pd;
using the first line which has name month blood sugar and blood pressure as the columns
in the name we have Senthil and Sam in the month we have Jan and February, in the
blood sugar we have a vector of blood sugar values and in the blood pressure you have a
vector of blood pressure values you can print the data frame and see how this it looks.
But in that data frame what I want to extract is a subset where the name has to be Senthil
or the blood sugar value has to be greater than 150. Now I can print the new data frame
with this p d 2 the result is as shown in the console output here. The original data frame
contains all the entries, but the new data from pd2 selects these entries because the first
entry is selected because the name is Senthil, the second entry is selected because the
name is Senthil the third entry is selected because the blood sugar value is greater than
150.
43
(Refer Slide Time: 07:30)
Now we will see how to edit data frames; much like list you can edit the data frames by
direct assignment. We have seen this data frame earlier we have vector 1, vector 2,
vector 3 containing the elements in them we have created a data frame using this
command. We can print that data frame also; now if I want to change the second entry in
vector 2 as an R instead of Scilab I can achieve that by using this command I am
accessing and I want to replace the element in the second row second column with the
string R, when you execute this command d f of 2 comma = r what it does is it replaces
the entry Scilab with R as shown in the results.
You can see that the Scilab has been replaced with the R, this is how you can edit the
data frame by direct assignment.
44
(Refer Slide Time: 08:38)
Next, we see anything a data frame using edit command. So, what you need to do for this
is you have to create an instance of data frame for example, you can see that I am
creating an instance of data frame and naming it as my table by using the command data
dot frame, this creates an empty data frame and I can use this edit command to edit the
entries in my data frame. To do that what I have done is I am assigning whatever that is
being edited into create a frame my table. So, when I execute this command it will pop
up a window, where I can fill in the details what I want to fill in and then when I close
this will save the data as a data frame by name my table.
Next, we will see how to add extra rows and columns to the data frame. We will continue
the same example which we have used and now let us say we want to add another row to
the data frame which you have created earlier.
45
(Refer Slide Time: 09:45)
So, we can add extra row using the command r bind and to add an extra column we use
the command c bind. Let us see how we can add an extra row using r bind command the
syntax for that is r bind, the data frame to which you want add and the entries for the new
row you have to add you have to be careful when using r bind because the data types in
each column entry should be = the data types that are already existing rows..
So, we are creating another row entry in which we have in the column 1 that is vec1 we
have a numeric data type 4, in the column 2 we have a character variable c, in the
column 3 we have a character variable for scale up. So, this command adds the row to
the data frame and when you print the data frame you can see that row has been added to
the original data frame.
Now, let us see how to add a column; adding a column is simple this can be done using a
c bind command the syntax for that is c bind the original data frame and the entries for
the new column. Now I am going to add a new column call vec4 which contains the
entries 10 20 30 40 and when you add a new column to this and print the new data frame,
you can see that this vec4 is added to the existing data frame.
46
(Refer Slide Time: 11:18)
Now we will see how to delete rows and columns in a data frame there are several ways
to delete rows and columns; we will see some of them to delete a row or a column, you
need to access that row first and then insert a negative sign before that close, it indicates
that you had to delete that rows. So, let us see the example here now from the data frame
we have if you want to delete the third row and the first column that can be done using
this command. So, I want to delete third row so I chose the third row and insert a
negative symbol before it.
Similarly, I want to delete a column one I chose that column and then insert a negative
symbol before that column. And I am assigning that to new data frame df2 now when I
print the df2. You can see in the results we do not have the column vector one and we do
not have vector 3 which is what we expected to happen. We can also do conditional
deletion of rows and columns as we have seen this command will delete column 3 from
the data frame what we have created.
So, the explanation goes as follows; we have a data frame we want to access all the rows
in the columns we want to access those columns where there is no vector 3; that means,
we want to access vector 2 vector 4 and vector one. So, this exclamatory symbol says no
to the columns that are having column name vector 3 and I am assigning that to data
from 3 when I print data from 3 you can see that there is no column vec3 in the data
47
frame which we are looking for, you can also delete the rows where we have an entry 3
by using this command.
So, what you are saying here is access those rows where the element in the vector 1 is
not = 3 and we need to access all the columns. So, and we are assigning that 2 data from
df df4 and when you print the df4 we can see that the row which is having the entry 3 is
deleted from the data frame.
now we will see how to manipulate the rows in the data frame and what is called as a
factory issue. R has inbuilt characteristic to assign the data types to the data you enter.
When you enter numeric variables, it knows all the numeric variables that are available
when you enter character variables it takes whatever the character variables you are
giving as categories or factors levels.
And it assumes that these are the only factors that are available for now; when you want
to change the element in the third row third column to others; what happens is it will
display warning message saying that, this others categorical variable is not available and
it replaces that with the NA you can notice that the place where we want others to be
there we are having a NA and we can also see the use of word factor in the warning
message, how to get rid of the factor issue is the question now.
48
(Refer Slide Time: 14:43)
New entries in R when you are entering should be consistent with the factor levels that
are already defined if not those error message will be printed out. If you do not want this
issue to happen what you have to do is while defining the data from itself you need to
pass another argument, which says strings as factors is false by default this argument is
true that is the reason why you get this warning message when you want to change the
string characters into new string characters as an element..
Now try doing the same manipulation you want to change the third row third element to
others and print the data frame you can see that there is no NA anymore and we achieved
what we want. In this lecture we have seen how to create data, frames how to access
rows and columns of data frame and how to delete rows and columns of a data frame and
so on.
In the next lecture we are going to see some other operations that can be done on data
frames.
Thank you.
49
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 06
Recasting and joining of dataframes
Welcome to the lecture 5, in the r module of the course data science engineers. In the
previous lectures, we have seen how to create data frames, How to access rows and
columns of data frames, How to add rows and columns to existing data frame. And so
on. Here we will look at more sophisticated operations on data frames, such as recasting
and joining of data frames.
In this lecture we are going to first define, what is recasting of a data frame mean? Why
do one need to recast the data frames, How the recasting can be done in 2 steps using
melt and cast command, How the recasting can be done in a single step using recast
command and how to join 2 data frames using left join right join and inner join functions
of d player package in r.
50
(Refer Slide Time: 01:10)
Let us first see what is recasting of data frames means, requesting as a process of
manipulating data free in terms of it is variables. Why do want wants to recast the data
frames? The answer is recasting helps in reshaping the data which could bring more
insights on the data, when it is seen from the different perspective. Let us take a data
frame which is created in the last lecture, we have the data frame name pd which has
column name month, blood sugar and blood pressure. So now, you want to convert this
data frame into the other form which is shown below, where you have blood sugar and
blood pressure as the variables of importance to you and this involves an operation
which is called recasting, this recasting is demonstrated using an example here.
51
(Refer Slide Time: 02:01)
So, in order to do recasting we have to have a data frame, which is the following which
is shown in screen. To create this data frame, you can use the code that is displayed in
screen this one and when you use this code and execute this, you will see the data frame
which is shown here. Since we have the data frame now, we can see how to recast the
existing data frame into the form which we want.
Let us see an example to demonstrate this, how to recast the data frame into another
form, using 2 steps first one is melt and the second one is cast. This is the data from you
52
have when you want to use melt and cast command to recast, the data frame you need to
identify what are called identifier variables and measurement variables of your data
frame. The rules for indentifying this identifying variables are, most of the discrete type
variables can be identifier variables, measure the numeric variables can be measurement
variables and there are certain rules for the measurement variables such as, categorical
and date variables cannot be measurement variables. So, the key idea is from the data
frame, you have to identify what are called identifier variables? And what are called
measurement variables? Once you have identify this identifier variables and
measurement variables, you are ready to do the melt operation which you are going to
see now.
This melt command is available in the reshape2 library, here is the first time we are
loading another library to perform some operations. In the pre-course material we have
given you how to install packages and this library command helps you do load the
packages, that are already installed and this is the syntax of the melt comment. For the
melt command you have to give the data frame as first argument and you have to specify
what are the identifier variables in your data frame , and you have to also specify what
are the measurement variables in your data frame and these are the default variable and
value arguments that, are generated when the melt command is executed. To do this
melting operation, we can use this code you first have to load this library reshape 2
which contains this functions melt and cast.
53
So, the syntax we have seen, melt you need to pass the data frame which you want to
melt and you have to also specify, what are the identifier variables in the data frame your
pass, what are the measurement variables that you are passing? Once you do this initial
data frame will become like this, what it has done is, since this name and month or given
as Id variables, there as it is and measurement variables BS and BP are now start under a
column by name variable, as you can see here and the values of them are stored in
another column, which is named as value.
So, when the melt command is executed, it will take this Id variables and keep them
assist and then convert the measure variables into, one single column which is given by
variable and stores the values of those variables, in another column by name value, that
is what when you say in this syntax variable dot name is variable and value dot name is
value means.
So, the columns which carries this measurement variables is named as variable and
which wholes the values of the measurement variable is named as value. So, this is the
first step identifier variables and measurement variables of the data frame and uses the
melt command to melt the data frame to get to this structure.
54
(Refer Slide Time: 05:44)
Next step is the cast, since we are using data frame here, we use the function d cast this d
cast function is also available in reshape 2 library the syntax for d cast is as follows, the d
cast command takes in the data frame, which you want to d cast and the formula which
will explain for this case what it is? And value dot var. So, you have to specify the
columns from which the values to be taken from when you are d casting.
Let us see the example our case, here you have a data frame Df which you already
melted. Now, you are creating another data frame Df2 by using d cast command, this is
the data frame which you are passing that is Df and this is the formula. What does it say?
I want to have this variable and month as constant, because you want blood sugar and
blood pressure to be your variables of importance and then, you have to convert the
name variable into 2 columns are, how many of a columns depending upon the number
of categories in the name.
That is what this formula explains, columns variable and month remain as it is and the
categories in the name becomes new variable. We have 2 categories in this example,
which are Sam and Senthil and they become the new columns, that are new variables and
the values for those variables has to be picked from the column value, that is what this
value dot variable suggests. Once this operation is done, if you print the data frame this
is how you will get the data frame in your required format.
55
(Refer Slide Time: 07:15)
So, you have this melted data frame from this, when you apply dcast function you pass in
this data frame and you say variable and month are the ones, which you want to have it
as constant, that are the left side of the formula and in the to the right of the formula you
have name. So, in this name column we have 2 categories Sam and Senthil and those will
be created as 2 new columns and the values for those columns, have to be taken from the
value column of the melted data frame, that is how the cast command works.
56
Now, let us see how to do this recasting in a single step. So, recasting can perform in a
single step, using recast function the syntax for this is as follows, recast you have to give
the data and the formula and we have to also give id variables and measurement
variables. So, if you can see these input arguments, it takes the input arguments of both
melt and cast as you can see in this command here.
So, recast command takes the data frame and it also takes the formula, this is the
parameter that refers to the cast section of the command and this is the parameter, that
refers to the melt section of the command. What we have seen in the melt, we have to
specify what are the Id variables and the measurement variables. So, when you specify
only Id variables, the rest of the variables are defaultly taken as the measurement
variables. So, that is why we did not specify measurement variables here, you can also
specify the measurement variables, as we can see from the syntax. Now when you
execute this command, it will melt and it also cast and it will print the casted data frame
as shown in this screen below. Next with this, we can see that melt and cast operations
can be done together using the recast command.
57
Next, we see how to create a new variable, that is a function of already existing variable,
using the mutate command. Sometimes it is essential to have a translated or the function
are variable, which is created from the existing variables. In this case, let us assume
logarithm of BP value is something which is giving us more insight about the data. How
do you create a new variable which carries the logarithm value of the blood pressure
from the existing blood pressure value-is the question.
So now, how to do that is you have to load the library dplyr, you can use mutate
command and you need to pass the data frame and you have to say, you have to create
new column which is carrying the values of logarithm the existing column BP. Now, if
you print this pd2 you can see that, there is another variable that is logarithm of BP that
is created and you have the corresponding values of it. Now, let us look how to join 2
data frames it is very important in terms of data analysis, because you will get part of the
data from one source and the part of the data from other source, when we want to match
these 2 data, which are having some common I ds, how do you do this-is the question.
58
(Refer Slide Time: 10:33)
So, this combining of data frames can be done using, dplyr package the general syntax of
the dplyr is as follows, you need to have a function which could be either left join, right
join, inner join and so on. And you need to pass the first data frame and the second data
frame, because you want to do joining of this 2 data frames and you have to specify, by
which I d variable you have to join this 2 data frames.
So, here the I d variable is common to both data frames; that means, you have to have
that variable in both data frames, which you want to combine and this variable provides
the identifiers for combining the 2 data frames and the nature of combination depends
upon the function that is being used. We will see some examples and (Refer Slide Time:
11:23) example let us see this one, we have one data frame which carries I d name and
age. We have one and 2 as I ds here, name as Jack and Jill whose ages are 10 and 12 at a
suppose, we have another data frame which has his Ids in the reverse order, Id2 Id1 and
gender is girl and boy and this is output you want to get, let us say you want to merge
these 2 data frames using some function either left join or right join are something.
So, that you will get the data frame which contains information in both the individual
data frames for example, you can see the I ds are the common variables or the identifiers
variables that are common to both data frames and we are using this Id variable, to
combine this 2 data frames 1 and 2. So, we have 1 Jack and for 1 we have boy and we
have age of Jack as 10 and that is also been taken care, and you have Jill and the Id
59
variable of Jill is 2. So, we will have 2 Jill age and the gender this is one example, how
the merging and combining the data frames happens? Now, let us look deep into the
different functions, that available in the dplyr package to combine two data frames.
There are several functions that are available in the dplyr package to combine data
frames, few of them are left join, right join and inner join and there are full join, semi
join and anti-join. In this lecture, what we have going to see are the first 3 left join, right
join and inner join. We will leave the audience as an exercise, to understand what full
join, semi join and anti-join does in combining the data frames.
60
(Refer Slide Time: 13:08)
Let us illustrate joining of data frames by creating 2 data frames first, let us first create
this data frame p d, this can be created using the code shown here and when you print
that, you can see the output as share we have 2 names Senthil and Sam, we have 2
months Jan and February, we have blood sugar and blood pressure values of those
variables. Now, we are taking another data frame which contains 3 names Senthil
Ramesh and Sam and the other column carries the department, where they are working.
So, Senthil and Sam is working in PSE and Ramesh is working in data analytics, to
create this data frame you can use this code and when you print this data frame, you can
see the result in the console has shown below. Now we have created 2 data frames pd
and pd new.
61
Let us look at how left join works. When you want to combine this 2 data frames pd and
pd new a left join joins, matching rows of data frame 2 to the data from 1 based on the Id
variables. From the syntax we can see that function data frame 1 and data frame 2 is the
syntax we have and you have to specify the Id variable as the last argument. So now, if
you want to left join data frame 1 which is pd and data frame 2 which is pd_new, what it
takes as a reference is, the data frame 1 which is pd and now it matches the rows of the
data frame 2, which is Senthil Ramesh and Sam and sees in the data frame 2, what is
matching with the name variables in the data frame 1, essentially it will keep only
Senthil and Sam and not keep Ramesh, because it will take to matching rows from the
data frame 2 to the data frame 1.
So, well see that when you do the example, now here there are only 2 Ids corresponding
to the values in pd_new and that will be merged with pd, the variable department will be
added to the final data frame, only for Senthil and Sam.
62
(Refer Slide Time: 15:23)
Let us see in detail, you have 2 data frames you need to load the library dplyr and you
are doing it, a left join and I am naming the new data frame, which is coming out with
this left join operation as, pd underscore left underscore join 1. I have to use this
command left join, this is the data frame 1 I am passing in and this is the data from 2 I
am passing in and then I want to join this 2 data frames by the variable name .
Now, when you specify I want to join this 2 data frames by name, the left join it will take
pd as a reference, look for the names that are common in both p d and p d new and then
take the data from the pd_new, and merge it with the p d and then create another data
frame, which is given by this name pd_left_join1. So, you have data frame 1 p d which
contains, Senthil and Sam and it look for, Senthil and Sam in the data frame 2 and then
merges the information, that is available extra for these names and add it to the existing
data frame, with another column department and the department of Senthil is PSC, in the
department of Sam is also PSC, it will rehold the pd and then add the corresponding
piece of information, that is coming from the data frame 2.
63
(Refer Slide Time: 17:00)
Now, let us look at the right join similarly. So, what right join does is, it is joins
matching rows of data frame 1 to the data frame 2 based on the Id variable. Let us say,
you have this data from 1 which is p d and data frame 2 which is pd new and you can do
the right join, by using right join command and you need to pass what is data frame 1
and what is data frame 2, here we have pd as a data frame 1 and p d new as a data frame
2. Now, what is it take is it will take the pd new as the reference data frame and try to
match the rows, which are present in the pd new and look for a match in the pd. We have
Senthil and Sam there are matching in the pd also and it will keep this Ramesh now,
because the references is this data frame. So, you will have Senthil Ramesh and Sam, but
for Ramesh you do not have month, blood sugar and blood pressure values, which are
replace by n s when the matching operation is that.
64
(Refer Slide Time: 18:10)
You can change the order, in which you pass the data frames and you can see that, if you
change the order, you pass the data frame one has pd new and data frame 2 as pd. You
can observe that output is similar to the left join, because now the reference variable here
is pd, when you are using pd as a reference data frame even though you are doing this
right join operation, the operation is similar to left join because your pd is the reference
at the right join here.
So, to summarize left join and right join can be used vice versa, but depending upon the
way you pass this data frames, the matching operations will either look similar or
different. So, you have to be careful when you are passing the arguments, to this left and
right join commands.
65
(Refer Slide Time: 19:09)
Now, let us see what inner join does, inner join merges and retains those rows in the ids
present in the both data frames. Now you have data frame 1 which is pd new you have
data frame 2 which is pd, now when I pass these 2 data frames as an argument to this
inner join function and I want to match them, by name it will look for the rows with I ds
present in the both data frames. In this 2 data frames we have Senthil and Sam present, it
will print only the data that is corresponding to the Senthil and Sam, because Ramesh is
not available in this data frame 2.
66
So, we have seen left join, right join and inner join. We left as an exercise for the viewer,
to understand how full join semi join and anti-works. To summarize in this lecture, we
have seen how to recast the data frames? And how to combine 2 data frames using the
dplyr package? In the next lecture we are going to see how to do arithmetic logical and
matrix operations in r.
Thank you.
67
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 07
Arithmetic, Logical and Matrix operations in R
Welcome to lecture 6 of the R module in the course Data Science for Engineers. In the
previous lectures we have seen various data types of R, how to access R delete the
elements of the different data types and so on. Now, it is time to see how to perform
arithmetic, logical and matrix operations in R.
In this lecture we are going to see how to do arithmetic operations, logical operations and
matrix operations in R.
68
(Refer Slide Time: 00:50)
R supports all the basic arithmetic operation, the first one is assignment operator. You
can use either = or the back arrow <- to assign a value to be variable and standard
addition, subtraction, multiplication, division, integer division and remainder operations
are also available in R. In R back arrow <- is only the valid assignment operator whereas,
as an R studio both = and back arrow R proper assignment operators.
69
(Refer Slide Time: 01:26)
Let us look at the hierarchy of operations while performing the arithmetic operations in
R. So, it is similar to our normal BODMAS rule with bracket has the first importance
exponent has the second priority and followed by division, multiplication, addition and
subtraction. For your understanding you can type in this expression and then see what is
the value of a would be if you want to understand the order of precedence first we do not
have any brackets in here.
The next one is exponent the first this part 3 square will be evaluated that is 9 and the
next operation is division 27 / 9 will give you 3, 3 times 2 is 6 because the next operation
is multiplication. So, once you have 6 here what is the next operation? Addition 6 is - 6
because you have - 1 here 7 + 4 is 11, - 6 and which gives you value of A as 5.
70
(Refer Slide Time: 02:35)
Next we move on to the logical operations in R. So, we have standard logical operations
such as <, < =, >, > =, = and so on.
There are examples where you can see if you ask 2 is greater than 3 it will true a value
false because this statement to greater than 3 is not true. Similarly if you say 2 = 3 it will
also say false because 2 != 3. When you execute this command 2 != 3 it will give answer
as true because 2 != 3. So, this is the summary of logical operations that can be
performed in R.
71
(Refer Slide Time: 03:21)
Next we move to the important class of operations that are needed for data analysis
problems. Most of the data we will treat them as matrices. So, matrix operations play a
key R important role by solving the data analysis problems.
Let us first define what matrices are. A matrix is a rectangular arrangement of numbers
in rows and columns in a matrix as we know rows are the ones which run horizontally
and columns are the ones which run vertically. These are the examples of matrices. This
72
matrix has 3 rows and 3 columns, and this matrix has 3 rows and 1 column, and this has
1 row and 3 columns.
Now, let us see how to create matrices in R. To create a matrix in R you need to use the
function called matrix. The arguments to this matrix are the set of elements that are
needed to be the elements of the matrix. You have to pass how many number of rows,
you want to have how many number of columns, you want to have in your matrix and
this is the important one by row usually R arranges the elements you have entered in a
column fashion, if you want the elements that are given to be entered in a row as fashion
you have to say by row as true the default option for by row is false.
Now, we have seen; what are the things that are involved in creating a matrix. Let us
create a matrix with the elements 1 to 9 which is containing 3 rows and 3 columns and
you want to fill the elements in a row wise fashion this is the command which does this
and if you see the output is 1 2 3 4 5 6 7 8 9 that are filled in a row wise fashion.
73
(Refer Slide Time: 05:10)
Now, let us see how to create some fashion matrices in R the first one is scalar matrix
which contains all the rows and columns that are filled by single constant k. So, we need
to specify the value to be 3 and you have to specify the number of rows you want and the
number of columns you want. So, you want to fill all the rows and columns with the
element 3 which is a matrix which contains 3 rows and 4 columns. So, you have
specified 3, 3 and 4 when you do that you will get the matrix printed like this.
So, the command is matrix this is the element you want to print in all the rows and
columns you have to specify how many rows and how many columns. Next we see how
to create diagonal matrix the inputs you have to give for the diagonal matrix is the
elements which you want to have in the diagonal and the dimension of the matrix. So,
this is the command diag, the elements are vector of elements you want to have as
diagonal elements and the rows and number of columns. So, see this example we want 4
5 6 ask the elements of our diagonals and you want to have A₃ by 3 matrix you can use
this command and you can see that 4 5 and 6 are your elements in the diagonal and the
rest of the elements are there.
How do you create identity matrix? You can create an identity matrices in the diag
command with the values in the diagonals has to be 1 and then let us say you want to
create A 3 by 3 identity matrix you have to specify then rows as 3 and number of
columns as 3 and it will put 1 in the diagonals with all other elements as 0.
74
(Refer Slide Time: 07:00)
Next we move on to matrix metrics once a matrix is created how can you know the
dimension of the matrix? How can you know how many rows are there in the matrix?
How many columns are in the matrix? How many elements are there in the matrix is the
questions we generally wanted to answer.
We can use the following comments to know all of this. Dimension of A will return the
size of the matrix that will say what is the size of the matrix that is it is A 3 by 3 or 4 by
75
5 and so on, n row of a will return you number of rows and n column of you will return
you number of columns. Either length of a or product of dimensions of A will return the
number of elements that are existing in the matrix. For the matrix A which is created by
using this command we can find that dimension of A will give you 3 by 3 because it
contains 3 rows and 3 columns number of rows is 3 and number of columns is 3 and the
number of elements that are present in the matrix is 9.
We can access, edit and delete elements in the matrices using the same convention that is
followed in data frames. So, you will have a matrix and followed by a square bracket
with a comma in between array and values before the comma is used to access rows and
array or value that is after comma is used to access columns. If you want to remove some
columns you need to add a negative symbol before the rows or columns, and you can
also assign strings as names of rows and columns by using the commands row names and
row columns.
Here we have created a matrix A which are having the elements 1 2 3 4 5 6 8 9 1 and it is
A 3 by 3 matrix and we want to fill the elements row wise and we can now name the
columns as a b c and name the rows as d e f. Once you do that and print a you can see
that this column is named as a, and this column is named as b, and this column is named
as c. Similarly we can see that row one is named as d, row 2 is named as e and row 3 is
named as f.
76
Now, let us suppose you want to access the first two columns you can use the same
convention as what we have used for data frames, A with the square bracket nothing
before the comma and then you want access 1 to 2 that is first two columns of a you have
to give that array here and then it will access the first two columns of A.
You can also access the columns using the names of the column as we have seen in the
data frames. So, you want to access the columns a and c; that means, columns 1 and 3
you can do so, by specifying the names of the columns. Similarly you can also access the
rows by using the names of the rows. You want to access first and third row which are
having the names d and f, you can do so by using this command you want access row d
and row f and all the columns. So, the output is shown here.
If you want to access an entry of a matrix you can use the similar convention. For
example if you want to access this element it is in the first row and the second column
the command you need to use is in the matrix A fetch the element which is in the first
row and in the second column that will give you the output 2. And for example, if you
want to access this element 6 you have to say it is in the second row and the third column
you have to say A of 2 comma 3, it is give an output 6. As we have seen earlier the part
before the comma should refer to the row number and the part after the comma should
refer to the column number.
77
(Refer Slide Time: 11:12)
Now, let us see how to access a column of a matrix. So, specify the column index which
you want access and leave the rows index unspecified. This means you are accessing all
the row elements of a given column index. So, for example, if you want to access first
column of the matrix A, what you need to do is A of all the rows and first column which
will give you the output 1 4 7.
Similar to accessing a column we can access a row of a matrix. What you need to do is
you need to specify the row which you want to access and specify nothing in the column
78
index which says access all the columns. If you want to access row 2 you have to specify
in the row ID as 2 and leave empty space in the column ID and so that row two all the
columns will print it and you will be able to access 4 5 6.
For you to think about how do you access the last row. Can you do something like this?
You figure out by trying on your own.
Next we will see how do access everything, but one column. I want to access in this
matrix this part 1 4 7 and 3 6 9 I do not want this column to be in the matrix where I
want to access.
So, now what I have to do is it is like eliminating this column from the matrix you can do
so by having a negative symbol before this is the second column you can say all the rows
I want and I want to take this second column off and if I assign it back to A, I will get A
as 1 4 7 and 3 6 9 or if you just print this a of all comma - 2 it will give the desired result
which is 1 4 7 and 3 6 9.
79
(Refer Slide Time: 13:09)
Similar to the one which you have seen in the earlier slide you can also access
everything, but one row all you need to do is for example, if you want to access all the
parts of a except this row you can do so by using this command I want to take the second
row off and I want to have all the columns. Now, once when you do this command you
will say 1 2 3 and 7 8 9 will be printed as your output.
As an exercise to access elements of a matrix you can try solving this problems that are
given.
80
(Refer Slide Time: 13:46)
Now, we will introduce what is called as a colon operator. Colon operator is used to
create an array of elements with equal width for example, if I type in 1 to 10 it will create
numbers from 1 to 10 with gap of 1. I can also reverse the order it will print from 10 to 1
with a gap of 1. Why is this colon important? If you would have realized I would have
used something similar while accessing the number of rows or columns in the previous
slides. Let us look how to do this.
81
For example if you want to select a part of matrix which has sub matrix you can use this
colon operator ok. So, let us now see if I want to access the first 3 rows and the first 2
columns of this matrix, how do I do this? I want to access rows 1 to 3 and also access
columns 1 to 2 do. So, you can see this colon operator is helping us in accessing the sub
matrices from the matrix.
In this example what does it says is I want to access all the 3 rows and I do not want the
third column. This is same operation, but done in a different fashion. You can also do the
same I want to access all the rows, but it has to be coming from first two columns only.
So, you can see that you can access sub matrix in different fashions depending upon the
way you are comfortable with.
So, this is another example of accessing sub matrices if I want to access this 1 comma 2
and 7 comma 8 and have it as a sub matrix separately how do I do this. I want to access
rows 1 and 3 and what are the columns I need to access in the columns 1 and 2. So, I
have to say in the columns 1 and 2 access the elements which are in the row 1 and row 3,
that brings me the matrix. You can use the concatenation operator also for both the
arguments like shown here you can use c of 1 comma 3 and c of 1 comma 2 which gives
you the desired result.
82
(Refer Slide Time: 16:00)
83
consistency of dimensions before you do this matrix concatenation. Let us illustrate how
an R bind works.
Let us suppose we have a matrix A and matrix B and you want to concatenate this matrix
B as a row in matrix A that can be done using the R bind command which is shown here.
I am concatenated matrix B to the matrix A and I am assigning it to the variable C. So,
when you do this command you can see that the matrix C is having the row 10 11 12
which is the matrix B and is concatenated to the matrix A.
84
Now, let us see the C bind. Let us say you have this matrix A and we have matrix B
which is shown in the screen you want to concatenate this B matrix with the columns of
A. You can do so by using the C bind command which is shown here C by pass the first
matrix A and second matrix B and assign it to the variable C. When you print the C you
can see that the matrix B has been concatenated as a column to the matrix A.
Now, let us try to concatenate this B to this matrix A using C bind. What would do you
expect? We expect an error because A is having the dimension 3 by 3, but B is having 1
by 3. If I want to do a column bind the dimension of matrix B would have been 3 by 1,
but it is 1 by 3 which is inconsistent that is why you will get an error, error in C bind of
A number of matrices must match.
85
(Refer Slide Time: 18:26)
Now, if you want to resolve this dimension inconsistency you have to transpose this B
and then have this as 3 by 1 and now A is 3 by 1 now you can easily do the C bind
operation by using C bind command C bind of A comma B and assign it to C. Now, you
can see that this C bind it happened and the B is concatenated to the matrix A.
You have seen how to delete a column, you can use negative symbol before the columns
which you want to delete and then assign it to A you will see that the required output is
printed.
86
(Refer Slide Time: 19:05)
Similar to what we have seen in the earlier slide we can also delete a row from the matrix
which is, let us suppose we want to delete this row 2 you have to say - 2 and then all
columns and then assign it back to A. You can see that in the output the row 2 is deleted.
Now, let us see how to do algebraic operations on matrices such as addition, subtraction,
multiplication and matrix division in R.
87
(Refer Slide Time: 19:35)
Let us suppose we have two matrices A and B which are shown here. Matrix addition is
straight forward you can say A + B you will get the output. So, 1 + 3 is 4, 2 + 1 is 3, and
3 + 3 is 6 you will see the element wise operation happens that is what normal matrix
operation is also about.
So, you can also do the subtraction, multiplication is little bit trickier when you say A
has trick B it will perform element wise multiplication such as 1 into 3 is 3, 2 into 1 is 2
and 3 into 3 is 9. But if you want to have a regular matrix multiplication you have to use
percentage symbol before and after this hash trick that will perform the regular matrix
operation.
88
(Refer Slide Time: 20:26)
Now, let us look at matrix division. Let us say you have two matrices A and B which are
4 9 16 25, and 2 3 4 5 respectively. Now, if I do A by B what it does is element wise
division, but not the inverse of a matrix. So, you have created matrix A matrix B and
then if you do A by B you will see that 4 by 2 is 2 9 by 3 is 3, 16 by 4 is 4. So, let us
suppose you have two matrices A and B as shown in the figure when you do A by B it
will perform an element wise division, but not the inverse of a matrix.
In this video we have seen how to do arithmetic logical and matrix operations in R. In
the next lecture we are going to discuss about how to write functions in R, and how to
invoke them, how to use them to perform the task we wanted.
Thank you.
89
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 08
Advanced programming in R: Functions
Welcome to the lecture 7 in the R module of the course Data Science for Engineers.
In this lecture we are going to introduce you to the functions in R. We are going to
explain how to load or source the functions and how to call or invoke the functions.
90
(Refer Slide Time: 00:32)
Functions are useful when you want to perform certain tasks many number of times. A
function accepts input arguments and produces the output by executing valid R
commands that are inside the function.
In R when you are creating a function the function name and the file in which you are
creating the function need not be same and you can have one or more function
definitions in a single R file. Functions are created in R by using the command function.
The general structure of the function file is as follows f = function of arguments and then
you have statements that are needed to be executed. This f is the function name when
you write this command this means that you are creating a function with name f which
takes certain arguments and executes the following statements.
91
(Refer Slide Time: 01:31)
Let us see how to create a function file. Creating a function file is similar to opening an
R script which we have already seen. You can either use file button in the toolbar or you
can use the + button just below the file tab to create an R script, once you create an R
script you can save it with whatever name you want.
For example, we have saved the R file as vol cylinder. Now, once you save you are ready
to write functions, now I want to create a function which calculates the volume of a
cylinder which takes in the arguments the diameter and length. So, to create a function
92
by name volume cylinder I have to have the function named as volume cylinder function
and the arguments that are needed to be passed are the diameter of the cylinder and the
length of the cylinder.
If you notice here we are passing this values of 5 and 100 as a default arguments for this
function. Once you have diameter and length you can calculate the volume by the
formula π d square l by 4 then what you need to return is the volume variable that is
calculated inside the function. Once you have written the R statements that are needed to
be executed in a function file, you can save that file. So, we are saving it as vol cylinder
dot R.
Once you save this. So, you need to load the functions before you invoke or execute
them in R. To load a function you need to click on the source button that is available in
the R script menu.
93
(Refer Slide Time: 03:02)
Clicking the source button will not execute the function; it will only load the function file
and make it ready for invoking.
Once you load the function, you can invoke the function from the console as follows you
want the volume to be saved in the variable v and then you are calling this function vol
cylinder with the arguments 5 and 10. So, this will run the function to calculate the
volume and returns the volume. In the variable browser you can also see value of volume
94
and you can also see that the function volume cylinder is available with two arguments
dia and length.
Now, there are several ways you can pass the arguments to the function. Generally in R
the arguments are passed to the function in the same order as in the function definition. If
you do not want to follow any order what you can do is you can pass the arguments
using the names of the arguments in any order. If the arguments are not passed the
default values are used to execute the function.
Now, let us see the examples for each of these cases when you pass the arguments 5 and
10 the first argument is diameter and second argument is length according to the
definition of the function. So, it will take in the same way, but when you want not to
follow any order you can pass the arguments by the names in any order. So, for example,
I want to pass length as a first argument you can specify length = 10 and diameter = 5
and you can still see the result is same even though you pass the arguments in the
different order.
So, point to keep in mind is you can pass the arguments in any order by specifying its
name. When you do not pass any arguments here it takes the default values of 5 and 100
which are default diameter and length and then calculates the volume.
95
(Refer Slide Time: 05:26)
In R the functions are executed in a lazy fashion, when we say lazy what it mean is if
some arguments are missing the function is still executed as long as the execution does
not involve those arguments. We will see an example for this. We have the same
function we have defined now an extra argument radius in the function and the volume
calculation does not involve this argument radius in this calculation.
Now, when you pass this arguments dia and length even though you are not passing this
radius the function will still execute because this radius is not used in the calculations
inside the function. But R is clever enough, if you do not pass the argument and then use
it in the definition of the function it will throw an error saying that this rad is not passed
and it is being used in the function definition.
96
(Refer Slide Time: 06:28)
In summary these are the steps in creating a function file in R and executing. First we
need to open or create a function file by clicking a that the + symbol or file tab in the
toolbar you have to define the function in this fashion function name, keyword function
and the input arguments.
All the statements that are typed inside the function has to be valid R statements to be
executed, and you need to save the function file before executing you need to load the
function file by using the source button once you load the function file you can invoke or
call the function file with the right number of inputs so that you will execute the function
properly and you will get the required result.
97
(Refer Slide Time: 07:23)
A final word we need to load the function file every time you change something inside
the function definition either you restart R studio or make changes in the function file. If
you do not do that either you get an error or you will not get correct outputs which you
are expecting, because you would have changed something in the function definition and
not saved the original version. Once you save the original version also you have to
invoke the function before you use.
In the next lecture we are going to explain the functions which are having multiple inputs
and multiple outputs.
Thank you.
98
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 09
Advanced programming in R: Functions
Welcome to lecture 8 in the R module of the course Data Science for Engineers. In the
previous lectures we have seen how to create functions, how to execute them, but we
have limited ourselves to the single output.
In this lecture we are going to see functions with multiple inputs and multiple outputs
which we call MIMO how to source and call those functions. We will also see about
inline functions how to loop over objects using the commands apply, lapply and tapply.
99
(Refer Slide Time: 00:53)
Let us see the functions with multiple input and multiple outputs. The functions in R
takes multiple input objects, but written only one object as output, this is however, not a
limitation because you can create lists of all the outputs which you want to create and
once the list is written out you can access the into the elements of the list and get the
answers which you want.
Let us consider this example I want to create a function vol cylinder underscore MIMO
which takes diameter and height of the cylinder and returns volume and surface area.
Since R can written only one object what I have to do is I have to create one object
which is a list that contains volume and surface area and return the list. Let us see how
we can do that in the next line.
100
(Refer Slide Time: 01:51)
So, you need to first create an R file which we have seen several times. You can create
an R script using a + button R from the file tab once you have opened R script this is the
piece of code that does what we need we want to name the function as vol cylinder
underscore MIMO because it is a multiple input and multiple output.
And this function takes an arguments diameter and length earlier we have calculated only
volume and now we want to calculate the surface also which is given by π times
diameter times length. Since R returns only one object first I need to create an object
called result which is a list of volume and surface area, I am naming the volume as
volume and surface area surface area I will calculate this result and ask the function to
return one object the result which contains both volume and surface area.
101
(Refer Slide Time: 02:53)
Once you will write the function you need to load the function to call it loading can be
done using the source button, once you source it you are ready to call a function. You
can call the function once you load the function. So, I am calling this function result =
volcylinder_mimo and I am passing 10 and 5 as my arguments this result will contain
two elements the first element volume will contain volume the second element surface
area will contain surface area. Once the object result is given out by the function I can
access inducer elements by using the techniques what we have teached in the list lecture.
102
Sometimes creating an R script file loading it, executing it is a lot of work when you
want to just create very small functions such as the ones as shown here x2+ 4x + 4. I want
to evaluate this expression for different x’s. So, having a function file and then loading it,
invoking it is a lot of work. So, what we can use in this kind of situations is an inline
functions to create an inline function you have to use the function command with the
argument x and then the expression of the function.
Once you create this you can call this in the command prompt itself function of one gives
takes the value of one as an argument and then x accelerates this expression 1 2 = 1 +
4(1)= 4, 1 + 4 = 5, and 5+ 4=9. Similarly you can get the outputs are different arguments
which are passed which are shown in the screen.
So, now, we move on to looping over objects there are few looping functions that are
pretty useful when working interactively on a command line few examples are apply,
lapply, tapply and so on. Let us see what each of these functions does.
Apply function applies a function over the margins of an array R matrix, lapply function
applies a function or a list or a vector, tapply function applies a function over a ragged
array and mapply is a multivariate version of lapply. We will see examples for each of
this in the coming slides.
103
(Refer Slide Time: 05:22)
Here is an example for apply function. What apply function does is applies a given
function over a margins of a given array, when you say margins here this refers to the
dimension of an array along which the function need to be applied.
Let us create a matrix A with the elements 1 to 9 which is of 3 by 3 size. You can do that
using this command and when you print A you can see this. Now, I want to evaluate
these sums across the rows and the sums across the columns. You can use the apply
function to do so, the syntax for this is take the matrix A and then apply this apply
function across the rows of matrix A and function you need to apply is sum.
So, now what does is it will sum up this first row 7 + 4 + 1 it is 12 some of the elements
in the second row and prints here, and some of the elements in the third row and prints
the output. You can do the same for the columns by specifying the margin as 2 which
says, apply the sum function on A across the margin 2 which is the columns. This
command will prints the sums of the elements in the columns as 3 + 2 + 1 is 6 and so on.
104
(Refer Slide Time: 06:50)
Next we want to lapply function, lapply function is used to apply a function over a list so
that is where you have l here. Lapply will always return a list which is of the same length
as the input list. The syntax for the lapply is as follows. You have to use the command
lapply and the list on which the function has to be applied and function which has to be
applied on the list. Let us illustrate this lapply using example here create a matrix A
which is having the elements from 1 to 9 which is of size 3 by 3 and create a matrix B
which are having the elements 10 to 18 and which is of size 3 by 3.
Now, I am creating a list of matrices A and B using the list command. I want to evaluate
the determinant of the matrices and then store them in a list. One way to do is calculator
determinant of A, calculate determinant of B and calculated it and make them as a list.
You can do easily the same operation using the lapply function which is shown here, I
want to name that variable has determinant I use a function lapply, I want to apply the
determinant function on my list when I do that it will calculate the determinant of A and
then store it in the element 1 and calculate the determinant of B and store it in the
element 2 of a list.
105
(Refer Slide Time: 08:19)
Now, let us move to the mapply. mapply is a multivariate version of lapply. What is does
is this function can be applied on several lists simultaneously the syntax is mapply the
function you need to apply on list 1 and list 2 together. So, we have seen R function vol
cylinder. Now, let us suppose I want to calculate the volume for different diameters and
different lengths which are having as a list here, I have a list of diameters and I have a
list of lengths I want to evaluate the volume for each individual pairs of dia and length.
You can individually take this length and dia and execute the same function, but it is so,
tedious mapply helps in simplify this job for us you need to create one variable vol and
then apply the function mapply on the vol cylinder function because this is the function
first we need to apply the function and then the list 1 and list 2. We have two lists dia and
length, and what it does is it will take a pair and then calculate the volume and it will
written the volumes were two lists of dia and length that are given.
106
(Refer Slide Time: 09:40)
Next we move on to the tapply function. tapply is used to apply a function over a subset
of vectors given by combination of factors. The syntax for that is tapply a vector,
practice and the function unit of length. So, to illustrate tapply let us use this example, I
am creating a vector called Id using this concatenation operator which contains four 1’s,
three 2’s and two 3’s and I am also creating another vector which is having the values
again a concatenation which are having the elements 1 to 9.
Now, if I want to add the values which are having the same ids tapply function can help
me. So, what I need to do is tapply I want to add this values the adding is given here as a
function which is sum according to the Id they belong to. What it does is it will take the
elements corresponding to one Id for example, four 1’s and the values 1 2 3 4 and sums
them up 4 + 3, 7 and + 2 9 and 1, so sum is 10. Similarly it will take the values
corresponding to the Id 2 and then sum it up and values corresponding to the Id 3 and
sum it up so that it will print the outputs which are indicating the sums of the elements of
category 1, category 2 and category 3. The tapply prints the output as the sums of the
elements which are having Id 1, Id 2 and Id 3.
In this lecture we have seen how to write functions which takes multiple inputs and
multiple outputs. We have seen inline functions and we have also seen some functions
that are useful to loop over the objects. In the next lecture we are going to discuss about
the control structures in R.
107
Thank you.
108
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 10
Control structures
Welcome to lecture 9 in the R module of the course, data science for engineers. Here we
will look at the control structures in R.
109
In this lecture we are going to talk about if else if family constructs, for loop nested for
loops, for loop with if break and while loop. Control structures can be divided into 2
categories.
The first one is where you need to execute certain commands only when certain
conditions are satisfied and example of this control structure is if then else type of
constructs. The second one is execute certain commands repeatedly, and use certain logic
to stop the iteration examples for this kind of constructs are for and while loops.
110
First look at if construct, what if does is if checks for a condition if the condition is
satisfied it will execute the statements that are in the if loop. The next level is if else
construct, where we want to do certain operations if the condition is satisfied and if not,
we want to do some alternate operations.
So, the if else construct syntax looks like this, if the condition is satisfied perform this
statements else perform this alternate statements. The next level is if else if else
construct. So, here you will have 2 things if a condition is satisfied execute this
statement; else if you check for another condition if that condition is satisfied execute
this alternate statements, if both of them are not satisfied then do something else so the
syntax is as follows; to illustrate this if let us consider this example here.
111
Next, we move on to the for construct to understand the for function we need to
understand what is a sequence function first. So, let us see what is a sequence function,
sequence is one of the components of the for loop that is the reason why we are looking
at the sequence function now, sequence function syntax is as follows.
Sequence function contains from the starting number from which the sequence has to
begin to the ending number with which the sequence has to begin, you can define the
sequence by either providing the by or length, when you provide this argument by it will
specifiy by what increment or decrement the sequence has to be generated, when you
provide this argument length. So, what it does is it will create number of elements that
are required from the starting number to the ending number you can see the examples
here, let us now assume that I want to create a sequence from 1 to 10 and then I want the
width of 3.
So, the argument which I want to pass is by = 3, this will create one separated by 3 and
then 4 4 and it leaves again 3 values and then 7 and then it leaves another 3 values and
then A₁0 instead I can do the same by specifying the length. So, the way you can do that
is as follows. I want to generate a sequence from 1 to 10 which contains the 4 elements.
So, it will generate the same thing so it will take from one and start from 1 and and go up
to 10 which contains 4 elements. Now if I want to say I want to create a sequence from 1
to 10 which is having a width of 4 this is how the output looks.
112
Now let us move on to the for loop the structure of for loop construct comprises of a
sequence which could be a vector or a list an iter which is an element of the sequence
and the statements that are needed to be executed.
So, if you see the structure of this for loop, iter in a sequence as we seen iter is an
element in the sequence the sequence is a list R vector; for every element in this
sequence execute this statements is what this for loop construct is saying. So, next level
of the for loops is a nested for loop. When you say nested for loop it means we have one
or more for loop constructs that are located within another for loop.
The structure of the nested for loop is as follows, the for loop here is an inner for loop
and the for loop, outside is called outer for loop for every iter 1 in the sequence 1 this for
loop will get executed , it will go to the for loop 2 where it will perform this operations
on sequence 2 for every iter 2 and return the output to illustrate this for loop.
113
(Refer Slide Time: 05:58)
So, you have n = 5 it will keep it in memory. So, when we execute and sum = 0 it will
initialize the sum to 0, for the first time it enters into the loop it will take value of 1 from
the sequence. And then you have already sum as 0, 0 + 1 is 1 and it will assign value 1 to
the sum. You can see that in the first iter or a first iteration the value of sum is 1. In the
second iteration you have the value sum as one it will go to the next iteration; that is now
the i value is 2. So, you have already sum value as one and i is 2, 1 + 2 is 3 and the value
of 3 is assigned to the variable sum in the second iteration, you can see that value of the
sum is 3 here and so on.
Since the sequence runs up to 5 the sum will be 15 at the end of 5 iterations, sometimes it
is necessary to stop the function when you feel that the required condition is satisfied.
114
(Refer Slide Time: 07:41)
This can be achieved using a break statement in the for loop. So, let us see how to do this
I am assigning a variable value of 100 to the variable n and I am initializing the sum as 0.
Now I want to stop the loop when the sum exceeds 15, how do I do that? So, in the for
loop what you have to have is if break construct. So, in the for loop these are the
statements that are available for every iteration I am adding this sum and then iter value
and assigning it to the sum, and I am printing the vector which is containing the loop
variable and sum.
And I have to check a condition if the sum is greater than 15 I will say break because this
is the condition which I want to break the loop. So, this break statement once executed
the program exit the loop even before the iterations are complete. So now, let us see how
this things work so in the first iteration the loop variable has a value 1 because it is a
sequence starting from 1 and the last value is 100 here. So, we have seen in the previous
example also for the first time the value of the sum becomes 1.
And it checks the condition if sum is greater than 15 because sum value is 1 it is not
greater than 15 the break statement will not be executed. And the break statement will
not be executed until 5 iterations when the iteration number 6 comes sum value is
already 15, and the iteration value now is 6, 15 + 6 will become 21 and that 21 will get
assigned to the variable sum.
115
Now if this condition is checked if sum is greater than 15; this condition is satisfied and
the break statement get executed; once this break statement get executed the program
quit from the for loop next we move on to another construct which is while loop.
A while loop is used whenever you want to execute statements until a specific condition
is violated, we can see it as an akin to for loop with if break construct.
Let us consider a sequence of natural numbers; you want to find a natural number n after
which the sum of the natural numbers n is greater than certain final sum you wanted to
be. You will consider the same example as we have seen I am initializing the sum as 0
and the initial variable i = 0. And I want the final sum to be 15. This is how I write a
while function while sum is less than final sum. I want to increment the i value by 1 and
then reassign it to the value i and I also want to increase the sum by the iter value and
then reassign it to sum and then finally, print the iteration value and the sum. These
commands get executed until the loop variable has a value 4.
Let us understand how does it works; for the first time i is 0 it will check the condition
sum is less than final sum, the condition is true what it does is it will increment the i by
1. So, i + 1 which is 0 + 1 you will get the loop variable as 1 and the sum is 0 + 1. So,
sum as the so gets the a value 1. Now this statement prints this first line, now it will go to
the next iteration. Now it checks whether the sum is less than final sum; the sum is 1
which is less than the final sum 15 it will go for the next iteration, it will update the value
116
of loop variable and the value of sum variable. At the fifth iteration what it does is you
have a sum variable 15 which = 15, but not less than 15. This statement is false and it
will come out of the loop. So, in this lecture we have seen how the if else family of
constructs can be coded in R and how to code for loops and while loops in R.
In the next lecture we are going to see how to perform basic graphics operations in R.
Thank you.
117
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 11
Data visualization in R Basic graphics
Welcome to the lecture 10 and the final lecture in the R module of the course data
science for engineers. In the previous lectures, we have seen; what are the basic data
types that are supported by R, how to write scripts, how to write functions and how to do
control structures, how to do programming and so on.
In this lecture, we are going to show you how to generate some basic graphics; such as
scatter plot, line plot and bar plot using R, and we will also give a brief idea on why there
is a need for more sophisticated graphics and how R does it.
118
(Refer Slide Time: 00:54)
First we will consider scatter plot; scatter plot is one of the most widely used plots where
we have some independent variable and dependent variable, when you want to see how a
dependent variable is dependent on the independent variable. We can use scatter plot
generating the scatter plot in R, is quite simple. The first command here shows, it is
creating a vector which is having the elements from 1 to 10, and the next command here
takes this x and calculates the element wise square of the x and then assign it to value y.
When you plot y it will generate this plot here. Since we have not specified, what is x
which is independent variable, the R generates its own independent variable as the index,
since this vector contains 10 elements. It will create the index based on the number of
elements in the vector and then the y values which are the squares of elements 1 to 10
that are 1 4 9 and so on are shown in the y axis, and 102 = 100. We have the final value
here on the y axis as 100.
119
(Refer Slide Time: 02:24)
So, let us illustrate the scatter plot using some inbuilt data set; that is available in R. So,
we are talking about a data set by name empty cars. So, you can access this data set by
just typing empty cars. This data set is a data frame which contains 32 observations on
11 variables. The variables are listed here such as number of cylinders, which is
represented by variable c y l and m p g. What is the mileage that this cars gives; that is
miles per us gallon and weight w t, which is weight of the car and so on.
120
Now, let us try to plot a scatter plot between weight and m p g of this data frame . To do
that what we need to use, is plot command. This is your independent variable car weight,
this is your dependent variable miles per gallon and this main helps in naming the title of
the graph x lab to give a label for x axis, y lab is used to give a label for y axis and the
this pch corresponds to different shapes for points, and this pch = 19 corresponds to the
shape that is shown in this screen. You can use different pch values to obtain different
shapes for the points in a scatter plot.
Next we move the line plot. You can take same example what we have seen earlier. If the
same plot command can be used what you need to do to generate a line plot, is to specify
an extra argument type which is l.
121
So, type = l generates a line, instead of the scatter plot. Next we move on to the bar plot,
the syntax to generate bar plot in R is as follows; bar plot of h. These are the heights
which can be a vector or matrices to keep it simple. We will deal with only vectors and
names dot argument. What this argument does is? it will print the names under the each
attribute in the H x lab and y lab, and main has a same meanings as what we have seen
for the scatterplot, and colour gives us an option to give colour to the bar plot. This is the
R code that can be used to generate the bar plot. I want to define h heights of the
barcodes as a vector, which is having the values 7, 12, 28, 3 and 41, and I want to create
another vector which is of character variable, which is having the values March, April,
May, June and July, and now, I am trying to create a bar plot with h as heights and name
start arguments as m x lab as month, y lab as revenue and the colour of the bar notes I
want, is blue and the title is revenue chart and the border is red.
122
So, when you execute these commands, this is how the bar plot looks. These are the
heights of the bar charts, this is A 3. And then in the names dot variable we have March,
April, May, June and July, which is printed at the bottom of each height, and the x axis is
month, y axis is revenue and the title is revenue chart.
Now, let us see why there is a need for sophisticated graphics. Let us say there is a need
for you to show multiple plots in a single figure as shown below. How do you do this?
123
What are the challenges that you face when you want to create figure that was shown in
the earlier slide. So, the exact figure can be reproduced using this code which is shown
here for this.
What you have to know, is you have know where to introduce for loop, which columns
of data frame to be selected for plotting, and you have to also position each graph in the
grid etcetera. Even though you do all of this operations, the visuals are less pleasing that
is where we need more sophisticated graphics packages in R . This is where the ggplot2
comes into picture. The ggplot2 provides a very beautiful package for generating
graphics in R in this course, we have not deal much with ggplot2.
124
(Refer Slide Time: 07:14)
In summary, we have seen how to generate scatter plots, line plots and bar plots in the R.
We have also seen the challenges and disadvantages of basic graphics and the need for
using the advanced packages; such as ggplot2 for generating beautiful graphics in R.
With this we end the R module for this course. Wish you all the best for the next
modules in this course.
Thank you.
125
Data Science for Engineers
Prof. Raghunathan Rangaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 12
Linear Algebra for Data science
So, that is one thing that I would like to mention right at the
beginning.The second thing that I would like to mention is the
following; Linear algebra can be treated very theoretically very
formally; however, in the short module on linear algebra which has
relevance to data science ,what we have done is we have tried to
explain the ideas in as simple fashion as possible without being too
formal. However, we do not do any hand waving we teach linear
algebra in a simple fashion. So, that is another thing that I would like
you to remember as we go through this material. So, we first start by
explaining what linear algebra is useful for.
126
So, when one talks about data science, Data Representation
becomes an important aspect of data science and data is represented
usually in a matrix form and we are going to talk about this
representation and concepts in matrices. The second important thing
that one is interested from a data science perspective is, if this data
contains several variables of interest, I would like to know how many
of these variables are really important and if there are relationships
between these variables and if there are these relationships, how does
one un-cover these relationships?
127
(Refer Slide Time: 01:48)
So, if you are an engineer and you are looking at data for multiple
variables, at multiple times, how do you put this data together in a
format that can be used later, is what a matrix is helpful for.
128
So, let us start and then try and understand how we can understand and
study matrices.
129
(Refer Slide Time: 05:47)
Let me explain matrix using a real life example. Let us consider that
you are an engineer and you are looking at a reactor which has multiple
attributes and you are getting information from sensors such as
pressure sensors, temperature sensors and density and so on.
Now let us assume that you have taken 1000 samples of these
variables. Now you want to organize this data somehow. So that you
can use it for purposes needed. One way to do this is to organize this in
this matrix form, where the first column is the column that corresponds
to the values of pressure at different sample points. The second column
corresponds to the value of temperature at several sample points and
the third column corresponds to the value of density at several sample
points.
So, that is what I meant when I said the columns are used to
represent the variable. So, each column represents a variable column 1
pressure, column 2 temperature and column 3 density and when you
look at the rows; the first row represents the first sample.
Here in the first sample you will read that the value of pressure was
300, the value of temperature was 300 and the value of density was
1000. Similar to that you will have many rows corresponding to each
sample point up to the last row 1000th row, which is a 1000 sample
point; which has a pressure is 500 temperature is 1000 and density is
5000.
130
(Refer Slide Time: 07:25)
Let us take another example let us say I have 2 vectors, X =[1, 2, 3]T
and Y =[ 2, 4, 6]. Let us say this is some variable that you have
measured and Y is some other variable you have measured and the 3
values could represent the 3 sampling points at which you measured
these.
131
Now, we have been talking about using matrices to represent data
from engineering processes sensors and so on. The notion of matrix
and manipulating matrices is very important for all kinds of
applications. Here is another example where I am showing how a
computer might store data about pictures. So, for example, if you take
this picture here on this left hand side and you want to represent this
picture somehow in a computer. One way to do that would be to
represent this picture as a matrix of numbers.
So, in this case for example, if you take a small region here you can
break this small region into multiple pixels and depending on whether
a pixel is a white background or a black background you can put in a
number. So, for example, here you see these numbers which are large
numbers which represent white back-ground and you have these small
numbers which represent black background. So, this would be a
snapshot or a very small part of this picture.
Now, when you make this picture into many such parts you will
have a much larger matrix and that larger matrix will start representing
the picture. Now, you might ask; why would I do something like that?
There are many many applications where you want the computer to be
able to look at different pictures and then see whether they are different
or the same or identify sub components in the picture and so on. And
all of those are done through some form of matrix manipulation and
this is how you convert the picture into a matrix.
Now notice that while we converted this matrix we have again got
into a rectangular form, where we have rows and columns, where data
is filled as a representation for this picture.
132
So, the image that I showed before could be stored in the machine
as a large matrix of pixel values across the image. And you could show
other pictures and
then say are these pictures similar to this, are these dissimilar, how
similar or dissimilar and so on and the ideas from linear algebra matrix
operations are at the heart of these machine learning algorithms.
So, in summary if you have a data matrix- the data matrix could be
data from census in an engineering plan. It could be data which
represents a picture; it could be data which is representing the model
where you have the coefficients from several equations. So, the matrix
basically could have data from various different sources or various
different viewpoints and each data matrix is characterized by rows and
columns and the rows represent samples and the columns represents
attributes or variables.
133
(Refer Slide Time: 11:14)
So that I know how much information is there in this data. So, if let
us say I have thousands of variables and of those there are only 4 or 5
that are independent. Then it means that actually I can store values for
only these few variables and calculate the remaining as a function of
these variables. So, it is important to know how much information I
actually have.
134
(Refer Slide Time: 12:08)
So, this would lead to the following questions. The first question
might be; Are all the attributes or variables in the data matrix really
relevant or important? Now a sub question is to say are they related to
each other. If I can write one variable as a combination of other
variables then basically I can drop that variable and retain the other
variables and calculate this variable whenever I want.
135
(Refer Slide Time: 13:09)
So, let us consider the example that we talked about; the reactor
with multiple attributes. In the previous slide, we talked about pressure,
temperature and density. Here I have also included viscosity. Let us
say I have 500 samples. Then when I organize this data with the
variables in the columns and samples in the row, then I will get a 500
by 4 matrix, where each row represents one of the 500 samples and if
you go across the column, it will represent the variable values,
at all the samples that we have taken. Now, I want to know how many
of these are really independent attributes.
136
So, from domain knowledge it might be possible to say that density
is in general a function of pressure and temperature. So, this implies
that at least one attribute is dependent on the other and if this
relationship happens to be a linear relationship then this variable can be
calculated as a linear combination of the other variables.
Now, if all of this is true then the physics of the problem has helped us
identify the relationship in this data matrix. The real question that we
are interested in asking is if the data itself can help us identify these
relationships.
Let us first assume that we have many more samples than attributes for
now and once we have the matrix, when we want to identify the
number of independent variables. The concept that is useful is the rank
of the matrix and the rank of the matrix is defined as the number of
linearly independent rows or columns that exist in the matrix.
137
So, consider this example here where I have this a matrix which is
1, 2, 3, 2, 4, 6, 1, 0, 0. If you notice this matrix has been deliberately
generated such that the second column is twice column one.
138
variables, the dependent variables or attributes can be calculated from
the independent variables, if the data is being generated from the same
data generation process.
And if you identify that there are certain variables which are
dependent on other variables and as long as the data generation process
is the same, it does not matter how many samples that you generate,
you can always find the de-pendent variables as a function of the
independent variables.
139
(Refer Slide Time: 17:13)
140
When we have a matrix A and if we are able to find vectors β such
that A β = 0 and β != 0 then we would call this vector β as being the
null space of the matrix. So, let us do some simple numbers here for
example, if A is a 3 by 3 matrix because β multiplies A, β has to be 3
by 1 and the resultant will be some 3 by 1 vector and if all the elements
of this 3 by 1 vector are 0, then we would call this β as being the null
space of the matrix. Now interestingly the size of the null space of the
matrix provides us with the number of relationships that are among the
variables.
If you have a matrix which is of dimension 5 and let us say the size
of null space is 2,then this basically means that there are 2 relationships
among these 5 variables, which also automatically means that of these
5 variables only 3 are linearly independent because the 2 relationships
would let you calculate the dependent variables as a function of these
independent variables.
So, if you take the first element on the right hand side, that would
be a product of the first row of this data matrix and this β vector which
141
would basically be x11 β1 + x12 β2; all the way up to xn and βn = 0.
Now, similarly if you get to the second row and multiply the second
row by this vector you will get another equation.
So, if we keep going down, for every sample if you write this
product, you are going to get an equation. The last sample for example,
will be xm1 β1 + xm2 β2 + xmn and βn = 0. Now, there is something
interesting that you should notice here. This equation seems to be
satisfied for all samples. So, what this basically means is, irrespective
of the sample, the variables seem to hold this equation and since this
equation is held for every sample we would assume that this is a true
relationship between all of these variables or attribute. So, in other
words this β 1 to β m gives you in
some sense a model equation or a relationship among these variables.
So, one might say that this equation can generally be written as x1
β1 + x2 β2 all the way up to xn βn = 0; where you can take any sample
and substitute the values of the variables in that sample at x1 x2 up to
xn and this is to be satisfied. So, this is a true relationship.
142
So, we will demonstrate this example this with a further example.
So, this rank nullity theorem basically says the nullity of matrix A +
the rank of matrix A is going to be equal to the total number of
attributes of A or the number of columns of the matrix. So, the nullity
of a tells you how many equations are there; there are so, many vectors
in the null space.
The rank of A tells you how many independent variables are there and
when you add these two you should get the total number of variables
that is there in your problem.
So, to summarize, when you have data, the available data can be
expressed in the form of a data matrix and as we saw in this lecture this
143
data matrix can be further used to do different types of operations. We
also defined null space, null space is defined as a collection of vectors
that satisfy this relationship A times β = 0.
Let us take some examples to make these ideas little more concrete.
Let us take a matrix A which is 1, 3, 5, 2, 4, 6. A quick look at this
matrix and the numbers would tell you that these two columns are
linearly independent and subsequently because these columns are
linearly independent there can be no relationships among these two
variables.
So, you can see that the number of columns 2 since they are
independent the rank is 2. Since the rank is 2, nullity is 0 and because
both the variables are independent you cannot find a relationship. If
you were able to find a relationship then the rank should not have been
2. So, this basically implies that null space of the matrix A does not
contain any vectors and as we mentioned before these variables are
linearly independent.
144
And you can print the number of columns, you can print the rank
and you can print nullity which is the difference between columns and
the rank number of columns and the rank.
Now, let us take the other example that we talked about where I
mentioned that we have deliberately made the second column twice the
first column. So, in this case as we saw before the rank of the matrix
would be 2 because there are only two linearly independent columns
and since the number of variables = 3, nullity will be 3 - 2 = 1.
So, when we look at the null space vector you will have one vector
which will identify the relationship between these three variables.
145
So, to understand how to calculate the null space let us look at this
example. So, we set up this equation A β = 0 and we know we will get
only 1 β here and β we have written as b1 b2 b3 and when we do the
first row versus column multiplication I will get b1 + 2b2 = 0.
When I do the second row and column multiplication I will get 2b1
+ 4b2 = 0 and when you do the third multiplication you get b3 =
0. Now the second equation which is 2b1 + 4b2 = 0 is simply twice the
first equation. So, that does not give me any extra information so, I
have dropped that equation. Now, when you want to solve this notice
that b 3 is fixed to be 0.
However, from this equation what you can get is b1 is - 2b2. So,
what we have done is instead of b1 we have put - 2b2 retain b2 and
1. This basically tells us that you can get a null space vector which is -
2 1 0; however, whatever scalar multiple you use, it will still remain a
null space vector.
So, whenever β is a null space vector then any scalar multiple of that will
also be a null space vector that is what is seen by this k here.
Nonetheless we have a relationship between these variables which is
basically saying - 2x1 + x2 = 0 is a relationship that we can get out of
this null space vector.
146
So, to summarize this lecture as we saw matrix can be used to represent data in rows and
columns; representing samples and variables respectively. Matrices can also be used to store
coefficients in several equations which can be processed later for further use. The notion of
rank, gives you the notion of number of in-dependent variables or samples. The notion of
nullity identi es the number of linear relationships, if any between these variables and the null
space vectors actually give us the linear relationships between these variables. I hope this
lecture was understandable and we will see you again in the next lecture.
Thank you.
147
Data Science for Engineers
Prof. Raghunathan Rangaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 13
Solving Linear Equations
148
In general when we have a set of linear equations, when we write it
in the matrix form, we write this in the form Ax = b where A is
generally a matrix of size m by n, and as we saw in the last class m
would represent the number of rows and n would represent the number
of columns, and for matrix multiplication to work, x has to be of size n
by 1 and b has to be of size m by 1.
Now if you take each row of this equation you will have a left hand
side and the right hand side. And the left-hand side will have terms
corresponding to multiplying the first row of A with x and the right-
hand side will have the term corresponding to b. If you take the first
row, it will be the first equation and so on. So, from that viewpoint, m
represents the number of equations in the system of equations and n
represents the number of variables, and in general b is the constant
matrix that is used on the right hand side. So, when we write Ax = b,
this represents a set of m equations in n variables.
149
Now, clearly there are three cases that one needs to address when m
= n; that means, the number of equations and variables are the same.
So, this turns out to be the easiest case to solve. When m is greater than
n; that means, we have more equations than variables. So, we might
not have enough variables to satisfy all the equations. In the usual case
this will lead to no solution when m is less than n.
150
We had already discussed the concept of rank in the previous lecture,
but let us talk about it again, because this is going to be useful, as we
talk about solving equations. If you can consider a matrix A m by n.
Now if all the rows of the matrix are linearly independent, then
remember we said the rows represent data. So, this, this basically
means that the data sampling does not present a linear relationship; that
is the samples are independent of each other.
Now when all the columns of the matrix are linearly independent
that basically means the variables are linearly independent, because
columns represent variable. In a general case if I have a matrix m by n,
if m is smaller than n, the maximum rank of the matrix can only be m.
So, the maximum rank can be the less of the two numbers. So, in cases
where I have A matrix m by n, where n is smaller than m, then the
maximum rank that is possible is n. In general whatever be the size of
the matrix, it has been established that row rank = column rank. So,
you cannot have a different row rank and column rank. What it
basically means is, whatever be the size of the matrix, if you have a
certain number of independent rows, you will have only those many
numbers of independent columns and so on. So, this is something
important to bear in mind.
151
(Refer Slide Time: 05:18)
Now from your high school and so on, you would have learned that
if the determinant is not 0, A-1 is possible to compute. So, one could
simply compute x = A-1 p as a solution to this problem, the difficulty
arises only when A is not fulltime; that means, the rank of the matrix is
less than n. In this case what it means, is if I take the left-hand side of
the equation Ax = b, and then make some linear combinations of some
rows of A, at least one of the rows of A is going to be a linear
combination of the other rows of A; that is the reason why the rank of
the matrix became less than m. In this case, depending on what the
values are on the right hand side, you could have two situations; one
situation is what we are going to call as a consistent situation.
I will explain this through an example in later slides when you have
a consistent situation, then you will have infinite number of solutions.
There could be many solutions for A x = b. And in the case where the
system of equations become inconsistent, there will be no solution to
this problem. So, to summarize when m equal n this is what is called a
152
square matrix, and if the matrix is full rank determinant A not = 0, then
there is a unique solution x = A-1 b, and when A is not full rank there
are two situations that are possible! One is what we call as a consistent
scenario, where we could have infinite solutions, and the other one is
what is called the inconsistent scenario where we might have no
solution.
Let us take a simple example where I have on the top of the screen,
the matrix in the form A x = b. In this case matrix A is 1 3 2 4 x is x 1 x2
7 10. So, you can notice that there are two equations in two variables x 1
and x 2. The first equation is basically x 1 + 3x2 = 7 and the second
equation is 2x1 + 4x2 = 10. We can see that this is full rank, because
whatever multiple of the first column you take, you can never represent
the second column and similarly whatever multiple of the first row you
take you can never represent the second row, and this can also be seen
from the fact that the determinant of A is not 0.
Now if you want to write an r code for this. It is very simple, you
write a matrix put the numbers in c and then define the number of
columns, you de ne what b is and then simply use the command solve
153
for x to get the solution. And as you notice here the solution is 1 2. So,
this is a case of full rank where I get an unique solution. The important
thing to note here is, no other solution will be able to satisfy these two
equations.
Now if you notice these equations, through the matrix A you will
see that, if I multiply the first column by 2 I get the second column. So,
the second column is linearly dependent on the first column or the first
column is linearly dependent on the second column, whichever way
you want to say it. Similarly if you divide for example, the second row
by 2 you will get the first row or if you multiply the first row by 2 you
get the second row. So, the rows are also linearly dependent and, and
as I said before, there is only one independent column and that
necessarily means that there will be only one linearly independent row.
154
When we talk about the linear dependence of the rows of A, we are
only talking about the left hand side, we never talked about the right
hand side. Now whenever the left hand side becomes linearly
dependent, if the same linear dependence is maintained on the right
hand side also then we have the situation of consistent equations. So, if
you take a look at this example, we know the left hand side, if I take
the first equation and multiply it by 2 I get the second equation on the
left hand side. So, x1 + 2x2 multiplied by 2 gives me 2x1 + 4x2. Now if
the same linear dependence is maintained on the right hand side; that is
if I take the first number 5 and multiply it by 2 I should get this
number.
155
(Refer Slide Time: 14:50)
156
(Refer Slide Time: 16:44)
So, that finishes the case where the number of equations and
variables are the same. So, till now we saw the case, where m = n. Now
let us take the second case, where m is greater than n. Since m is
greater than n this basically means that I have more equations than
variables. So, this is the case of not enough variables or attributes to
solve all the equations. Since the number of equations is greater than
the number of variables in general, we will not be able to satisfy all the
equations; hence we termed this as a no solution case.
157
Let us look at a solution to Ax = b when the number of equations
are more than the number of variables. As we mentioned before we are
going to take an optimization perspective here. When we try to solve
Ax = b, we can write that equation as A x - b, and if there is a perfect
solution to the set of equations then Ax - b will be = 0. However, since
we know that the number of equations are a lot more than the number
of variables,there might not be a perfect solution.
158
Now, this is the least squares solution you could also minimize
instead of e12, you can minimize modulus e1 + mod e2 + mod e3,
because mod is always positive; that is also possible, but in general we
are going to talk about least square solution where we minimize this
sum of squares of errors. Now this is the same as minimizing Ax - b
transpose times Ax - b simply, because Ax - b is e. So, (Ax - b ) T is eT
e. So, if I have numbers e 1 e2 e3 T and multiply by e1 e2 e3, this will lead
to e12 + e22 + e32 which is the same as this right here. So, minimizing
this, is the same as minimizing (Ax – b)T (Ax – b).
159
(Refer Slide Time: 22:20)
Thank you.
160
Data Science for Engineers
Prof. Raghunathan Rangaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 14
Solving Linear Equations
So, if you notice these equations you would realize that the first 2
equations are inconsistent. For example, if we were to take the first
equation is true then x1 = 1 and if we substitute that value into the
second equation you will get 2 = - 0.05. If you were to take the second
equation as true then 2x1 is - 0.5. So, x1 will be - 0.25 and that would
161
not solve the first equation. So, these 2 equations are inconsistent. The
third equation since it is 3x1 + x2 irrespective of whatever value you get
for x1 you can always use this equation to calculate the value for x2;
however, we cannot solve this set of equations.
Now, let us see what is the solution that we get, by using the
optimization concept that we described in the last lecture. We said x =
Aᵀ A inverse Aᵀ b. the A matrix is 1 0 2 0 3 1. So, Aᵀ matrix is 1 2 3 0
0 1. Simply plugging in the matrices here.
And then doing the calculation gives us this equation which says x 1
x2 is a matrix times 15 5. This is an intermediate step for the
calculation. And when you further simplify it you get a solution x 1 = 0,
x2 = 5. Notice that the optimum solution here that is chosen does not
have either one of the 2 cases that we talked about in the last slide,
which is x1 = 1 and x1 = - 0.25 the optimization approach chooses x 1 =
0 and x2 = 5 and when you substitute it back into the equation you get b
as 0 0 5 whereas, the actual b that we are interested in is 1 - 0.55.
So, you can see that while the third equation is being solved exactly
the first take both the first 2 equations are not solve for; however, as
we described before this is the best solution in a collective
minimization of error sense, which is what we defined as minimizing
sum of squared of errors. We will now move on to the next example.
162
(Refer Slide Time: 03:47)
So, from the first equation you can get a solution for x1 = 1 and the
second equation since it reads as 2x1 = 2, we have to simply substitute
the solution that we get from the first equation and see whether the
second equation is also satisfied since x1 = 1 2 times x1 2 times 1 is 2
the second equation is also satisfied.
Now, let us see what happens to the third equation. The third
equation reads as 3x1 + x2 = 5, we already know x1 = 1 satisfies the first
2 equations. So, 3 x1 + x2 = 5 would give you x 2 = 2. Now you notice
that if I get a solution 1 and 2 for x1 and x2; though the number of
equations are more than the variables, the equations are in such a way
that I can get a solution for x1 and x2 that satisfies all the 3 equations.
Now let us see whether the expression that we had for this case
actually uncovers this solution. So, we said x = (Aᵀ A ) -1 Aᵀ b and we
do the same manipulation as the last example except that this b has
become 1 2 5 now.
163
(Refer Slide Time: 05:40)
So, the important point here is that if we have more equations than
variables then you can always use this least square solution which is
(Aᵀ A)-1 Aᵀ b. The only thing to keep in mind is that (Aᵀ A) -1 exists if
the columns of A are linearly independent. If the columns of A are not
linearly independent, then we have to do something else which you
will see as we go through this lecture.
164
So, that finishes the case where the number of equations are more
than the number of variables. Now let us address the last case where
the number of equations are less than the number of variables, which
would be m less than n in this case we address the problem of more
attributes or variables than equations.
Now since I have many more variables and equations I would have
infinite number of solutions the way to think about this is the
following. If I had, let us say, 2 equations and 3 variables. You can
think of this situation as one where you could choose any value for x 3
and then simply put it into the 2 equations. And whatever are the terms
with respect to x3 you collect them and take them to the right-hand
side; that would leave you with 2 equations and 2 variables and once
we solve for that 2 equations and 2 variables we will get values for x 1
and x3 x2.
So, basically what this means is that, I can choose any value for x 3
and then corresponding to that I will get values for x1 and x2. So, I will
get infinite number of solutions. Since I have infinite number of
solutions then the question that I ask is how do I find one single
solution from the set of infinite possible solutions? Clearly if you are
looking at only solvability of the equation, there is no way to
distinguish between this infinite possible solutions. So, we need to
bring some other metric that we could possibly use, which would have
some value for us to pick one solution that we can say is a solution to
this case.
165
nice form. And notice here something that is important we also have a
constraint for this optimization problem s dot t dot means subject to.
This basically says that of all the solutions I want the solution which
is closest to the origin is what this is saying in terms of x transpose x.
From an engineering viewpoint one could justify this as the
following;if you have lots of design parameters that you are trying to
optimize and so on, you would like to keep the sizes small for example,
so you might want small numbers. So, you want to be as close to origin
as possible this is just one justification for doing something like this
nonetheless this is one way of picking one solution from this infinite
number of solutions.
So, when we do that, you will see how we solve these kinds of optimization
problems. The optimization problem that we solved for the last case is
what is called an unconstrained optimization problem because there are
no constraints to that problem whereas, this problem that we are
solving is called a constrained optimization problem because while we
have an objective we also have a set of constraints that we need to
solve.
So, you will have to bear with us till you go through the
optimization module to understand this. Interestingly it is generally a
good idea to teach linear algebra on optimization, but interestingly,
some of the linear algebra concepts you can view as optimization
problems and solving optimization problems requires lots of linear
algebra concepts. So, in that sense they are both coupled. In any case to
solve optimization problems of this form we can de ne what is called a
Lagrangian function f(x) comma λ, λ are extra parameters that we
introduce into this optimization formulation. And what you do is you
166
minimize this Lagrangian with respect to x to get a set of equations.
And you also minimize this with respect to Lagrangian which will back
out the constraint. So, whatever solution you have, has to solve both
the differentiation with respect to x which should give you x + Aᵀ λ = 0
and also differentiation with λ which will simply give you A x - b =
That would basically say that whatever solution you get, that has to
satisfy the equation A x = b we will see how this is useful in
identifying a solution.
167
(Refer Slide Time: 14:17)
168
And when I do some more algebra I finally get a solution to x 1 x2 x3
which is the following; And we had already seen that x3 = 1 has to be a
solution because the last equation basically said x 3 = 1. Now x1 and x2
you could have found several numbers to satisfy the first equation after
you choose x3 = 1 of all of these this solution says this - 0.2 - 0.4 is the
minimum norm solution or this vector is the closest vector from the
origin; that satisfies my equation A x = b. So, I can finally, say my
solution x1 x2 x3 is - 0.2 - 0.41.
And you can easily verify that this satisfies the original equation
since x3 is 1, the second equation is x3 = 1.
So, that gets satisfied when you look at the other equation you have
one times - 0.2 + 2 times - 0.4. That will be - 0.2 - 0.81 + 3 times 1 will
give you 3 - 1 = 2 which is this. So, the solution that we found satisfies
the original equation and this also turns out to be the minimum norm
solution as we discussed.
169
(Refer Slide Time: 16:56)
And if it is not a full rank matrix then you could have infinite
solutions or no solutions, and interestingly the next 2 cases covers
these 2 aspects when I have lot more equations than variables I have a
no solution case, and when I have lot more variables than equations I
have infinite solution case and since we are able to solve all the 3 we
should be able to use the solution to the case 2 and 3 for the case one
where the rank is not full. And depending on whether it is a consistent
set of equation or inconsistent set of equation you should be able to use
the corresponding infinite number of solutions or no solutions result
right?
170
So, when we typically have equations of the form a x = b, we write
x = A-1b as a solution. The generalization of this is to write x is A + I
have used this term to denote the pseudo inverse b. And as long as we
can calculate the pseudo inverse in a fashion that irrespective of the
size of A, irrespective of whether the columns and rows are dependent
or independent.
If I can write one general solution like this which will reduce to the
cases that we discussed in this lecture, then that is a very convenient
way of representing all kinds of solutions instead of looking at whether
the number of rows are more, number of columns are more, is rank full
and so on. All of them if they can be subsumed in one expression like
this it would be very nice and it turns out that there is an expression
like that and that expression is called the pseudo inverse.
So, how do I get this in R? So the way you do this in R is you use
this library and the pseudo inverse is usually calculated using this g
inverse a. Here g stands for generalized. So, what R does is whatever
size of the problem you give here we have given 2 different examples,
171
where one example has more equations than variables the second
example has more variables than equation.
These are the examples that were picked from this lecture itself and
we show that irrespective of whatever be the sizes of this matrices a
and b, we use the same equation g inverse A and the solution 1 2 that
we got in one example and the solution - 0.2 - 0.4 and one we got in
the other case come out of this g inverse.
This is what is called the minimum norm solution. While there are
infinite number of solutions this is a solution that is the closest to
origin. So, that is the interpretation for these 2 solutions that that we
want to keep in mind as far as solving linear equations is concerned,
nonetheless the operationalization for how to use R is very simple you
simply use g inverse as a function.
172
We can write this as A-1b or I could also write this as pseudo inverse
b. In this case the pseudo inverse and A inverse will be exactly the
same and as I mentioned before since these 2 cases are covered by
these 2. I should be able to use the same a pseudo inverse b for both
these cases also without worrying about whether they are consistent
inconsistent and so on. In all of these cases I will get a solution by
using the idea of generalized inverse.
Thank you and in the next lecture we will take a geometric view of
the same equations and variables that is useful in data science.
173
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 15
Linear Algebra - Distance, Hyperplanes and Halfspaces, Eigenvalues, Eigenvectors
174
(Refer Slide Time: 01:21)
175
is another axis that represents x2, and depending on the value of the an
x1 and x2 you will have a point anywhere in this plane.
So, for example, if you have let us say 1 as your vector, and if this is
one and this is one, then the point will be here and so on. So, what we
are doing here is, we are looking at vectors as points in a particular
dimensional space. Since, there are 2 numbers here we talked about 2-
dimensional space if for example, there are 3 numbers here, then it
would be a point in a 3 dimensional space, you could also think of this
as a vector and we defi ne the vector from the origin.
Now, just as a very, very simple example, if you have a 0.34 then
you can find the distance from the origin is root of (3 2+ 42 ) is going to
be = 5. It is important to notice that the geometric concepts are easier
to visualize in 2D or 3D; however, they are difficult to do. So, in
higher dimensions, nonetheless since the fundamental mathematics
remain the same what we can do is, we can understand these basic
concepts using 2D and 3D geometry and then simply scale the number
of dimensions, and then most of the things that we understand and
learn will be the same at higher dimensions also.
176
(Refer Slide Time: 04:45)
So, in the previous slide we saw one point in 2 dimensions. Now, let
us consider a case where we have 2 points in 2 dimensions. We have x 1
here, which has 2 numbers representing the 2 coordinates and we have
x2 here, which also represents the 2 coordinates. Now, we ask the
question as to, whether we can de ne a vector which goes from x 1 to x2.
So, pictorially this is the way in which, we are going to de ne this
vector. What we do is, we draw a line starting from x 1 to x2 and this
vector is x2 - x1, the direction of the vector is given by this here, much
like the previous case every vector will have a direction and a
magnitude.
So, we might ask what is the magnitude of this vector and that is
given by the wellknown formula that we see right here. Where what
you do basically is, you take the x1 coordinate of this point and this
point take the difference square it, take the x2 coordinate of this point
the x2 coordinate of this point, take the difference and square it add
both of them and take a root and that is the equation that we have here.
This is the length of this vector right here, this also can be written in
a compact form as given here, which is root of (x2 - x1)T (x2 - x1) .
177
(Refer Slide Time: 06:30)
178
Now, it is useful to de ne vectors with unit length, because once you
write a vector in unit length any other vector in that direction, can be
simply written as the unit vector times the magnitude of the vector that
you are interested in.
So, basically what you do is, if you have 2-dimensional vector then
you take the 2 x coordinates multiply them, and then you take the 2 y
coordinates and multiply them and add both of them you will get the
dot product. This dot product again much like the distance that we saw
before, can also be written in a compact form as Aᵀ B you can quite
easily see that this and this will be the same, and if this dot product
turns out to be 0 then we call this vectors A and B as being orthogonal
to each other.
179
(Refer Slide Time: 09:06)
180
Now, take the same 2 vectors, which are orthogonal to each other
and you know that, when I take a dot product between these 2 vectors
it is going to go to 0. If I also impose the condition, that I want each of
these vectors to have unit magnitude then what I could possibly do? Is
I could take this vector and then divide this vector by the magnitude of
this vector.
181
So, we are going to introduce the notion of basis vectors. So, the
idea here is the following, let us take R squared which basically means
that, we are looking at vectors in 2 dimensions. So, I could come up
with many many vectors, right? So, there will be infinite number of
vectors, which will be in 2 dimensions. So, this is like saying, if I take
a 2-dimensional space how many points can I get? So, I can get infinite
number of points. Which is what has been represented here.
So, I have put in some vectors and then these dots represent that,
there are infinite number of such vectors in this space. Now, we might
be interested in understanding, something more general than just
saying that there are infinite number of vectors here. So, what we are
interested in is, if we can represent all of these vectors using some
basic elements and then some combination of these basic elements, is
what we are interested in.
So, the key point being, while we have infinite number of vectors
here, they can all be generated as a linear combination of just 2 vectors
and we have shown here, these 2 vectors as 1 0 1 0 1. Now, these 2
vectors are called the basis for the whole space, if I can write every
vector in the space as a linear combination of these vectors and these
vectors are independent of each of them.
Then we call them as a basis for the space. So, why do you want
these vectors to be independent of each other? We want these vectors
to be independent of each other, because we want every vector, that is
in the basis to generate unique information. If they become dependent
on each other, then this vector is not going to bring in anything unique.
So, basis has 2 properties, every vector in the basis should some bring
something unique, and these vectors in the basis should be enough, to
characterize the whole space, in other words the vector should be
complete.
182
(Refer Slide Time: 15:07)
So, this we can formally say as the following, basis vectors for any
space are a set of vectors that are independent and span the space and
the word span basically means that, any vector in that space, I can
write as a linear combination of the basis vectors. So, the previous
example, we saw that the 2 vectors v1 1 0 and v2 0 1, can span the
whole R squared and you can clearly see that they are independent of
each other, because no multiple scalar multiple of this will be able to
give you this vector .
So, the next question that immediately pops up in ones head is, if I
have a basis vector, are they unique? Now it turns out these basis
183
vectors are not unique, you can find many many sets of a basis vectors,
all of which would be equivalent. The only conditions are that they
have to be independent and should span the space. So, take the same
example and let us consider 2 other vectors, which are independent.
So, the same example as before, where we had used 2 basis vectors
1 0 and 0 1, I am going to replace them by 1 1 and 1 - 1. Now, the first
thing that we have to check is, if these vectors are linearly independent
or not and that is very easy to verify. If I multiply this vector by any
scalar, I will never be able to get this vector. So, for example, if I
multiply this by - 1 I will get - 1 and - 1, but not 1 - 1. So, these 2 are
linearly independent of each other.
Now, let us take the same vectors and then see what happens. So,
remember we represented 2 1 in the previous case, as 2 times 1 0 + 1
times 0 1. Now, let us see whether I can represent this 2 1 as a linear
combination of 1 1 and 1 - 1. So, if you look at this, this is the linear
combination notice; however, because of the way I have chosen these
vectors, these numbers are not the same as this.
So, the key point that I want to make here is that, the basis vectors
are not unique there are many ways in which you can de ne the basis
vectors; however, they all share the same property that, if I have a set
of vectors which I call as a basis vector, those vectors have to be
independent of each other and they should span the whole space and
whether you to take 0 1 1 0 and call it a basis set or you take 1 1 and 1
- 1 and call the basis set, both are all right and you can see that, in each
case the vectors are independent of each other and they span the whole
space.
However, there are still only 2 vectors. So, while you could have
many sets of basis vectors, all of them being equivalent, the number of
184
vectors in each set will be the same. They cannot be different and this
is easy to see. I am not going to formally show this, but this is
something that you should keep in mind, in other words for the same
space you cannot have 2 basis sets - one with n, vectors other one with
m vectors - that is not possible. So, if it is a basis set for the same
space, the number of vectors in each set should be the same. Now, I do
not want you to think that the basis set will always have to be the
number of elements in the vector.
Now, what we want to ask is, what is the basis set for these kinds of
vectors? Now when I do this here, the assumption is the extra vectors
that I keep generating, the infinite number of them, all follow certain
pattern that these vectors are also following and we will see what that
pattern is. So, what we can do is, we can take, let us say 2 vectors here,
in this case this is how this example has been constructed to illustrate
an important idea. Let us take this vectors v1 which is 1 2 3 4 2 which
is 4 1 2 3 and let us take some vector here, in this set let us take this
vector here, and then see what happens, when I try to write it as a
linear combination of these 2 vectors.
So, I can see that if I take this I can write it as 1 times this + 0 times
the second vector. So, that is one linear combination, now let us take
some other vector here. So, let us say for example, we have taken this
vector 7 7 11 15, we can see that that can be written as a linear
185
combination of 3 times the first vector + 1 times the second vector and
so on.
Now, you could do this exercise for each one of these vectors and
you will be able to see, because of the way we have constructed these
vectors, you will be able to see that each one of these vectors, I can
write as a linear combination of v1 and v2. So, what this basically says
is the following, it says that, though I have 4 components in each of
these vectors, that is, all of these vectors are in R4, because of the way
in which these vectors have been generated, they do not need 4 basis
vectors to explain them, all of these vectors have been derived as a
linear combination of just 2 basis vectors, which are given here and
here.
186
So, this is the same as the previous slide, except that I have removed
the dot dot dot. So, the way to think about this it is let us say there is
some data generation process, which is generating vectors like this, and
the dot dot dots that I have left out, will also be generated in the same
fashion, because those are also vectors that are being generated by the
same data generation process.
So, this is a first vector here, from here second vector and so on all
the way up to the last vector and I say I have so many vectors, how
many fundamental vectors do I need to represent all of these as linear
combinations? It is a question that I am asking. The answer is
straightforward this is something that we have already seen before, if
you identify the rank of this matrix it will give you the number of
linearly independent columns.
So, what that basically means is, if I get a certain rank for this
matrix, then it tells me there are only so many linearly independent
columns and every other column, can be written as a linear
combination of those independent columns. So, while I have many
many columns here, 1 2 all the way up to 10. The rank
of the matrix will tell me, how many are fundamental to explaining all
of these columns, and how many columns do I need.
187
Now, if you had generated these vectors in such a way that they are
a linear combination of 3 vectors, then the rank of the matrix would
have been 3. If you had generated these vectors in such a manner, that
they are linear combinations of 4 linearly independent vectors, then the
rank of the matrix would have been 4, but that would be the maximum
rank of the matrix, because in R 4 you would not need more than 4
linearly independent vectors to represent all the vectors.
So, for example, I could choose this and this and say, this is the
basis vector for all of these columns or I could choose this and this and
this or this and this and so on. So, I can choose any 2 columns, as long
as they are linearly independent of each other and this is something that
we know, from what we have learned before, because we already know
that the basis vectors need not be unique. So, I pick any 2 linearly
independent columns that represents this data. Now, let me take a
minute to explain why this is important from a data science viewpoint.
I will just show you some numbers. Supposing, I have let us say 200
such samples and I want to store these 200 samples since each sample
has 4 numbers, I would be storing 200 times 4 which is 8 numbers.
Now, let us assume we do the same exercise for these 200 samples
and then we find that, we have only 2 basis vectors, which are going to
be 2 vectors out of this set. What I could do is, I could store these 2
basis vectors that, would be 8 numbers which is 2 by 4 and for the
remaining 198 samples, instead of storing all the samples and all the
numbers in each of these samples, what I could do is for each sample I
could just store 2 numbers right?
So, for example, if you take this sample, instead of storing all the 4 numbers,
I could just store 2 numbers, which are the linear combinations that I
am going to use to construct this. So, for example, since I have 2 basis
vectors here, there is going to be some number α1 times the basis
vector, + α2 times the second basis vector, which will give me this
sample right?
188
multiply v 1 + the second constant multiply v 2 and I will get this
number. So, I store 2 basis vectors which gives me 8 numbers and then
for the remaining 198 samples, I simply store 2 constants. So, this
would give me 396 + 8 404 numbers stored. I will be able to
reconstruct the whole data set.
So, compare that with 800. So, I have half reduction in number. So,
when you have vectors in multiple dimensions, let us say you have
vectors in 10 dimensions 20 dimensions and the number of basis
vectors, are much lower than those numbers. So, for example, if you
have A₃0-dimensional vector and the basis vectors are just 3, then you
can see the kind of reduction that you will get in terms of data storage.
So, this is one viewpoint from data science. Why? It is very important
to understand and characterize the data in terms of what fundamentally
characterizes the data. So that you can store less, we can do smarter
computations and there are many other reasons why we will want to do
this, you can identify this basis to identify a model between this data,
you can identify a basis to do noise reduction in the data and so on.
Thank you.
189
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 16
Linear Algebra - Distance, Hyperplanes and Halfspaces, Eigenvalues, Eigenvectors
(Continued 1)
So, let us continue with our lectures on linear algebra for data
science. We will continue discussing distances hyperplanes, half
spaces, eigenvalues and eigenvectors in this lecture, and the lecture that
follows this lecture. So, what we are going to do is we are going to
think about the equations in multi-dimensional space. And then think
about what geometric objects that these equations represent.
190
And to understand this if you look at a picture like this, let us say I
have one equation which is a line, let us draw the other equation which
is also a line. Then if both of these equations have to be satisfied, then
that has to be this intersecting point. So, 2 equations in 2 variables
represent a point if these equations are solvable together.
So, when I substitute X1 into the equation for the line, it should
satisfy it which is what is shown here nTX1 + b = 0 and when I
substitute X2 into the line equation that should also satisfy. So, n TX2 +
b = 0. Now what you could do is you could subtract the first equation
from the second equation. The b’s will get cancelled. You will have
nTX2 - X1 = 0. Now let us interpret this equation. From vector addition
you know that if I have X1, - X1 is in this direction.
Now, if you want to extend this, and then ask the question, if I have
one equation in a 3-dimensional space, what does that represent. Now
the form of the equation will be very similar. You will have something
like this here, and n now would become supposing you have 3
variables X1 X2 X3. N could be n1 n2 n3.
191
So, the same equation would be n 1 X1 + n2 X2 + n3 X3 + b = 0. So,
which is nTX + b = 0. So, irrespective of what-ever is the dimension of
your system, you can always represent a single linear equation in this
form nTX + b = 0.
192
Now, that we have talked about what equations represent and so on.
One of the things that we are quite interested in and you will see this
again and again in data science, as we teach some of the algorithms
later such as principle component analysis and so on. We are always
interested in projecting vectors onto surfaces. The reason why we are
interested in doing this is, because many times we might want to
represent data through a smaller set of objects or a smaller number of
vectors. So, in some sense the data cannot be completely represented
by these vectors.
So, let us assume these basis vectors are ν₁ and ν₂. Just to recap
what this basis vectors are useful for is that, any line on this plane,
basically can be written as a linear combination of ν₁ and ν₂. That is
what we described before, that these basis vectors are enough to
characterize every point or any vector on this 2-dimensional plane. So,
any vector can be written as a linear combination of ν₁ and a ν₂.
Now, the way this picture is drawn, you would see that this is the
plane and I have let us say, a vector that is coming out of the plane. So,
this is not clearly in the plane. So, it is projecting out. So, from the data
science viewpoint if you want to make an analogy, what we are saying
here is that, I have a data here X which is represented by this vector. I
want to write this simply as only a function of ν₁ and ν₂. So, in other
words I want to represent this vector X, in a tool, I cannot do it exactly
projecting out of the plane. So, I might ask what is the next best thing
that I could do in this case. It turns out the next best thing to do would
be to project this vector onto the plane, because ultimately, however I
write this vector with only this 2 basis vectors it has to be on the plane.
193
Now, there are many vectors on the plane. I want to find what is the
best projection for this onto this plane. So, a common sense idea would
be to say, I want a point here which I write. And if this is the projection
of this vector I want this distance to be minimized. So, you can see
why that is. Think about this if you keep projecting it bac k to the plane,
if this is the closest point if the vector is already in the plane, it would
be the same vector that is also the product right. So, as soon as this
vector goes up slightly outside the plane, I want it to be projected bac k.
So, that it is closest to that point of projection. So, how do we explain
these concepts mathematically? So, we do that here. First,X̂ is the
projection of X onto lower dimension in this case 2 dimensions.
So, using vector addition, again we can start from here let us say,
and this is x. So, X can be written as X̂+ n, which is what is written
here. And X̂ has been expanded to be c 1 ν₁ + c2 ν₂. Notice that while
we write this, the fact that we are using a projection comes from this n
being perpendicular to the plane. So, what does n being perpendicular
to the plane mean? If n is perpendicular to the plane, then we know that
nTν₁ or ν₁T n both are the same will be 0. Similarly, n Tν₂ = ν₂Tn will
also be = 0. So, these are 2 facts that will know, if n is perpendicular to
the plane. So, how are we going to use this to calculate c1 and c2 is
what I am going to show you in the next slide.
194
(Refer Slide Time: 15:09)
So, let us first take this ν₁Tn = 0; the first equation I wrote. Let me
write n as this from the previous slide, because X was c 1 ν₁ + c2 ν₂ + n,
I simply move c1 ν₁ and c2 ν₂ to the other side. And I have this
equation right here.
Now, when I expand this equation, I will get ν₁TX - c 1 ν₁Tν₁ and I
will also have another term which would be here - c2 ν₁T ν₂. Now as the
first case I am going to show you how you do projections on 2
orthogonal directions. Now if these 2 directions are orthogonal the
basis vectors themselves are orthogonal, then we know that this will be
0, that is the reason why this term drops out. And I have ν₁ TX - c1 ν₁T
ν₁ = 0. Take this to the other side, and then bring ν₁ Tν₁ to the
denominator, then you will get c1 = ν₁TX divided by ν₁Tν₁.
Now, you could use the same idea, and then do the calculations for
ν₂ n = 0. And when you do this, again you use this fact that ν₂ Tν₁ or
T
ν₁Tν₂ = 0 because these are orthogonal directions. And then you will
end up with this equation for c2, which will be ν₂TX + ν₂Tν₂.
Once you get this, then you can bac k out the projection and the
projection is c1 times ν₁ + c 2 times ν₂. So, this is how you project a
vector on to 2 orthogonal directions, and this can be extended to 3
orthogonal directions 4 orthogonal directions and so on. Because all
you will get if let us say it is 3 orthogonal directions then you will get
ν₁TX ν₁Tν₁ for c1, this is for c 2 and ν 3TX divided by ν3Tν3 for the third
constant c3.
195
(Refer Slide Time: 17:52)
Now, let us take 2 vectors and then make a plane. So, let us take
vector ν₁ which is 1 - 1 - 2, and ν₂ which is 2 0 1. And then try and see
whether I can project X on to these. Let us first find out whether these
2 vectors are orthogonal. So, to do that we have to do ν₁Tν₂. So, I am
going to do 1 - 1 - 2 2 0 1. So, this will be one times 2 - 1 times 0 0 - 2
times 1 2. So, 2 - 2 is 0.
So, we know that these 2 vectors are orthogonal. So, we can use a
formula that we had before. Now this formula is what we apply here.
So, this is ν₁TX transpose, sorry, this is XTν₁ which is 1 - 1 - 2, and this
should be ν₁Tν₁. So, that will be one square + 1 square + 2 square. So,
1 + 1 + 4, 6. So, this is constant c1 that we get, and this is multiplied by
ν₁.
And if you look at the second term here. So, this is X T this is ν₂ 2 0
1. And this should be ν₂Tν₂. So, this should be 22 + 02 + 12 = 5. And we
have this vector ν₂.
196
(Refer Slide Time: 19:50)
So, once you simplify this further you get the projection as the
following. So, my original data vector 1 2 3, when it is projected onto a
space spanned by these 2 basis vectors becomes this.
So, in other words, if I had a data point 1 2 3, and then I say I want
to represent this with only 2 vectors that I had identified before
whatever reason it might be, then the best representation is the
following is what this projection says.
197
Now, we talked about projecting on to certain number of directions.
And we also talked about projections when these directions are
orthogonal. I am going to generalize this in the coming slides. So, that
we have a result that is general and can be used in many places. So, I
am going to look at how we can project this vectors onto general
directions. So, let us consider the problem of projection of X onto
space spanned by k linearly independent vectors,. Now I have dropped
this notion of orthogonal here, I am simply saying these vectors are
independent. As before since I want to project X on to k linearly
independent vectors.
And since there are k constants, which I have put in a vector. So,
this would be a vector of dimension k by 1. And you can notice that
this n by k times k by 1 will give you an n by 1 vector which is what
this X̂ nonetheless, this n by one vector is in a space spanned by these k
vectors linearly independent vectors.
Now this is an important thing to notice, if you go bac k and then say
let me expand this, then basically you should get this. And this is
another way of thinking about matrix multiplications, which is
important to understand. So, let me illustrate this with some very, very
simple examples so that we use this at later times.
198
(Refer Slide Time: 23:25)
So, this and this are same and you see that this is this. So, X̂ can be
written as v times c ; where v is a matrix where are all these basis
vectors are stacked in columns. And c are the scalar constants which
have been stacked as a single column. Now, let us proceed to identify
the projection from here.
199
(Refer Slide Time: 25:00)
200
So, it is important to understand this idea very clearly. Now that we
have understood projections, in the next lecture I will describe the
notion of an hyper plane and half spaces. And then continue on to
eigenvalues and eigenvectors. I will see you the next lecture.
Thank you.
201
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture- 17
Linear Algebra - Distance, Hyperplanes and Halfspaces, Eigenvalues, Eigenvectors
202
subspaces, however, if we have hyper planes of the form X transpose n
= 0; that is if the plane goes through the origin, then an hyper plane
also becomes a subspace.
Now that we have described what a hyper plane is, let me move on
to the concept of half space to explain the concept of half space. I am
going to look at this 2 dimensional picture on the left hand side of the
screen. So, here we have a 2 dimensional space in X 1 and X2 and as we
have discussed before an equation in two dimensions would be a line
which would be a hyperplane. So, the equation to the line is written as
XTn + b = 0.So, for, in this two dimensions we could write this line as;
for example, X1 n1 + X2 n2 + b = 0, while I have drawn this line only
for part of this picture. In reality this line would extend all the way on
both sides.
Now, you notice the following. You see when I extend this line all
the way on both sides, then this whole two dimensional space is broken
into two spaces, one on this side of a line and the other one on this side
of a line. Now these two spaces are what are called the half spaces.
Now the question that we have is the following.
If there are points on one half space and points on the other half
space, is there some characteristic that separates them? For example,
can I do some computations for all the points on one half space and get
some value and some computation for all the points on the other half
space and get some value and use that to make some decisions and that
is a reason why we are interested in this half spaces from a data science
viewpoint.
203
classification problem. Let me explain what that means. In fact, we are
going to look at a very specific classification problem called binary
classification problem. So, let us assume that I have, let us say in two
dimensional space a data belonging to two classes.
For example, let us say I have data belonging to class one like this,
and I call it class one and then I have data belonging to class two is
something like this, call it class 2. So, this classes could be anything.
So, for example, this could be a group of people who like South Indian
restaurants and this could be a group of people, who do not like South
Indian restaurants, and the coordinates X1 and X2 could be some way of
characterizing people in terms of some attributes of these folks. Let us
say we have taken a survey to say whether they like South Indian food
or do not like South Indian food.
Whereas if I gave you another point here for example, then you
would come to the conclusion, this person is likely to like South Indian
food. So, what we want to do is, we want to be able to evaluate cases
like this. So, we want to somehow come up with a discriminating
function between these two classes. So, one way to do that would be
something like this; draw a line between these two classes and then
say, if there is some characteristic that holds for this side of the line,
which is what we called as a half space here. And if there is some
characteristic that holds to this side of the line then we could use that
characteristic as a discriminant function for doing this binary
classification problem. So, that is the data science interest in
understanding this topic in linear algebra.
204
thing to do is to just take the opposite direction and then define a
normal in this fashion also. So, its important to know in which side
normal is defined to understand this. For example, if I say this is a
normal for an equation which is XTn + b = 0.
205
We want to understand what this will be, what this will be and what
this will be. Now, when we look at point X1 we know that the point lies
on the line. So, this is going to be 0. So, this is straightforward. What
we are interested in, is what happens to this quantity for X 3 and X2, and
is there some way in which we can say that every point to one side of
the line will have the same characteristic and every other point on the
other side of the line will have a different characteristic. So, to do this,
let us first look at X3Tn + b and then see what happens.
So, I want to know what this is. Notice in this picture I have defined
a new point X on the line and then I have another vector which goes
from X′ to X3. Now X3 is the vector that goes from here to here. From
vector addition we know that I can write X 3 as X’, this + this X’ + Y’.
So, what I am going to do is, I am going to simply substitute this into
the equation and then see what happens. So, I am going to have X’ +
Y’T n + b.
206
So, you take any point this side or this side. So, the angle between
the normal and that point is going to be the following. So, supposing
we look at this and then say; I am going to do this angle in this
direction right. So, what you are going to notice is the following. If the
point is between these two, then I am going to have a positive θ angle.
Now the way you do this is the following.
So, you go like this. So, for this quadrant if you start with 0 here for
this quadrant the angle is going to be between 0 and 90, and for this
quadrant the angle is going to be between 270 and 360. So, if a point is
this side, the angle between this vector and this normal is going to be
between 0 and 90. And if the point is in this side the angle is going to
be between 270 and 360. We also know that when I have dot products
Aᵀ b, I can also write this as magnitude of a magnitude of b cos θ,
where θ is the angle between these two vectors.
So, we will look at all the points up to here. So, whatever is a point
you have these angles and all of these angles are between 0 and 90. So,
for any point between here and here in this whole space you are going
to get a b, some angle between 0 and 90, and we know from our high
school rule, all silver teacups cos θ will always be positive. So, Aᵀ b is
going to be positive; that means, this is going to be positive. Now when
you get two points here then the angles are going to be between 270
and 360 which is in the fourth quadrant. Again using the same rule all
silver teacups the fourth quadrant is c cos. So, cos is going to be
positive. So, again you have Aᵀ b being positive.
So, irrespective of where the point is to this side of the line, when I
take this X3 Tn + b, I am always going to get a positive value. Now by
similar argument you can say for any point on the other side or the
other half space, the angles are going to be between 90 to 180 here and
180 to 270 and as we know cos θ for angles between 90 to 270 is
negative.
So, any point on this side of the line or the half space the
computation X2 T n + b is going to be less than 0. So, this is an
important idea that that I would like you to understand. So, what this
basically says is the following. If you were to simply take any point
that I give you and then I evaluate X T n + b, if that point is on the side
of the normal half space then X T + b will be positive, and if its on the
half space in the opposite side then its going to be negative. And I
already told you how this is important from a data science viewpoint.
207
(Refer Slide Time: 15:41)
So, this is on in the positive half space and when I take the point 1 -
2 then I am going to get 1 - 6 + 4 which is going to be = - 1 less than 0.
So, this is in the negative half space. So, this is on the hyperplane of
the line, this is on the positive half space and this is on the negative
half space. So, that tells you how to look at different points and then
decide which side of the hyperplane or which half space these points
lie.
208
(Refer Slide Time: 17:35)
Now, that we have understood hyper planes and half spaces we are
going to move on to the last linear algebra concept that I am going to
teach in this module on linear algebra for data sciences. And once we
are done with this topic, then we have enough information for us to
teach you the various algorithms, commonly used algorithms or the
first level algorithms in data science. So, let us look at this idea of
eigenvalues and eigenvectors, we have previously seen linear equations
of the form Ax = b. We have looked at it both algebraically and
geometrically, we have spent quite a bit of time on looking at these
equations algebraically. We talked about when these equations are
solvable when there will be infinite number of solutions, how do we
address all of those cases in a unified fashion and so on.
Now what we are going to do is, now that we know both vectors
and so on, we are going to look at a slightly geometrical interpretation
for this equation again and then explain the idea of eigenvalues and
eigenvectors and then connect the notion of these eigenvalues and
eigenvectors with the column space, null space and so on that we have
seen before. So, this is very important, because these ideas are used
quite a bit in data compression, noise removal, model building and so
on. We will start saying I have this Ax = b and a is an n by n matrix x
is n by 1 and b is n by 1. So, this is the kind of system that we are
looking at.
So, we are going only look at square matrices n by n. Now you can
think of this as n equations and n variables. There is also another
interpretation you can give for this which is the following, supposing I
have a vector x something like this and if I operate A on this. So, by
operating, I mean we define an operation as pre multiplying this vector
by A. So, let us say I operate A on this vector which is Ax then I notice
from this equation I get b, which is basically some other new direction
209
that I have. So, you can think about this as the following I can think
about this as a equation, which tells me that when I operate A on x then
I get a new vector b which is in a different direction from x. So, this is
a very simple interpretation of this equation Ax = b, which is what is
written here x, I send it through a and I define sending it through a as
pre multiplying by A; so A times x equal b.
So, if this is x. So, when I operate A on x, since its in the same direction, its
210
either this way or this way and if λ is positive, it will be in this
direction and if λ is negative, it will be in this direction and so on. So,
the question is, would there be directions like this for all kinds of
matrices is an interesting question, that you could ask for.
Now the question is, for every matrix A, would there be A vectors
like this x and what would be the scalar multiple and what is the use of
all of this, is something that we should also address at some point as
we go through this lecture. Now let me give you some definitions, this
x are called eigenvectors and lambdas are called the eigenvalues
corresponding to those eigenvectors. So, the questions that we are left
with, are how do we find out that every matrix, whether it would have
eigenvectors and how do I compute this eigenvectors and eigenvalues.
211
(Refer Slide Time: 23:26)
So, I have a vector here and I want 0s here. So, I want to find an x;
such that this is true. Now we have everything that we need to solve at
this equation. So, what I am going to explain to you is the following. If
I want to get an x which is not all 0, notice that if x is all 0, this is a
solution right. So, x = 0 is a solution, but we are not interested in this
solution, because this is what we call as a trivial solution.
212
So, if there needs to be a solution for x, then we know that the rank
of the matrix A - λ I has to be less than n; that is the matrix A - λ I is
not a full rank matrix, and we know that if the matrix is not full rank
then the determinant of that matrix has to be 0. So, in summary if we
want a non trivial solution for x, then that necessarily means that this
determinant A - λ I has to be = 0.
Now once we solve for this equation and compute A λ, then we can
go back and then substitute the value of λ here and then we have A - λ I
times x = 0, the way we have chosen λ is such that this matrix does not
have full rank; that means, there is at least one vector in the null space,
and using concepts that we have learned before we can identify this
null space vector which would become the eigenvector.
213
eigenvalues λ 1 = 10 and λ 1 = 1. Now how do I go ahead and calculate
the eigen vectors corresponding to these eigen values.
So, let us illustrate this for λ = 1. So, I take this eigenvalue eigen-
vector equation now that I know λ = 1, this becomes 8 7 2 3 X 1 x is X1
+ X2. Now this turns out into these two equations, and if you notice
you take the first equation, the first equation is 8X1 + 7X2 = x 1.
So, if I take X1 to this side I get 7X1 + 7X2 = 0, which is the same as
X1 + x = 0. If you take the second equation you will see that it is 2 X 1 +
3X2 = X2 which basically says 2x + 2 X2 = 0, which also is X1 + X2 = 0.
So, both these equations turn out to be the same. Now any solution
where X2 is the negative of X1 would be a eigenvector, what we do is,
the following of all of those solutions.
214
(Refer Slide Time: 30:30)
So, basically what you could do is, any vector which is such that if
X1 is kX2 is 2k by 7 would satisfy this equation; however, what we do
is, we choose this k in such a way that the magnitude of the
eigenvector is 1. So, in this case 7 by root 53 2 by root 53, if you do the
magnitude of this you will see this is going to be root of 49 by 53 + 4
by 53, which will be root of 1 = 1. So, you see that the magnitude is 1
and also this equation is basically satisfied by any eigenvector which is
of this form k to k by 7.
215
x some scalar multiple of x itself, where the scalar multiple could be
either positive or negative, we get the eigenvalue eigenvector equation
and to calculate the eigenvalue. What we do is, we calculate the
determinant A - λ. I set it to 0 for an n by n matrix, there will be an nth
order polynomial that we need to solve which opens out to the
possibility of the eigenvalues, being either real or complex. And once
we identify the eigenvalues we can get eigenvectors as the null space
of A - λ I where λ is the corresponding eigenvalue.
In the next lecture what I will do is, I will con nect this notion of
eigenvalues and eigenvectors to things that we have already talked
before in terms of column space and null space of matrices and so on.
We already saw that the eigenvectors are actually in the null space of A
- λ I, I am going to develop on this idea and then show you other
connections between eigenvectors and these fundamental subspaces,
and I will also allude to how this is a very important problem; that is
used in a number of data science algorithms. So, I will see you in the
next class.
Thank You.
216
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 18
Linear Algebra - Distance, Hyperplanes and Halfspaces, Eigenvalues, Eigenvectors
This is the last lecture in the series of lectures on Linear Algebra for
data science and as I mentioned in the last class, today, I am going to
talk to you about the connections between eigenvectors and the
fundamental subspaces that we have described earlier. We saw in the
last lecture that the eigenvalue eigenvector equation results in
So, for a general matrix, you could have eigenvalues which are
either real or complex. And notice that since we write the equation Ax
= λ x, whenever this eigenvalues become complex, then the
eigenvectors are also complex vectors. So, this is true in general;
217
however, if the matrix is symmetric and symmetric matrices are of the
form A = Aᵀ, then there are certain nice properties for these matrices
which are very useful for us in data science. We also encounter
symmetric matrices, quite a bit in data science for example, the
covariance matrix turns out to be a symmetric matrix and there are
several other cases where we deal with symmetric matrices.
218
any general matrix and we already know that eigenvectors could be
complex for any general matrix; however, when we talk about
symmetric matrices, we can say for sure that the eigenvalues would be
real, the eigenvectors would be real, further we are always guaranteed
that we will have n linearly independent eigenvectors for symmetric
matrices. It does not matter how many times the eigenvalues get
repeated. One classic example of a symmetric matrix, where
eigenvalues are repeated many times, so take identity matrix,
something like this here, this identity matrix has eigenvalue λ = 1,
which is repeated thrice.
1.
219
We can also say that while the eigenvalues are real; they are also
non-negative, that is they will be either 0 or positive, but none of the
eigenvalues will be negative. So, this is another important idea that we
will use; when we do data science, when we look at covariance
matrices and so on. Also the fact that, this Aᵀ A and Aᵀ are symmetric
matrices; guarantees that there will be n linearly independent
eigenvectors for matrices of this form also. So, what we are going to do
right now is, because of the importance of symmetric matrices in data
science computations, we are going to look at the connection between
the eigenvectors and the column space a null space for a symmetric
matrix. Some of these results translate to non-symmetric matrices also,
but for symmetric matrices, all of these are results that we can use.
220
We have also seen this equation before, when we talked about
different sub-spaces for matrices; we saw that null space vectors are of
the form A, β is 0 from one of our initial lectures. You notice that, this
and this form are the same. So, that basically means that, ν which is an
eigenvector corresponding like corresponding to eigenvalue, λ = 0, is a
null space vector, because it is just of the form that we have here . So,
we could say, the eigenvectors corresponding to 0 eigenvalues are in
the null space of the original matrix A. Conversely, if the eigenvalue
corresponding to an eigenvector is not 0, then that eigenvector cannot
be in the null space of A. So, these are important results that we need
to know.
221
So, this r could be 0 also; that means, there is no eigenvalue which
is zero. So, even then all of this discussion is valid. But as a general
case, let us assume that r eigenvalues are 0. So, there are r zero
eigenvalues. And since we are assuming this matrix is n by n, there
will be n real eigenvalues of which r are 0. So, there will be n - r non-
zero eigenvalues. And from the previous slide, we know that the r
eigenvectors corresponding to this r 0 eigenvalues are all in the null
space ok. So, since I have r 0 eigenvalues, I will have r eigenvectors
corresponding to this.
So, all of these r eigenvectors are in the null space which basically
means that the dimension of the null space is r; because there are r
vectors in the null space; and from rank-nullity theorem, we know that
rank + nullity = number of columns in this case n; since there are r
eigenvectors in the null space, nullity is r. So, the rank of the matrix
has to be = n - r.
So, that is what we are saying here. And further we know that
column rank = row rank; and since the rank of the matrix is n - r, the
column rank also has to be n - r. This basically means that there are n -
r independent vectors in the columns of the matrix. So, one question
that we might ask is the following; we could ask what could be a basis
set for this column space? Or what could be the n - r independent
vectors that we can use as the columns subspace?
So, there are a few things that we can notice based on what we have
discussed till now. First, notice that the n - r eigenvectors that we
talked about in the last slide, the ones that are not eigenvector is
222
corresponding to λ = 0; they cannot be in the null space; because λ is a
number which is different from 0. So, these n - r eigenvectors cannot
be in the null space of the matrix A. So, let me write again. We are
discussing all of this for symmetric matrices. We know, that all of this
n - eigenvectors are also independent; because we said irrespective of
what the symmetric matrix is, we will always get n linearly
independent eigenvectors.
So, this could be true for any of the n - r eigenvectors; which are not
in the null space of this matrix A. Now, take λ to the other side. So,
you will have this equation as ν, is ν₁ by λ A 1 and so on + νn by λ A n.
Again ν₁ is a scalar λ is a scalar. So, these are all constants that we are
using to multiply these columns. Now you will clearly see that, each of
these eigenvectors; n - r eigenvectors are linear combinatins of the
columns of A. So, there are n - r linear linearly independent
eigenvectors like this and each of this are combinations of columns of
A. And we also know that the dimension of the column space is n - r.
In other words, if you take all of these columns, A1 to An; these can be
represented using just n - r linearly independent vectors.
223
So, let us take a simple example to understand how all of these
work. Let us consider a matrix which is of this form here; it is a 3 by 3
matrix. First thing that I want you to notice, that this is a symmetric
matrix. So, if you do Aᵀ = A. And we said symmetric matrices will
always have real eigenvalues and when you do the eigenvalue
computation for this, the way you do the eigenvalue computation is,
you take determinant A - λ I = 0, then you are going to get a third order
polynomial , you set it = 0; and then you calculate the three solutions to
this polynomial and these would turn out to be the solution 0, 1, 2 and
you take each of these solutions and then substitute it back in and then
solve for Ax = λ x.
224
So, let us first check A times ν₁. So, this is a matrix, I have a times
ν₁ here and you can quite easily see that when you do this computation,
you will get this 0, 0, 0; which basically shows that this is the
eigenvector corresponding to zero eigenvalue. Interestingly, in our
initial lectures, we talked about null space and then we said the null
space vector identifies a relationship between variables. Now, since
this eigenvector is in the null space, the eigenvector corresponding to
the eigenvector, corresponding to zero eigenvalue or eigenvectors
corresponding to zero eigenvalues, identify the relationships between
the variables because these eigenvectors are in the null space of the
matrix.
So, it is an interesting connection that we can make. So, the
eigenvectors corresponding to zero eigenvalue can be used to identify
relationships among variables.
225
Now, let us do the last thing that we discuss. Let us now check the
other two eigenvectors shown below. So, this is for the other two
eigenvalues, span the column space. So, what I have done here is, I
have taken each one of these columns from matrix A. So, this is
column 1. So, this is A₁, this is A₂ and this is A₃. Column 1 is 6 times
ν₂, column 2 is 8 times ν₂ and column 3 is 2 times ν 3. So, we can say
that A₁ A₁, A₂ and A₃ are linear combinations of ν₂ and ν 3.
So, ν₂ and ν 3 form a basis for this column space of matrix A. So, to
summarize, we have Ax = λ x and we largely focused on symmetric
matrices in this lecture. So, we saw that, if we have symmetric
matrices, they have real eigenvalues. We also saw that symmetric
matrices have n linearly independent eigenvectors. We saw that the
eigenvectors corresponding to zero eigenvalues span the null space of
the matrix A and eigenvectors corresponding to nonzero eigenvalues
span the column space of A for symmetric matrices that we described
in this lecture. So, with this, we have described most of the important
fundamental ideas from linear algebra that we will use quite a bit in the
material that follows.
226
component analysis which we will be discussing later in this course
where these ideas of connections between null space, column space
and so on are used quite heavily.
I thank you and I hope to see you back after you go through the
module on statistics which will be taught by my colleague professor
Shankar Narasimhan.
227
Data Science for Engineering
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 19
Statistical Modelling
The module is divided into two parts: in the first part we will
provide you an introduction to random variables and how they are
characterized using probability measures and probability density
functions, and in the second part of this module we will talk about how
parameters of these density functions can be estimated and how you
can do decision making from data using the method of hypothesis
testing.
So, we will go on to characterizing random phenomena; what they are?
And how probability can be used as a measure for describing such
phenomena?
228
(Refer Slide Time: 01:03)
Of course, if you are asked to predict the age of the person to a hour
or a minute the date of birth from an Aadhaar card is insufficient.
Maybe you might need the information from the birth certificate, but if
you want to predict the age with higher degree of precision, let us say
to the last minute, you may not be able to do it with the same level of
confidence. On the other hand, stochastic phenomena are those there
are many possible outcomes for the same experimental conditions and
the outcomes can be predicted with some limited confidence. For
example, if you toss a coin you know that you might get a head or a tail
but you cannot say with 90 or 95 percent confidence, it will be a head
or a tail. You might be able to say it only with a 50 percent confidence
if it is a fair coin.
Such phenomena we will call it as stochastic. Why are we dealing with
stochastic phenomena.
229
(Refer Slide Time: 02:20)
Because, all data that you actually obtain from experiments contain
some errors these errors can either be, because we do not know all the
rules that govern the data generating process, that is, we do not know
all the laws we may not have knowledge of all the causes that affects
the outcomes and therefore, this is called modeling error. The other
kind of errors is due to the sensor itself. Even if we know everything
the sensor that we use for observing these outcomes may themselves
contain errors. Such errors are called measurement errors.
230
(Refer Slide Time: 03:47)
On the other hand if you are having two successive coin tosses, then
there can be 4 possible outcomes either you might get a head in the
first toss followed by a head in the second toss or a head in the first
toss followed by a tail and so on. So, these are the four possible
outcomes denoted by the symbol HH, HT, TH and TT and that
constitutes what we call the sample space. The set of all possible
outcomes an event is some subset of this sample space. For example,
for the two coin toss experiment if we consider that and consider the
event of receiving a head in the first toss then there are two possible
outcomes that constitute this even space which is HH and HT we call
this event a which is the observation of a head in the first toss.
231
(Refer Slide Time: 05:14)
232
that toss 10,000 times instead of 1,000 times you might get a slightly
different number.
233
So, if we look at the joint probability of two successive heads which
is the head in the first toss and the head in the second toss, because we
consider them as independent events we can obtain the probability of
this two successive heads as a probability in the first toss of head in the
first toss multiplied by the probability of head in the second toss which
is 0.5 into 0.5 and 0.25.
So, all the four outcomes in the case of two coin toss experiment,
we will have a probability of 0.25, whether you get two successive
heads or two successive tails or a head or a tail or a tail in the head all
will be 0.25 and this way we actually assign the probabilities for the
two coin toss experiment from the probability assignment of a single
coin toss experiment. Now, mutually exclusive events are events that
preclude each other. Which means, if you say that event A has
occurred then it implies B has not occurred, then A and B are called
mutually exclusive events one excludes the other occurrence of one
excludes the other.
So, let us look at the coin toss experiment again. Two coin tosses in
succession we can look at the event of two successive heads as
precluding the occurrence of a head followed by a tail. If you tell me
two successive heads have occurred, it is clear that the event of head
followed by a tail has not occurred. So, these are mutually exclusive
events. The probability of either receiving two successive heads or a
head and followed by a tail can be obtained in this case by simply
adding their respective probabilities because they are mutually ex-
clusive events. So, we can say the probability of either a HH or a HT
which is nothing but the event of a head in the first toss is simply 0.25
+ 0.25 which is 0.5 which is obtained by a basic laws of probability of
mutually exclusive events.
234
(Refer Slide Time: 11:11)
Now, there are other rules of probability that we can derive and
these can be done using Venn diagrams. So, here we have illustrated
this idea of using Venn diagram to derive probability rules by for the 2
coin toss experiment. In the two coin toss experiment the sample space
consists of 4 outcomes denoted by HH, HT, TH and TT.
So, it verifies that P(A)c = 1 - P(A). Now you can consider a subset,
in this case even be denoted by the blue circle of two successive heads
notice two successive heads it is a subset of receiving a head in the first
toss which is A event A. So, we can claim that if B is a subset of A,
then the P(B)should be less than the P( A). You can verify that the
P(B) is two successive heads which is 0.25 is less than the P( A)which
is 0.5. You can also compute the probability joint probability of two
events A and B, which is not joint probability, but the P(A) or B which
is given by P(A∪B) can be derived as P( A)+ P( B)- the probability of
joint occurrence of A and B. Let us consider this example of receiving
235
a head in the first toss which is event A and receiving a head in the
second toss which is event B.
So, overall gives you 1 - 0.25 which is 0.25 which is what we can
derive by counting the respective adding up the respective probabilities
of the mutually exclusive events HT, HH and TH. So, such rules or
things can be proved by using Venn Diagrams in a simple manner.
236
So, we de ne what is called the conditional probability, that is, the
probability of event B occurring given that event A has occurred can be
obtained by this formula which is the P(A) and B simultaneously
occurring divided by the P(A) occurring.
Give of course, assuming that P(A) is greater than 0. Now using this
notion of conditional P(B), given A and this formula we can derive
what is called the Bayes rule, which simply says the conditional P(A|B)
multiplied by the P(B) is the conditional P(B|A) multiplied by P(A).
This rule can be easily derived from the first rule by simply
interchanging A and B and deriving the conditional P(A|B) multiplied
by P(B), which is the P(A ∩ B) and right hand side. In this also is P(A
∩ B), both of these are equal to A intersect P(A) intersection B. We
can also derive another rule for P(A) which is P(A|B) P(B) + P(A|B) c
P(B)c
However, if you tell me that you are observed event A; that means,
that the first toss is a head, in this case then the probability of event B
is actually improved. I can tell now there is a 50 percent chance of
getting probability event B, because you have already told me that the
first toss is a head. So, notice that I can compute this probability
conditional P(B) given A. Using the first rule which is P(A ∩ B) which
is 0.25 divided by the P(A) which is 0.5. So, this P(B) given A is 0.5
which has improved my ability to predict B, because I have used some
information you have given about point event A. Now, if B and A were
totally independent, then information that you are provided to A will
not affect the probability of predicting predictability of B it would have
remain the same in this case it does not remain the same.
237
(Refer Slide Time: 18:53)
Clearly, because there are 50 defective parts and each of these parts
can be uniformly picked, we know that the P(A) is the number of
defective parts divided by the total number of parts which is 50 by 100,
1,000. On the other hand, the probability of picking a non defective
part is the complement of this, which is 950 divided by 1,000. Now let
us assume that we have picked one part kept it aside and we draw a
second part without replacing the first part into the pool. We are
interested in the outcome whether the second part that we have picked
is it a defective part or a non defective part.
238
So, the probability of picking a defective part in the second pick
given that you picked a defective part in the first pick is 49 by 999. On
the other hand, if you tell me that the first draw is non defective which
means the total number of parts again as reduce to 999, but the number
of defective parts in the pool still remains at 50.
So, the probability of picking as defective part in the second round
given that the first pick was non defective is 50 by 999. Now,
according to the rules of conditional probability we can compute the
P(C), by P( C | A); which is 49 by 999 multiplied by the P(A); which is
50 by 1,000 + the P( C | A) c. Remember, A complement is nothing, but
B.
239
outcome of A and you will not be able to improve or decrease the
predictability of A in the first pick.
240
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 20
Random Variables and Probability Mass/Density Functions
241
of a sample space to this real life that is what we are referring to as a
random variable.
242
interval. Notice, in the case of a continuous random variable, there are
∞ of outcomes and therefore, we cannot associate a probability with
every outcome. However, we can associate a probability that the
random variable lies within some finite interval. So, let us call this
random variable x which can take any value in the real line from - ∞ to
∞, then we de ne the density function f(x), such that the probability that
the variable lies in an interval a to b is defined as the integral of this
function from a to b.
So, the integral is an area. So, the area represents the probability.
So, this is denoted on the right hand side, you can see a function and
here we show how the random variable the probability that the random
variable lies between - 1 to town 2 is denoted by the shaded area. That
is how we de ne the probability and f(x) which defines this function is
called the probability density function.
Again, since this has to obey the laws of probability the integral
from - ∞ to ∞ of this function or the area under the entire curve should
be = 1 and obviously, the area is non zero therefore it obeys the second
law also we actually described in the last lecture. We can also define
what is called the cumulative density function, which is denoted by
capital F and this is the probability that the random variable x lies in
the interval - ∞ to b for every value of b you can compute this value
function value and this is nothing, but the integral between - ∞ and b of
this density function f(x) dx.
243
(Refer Slide Time: 05:55)
244
So, the probability finally, of receiving k heads in n tosses which is
denoted by f the random variable taking the value k is given by the
probability, which is defined on the right hand side here, this
distribution for various values of k. For example, x = 0, x = one can be
computed and such a distribution will be called as the probability mass
function. For this particular random variable and this particular
distribution is called a binomial distribution. As an example of the
binomial distribution mass function is shown on the right hand side for
n = 20 and taking the probability p = 0.5 clearly, it shows the
probability of receiving 0 heads in 20 tosses is extremely small and
similarly the probability of obtaining 20 heads in 20 tosses loss is also
small as expected the probability of obtaining ten heads has the
maximum value as shown.
245
We have now considered a continuous random variable. In this case we
will look at what is called the normal density function which is shown on the
right. Usually this normal density function is used to characterize what we
call random errors in data and it has this density function as given by this
− ( x−μ )2
1 2 σ2 .
e
√2 π σ
Now, this particular density function has two parameters μ and σ and it
has the shape as shown here like a bell shaped curve, which is the normal
density function. Notice that it is symmetric and in a particular case of this
normal density function is when μ = 0 and σ = 1 and such a normal density
function with mean μ =0 and σ =1 is called a standard normal distribution.
Again, if you want to compute the probability of this, that the standard
normal variable lies within some interval. Let us say 1 and 2 you have to
compute the shaded region. Unfortunately, you cannot analytically do this
integration of this function between 1 and 2 you have to use numerical
procedures and the our package contains, such functions for computing the
probability numerically such that the variable lies within some given interval
we will see such computations a little later.
(Refer Slide Time: 11:37)
246
negative values is defined to be exactly = 0 and it turns out that this
distribution arises when you square a standard normal variable.
−∞
= 1, you will call it the first moment. If k = 2 you will call the second
moment and so on so, forth. So, if you give all the moments of a
distribution it is equivalent to stipulating the density function
completely.
247
Typically we will usually specify only 1 or 2 or 3 moments of the
distribution and work with them for discrete distributions. You can
similarly describe this moment; in this case expectation of x power k is
defined as summation -integrations replaced by summation- over all
possible outcomes of the random variable x. I represent the outcome
and k is the power to which you have raised. So, xi power k probability
of obtaining the outcome xi which is very similar to this integration
procedure except that the finite number of outcomes and therefore, we
have replaced the probability f(x)dx with p (xi) and the value xk with
the outcome xik and integral as a summation.
This is the function that we want to take the expectation of, which
can be obtained by x - μ the whole square f(x) dx. In the case of a
continuous distribution we can show that the variance σ2is expectation
of x squared which is the second moment of the distribution about 0 - μ
squared. This proof is left to you, you can actually try to prove this; the
standard deviation is defined as the square root of the variance.
248
So, the parameters that we have used to characterize the normal
variable is μ the mean and σ2which is the variance. Typically σ2 tells
you how wide the distribution will be and μ tells me what the value is
at which the density function attains the highest probability most
probable value. So, μ is also known as the centrality parameter and σ 2
is essentially the width of the distribution tells you how far the values
are spread around the central value μ symbolically. We defined this
normal distribution variable random variable as distributed as N
represents the normal distribution and the parameters are defined by μ
and σ2 which completely defines this density function. A standard
normal or standard Gaussian random variable is a particular random
variable, which has normally distributed random variable which has
mean 0 and standard deviation one and denoted by this symbol.
249
(Refer Slide Time: 18:39)
Now in our there are several functions that allow you to compute
probability given a value or the value given the probability. So, we will
see some couple of examples of such functions. For example, if you
give a value x and ask what is the probability that this continuous
random variable lies between the interval - ∞ to this capital x value that
you are given then, obviously, I have to perform this integral - ∞ to x
of the density function.
This value of this integral, notice this integral is nothing but the area
under the curve between - ∞ to x, if lower tail = TRUE it will give you
this integral value area between - ∞ and x. On the other hand if lower
tail is FALSE, then it will give the in area under the curve between x n
∞. So, x will be taken as the lower limit and ∞ is the higher limit if you
say lower dot tail = TRUE it will take excess the upper limit and do the
area in under the curve f(x) between - ∞ and x the default value is
TRUE. So, this norm can be replaced by other distributions like χ
square exponent uniform and so on, so forth to give the probability for
other distribution given this value x.
250
Now, the parameters of the distribution must also be specified for
every case in the normal distribution. There is only there are two
parameters, but other distributions such as χ squared may have one
parameter such as the degrees of freedom and exponent will have one
one parameter such as the parameter λ and so on so forth. As I said the
lower dot tail tells you whether you want the area to the left of x or to
the right of x.
251
Then other functions in R one of them is qnorm which actually does
what is called the inverse probability; here you give the probability and
ask what is the limit X. So, here I have if you give the probability to q
norm with the mean and standard deviation parameters of the normal
distribution and you say the lower tail = 2, then it will actually
compute the value of X such that the integral between - ∞ to X of this
density function = the given value p you are specified p and calculating
X.
252
And the way this density function is used to compute the probability
is as before the joint probability that x < = a - ∞ being the other
assumed to be the other limit and y < = b or y ranging from - ∞ to b,
the joint probability of these two variables lying in these intervals is
denoted as computed as the integral - ∞ to b and double integral - ∞ to
a f(x) y dx dy.
On the other hand, if x and y are independent, then the joint density
function of f(x) x comma y can be written as the product of the
individual density functions or marginal density functions of x and y.
That is f(x) into f of y. This is the extension of this notion of
independent variables in terms of probability we defined in the
previous lecture where we said the probability joint probability of x
and y is basically probability of x into probability of y.
253
(Refer Slide Time: 25:01)
Now, we can now extend this idea of joint distribution of two variables
to joint distribution of n variables. Here I have defined the vector x
which consists of n random variables x1 one to xn. And specifically we
look at this multivariate normal distribution we denoted by the symbol
x multivariate normal with mean vector μ and covariance matrix σ.
Now each of these xi components x 1, x2 and so on have their respective
means. If you put them in the vector form, we get this mean vector
symbolically written as expectation of x which is a multi dimensional
integral, we get this value μ which is known as the mean vector. And
similar to the variance we defined what is called the covariance matrix
which is defined as expectation of x about the mean μ or x - μ into (x –
μ)T. Remember this is a matrix of variables because x is a vector and if
you take the expectation of each of these elements of this matrix you
will get this matrix called the variance covariance matrix.
So, it has very similar form we do not need to know the form. We
need to know how to interpret μ and σ. And if you look at the structure
of σ you will find that it is a square matrix with the diagonally
elements representing the variance of each of the elements that is σ 2x1
is the variance of x1 and σ2x2, the variance of x1 and so on, so forth.
And the of diagonal elements representing the covariance between x 1
and x2 or x1 and x3, x2 and x3 and so on ,so forth pair wise covariance.
254
Those are the o diagonal elements for example, σ x1, x2 represents
the covariance between x1 and x2 σ x1, xn represents the covariance
between x1 and xn this particular matrix is symmetric and we
completely characterized the multivariate normal distribution by
specifying this mean vector μ and the covariance matrix σ.
255
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 21
Sample Statistics
256
Suppose you want to actually find out the average height of people in
the world, you cannot go and sample just people or take heights of
American people alone because they are known to be much taller
compared to the Asian people.
So, when you take samples you should take examples from let us
say America, from Europe from Asia and so on, so forth. So that you
get a representative of the entire population of this world. So, this is
called proper sampling procedures and these are dealt with in the
design of experiments. We will assume that we have obtained a
sample; you have done the due diligence and obtained the
representative sample of whatever population you are trying to analyze.
You have to note that when you actually lose such inferences, your
inference is also stochastic or uncertain because the sample that you
have drawn are also uncertain; they are not representative of the entire
population. So, you should expect that your inferences are also
uncertain and therefore, when you provide the answers, you should
257
actually provide also the confidence interval associated with these
estimates that you have deriving.
We did talk about parameters such as the expected value or the first
moment and the second moment and so on, and different distributions
are different number of parameters and how do we estimate these
parameters from a small sample that we obtain. And how do you give a
confidence region for these estimates that is called estimation and the
258
other kind of decision making that we want to do is; we want to judge
whether particular value is 0 or not. The parameters of the distribution
and such decision making that we do from a sample is called
hypothesis testing.
So, some of the summary statistics that we can define for a sample
or what we call measures of central tendency, it is the kind of the
center point of this entire sample you might say. And let us define what
is called the mean, these are measures that you are familiar with from
your high school courses in mathematics.
And we can show that this estimate that we obtained of the sample
the average is the best estimate in some sense. Later on, we will set up
what is called the least squares method of estimating parameters. And
if you set up this particular criterion for estimating parameters you will
find that x̅ is the best estimate that you can get from the given sample
of data. We can also show some properties of this estimate. For
259
example, we can prove that this x̅ represents an unbiased estimate of
the population mean μ which you do not know anything about.
And you average all these x̅s that you get from different
experimental sets, then the average of these averages will tend to this
population mean. That is a way of interpreting this statement that its an
unbiased estimate. There are other properties of estimates we will see
that we want the demand, but this is an useful and important property
of estimates that you should always check.
Let us take one example we have taken 20 cherry trees and we have
measured the heights of the cherry trees in terms in feet and we got let
us say the set of bunch of numbers; generated these randomly from a
normal distribution with some mean which is 70 and the standard
deviation of 10. So, the population mean is 70 and the population
standard deviation is 10 and I got these values. You can use rrnorm for
example, in r in order to generate such data points.
Now, if you take this sample of 20 points and compute the mean,
you get a value of 69.25 which is very close to the population mean.
So, you see that it is a good estimate of the population mean, even
though you did not know what that value was until I told you. Now on
the other hand if I take the first data point 55 and add a bias wrongly
enter it as 105 let us say by adding 50 to ta and then recompute this
mean, I will find that the mean becomes 71.75.
It starts deviating from 70 you see more significantly. A single bias
in this sample actually caused your estimate to become poorer. That is
what we mean by saying that x̅ will get affected by outliers in the data.
We can de ne other measures of central tendency which are robust with
respect to the outliers, even if the outlier exists, it does not change by
much and we will see what that such a measure is.
260
(Refer Slide Time: 10:13)
So, same 20 cherry trees data I have looked at, this data point I have
ordered from the smallest to the largest. And if you look at it the tenth
point 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; tenth point is 67 because there are even
number of points, the eleventh point is 71 and you take the average
between this and call that the median. If there are odd number of points
then you can take the middle point just as it is because there are even
number of points, you take the average of the mid midpoints, in this
case the tenth and the eleventh point and that gives you a median of 69.
Suppose we add a bias in the first data point as before and make this
105 and then reorder the data and find out the again the median; we
find that the median has not changed.
So, the presence of an outlier in this particular case has not affected
the median at all and that is why we call this a robust measure even if
there is a bad data point in your samples. You can also show that this
estimate is the best estimate in some sense. In this case the merit that
you are using is what is called the absolute deviation. That is you are
asking what is the estimate which deviates from the individual
observations in the absolute sense to the least extent? And it turns out
that the median is such a point, such an estimate. So, when there are
261
outliers typically we would like to use this as a central measure rather
than the mean.
262
(Refer Slide Time: 13:35)
And add over all the data points n data points and divide the this
particular sum squared value by N - 1, such a measure is called the
sample variance. And again you can prove that the sample variance is
an unbiased estimated estimate of the population variance and the
square root of the sample variance is also known as the standard
deviation.
Now, just like the mean, the sample variance happens to be also a
very susceptible to outliers. So, if you have a single outlier, the sample
variance, our sample standard deviation can become very poor estimate
of the population parameter. So, we define another measure of spread
which is called the mean absolute deviation somewhat similar to the
median. In this case instead of taking the sum squared deviation, we
take the absolute deviation of the data point from the mean; you can
also take it from the median if you wish. So, deviation of the
observation from the mean or the median, you take the absolute value
of this deviation sum over all the end points and divide by N and that is
what is called the mean absolute deviation.
263
you have one data point, s squared will turn out to be 0 because the
mean will be equal to the point. So, really speaking you have only N -
1 data points to estimate the spread. So, that is why we divide by N - 1
to indicate that one data point has been used up to estimate the sample
mean or the median whatever the parameter that you are actually
estimated.
So, similarly here also you can divide by N - 1 to indicate that only
N - 1 data points were available for obtaining the mean absolute
deviation. A third measure of spread is what is called the range that is
basically the difference between the maximum and minimum value.
All of these give you indication of how much the data is spread around
the central measure which is the mean or the mode or the median as the
case may be.
On the other hand, if I add outlier of 50 units to the first data point
and recompute s squared and s, it turns out s squared turns out to be
212 and you can see if I take the square root it might be around 14,
which is significantly deviating from the population parameter 10. So,
a single outlier can cause the standard deviation and the variance to
become very poor and therefore cannot be trusted as a good estimate of
the population standard deviation or variance.
On the other hand, let us look at the mean absolute deviation. In this
case I have, if we do not have an outlier, we get a mean absolute
deviation of 6. 9, which is not too bad compared to 10. The moment
you have an outlier, the mean absolute deviation shifts to 9.5, it comes
closer that is not what is important, but it does not change much just
because of the presence of the outlier. So, this is a much more robust
measure. In fact, if you take the mean absolute deviation from the
median, it would be even better in terms of robustness with respect to
the outlier. The range of the data can be obtained as the maximum and
minimum value and I have just simply reported it.
So, these are measures of spread. So, even if I do not give you the
entire 20 data points and I tell you the mean is, let us say 69 and the
standard deviation is 8.5, then you can say that the data will spread
typically between 69 + or - 2 times the standard deviation which is 16.
So, the lowest value will be about 53 and the highest value will be
about 85 and it turns out if you look at the highest and maximum value
and that is what it turns out to be.
264
So, + or - 2 times the standard deviation from the mean would rep-
resent about 95 percent of the data points if the distribution is normal.
For other distributions you can derive these kind of intervals if you
wish, but just giving two numbers allows me to tell you some
properties of the sample and that is the power of these sample statistics.
Now, there are some important properties of the sample mean and
variance which we will use in hypothesis testing. So, I want to recap
some of these. If you have observations drawn from the normal
distribution with some population parameter μ and population variance
σ squared. And let us say you draw N capital N observations for all
from this distribution; let us assume these draws or samples that you
are drawn are independent, it does not have a bias in any manner.
And if you compute the sample average from this set of samples
independent samples, then you can prove that x̅ is also normally
distributed with the same mean population mean μ. Which means the
expected value of x̅ is μ as I told you before and the expected variance
of x̅ however, is σ2by N.
So, one simple way of dealing with noise and reducing the noise
content in observations is to take n observations at the same
experimental condition and average them. The average will contain
less variability or less noise and it will reduce the variance of this
265
average will be 1 by N times the variance in your individual
observations. So, what we call the noise will be reduced by square root
of N where N is the number of samples.
So, my suggestion is always when you have a data set, if you can
plot and visualize it please do. So, let us see some of the standard plots
again. Some of it you might have already encountered in your high
school days. We will start with what is known as the histogram. Here I
am given a sample set and what we do is first divide this sample set
into small ranges; we de ne a small range and count how many
266
observations fall within that range or within each interval. And then we
plot the width of the interval or the interval size of the x axis and the
number of data points we see in that interval as the y axis, we call it the
frequency and that is on the y axis.
So, let us take this example of the cherry trees. We have 20 data
points. What I did was divided into small intervals of 5 feet which
means I asked what are the number of cherry tree heights which are
falling in the range 50 to 55, 55 to 60, 60 to 65 and so on, so forth. And
I find between 50 and 55 there are no trees within that height, we find 4
trees with the height between 55 and 60 which we can easily see there
is one data point here, there is second data point here, there is a third
data point here and fourth data point is 60; so, the edge.
So, the 4 data points lying between 55 and 60 and similarly we find
there are two data points lying just above 60 and up to 65 and so on, so
forth. And that is what we plotted as a rectangle for each interval and
this is known as a histogram. In fact, if I take 100 such examples and I
plot, you will find this standard bell shaped curve. And that is because
I drew these samples from the normal distribution. In this case because
it is 20, you are not able to clearly see its bell shaped. You can see that
the most of the data points are clustered around the middle point which
is around 70 and you can see highest number 6 there. So, at least that is
borne out.
You have other kinds of plots. One is called the box plot, which is
used most often in sometimes in visualizing stock prices. Here you will
compute quantities called quartiles Q1, Q2 and Q3 and the minimum
and maximum values in the range. What are quartiles? Quartiles are
267
basically an extension of the idea of median. Q2 is exactly the median
which means half the number of points fall below the value of Q2 and
half the number of points are exactly about Q2.
So, as an example if you take the 20 cherry trees and sort them out,
sort it from the lowest to highest value, and we try to compute the
median it turns out the median is the average of the tenth and eleventh
point which is 69. Then the quartile one can actually be computed by
just taking the first half which is the 10 points and computing the
median of the first 10 points which turns out to be 64. And the Q3 can
be computed as the median of the other half from 60 from 71 to 83 and
that turns out to be around 75. So, the min and max of course, is 50 and
83 and therefore, you can perform this plot.
268
(Refer Slide Time: 27:21)
The third kind of plot which is very useful is to know about the
distribution of the data and this is called the probability plot the p-p
plot or the q-q plot. Here instead of determining just Q1, Q2, Q3 you
compute several quantiles. And then plot these quantiles against the
distribution which you think this data might follow. And if the data
falls on the 45 degree line, then you can conclude that the sample data
has been drawn from the appropriate distribution you are testing it
against.
So, this is useful for visually figuring out from which distribution
did the data come from. So, I have taken this example of these 20
cherry trees. I first standardized them, standardization means we
remove the mean and divide by the standard deviation and we get the
values. The 20 values as these are called the standardized values I have
sorted them from the lowest to highest.
Now if you look at the 10 percent quantile, I can say that out of 20
points two points first two points fall below 1.679 - 1.679. So, - 1.679
represents the 10 percent quantile similarly - 1.1016 represents to 20
percent quantile and so on, so forth. For example, the 50 percent
quantile or the median quantile, in this case the standardized measure,
will be and which is around these two points around let us say - point
or around 0 close to 0.
Notice that for a normal distribution, 50 percent of the data will lie
below 0 and that is what this also seem to indicate. Now if you go to
the standard normal distribution and try to compute the value below
which the probability is 0.1 which is the lower tail probability we
269
talked about. Then you will find the value is around let us say - 1. 7-
1.5 whatever that value turns out to be. So, that is the 10 percent
quantile. Similarly you ask what is the value below which 20 percent
of the data lies or what is the value below which 20 percent of a
standard normal distribution values will have a probability of area
under the curve of 0.2 and that value you take that is the second 20
percent quantile and so on, so forth.
Then you plot the actual value obtained for the sample which is -
1.67 against the standard normal contact and that is what is called the
probability plot. So now, if this data has been drawn from the normal
distribution then you should find a curve like this. I did not plug the
normal probability plot for these 20 points, but for some other set of
data. But typically if you find if you think that this data comes from the
normal distribution, then you will find that in the normal probability
plot the data will align itself on the 45 degree line and then you can
conclude yes that the data has come from the distribution.
You can test this against any distribution. In this case, I have shown you
how to test it against the normal distribution. You can take the
quantiles from a uniform distribution or from the χ squared distribution
what have you and the plot these sample quantiles against the expected
population quantiles. And if they fall on the 45 degree line then you
know that it comes from the appropriate distribution.
270
variables, let us say y and x and I want to know whether there is any
relationship between y and x, then one way of visually verifying this
dependency or interdependency is to plot y versus x.
But more importantly if the random variable y, in this case the quiz
marks has a dependency on the study time, then you will see an
alignment of the data. On the other hand if there is no dependency you
will find a random spread, this data will be spread all around with no
clear pattern, in which case you will say that there are these two
variables are more or less independent and you do not have to discover
a relationship between these variables.
271
Data Science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 22
Hypotheses Testing
Similarly, let us assume you are a medical practitioner and you want
to ask this question whether the incidence of diabetes is greater among
males than females based on data that you have gathered about males
and females and what proportion of males and what proportion of
272
females have diabetes. Now, you can ask similar question in the social
sector. You can if you are a service provider mobile service provider
you want to know whether women are more likely to change service
provider than men and depending on that you might want to provide
more incentives accordingly for women to retain them.
273
If you have per performance data that you collect then you based on the
data you want to reject whether this efficiency at current time is
different from its original value η₀. If there is evidence, you will reject
this hypothesis in favor of the alternative hypothesis which is
essentially saying that the pump efficiency in this case is less than η₀.
So, this is called the alternate hypothesis. So, you set up the hypothesis
such as η = η₀ which you call the null hypothesis and the alternative
hypothesis which you want to choose in favor of this if the evidence is
there from the data such as for example, η less than η₀.
So, all hypothesis testing has this null and alternative. And we can
have different types depending on whether you are testing for the mean
or the variance and whether this is less than or greater than or on both
sided and we would see many such examples.
Now, this could be for example, if you are testing for the population
mean you may use as the test statistic, the sample mean. If you are
testing for the population variance you may use as test statistic the
sample variance and so on, so forth. Now, you also have to derive the
distribution of the test statistic under the null hypothesis, what it means
274
is if the null hypothesis is true what is the distribution of this test
statistic that you are computed?
You can also alter the performance of the test by choosing the number of
275
observations, experimental observations, you want to make which is
called the sample size. Now, the test statistic as we said is a function of
the observations there are different sometimes you might have different
choices of functions, and some test statistic may actually perform
better than others. That depends on the theoretical foundations of
statistical hypothesis testing, we will not touch upon this. We have
control over the of number of observations. We will take a look at how
we can alter the performance based on sample size.
Now, there are two types of hypothesis tests what we call the two
sided test and the one sided test and I am giving a simple illustration to
tell you what a two sided test means. Suppose you are testing for the
population parameter, population mean μ, and you want to test the test
whether this population parameter = 0 or != 0. So, the null hypothesis
is the population parameter is 0, and the alternative is then a population
mean is != 0. So, you observe some set of observations from this
particular population we have a sample and you have computed let us
say a sample statistic and let us also assume that sample statistic
happens to be a standard normal variable z. We will show you how to
construct a test statistic that finally, has this kind of a distribution for
testing the mean.
276
But let us for the time being take it that we have a test statistic based
on the observations we have made and this test statistic that we have
computed is the standard normal test statistic z. Now, we know that
this z, because it is a standard normal distribution, will have some
shape like this and about 95 per-cent of the time the data the statistic
will have a value between around - 2 to + 2. So, for a two sided test, we
will say if the test statistic happens to be very large we will reject it or
if it is very small we will reject it. Why? because if it is very large it
means it does not come from a distribution with mean 0. So, in this
particular case what we mean by very large we can choose a threshold
let us say - 2, I am sorry + 2 and if the statistic is greater than 2 we
reject the null hypothesis or if its small what do we mean by small if it
is less than - 2 we reject the null hypothesis.
So, in this case we reject the null hypothesis whether the test
statistic is less than a particular value the threshold value, in this case I
have chosen the threshold as - 2, or if the statistic is greater than the
threshold value, which is + 2. So, there are 2 thresholds the upper
threshold which is 2 and the lower threshold which is - 2 because it is a
two sided test. We want to reject the null hypothesis if μ is less than 0
or μ is greater than 0. How we choose these thresholds and what are the
implications we will see later. But realize that if it is a two sided test
you basically have a lower criterion threshold and the upper threshold
which you select from the appropriate distribution.
Now, suppose you have the same thing, but you are only interested
in testing whether the mean is 0 or greater than 0. So, in this case the
null hypothesis is μ = 0 and the alternative is μ less than 0 or greater
than 0. So, notice that μ = 0 implies that you are not interested in the
case when μ is less than 0, you are you are not going to reject the null
hypothesis if μ is less than 0 you are going to reject the null hypothesis
only μ is greater than 0. That is why you have written the alternative
like this. This is called the one sided test.
In this case let us assume that you again have computed a test
statistic based on the observations and that is a standard normal test ,
then we only have an upper threshold, because we want to reject the
null hypothesis only if the mean is greater than 0. So, similarly if the
statistic is greater than a threshold then we reject the null hypothesis.
So, we have a upper threshold, in this case I have chosen 1.5 and reject
the null hypothesis if the statistic happens to be greater than 1.5. If it
has a low value we are not bothered because we are not bothered when
μ is less than 0.
277
type of test whether its two sided or one sided you choose thresholds
and then compare the test statistics against those thresholds.
Now, when you do such a test you commit two types of errors. So,
essentially let us look at this truth table. Suppose the null hypothesis is
actually true and you have made a decision to not reject the null
hypothesis which means you have made the correct decision. So, this
will not happen all the time because your sample is random it is
possible that even if you are not high passes is true, you may conclude
that the null hypothesis you may decide to reject the null hypothesis in
which you commit a type I error what we call a type I error or a false
alarm.
So, when the null hypothesis is true and then you reject the null
hypothesis based on the sample data and your statistic and the
threshold d of selection. So, you have selected, we call this a type I
error or false alarm and the probability of that is known as α and we
call it a type I error probability.
So, you will have it is not that your decision is perfect you will
always commit some type I error α depending on the threshold that you
have selected and the statistic you have computed. Similarly let us
assume that the truth is that the alternative is correct. So, in this case it
may turn out that from your sample setup data you do not reject the
null hypothesis, in which case you commit what we call a type II error
and this type II error also has a probability which is denoted by β.
278
made a correct decision and that correct decision probability is known
as power of the statistical test and is denoted by 1 - β. Remember the
only one of two decisions you have always going to make, you are
either going to reject the null hypothesis or you are not going to reject
it. So, the total property always be 1 and the probability of type II error
if its β then the probability or statistical power is 1 - β. So, there are
two types of errors that you commit what to call the type I error
probability α, and the type II error probability β.
279
Now, let us see how we actually control the type I error probability
by choosing the appropriate test statistic value, test criterion value. So,
let us look at the qualitative comparison of the type and type I error and
type II error.
Let us assume that under the null hypothesis you have this
distribution which is known as distribution of the test statistic under h
naught. Now, depending on, remember that even if h naught is true, the
statistic can have a value between - ∞ to + ∞. On the other hand you
are going to choose some threshold in I have indicated this by this
vertical line you are going to choose some threshold because you
cannot say that you will accept you will not reject the null hypothesis
whether the value is minor any value between - ∞ and + ∞. In which
case you will never reject a null hypothesis whatever be the evidence
you get that is not fair.
You decide to null reject the null hypothesis let us say in this case a one
sided test and you have decided to reject it, if the statistic exceeds this
threshold value. In which case notice that when the null hypothesis is
true this test statistic can have a value greater than this threshold. And
the probability that can have a value greater than the threshold is
indicated by this blue area. So, this is your type I error probability that
you have committed . If the null hypothesis is true your test statistic
can exceed this threshold in which case this is the area the probability
that that the test statistic can exceed this threshold and therefore, this is
your type I error probability.
Now, if you move the threshold to the right obviously, you can
reduce your type I error probability, but there is a price to be paid. Let
us look at what is the price. Let us look at what happens to a type II
error probability. So, let us assume that the actual distribution of the
test statistic under h 1 under the alternative happens to be this kind of a
distribution. Of course, this depends on what value the parameter is
going to take.
280
If you try to make your test perfect in the sense of no type I error
then you will come commit a type II error of one. Which means you
are be a very insensitive test you will never be able to find whether
your mean is different from 0 for example. So, the more less type I
error you are willing to entertain the less sensitive your test will be. So,
that is always a trade off you cannot help it. So, that there is a trade off
if we decrease type I error probability then type II error probability will
increase there is no choice and you have to accept this trade off .
Now, let us look at some examples for hypothesis testing. In this case we
have looked at a manufacturer of a solid propellant and we want this
solid propellant to burn at a certain rate and this burning rate of the
solid propellant is specified to be let us say 50 cm per second. If it
burns at a higher rate then we will not be able to come control the
rocket, if it burns at the slower rate then the rocket may not even take
off . So, we are going to check whether the propellant we are made,
which is based on mixing a lot of different chemicals, whether it will
have a burning rate of 50 centimeter per second.
Now, what we have done in the from the mixing bowl where we are
making the solid propellant, we have taken 25 samples from different
locations in the mixing bowl and each of these samples we test in the
281
lab and find that what their burning rate is. And it will be some value
maybe it is 48, 49, 51 whatever. We compute the average of these 25
samples.
Now, based on this data that we are collected the sample mean and
the sample standard deviation we want to ask the question whether the
population mean happens to be 50 centimeter per second. I am sorry let
us say that the sample standard deviation is already known to you
given to you that it will be 20, 2 cm per second it is not based on the
sample its already known to you. Let us take that as a case which is the
simplest case.
You can refer to the previous lecture to find out that the distribution
of x̅ the sample mean, is the same as the population mean and has a
standard deviation which is one standard deviation of the population
divided by the square root of the number of samples you have taken.
So, we can now do what is called standardization which is subtracting
282
the mean of the sample mean which is 50, the expected value ,divided
by the standard deviation of the sample mean, which is 2 by root 25,
and we get what is called the standardized value and this standardized
value z will have a standard normal distribution with 0 mean and unit
variance.
Now, notice that this will be the distribution if the population mean
happens to be 50; that means, under h naught under the null hypothesis
this test statistic will have a standard normal distribution. Now, we
know that 95 percent of the of a standard normal variable will lie
between + or 1 - 2 standard deviation approximately. More particularly
the standard normal variable will lie between + or - 1.96, 95 percent of
the time. So, if we are willing to tolerate a type I error probability of
0.5 percent let us say, which means what you are saying is, the area
that you have on the left + the area that you have on the right of the
upper threshold = 5 percent. If you are willing to tolerate that type I
error probability then you can choose your criterion or your I am sorry
your threshold as 1.96 and - 1.96 and that is exactly what is stated here.
283
(Refer Slide Time: 25:41)
284
means 1 - 2 is less than 0 is the alternative we choose, in case we find
enough evidence for rejecting the null hypothesis.
Notice that all the teachers I can group them because they have the
same variability, there is no difference in the variability of group A and
group B. So, we can pool their variances and we can obtain a pooled
variance by just taking the sum square deviation of all with their
respective means and then obtaining a pooled variance. This is a way
by which you obtain the pooled variance for two groups, and once you
obtain an estimate of the standard deviation of the group of teachers
then you can take the difference in the sample means because
remember you are testing the difference in means, so you can take as
the test statistic in the sample means (x̅1 - x̅2) divided by the variance
standard deviation of these means which is what this is all about. So, S
p represents the standard deviation of the difference in the means,
remember assuming that we have they are both the groups have the
same variances.
So, now, we can show that this particular statistic we have computed is t
statistic because σ is also estimated from the data. So, we can now
compare this with this t distribution, the number of degrees of the t
distribution happens to be N1 + N2 - 2. Notice that that depends on the
denominator degrees of freedom which is essentially total number of
observations - 2 parameters which are mean parameters that you have
used up to estimate the means. So, the remaining degrees of freedom is
this and that is how this number of number of degrees of freedom
comes about.
In fact, for large enough N1 and N2, if you take large in a frame,
you can actually approximate this with a standard normal variable , but
in this case let us let us do a precise job. We will choose the test
statistic test criteria from the t distribution with this many degrees of
freedom 10 + 10 - 2 which is 18 degrees of freedom and we find that
the one sided confidence interval. Remember this is one sided if we are
willing to tolerate a type I error probability of 5 percent then there is
only a lower threshold less than 0 notice. So, - 1.73 the probability that
a t distribution with 18 degrees of freedom is greater than this - 1.73
will be 5, 5 percent.
285
So, if we are willing to tolerate the type I error probability of 5
percent then we can choose the threshold value as - 1.73 drawn from
the t distribution with eighteen degrees of freedom and compare it with
the statistic. The test statistic itself we compute by plugging in the
value for x̅1 , x̅2 and so on so forth we get - 1.989, since this is less than
- 1.73 we reject this null hypothesis in favor of this alternative, which
means method B is more effective that method A that is a conclusion.
Again we can use the ratio of the sample variances as a test statistic
and this ratio turns out to be an f distribution with degrees of freedom
N1 - 1 and N2 - 1 where N1 and N2 represents a number of samples we
have taken for each of the process. In this case we have equal number
of samples we have taken, even if you are taken unequal samples we
can appropriately choose the degrees of freedom for the f distribution
286
and compare. So, in this case if you plug in the values for S 1 squared ,
S 2 squared we have got a f value of 0.27, and if you go to the f
distribution with N1 - 1 which is 49 degrees of freedom and N2 - 1
which is 49 degrees of freedom and ask 5 percent probability which we
break it up as two, left of the threshold should be 2.5 percent and
greater than the threshold is 2.5 percent.
So, we ask what is the threshold where lower threshold value for the
f distribution and it turns out to be 0.567 that means 2.5 percent, there
is a 2.5 percent probability that this f distribution is less than this value.
Similarly under h naught and similarly there is A₂.5 percent probability
that the f distribution is greater than this value. So, the lower threshold
we choose as 0.567 upper threshold is 1.762.
287
The application of such a test is usually found in testing whether the
coefficient of a regression linear regression model is 0 or not. Now, the
t test on the other hand is used also for a test of the mean or a
comparison between means group means, but in this case the
population variance is not known. So, we use the sample variance for
normalization and that gives rise to a t test. Here also we whenever we
test for the coefficients of the regression model under the assumption
that the errors corrupting the variables, we do not have an idea the way
a standard deviation of the errors that corrupt the observations are un-
known, then we use the t test and of to test whether the coefficient of
regression model = 0 or not.
The χ square test is used for testing the variance of a sample and the
variance of a sample, this test is used to whether a regression model is
high is good or not, whether acceptable quality or not. The objective
function of a regression model is the sum squared term and is similar to
a variance, and you can use it to this χ square test to test whether the
integration model is acceptable or not.
Thank you.
288
Data science for Engineers
Prof. Ragunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 23
Optimization for Data Science
We will also introduce you very very briefly to the various types of
optimization problems that people solve. While all of these types of
problems have some relevance from a data science perspective, we will
focus on two types of optimization problems which are used quite a bit
in data science. One is called the unconstrained non-linear optimization
and the other one is constrained nonlinear optimization and as I
mentioned before we will also describe the connections to data science.
289
(Refer Slide Time: 01:21)
290
So, we will start by asking what is optimization and Wikipedia
defines optimization as a problem where you maximize or minimize a
real function by systematically choosing input values from an allowed
set and computing the value of the function, we will more clearly
understand what each of these means in the next slide.
Now, what does best mean? You could either say I am interested in
minimizing this functional form or maximizing this functional form.
So, this is the function for which we want to find the best solution. And
how do I minimize or maximize this functional form? I have to do
something to minimize or maximize and the variables that are in my
control so that I can maximize or minimize this function or these
variables x. so these variables x or call the decision variables.
291
Now, why is it that we are interested in optimization in data
science? So, we talked about two different types of problems, one is
what is called a function approximation problem which is what you
will see as regression later. So, in that case we were looking for solving
for functions with minimum error remember that. Now, the minute I
say minimum error, then I have the following minimum error, we said
we have to define what this error is somehow, and the minute we say
minimum error that basically means we are trying to find something
which is the best we are trying to minimize something.
So, this part is already there finding a best for some function. And
this error is something that we define. So, for example, if you
remember back to our linear algebra lectures we said if there are many
equations and they cannot be solved with a given set of variables then
we said we could minimize this σi = 1 to m ei2.
So, this is the function now that we are trying to minimize. And
there are the decision variables in that particular case we said the
variable values are going to be the decision variable. So, this whole
function if I call this as f, this is going to be a function of x - the values
that the variables take. So, you already have a situation where you are
trying to minimize a function and these are the decision variables. This
is completely unconstrained or I have no restrictions on the values of x
I choose, I can choose any value of x, I want as long as that value
minimizes this or finds a best value for this function.
292
So, if I put each of the sample points into this equation. So, I will
get y1 = a₀ + a₁ x1 and all the way up to yn = a₀ + a₁ xn. Clearly you
can take the view that there are n equations here, but only two
variables. So, there are many more equations and variables. So, I
cannot solve all of these equations together. So, what I am going to do
is I am going to define an error function which is very similar to what
we saw before y1 - a₀ - a₁ x1 all the way up to yn - a₀ - an xn. And then
I know that I have only two variables that I can identify which are a₀
and a₁; however, there are n equations. So, what I am going to do is I
am going to minimize a sum of squared error is something that we
talked about.
Now, in terms of the other bullet point I have here which is find the
best hyper plane to classify this data. This is also something that we
had seen before, where we looked at data points and then we said for
example, I could have lots of data here corresponding to one class this I
described when we are talking about linear algebra and I could have lot
of data points here corresponding to another class. Now, I want to find
a hyperplane which separates this.
Now, you could ask the question as to which is the best hyperplane
that separates this. So, you could say I could draw a hyperplane here or
I could draw hyperplane here or I could draw a hyperplane here and so
on. Now, which one should I choose and the minute I say which one
should I choose. We know that these hyper planes are represented by
an equation. Then we say which hyperplane do I choose, then basically
it means I am saying which equation do I use, which basically means
what are the parameters in the equation that I choose to use. So, I want
to find the parameters parameter values that I should use in that
equation. So, those become the decision variables the parameters that
characterize these hyper planes become the decision variables.
293
And in this case the function that I am trying to optimize is that
when I choose a hyperplane I should not miss classify any data. So, for
example, I have to choose a hyperplane in such a way that all of this
data is to one half space of the hyperplane and all of this data is to the
other half space of the hyperplane. So, you see that again this
classification problem becomes an optimization problem.
294
So, in summary we can say that almost all machine learning
algorithms can be viewed as solutions to optimization problems and it
is interesting that even in cases, where the original machine learning
technique has a basis derived from other fields for example, from
biology and so on one could still interpret all of these machine learning
algorithms as some solution to an optimization problem. So, basic
understanding of optimization will help us more deeply understand the
working of machine learning algorithms, will help us rationalize the
working.
So, if you get a result and you want to interpret it, if you had a very
deep understanding of optimization you will be able to see why you got
the result that you got. And at even higher level of understanding you
might be able to develop new algorithms yourselves.
295
minimize. And the third component is the constraint which basically
constrains this X to some set that will be defined as we go along. So,
whenever you look at an optimization problem. So, you should look for
these three components in an optimization problem. In cases where this
is missing we call this as unconstrained optimization problems, in
cases where this is there and we have to have the solution satisfy these
constraints we call them as constrained optimization problems.
296
are non-linear functions, then we have what is called a nonlinear
programming problem. So, a programming problem becomes non
-linear if either the objective or the constraints become non-linear.
In general people used to think non-linear programming problems are
much harder to solve than linear programming problems which is true
in some cases, but really the difficulty in solving non-linear
programming problems is mainly related to this notion of convexity.
So, whether a non-linear programming problem is convex or non
convex is an important idea in identifying how difficult the problem is
to solve.
So, this idea of convex and non convex very very briefly without
too much detail we will see in the next few slides nonetheless I just
wanted to point this out here and also wanted to describe the second
type of optimization problem that is of interest which is the nonlinear
programming problem.
Till now, we have just been talking about the types of objective
functions and constraints however, we have always assumed that the
decision variables are continuous. In many cases we might want the
decision variable not to be continuous, but to be integers. So, for
example, I could have an optimization problem where I have f as a
function of let us say two variables x1 and x2 and I could say minimize
this. Now, I could say x1 is not continuous, but x1 has to take a value let
us say from this integer set μ 0 1 2 3 so on, and x 2 maybe has to also
take a value in this set. So, this is called a integer programming
problem.
Now, when you combine variables which are both continuous and
integer. So, for example, in this case when I have f(x 1 , x2) let us say X
has to take a value 0 1 2 3 whereas, x 2 is continuous it can take any
value let us say within a range then we have what are called mixed
programming problems and if both the constraints, and the objective
are linear then we have mixed integer linear programming problem and
if either the constraints or the objective become non-linear then we
have mixed integer non-linear programming problem. So, these are the
various types of problems that are of interest.
Now, these types of problems have been solved and are of large interest in
297
almost all engineering disciplines. So, we typically solve these
problems in for example, in chemical engineering we solve these types
of problems routinely for optimizing. Let us say refinery operations or
designing optimal equipment and so on, and similarly in all
engineering disciplines these optimization problems are used quite
heavily. From these lectures viewpoint what we want to point out is to
show how we can understand some of these optimization problems and
how they are useful in the field of data science.
298
(Refer Slide Time: 20:57)
So, when you look at this optimization problem you typically write
it in this form where you say I am going to minimize something, this
function here, and this function is called the objective function. And
the variable that you can use to minimize this function which is called
the decision variable is written below like this here x and we also say x
is continuous, that is it could take any value in the real number line.
And since this is a univariate optimization problem x is a scalar
variable and not a vector variable. And whenever we talk about
univariate optimization problems, it is easy to visualize that in a 2-D
picture like this. So, what we have here is in the x axis we have
different values for the decision variable x and in the y axis we have
the function value. And when you plot this you can quite easily notice
that, this is the point at which this function right here attains its
minimum value.
So, the point at which this function attains minimum value can be
found by dropping a perpendicular onto the x axis. So, this is actual
value of x at which this function takes a minimum value and the value
that the function takes at its minimum point can be identified by
dropping this perpendicular onto the y axis and this f * is the best value
this function could possibly take. So, functions of this type are called
convex functions because there is only one minimum here. So, there is
299
no question of multiple minima to choose from. There is only one
minimum here and that is given by this.
So, in this case we would say that this minimum is both a local
minimum and also a global minimum. We say it is a local minimum
because in the vicinity of this point this is the best solution that you can
get. And if the solution that we get the best solution that we get in the
vicinity of this point is also the best solution globally then we also call
it the global minimum.
Now, contrast that with the picture that I have on the right hand
side. Now, here I have a function and again it is a univariate
optimization problem. So, on the x I have different values of the
decision variable on y axis we plot the function. Now, you notice that
there are two points where the function attains a minimum and you can
see that when we say minimum we automatically actually only mean
locally minimum because if you notice this point here in the vicinity of
this point this function cannot take any better value from a
minimization viewpoint. In other words if I am here and the function is
taking this value if I move to the right the function value will increase
which basically is not good for us because we are trying to find
minimum value, and if I move to my left the function value will again
increase which is not good because we are finding the minimum for
this function.
What this basically says is the following. This says that in a local
vicinity you can never find a point which is better than this. However,
if you go far away then you will get to this point here which again from
a local viewpoint is the best because if I go in this direction the
function increases and if I go in this direction also the function
increases, and in this particular example it also turns out that globally
this is the best solution. So, while both are local minimum in the sense
that in the vicinity they are the best this local minimum is also global
minimum because if you take the whole region you still cannot beat
this solution.
So, when you have a solution which is the lowest in the whole
region then you call that as a global minimum. And these are types of
functions which we call as non convex functions where there are
multiple local optima and the job of an optimizer is to find out the best
solution from the many optimum solutions that are possible.
300
optima which is not good enough for the type of problems that were
that being solved.
So, that became a real issue with the notion of neural networks and
then in the recent years this problem has been revisited and now there
are much better algorithms, and much better functional forms, and
much better training strategies, so that you can achieve some notion of
global optimality and that is reason why we have these algorithms
make a comeback and be very useful.
However, notice this picture right here for example, if I started here
I want a better my function. So, I will keep improving it and when I
come here there is no way to improve it any further. So, I might call
this as my best solution and then the data science algorithm will
converge. However, if I start here then I would more likely end up here
and then I will say this is the best I get and I will stop my data science
algorithm.
301
your problem, and when you run with this initialization you might get
this as a solution to this problem. In other words the algorithm will not
give you the same result consistently and more importantly if it is very
difficult to find this most of the time your algorithm will give you
result which is local minimum, in other words you could do much
better, but you are not able to find the solution that does much better.
So, if I take f(x) and then let us say I am at a particular point xk, what
I can do is I can do a taylor series approximation of this function which
we would have seen before in high school and so on. So, let us say this
x* is the minimum point and let us see what happens to this Taylor
series approximation around this point. So, I am going to say this
function f(x) can be approximately written as f(x*) + these. Now, if
you notice this expression right here this is a number because this is an
x* that I know. So, I simply evaluate f at that x*. So, this is a number,
so this is not a function of this x. However, the second term and third
term and so on we will all be functions of x. In other words if I change
x these are the terms that will change this will remain the same.
302
Now, you could see that if you look at this term this is x - x* if you
look at this term this is x - x* square and so on. In this univariate case
let us call this as δ. So, if I go a δ distance from this - δ here, let us call
this x - x* δ. So, if I go in the positive direction I will have a δ, I will
have a δ2, I will have δ3 and so on. Now, this is a fixed number let us
look at this sum of these terms. Now, if you keep reducing δ to smaller
and smaller values this is what we explained when we said we are
looking at it locally. So, at some point what will happen is δ will
become so small that none of these terms will matter the sign of the
whole sum will be only depending on this term here. So, if this term is
positive this whole sum will be positive and if this term is negative the
whole sum will be negative.
Now, you notice this and then if you look at this, if let us say this
sign is positive for positive δ then, unfortunately when I go in the
negative direction it will become negative because this is again a fixed
number if this is positive for δ for - δ this will become negative. That
basically means that x* cannot be a minimizer because I can further
reduce this function by going to the left.
303
(Refer Slide Time: 35:54)
304
we do the first derivative and set it to 0 we get 3 solutions x = 0, x = - 1
and 2.
So, in this case this is 36 this is 72 in both cases this is greater than
0. So, points x = - 1 and 2 both are minimum points for this function
because both of them satisfy the two conditions f ‘( x*) is 0 and f
double prime x* is greater than 0. Now, it is interesting that at this
point we cannot say anything more about these two points these
numbers do not help we just look whether they are positive or not and
of these two points clearly one of them is a local minimum another one
is a global minimum. So, the only way to figure out which point is a
local minimum which is a global minimum is to actually substitute this
into the function and then see what values you get. So, when you
substitute - 1 into the function you get - 2 and when you substitute 2
into the function you get - 29. Since we are interested in minimizing
the function - 29 is much better than - 2. So, that basically means 2 is a
global minimum of this function and - 1 is a local minimizer for f(x).
305
Data Science for Engineers
Prof. Ragunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 24
Nonlinear Optimization Unconstrained Multivariate Optimization
306
dimensions higher than 2, if the decision variables are more than 2 then
it is di cult to visualize. So, what we are going to do is we are going to
explain some of the main ideas in multivariate optimization through
pictures such as the one that I have shown on the left side of the slide
and as I mentioned before since even for cases where there are two
decision variables we need to go to 3-dimensions and that is simply
because if this is x1 and this is x2 or this is x1 and this is x2, I need a
third dimension to describe the value of the function f(x1, x2) = 0. So,
the objective function value becomes the third axis.
Nonetheless if you start moving in the x1, x2 plane then when you
compute the objective function at different points. Let us say I compute
the objective function at this point then this is going to be outside the
plane and this is a value of the objective function at this point. And if I
compute the objective function at this point it is going to be outside the
plane this is going to be the value of the objective function at this point
and so on and if I go this direction I might come here and so on.
Now, if you have a plane that is parallel to the x1, x2 surface then
what we are going to see is we are going to have the objective function
307
value be a constant across the plane because when you project it here it
is going to be at a particular f(x1, x2) value or z value. So, what one
could say then is that if I cut this surface with the plane parallel to x 1,
x2 surface then I am going to get what are called contours on the x 1, x2
surface. So, you want to think about it this way. So, here is a plane that
is cutting the surface. So, on the plane wherever the surface is cut you
are going to have a contour and what we are going to do is we are
going to project that contour onto this x1, x2 surface. So, that is the plot
here. So, for example, we could take z = 5 and then have that plane cut
this surface then let us see what the projection of that in x1, x2 axis will
be.
So, this is the basic idea about how we optimize this function.
Now, as I speak you would have noticed that there are two decisions
that I need to make. The one decision that I need to make is of all of
these directions, what direction should I choose? So, I have to sit here
and then make a choice about the direction that I need to figure out.
And once I choose a particular direction let us say I choose this
direction how far should I go in this direction is another decision I
should make.
308
actually made my objective function worse. So, there are two important
things that I need to decide, one is the direction in which I should move
in the decision variable surface and once I figure out which direction I
should move in how much should I move in the direction. So, those are
two important questions that we need to answer.
So, you can see how hard it can become in case of functions like
this, where if you if you let us say, you start from here, then clearly you
know one of the good things to do would be to go to this minimum.
And you will be here and from here when you look at the conditions
309
for minimum there will not be any difference between the conditions
here and the conditions here in terms of the first order and second order
conditions that we talked about in the univariate case we will see what
the equivalent conditions are in the multivariate case subsequently.
However, from just those conditions you will not see any
difference, nonetheless if you actually compute the objective function
value at this point and this point this will be much smaller than this.
However, when you are here you have no reason to suspect that a point
like this really exists unless you do considerable analysis. So, in cases
like this what you will have to do is, you have to see whether you can
improve it further, that basically means though you know locally you
are very good here you have to do some sacrifice and then try and see
whether there are other points which could be better. So, there are
algorithms which will let you jump here and then maybe will jump
here and so on, but these are all algorithms where it is very difficult in
a general case to prove that I will go and hit the global minimum.
From a data science viewpoint what it means is that the error is not
as small as it could be here. However, from the model viewpoint if you
were to change the parameters from this value in any direction you
change, you will be finding out that the error actually increases in the
local region. So, there is very little incentive to improve your objective
function value, sorry a very little incentive to move away from this
point because locally you are increasing your objective function value.
So, ultimately your algorithm might find parameters which while may
be acceptable are not the best. So, this is one problem that needs to be
solved really to have good efficient data science algorithms.
310
(Refer Slide Time: 14:54)
So, let us get back to finding out analytically how we solve this
problem. So, if you have a multivariate optimization problem where
you have z is f(x1), x2 all the way up to xn. And in the univariate case
let us just contrast this with the univariate case. So, let us say z = f(x)
just one variable. Then remember we said the necessary condition for a
minimum is that I should have dz dx = f ‘( x )= 0 and then we said d 2z /
dx2 = f’’( x)> 0 for minimum. So, these are the conditions that we
described in the previous lecture.
Now, this is with respect to variable x 1. Now, you can do the same
thing with respect to variable x2. Notice here that x2 has come before x1
because it is in the second row and this diagonally is always with
respect to the same variable differentiated twice. So, this is ∂ 2 f/ ∂x22 ∂2
311
f/ ∂xn2 and so on. Also notice that this hessian will be a symmetric
matrix because for most functions this = this and similarly you will
have ∂2 f /∂x1 ∂x3 the next term here will be ∂2 f /∂x3, ∂x1 which will be
the same. So, the hessian matrix is going to be symmetric. Remember
in the linear algebra lecture we said that we will be seeing symmetric
matrices quite a bit and here is a symmetric matrix that is of
importance from an optimization viewpoint.
So, much like what we did in the univariate case, we are going to do
a Taylor series approximation and what I have done here is I have just
written it till two terms there are more terms here. But what we are
going to do is we are going to make the argument that if you make the
distance between the point that you are at and the next point that you
are going to choose very small, then whatever is the leading term in the
sum is going to decide the sign of the whole sum.
So, in other words if you take this whole thing right here if you keep
making this as small as possible or as small as needed then what will
happen is, the fact that whether this infinite sum is positive or negative
can be identified only by the first term and if that is positive then the
whole sum series sum is going to be positive and so on. So, that is the
kind of logic that we are going to use again here.
Much like before we said that if I keep making this small I need to
only look at this here and much like the univariate case if this does not
go to 0, I can make this term either positive or negative. To see this if I
take a particular direction and then say δ f T α. If this turns out to be
312
negative this number, then if I go in the opposite direction of - α I will
get ∇ ( α) this will be positive; that means, that I will have a point here
which can be either larger than this or smaller than this and if I can find
a point such that this is smaller than this then this cannot be a
minimum.
So, whatever you do unless this goes to 0, I cannot ensure that this
is a minimum point. So, the first condition that we will get is that this
is 0. And once that is 0 then I am left with the just this term right here
and if you notice this term is of the form δ T the hessian matrix let me
use H here δ. We know that this is a symmetric matrix and let us also
make sure we understand this clearly the function f is a scalar function
and you can see that here also H is an n by n matrix here λ T will be 1
by n and λ will be n by 1. So, when I do this I will get one by one
which is a scalar.
So, we come back to this in the last slide I wrote this as δ T H δ > 0,
H is symmetric. H is basically this second derivative matrix. Now, we
did not see this in the linear algebra lectures, but if I need this
condition to be satisfied irrespective of whatever δ is then we call this
H as a positive definite matrix. So, if H is positive definite then this
will be greater than 0 for all δ ! = 0, clearly when δ = 0 this will be = 0.
313
So, how do I check if a matrix that I compute is positive definite or
not. Remember from the linear algebra lecture we said if I have a
symmetric matrix, then I will have the eigenvalues as being real. So,
symmetric matrices always have real eigenvalues and the eigenvalues
could be positive or negative in this case. Now, the linear algebraic
result for positive definite matrix is that if this matrix has let us say n
eigenvalues, and if all of these eigenvalues are greater than 0 then this
matrix is called positive definite.
In other words if all the eigenvalues of this matrix are greater than
0, it is automatically guaranteed that whenever we compute this for any
δ direction we will always get a positive quantity. So, this has already
been proved. So, if you want this to be positive for any direction why
do we want this we want this because we want f(x*) to be the lowest
value in its neighborhood and that we said will happen if this is
positive for any δ or for every δ this should be positive. That condition
can be translated to H being positive definite and H being positive de
nite can be translated to the condition that λ1 to λn the n eigenvalues of
H are strictly greater than 0.
314
So, to summarize in the univariate case the two conditions are f
prime has to be zero and f double prime has to be greater than 0. In the
multivariate case these translate to ∇ f = 0 and the Hessian matrix being
positive definite.
So, what you can do is you can first construct this ∇ f vector which
is ∂f /∂x1. So, that would be ∂x 1/ ∂x1 will be 1, this will be a 0 term this
will be 4 time 2 times x1, 8 x1 this would be - x2 and this would be 0.
So, the first term is ∂f/ ∂x 1 here similarly when you do ∂f /∂x 2, I will
have a term corresponding to this two. This when differentiated with
respect to x2 will go to 0 corresponding to this I will have a - x 1. Now,
and corresponding to this I left 4 x2 which is what we have here.
So, we have these two equations that we need to solve. So, when we
solve this and get let us say one of the solutions x 1* x2* is this here. I
can check whether this is a maximum point or a minimum point, to do
that what I have to do is I have to do this second derivative matrix. So,
the way you do the second derivative matrix is the following. So, the
first term is ∂2 f/ ∂x12. So, we already have ∂f /∂x1.
So, if you differentiate it with respect to x 1 you will get this term.
So, the only term remaining will be 8 which is what we see here and
when we look at this we already have ∂f/ ∂x 2. So, we have to
315
differentiate this with respect to x1. So, the only term remaining will be
- 1 which will be here and I already told you this is a symmetric matrix.
So, you can simply fill in the - 1 here and to get this term I already
have ∂f/ ∂x2 here I differentiate this again with respect to x2. So, the
only thing that will be remaining would be 4 which is here. So, I have
this. Now, what I need to do is I need to compute the eigenvalues for
this and when I compute the eigenvalues for this I find the eigenvalues
to be both positive, that means, that this is a minimum point.
Now, when we look at this equation here there are two equations in
two variables and both are linear equations. So, there is going to be
only one solution here and it turns out that that solution is a minimum
for this function. So, this finishes our lecture on multivariate
optimization in the unconstrained case.
316
Data Science for Engineers
Prof. Ragunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 25
Nonlinear Optimization Unconstrained Multivariate Optimization
317
going to solve these using what we call as a directional search. The
idea here is the following, if you are on the top of a mountains keying
and you are interested in reaching the bottom most point from where
you are pictorially shown through this picture here, you will see that
there are several different points in this surface. This is a point which is
at the bottom of the hill. So, we call this as a minimum point.
So, the aim is to reach the bottom most region. Typically what you
would do is the following. So, if you are at a particular point here and
then you say look let me go to the bottom of the hill as fast as possible,
then you would look around and find the direction where you will go
down the fastest. So, this is a direction we call as the steepest descent.
So, the direction in which I can go down really fast and I will find that
direction and then go down the direction.
Now, the way optimization algorithms work is the following, you are at a
point you find the steepest descent direction, and then what you do is
you keep going in that direction till a sensible amount of time or in this
case the length of the step that you take in that direction.
The reason for this is the following, the reason is you could find this
as the steepest descent direction and you could keep going in this
direction, but let us say beyond this point you really do not know
whether this is going to be the steepest descent at that point also. In
other words is this going to continue to be the steepest descent till I get
to my best solution. Now, that is something that is you cannot
guarantee easily.
318
direction, if not you find a new direction and then go in the direction.
So, that is a basic idea of all steepest descent algorithms.
Now, notice that you know we do the steepest descent and let us say
we end up here, then at that point you will find no direction where you
can improve your objective function that is you cannot minimize your
objective function anymore. In which case you are stuck in the local
minimum. So, there are optimization algorithms which, when they try
to get out of the local minimum. The only way to do it is let us say if
you are here the only way to get to global minimum is to really climb
up a little more and then find directions and maybe you will find
another direction which takes you here.
319
x1 and move to x2 and so on, till you find the solution that you are
happy with.
Now, we will show you what the steepest descent direction is, but
you figure out this direction somehow. So, this is going to be a
direction vector. So, this direction vector will be something like s 1k, s2 k
all the way up to snk. So, let us assume that we have somehow figured
this out, then the real question is what is the step length. So, one idea is
to figure out the step length, so that this when substituted into the
objective function is an optimum in some sense.
So, that is what we are going to try and do. So, the key take away
from this is that if you are at a current point which you know and if you
somehow figure out a search direction, then the only thing that you
need to then calculate is the step length. And since step length is a
scalar what happens is a multivariate optimization problem has been
broken down into a search direction computation and finding the best
step length in that direction which is a univariate optimization because
we are looking for a scalar α.
320
In general this kind of equation that we see here you will see in
many places as we look at machine learning algorithms in clustering, in
neural networks and many other places, in machine learning techniques
this is called the learning rule. Why is it called the learning rule? It is
called the learning rule because you are at a point here and you are
going to a new point you are learning to go to a point which is better
than wherever you are and I mentioned to you before that we could
think of this machine learning algorithms as being optimization
problems solutions to optimization problems.
So, if you talk about neural networks one of the well known
algorithms is what is called a back propagation algorithm. It turns out
that the back propagation algorithm is nothing but the same gradient
descent algorithm. However, because of the network and several layers
in the network it is basically gradient descent we are including an
application of a chain rule which we all know from our high school.
Similarly in clustering algorithms you would see that clustering
algorithms would turn out to be minimization of a Euclidean distance
now.
So, let us now, focus on the steepest descent and the optimum step
size that we need to take. So, the steepest descent algorithm is the
following. At iteration k you start at a point xk. Remember with all of
these optimization algorithms you would have to start with something
called an initialization which is x naught and this is true for your
machine learning algorithms also. All of them have to start at some
point and depending on where you start, when you go through the
sequence of steps in the algorithm you will end up at some point let us
321
call x*, and in many cases if the problem is non convex that is there are
multiple local minima and global minima the point that you will end up
is de-pendent on not only the algorithm, but also the initial point that
you start with.
That is the reason why in some cases if you run the same algorithm
many times and if the choice of the initialization is randomized, every
time you might get slightly different results. So, to interpret the
difference in the results you have to really think about how the
initialization is done. So, that is an important thing to remember later
when we when we learn machine learning algorithms.
So, as I said before, we start at this point xk and then we need to find
a search direction and without going into too much detail the steepest
descent will turn out to be a search direction sk which is basically the
negative of gradient of f(x), where f(x) is your objective function. So,
if f(x) is an objective function of the form with this many decision
variables then grad f is basically ∂f /∂x 1 all the way up to ∂f/ ∂x 1 and
negative grad f would be this, and we keep this as the search direction s
k = negative grad f and this is called the steepest descent search
direction.
The key thing that I really want you guys to notice here is the
following, xk is known, the function is known. So, to get to x k + 1, xk is
known since the function is known we know also the grad of f and s k
is given as the -∇( f )evaluated at xk. So, basically this is going to be let
us say some functional form - g1x all the way up to gnx all you are
going to do is simply substitute for this x, x k. So, that basically gives
you the search direction. So, this is given this is calculated once this is
given.
Then the only thing that I need to find out is this αk and the way they
are value for αk is found out is by looking at this f(x k + 1). Now,
substitute this xk + 1into this. So, you are going to have f(x k + α) sk αk,
sk. In this you know this you know this. So, this f is going to simply
become a function of α right. So, let me put αk here. So, this is going to
be a function of α.
322
is optimized for, then you go on to x2 is x1 + α1 s1 and so on. And you
keep doing this till you use some rule for convergence you say at some
point the algorithm is converged. So, that point is what I am going to
call as x*.
Thanks.
323
Data Science for Engineers
Prof. Ragunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 26
Numerical Example Gradient (Steepest ) Descent (OR) Learning Rule
324
the objective function value. So, we are trying to look at how an
algorithm would minimize this function in a numerical fashion.
Now, notice from the previous lecture we described that while you
try to minimize these functions you do what are called contour plots.
And if you looked at the previous slide you would notice that x 1 and x2
are in a plane and the objective function is a value that is projected
outside the plane. So, if you think about constant objective function
values then what you think about is a constant z value in the previous
graph, and a constant z value would be a plane which will be parallel to
the plane of the decision variables x1 and x2. So, when that cuts the
surface that we saw in the previous graph then you have what are
called these contour plots.
325
value for x1 and here is a value for x2. So, we have picked some x1, x2,
and we have initialized this problem.
So, let us look at this picture here. When we start we have x 0 which
is initialization which is this point. What we need to do is we need to
find a direction in which to move and once we find a direction in which
we would like to move, then we will find out a learning rule or a
learning constant which will take us to the next point. So, initial guess
in this case that we have chosen is about 2 2 and f(x) naught value
when you substitute this 2 2 is 19.
So, the next step is the following. So, we are going to say x 1 is x0.
Now, if I take this direction as - ∇f which is what we discussed last
time which I have written here as f prime then, the equation will be the
new point x1 is x0 - α times f ‘(x0 ).So, this grad is evaluated at the point
that you are currently at. So, the same equation becomes this here. So,
just to illustrate the idea of how an optimization approach works we are
going to pick some α here, which we have picked as 0.135 here, there
is a way in which you can automate this and here we are just using this
number.
Now, what you need to do is, once you identify this if you look at
this equation this direction which is a gradient direction has to be
evaluated at x0 and since our original x0 is 2 2, I am going to substitute
326
the value of 2 2 into these equations. So, I have 8 times 2 + 3 times 2 -
5.5, and 3 times 2 + 5 times 2 - 4 and this is the learning parameter and
this is our original point which gives me a ν point x 1 after simple
computation. And when you compute the function value at x 1 you
notice that the original function value was 19. Now, it is come down to
this number right here.
So, the point that we are trying to make is this direction is actually a
good direction and then you move in the direction and you find that
your objective function value decreases which is what which was our
original intent because we are trying to minimize this function. And as
I mentioned before this, gradient descent is usually called the learning
rule. So, when you have parameters let us say that you are trying to
learn for a particular problem this keeps adjusting this equation keeps
adjusting the parameters till it serves some purpose and this adjustment
is usually called the learning rule in machine learning.
So, this is the new point that we are right and if you find out this
value which is 0.0399 and then set this k to be this number whatever
was f(x) 1 which is 0.0399, then you would notice that the equation
form remains the same except this constant has changed. So, this is
continuing to be an ellipse, but it is an ellipse that are shrunk from your
original ellipse. So, the constant contour plot, this blue plot, is the plot
at which f(x) will take a value 0.0399. So, wherever you are on this
blue curve or the blue contour the objective function value is the same.
So, this is a first step of the learning rule that we see.
327
Now, let us proceed to the next iteration. The next iteration is pretty
much exactly the same. What you can see here is now, we start with x 1
which was what we identified from the previous iteration and x 2 is x1 -
α f prime x1. So, pretty much we are doing the same thing I substitute
the value of x1 here α remains the same and I do the same ∂f /∂x, but
now, I evaluate the gradient at the new point x 1. So, if you notice here
in the last slide we had put 2 and 2 for these values, but in this you
would see I am using the x1 value - 0.2275, 0.3800 and so on. And in
this case the learning rate remains constant, but in more sophisticated
algorithms or algorithms where you could actually optimize the size of
this learning parameter as we go along in the algorithm.
Nonetheless, the ideas are pretty much the same only that this
number will keep changing iteration to iteration. Now, we get a new
value x2 and notice that this new value of f x2 when substituted into this
function f(x2) give you even smaller objective function value. In fact,
the objective function has become negative. So, this new point is
shown here as x2. Now, again much like how we discussed the
previous iteration in the last slide, in this iteration if you were to take
the function f(x) and then set it = - 2.0841 then that would be again an
elliptical contour and that contour is actually described by this green
contour.
So, one more iteration you can simply follow through the steps same thing
328
here x2 from the previous slide the new x 3 value which is x2 - α. Now,
again the gradient is evaluated at the new point and the α remains the
same and now you notice from the previous slides. Let us go back
quickly to the previous slide and see what the value was. The value of
the objective function was - 2.0841.
Now, when you look at the new point x 3 the objective function
value has decreased even more it has become - 2.3341. So, we notice
that at every step of the algorithm the objective function keeps
improving for us here in this problem improvement means the
objective function value keeps coming down and since at every point
and the objective function value keeps coming down our hope is at
some point it will hit the minimum value. How do you understand if it
is a minimum value or not? It is something that that we will discuss in
the next slide.
329
(Refer Slide Time: 14:16)
So, you could go through this process. The third iteration maybe
gives us this x3 and then the next iteration gives you an f x 4 value
which is this and you can notice that the objective function value has
come down.
And you would expect that because as you get closer and closer to
the optimum you know that the optimum point is where the gradient
goes to 0. So, as close to the optimum if the function is reasonably
continuous then you are going to have derivatives which are very small
and if you notice each of these steps, this is one thing that dictates the
size of the step you take. And if this is a constant value the size of the
step you are going to take is going to come down and also the
improvement in your objective function is going to come down. But
keep in mind the objective function keeps improving all I am saying is
that the amount by which it improves will keep coming down.
330
So, if you do this for a few more iterations you will get to the
optimum point which is this solution 0.5, 0.5 and the function value at
this optimum point turns out to be this. Now, the couple of things that I
would address here. So, when you write an algorithm like this or when
an algorithm like this works you have to tell the machine to stop doing
this algorithm at some point. Which is what in optimization
terminology called as convergence criteria. And the way the
convergence criteria works is the following there are many ways in
which you could you could post a convergence criteria and then say
this the algorithm has converged.
So, the logic behind this is that if you are making minor modi
cations to your parameters you can keep doing it to try to get to perfect
value, but at some point it starts making not much of a difference. So,
you could use this norm as we call it which is the difference between
these two values at two different iterations as a condition for saying the
algorithm converges. That is when this becomes small enough you say
the algorithm has converged.
You could also simply take the difference between the objective
function values in two iterations for example. When that becomes very
very small you could think about saying that the algorithm has
converged. Or you could take the derivative at every point and then
when the derivative norm becomes very small you could say the
algorithm converges. The logic between these two are that in this case
we are saying well we are doing this, but we are not really improving
our objective function.So, I am going to be happy with whatever I get
at some point and then say if you do not improve significantly and
what is significant is something that you define I am going to stop the
algorithm.
So, you could do that. Or when you do the norm of this you know
ultimately at the optimum value you know the gradient has to be 0, that
means, the norm of this vector has to be 0. So, when grad f becomes
very close to 0 then you could say I have converged my algorithm and
I am going to stop the algorithm at that point. So, in typical
optimization packages or software there are these various options that
331
you can use to ask for convergence to be detected and the algorithm to
stop at that point.
So, this gives you an idea of how the analytical expression that we
started with for maximum or minimum is converted into a gradient rule
and these are all called as gradient based optimization algorithms. And
then we showed you a numerical example of how actually this gradient
based optimization algorithm works in practice. We also made the
connection between these algorithms and machine learning and as I
mentioned before most of the machine learning techniques you can
think of them as some form of an optimization algorithm and the
gradient descent is one algorithm which is used quite a bit in solving
data science problems.
Couple of other things to notice are that the direction for changing
your values iteration by iteration, in this case we have taken it as a
steepest descent. There are many other ways of doing this you can
choose directions in using other ideas that many other algorithms use.
So, we here in this introductory course on data science we focused on
the most common and the simplest of the search directions which is the
negative of the gradient at that point.
And again these algorithms also keep changing the value of the
learning parameter or the step length as they would call it in
optimization algorithms, iteration to iteration. In this case we have kept
that to be a constant just to make sure that we explain the fundamental
ideas first before moving on to more complicated concepts.
Nonetheless, I just want you to remember that this learning parameter
is something that could be changed optimally from iteration to iteration
in a given optimization algorithm.
So, with this I hope you have got a reasonable idea of univariate and
multi-variate optimization, unconstrained non-linear optimization.
What we are going to do in the next lecture is to look at how we can
introduce constraints into this formulation, and what effect does a
constraint have on the formulation and how do we solve constrained
optimization problems. And as I mentioned before these constraints
could be of two types, equality constraints and inequality constraints.
We will see how we can solve optimization problems with equality
constraints and inequality constraints. So, I will see you in the next
lecture.
Thank you.
332
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 27
Multivariate Optimization with Equality Constraints
333
In some cases while we are trying to optimize or minimize our error
or the objective function, we might know some information about the
problem that we want to incorporate in the solution. So, if for example,
you are trying to uncover relationships between several variables and
you do not know how many relationships are there, but you know for
sure that certain relationships exist and you know what those
relationships are, then when you try to solve the data science problem
you would try to constrain your problem to be such that the known
relationships are satisfied.
Now, if you look at the optimum value for this function and just by
inspection you can see that the optimum value is 0 because this are
functions where I have terms which are squares of the decision
variables. So, there are 2 terms here 2 x 12 + 4 x22 and the lowest value
that each one of these could take has to be 0 and that basically means
the unconstrained minimum is at x1 = 0 x2 = 0 the objective value at
that point is also 0.
So, which is what is shown here in this point right here. So, if I had
no constraints I would say the optimum value is 0 and it is at 0 0 the
star point here. Now let us see what happens if I introduce a constraint
in this case I am going to introduce a very simple linear constraint. So,
let us assume that we have a constraint of this form which is 3 x 1 + 2 x2
= 12.
334
Now what this basically means is the following. You are looking for
a solution to x1 and x2 which also satisfies this constraint that is what it
means. So, though I know the very best value for this function is 0
from a minimization viewpoint, I cannot use that value because the 0 0
point might not satisfy this equation. So, you will notice that if I put x 1
= 0 x2 = 0 it does not satisfy this equation. So, the unconstrained
solution is not the same as the constrained solution.
So, you notice that as you pick different points on this line the
objective function takes different values and what we are interested in
is the following. Of all the points on this line, I want to pick that one
point where the objective function value is the minimum. To
understand this let us pick two points and see what happens. So, if I
pick a point here and let us say I pick a point here and ask the question
which one of these points is better from a minimization of the original
function view point. The way to think about this is the following.
So, when I pick a point here I know that the objective function
value would be based on the contour which intersects at that point and
if you compare these 2 points you will see that this point is worse o
than this point, from a minimization viewpoint. This is because if you
look at these contours these are contours of increasing objective
function values and the contour that intersects this point is within the
contour that intersects the line at this point. So, basically what that
means, is because this is the direction of increasing objective function
value the contour intersecting this point is inside the contour intersect-
ing at this point. So, that basically means this function takes a lower
value at this point on the line.
So, as you go along the line you see that the value keeps changing
and my job is to find that particular point where the objective function
value is the minimum. So, this is a basic idea of constrained
335
optimization solution that we are looking for. The key point to notice
here is that the unconstrained minimum is not the same as the
constraint minimum. If it turns out that the unconstrained minimum
itself is on the constraint line then both would be the same, but in this
case we clearly see that the unconstrained minimum is different from
the constrained minimum.
So, when I have just only one constraint, how do I solve this
problem? I will first give you the result and then explain the result
again by going back to the previous slide and then showing you
another viewpoint and then how that leads to this solution.
336
Now, let us see what happens if I am trying to solve a problem
where I have to minimize f(x) which is x 1 x2 all the way up to xn. But
like I said before let us assume that I also introduced one constraint and
I am going to introduce let us say a constraint which is of the form h of
x1 x2 all the way up to xn = 0. So, this is an equality constraint. So, we
can always write a constraint in this form, even if you have some
number on the right hand side you can always move to the left hand
side and write this constraint as something equal to 0.So, basically my
job is to find the minimum point for f subject to this constraint. I will
just give you the result and then we will see how we get this result.
So, when I want to solve for this problem the result is in this case
grad f = 0 itself gave us the result. In this case it will turn out that the
result is the following. So, we can write the negative of ∇ has to be
equal to some λ ∇ of h. So, you can write this as negative or drop the
negative and the sign of the value λ will take care of that, but I am
writing in this particular form.
So, just to expand this basically says I have something like this I
have domain is ∂f /∂x1 all the way up to ∂f /∂xn = λ ∂h /∂x1 all the way
up to ∂h/ ∂xn. So, if we expand this further I am going to get n
equations. In this case I got these equations to be 0 in this case I am
going to get equations of the form - ∂f∂x 1 = λ ∂h/∂x1 will be one
equation, the second equation will be - ∂x 2 = λ ∂h/∂x2 and so on, the
last equation will be - ∂f /∂ xn = λ ∂h /∂xn.
So, now notice much like before I have n equations the difference
being in the unconstrained case I had zeros on the right hand side in the
constrained case I have these terms on the right hand side. Nonetheless
these equations are in now n + 1 variable because I have my x 1 x2 all
the way up to xn and I have also introduced a new variable λ right here.
So, my equations are in n + 1 variables. I have only n equations at this
point, but notice that I have one more equation that I need to use and
that equation is the following if I find some solution x 1 to x n which
satisfies all of these equations.
Then that also has to satisfy the constraint. So, here we are only
talking about the gradient form of the various functions, but the
equation which represents the constraints also needs to be satisfied by
any solution that we get for the constrained optimization problem. So,
with these n equations I will also get another equation which is that h
of x1 x2 xn has to be = 0.
Now, you notice that in this case with one linear constraint I have n
+ 1 equation in n + 1 variables. So, I can solve this, so to reiterate the
difference between the constrained and unconstrained case was, in the
unconstrained case we had n equations in n variables, in the constraint
case with just one constraint we have n + 1 equations in n + 1
variables.
337
Now, you might ask what happens, if there are there is more than
one constraint there are let us say 2 constraints.
So, we will see what happens if there are 2 constraints, so in this case
we are going to minimize f(x) and now let us say we have 2 constraints
we are going to say h1 x = 0 h2 x = 0. In this case what is going to
happen is the following the solution to the constrained optimization
problem is going to be - ∇f = λ 1 ∇of h1 + λ 2 ∇of h2.
So, if you look at this and the one constraint case you will see the
similarities. In the one constraint case we introduced one extra
parameter, in the 2 constraints case the x introduced 2 extra
parameters. In the one constraint case, we simply said grad f - ∇f equal
λ ∇h, in this case this λ ∇h becomes a sum of λ ∇h1 λ 1 ∇h1 + λ 2 ∇ has
2 and so on.
338
directly given by the constraints. So, since the optimum point has to
satisfy both the constraints, I get one extra equation x1 x = 0 and the
other extra equation is 2 x = 0. So, if you have 2 equality constraints in
n variables I will have n + 2 variables and n + 2 equations and we will
always find this to be true because if I had 3 equality constraints, I will
have n + 3 variables which would be x1 to xn λ1, λ2, λ3 and this
gradient equation will always give me n equations and the 3 extra
consent would have given me the 3 extra equations.
So, we go back to the previous slide and when I was discussing this
slide and explaining how equality constraints affect the optimum
solution. I said there are many points on this line which could all be
feasible solutions. Feasible solutions meaning those are all solutions
which can satisfy this equation. Nonetheless, there are some points out
of those or one point which would give me the lowest objective
function value. So, when we looked at candidate solutions for this
optimization problem we were looking at the points on this line that
that we are interested in because that is a constraint.
Now, let us take a slightly different viewpoints and then look at the
same problem from an objective function viewpoint. Now from an
objective function viewpoint if you did not constrain me at all and you
said you could do anything you want, then I would pick this point as a
solution. Now when you look at this point and then say well this is a
best point I have let me find out whether it satisfies my constraint and
then you substitute this point into this and then you figure out it does
not satisfy the constraint.
So, you say look I have to do something because I am forced to satisfy
the constraint. So, you will say let me lose a little bit in terms of an
objective function perspective and then see whether I can meet the
constraint.
So, when we say I want to lose a little bit basically you know as we
mentioned before these are contours where the objective function value
increases and those are actually not good from a minimization
viewpoint. So, while I am here since the constraint is not satisfied, I am
willing to lose a little to see whether I can satisfy the constraint and
maybe I go to a point here and then this is a constant objective function
339
contour point. So, if I am willing to give up something that basically
means I am going and sitting on different points on this contours and as
I am pushed away and away from this minimum point I am losing
more and more in terms of the objective function value. That is I am
increasing the objective function value. Now logically if you keep
extending this argument you will see that, let us say this is the first
point I moved here which is basically worse than this because you see
this is a contour which is going to be outside of this contour, but I
moved here I made my objective function worse, but I still am not
satisfying the constraint. So, I give some more I come to this point and
I see a contour and this point is worse than this because this contour is
outside this contour and if I extract this argument let us say I keep
making things worse and the only reason I am making these worse is
because I am forced to satisfy this equality constraint.
So, I come up to let us say here and this is still a contour which is
much worse than my original solution, but it is still not enough for me
to satisfy my constraint. So, if you keep repeating this process, you are
going to find a contour here where I touch this line for the first time.
So, when I touch this line for the first time is the point at which for the
first time, when I give up my objective functions value, I am also able
to satisfy the constraint. Now once I find a contour like that which
touches this line then there is no incentive for me to go further beyond
because going further beyond would mean I would be making my
objective function worser.
So, when I just touched that line that is the best compromise
because that is where I become feasible for the first time and going any
further would only make my objective function worse. So,
geometrically what this would mean is I keep making my objective
function worse till a contour just touches this line. So, at that point this
line will become a tangent to that contour and remember what that
contour is that contour is f(x) = k for some k
So, that is the reason why we get these conditions for optimization
with equality constraints the key point that I want you to remember is
that as you have more and more equality constraints you will have to
340
introduce more and more parameters λ1 λ2 and so on. However, there
will always be enough equations and variables.
So, now, let us see what this equation becomes. So, this equation
you would see is if you put this in a bracket and you will see this
easily. So, you have the first equation is - ∂f/ ∂x 1 = λ ∂h /∂x1 and we
have - 4 x1 = 3 λ, similarly the second equation would turn out to be - 8
x2 = 2 λ and this equation is basically the same equation as the
constraint equation that we have here. And as we mentioned before we
thought the constraint that would have been 2 variables and you would
have got 2 equations which would have been ∇f = 0.
341
But with an equality constraint we have added a new parameter λ.
So, we need 3 equations in x1 x2 and λ we do have these 3 equations
here and when you solve these 3 equations you will get this solution
and this is your optimum solution in the constrained case which is
different from the optimum solution in the unconstrained case which
would have been 00. So, in other words we have given up on the value
of the objective function.
Thank you and I look forward to seeing you again in the next lecture.
342
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer and Science Engineering
Indian Institute of Technology, Madras
Lecture – 28
Multivariate Optimization with Inequality Constraints
So, if you remember that example I said there are a group of people
who might like certain type of restaurant there might be a group of
people who do not like that kind of restaurant and so on. Then if we
were to do a classifier which probably is a line like this then, as we are
343
trying to solve an optimization problem to identify a classifier like this,
we have to impose the constraint that all these data points will have to
be on one side of the line and all of these data points will have to be on
the other side of line and from our lecture on half spaces and hyper
planes in linear algebra we know that if this equation is something like
this which is linear equation, then it might be that if the normal is in
this direction then this direction is said that if I substitute a value of this
point into this it is greater than equal to 0 and on this side it is less than
equal to 0 with 0 being the line.
So, now, notice that for each point if we were to write the condition
in terms of the equation of the line and then you would see that these
become inequality constraints. So, there may be as many in equality
constraints there are points and so on.
Now, let us go back to the same example that we had before, where
we had the equality constraint and then we tried to solve the
optimization problem and then just make that equality into an
inequality. So, in this case let us assume what was equal to 12 in the
last lecture has now become less than equal to 12 and let us understand
intuitively what happens to problems this of this type. Now in the
previous case we said when we have an equality constraint we said we
are interested in any point on this line as a candidate solution these are
all called the feasible points and of all of these points we were trying to
pick the point which will give me the minimum objective function
value.
So, the difference between the equality constraint and the inequality
constraint is the following, in the equality constraint we had every
point on this line being a feasible solution when you make this as a
344
inequality constraint. Then what happens is if you think of this line is
extended all the way, any point to this half space now becomes a
feasible solution and because every point in this half space is a feasible
solution the unconstrained minimum also becomes a feasible point so
the constrained minimum and unconstrained minimum are the same so
this is an interesting thing to see.
Now, let us try and see what happens if I flip the sign and then said
this is greater than equal to 12. Then what would happen is if you were
to extend this line all the way, then the feasible region is to this side.
Now, you asked the question where does the optimum lie? You will
notice that again we know the best solution is here and as we move
away from this solution, we will see that we are losing out on the
objective function value and as before we know any point on this line
or to the side is a feasible point and any point on this line is also
feasible because of this equality sign.
Now, making the same arguments that we made in the case of the
equality constraint problem, we will see that I give up on my
optimality, that is I keep going through contours of larger and larger
size where the optimum value keeps increasing and when a contour
particular contour touches this line exactly at one point then I have a
feasible point which is going to satisfy this constraint, the equality part
345
of the constraint. So, it is satisfying the general constraint and that is
the worst I have lost in terms of how much my objective function has
increased its value by.
346
greater than equal to form then what I do is I multiply by a negative
and then I have this condition has less than equal to 0.
So, now, I call this my constraint so I can again put it in this form.
Now this becomes a little more complicated formulation and I am
going to show you the conditions for this in the next slide. Which will
look a little complicated in terms of all the math that is there. What I
am going to do is I am just going to simply read out the conditions in
the next slide and then we will take a particular example and then
demonstrate how the conditions work.
So, the difference between this condition in the equality and inequality case
347
is that for every one of these inequality constraints you add more linear
combinations of μ times δ g. So, look at this, this is the same form of as
this except that I used λ here and μ here, but I take a take a ∇ of h and
∇of g. That is the first set of conditions then much like how we had the
constraints equality constraints also as part of conditions in the
previous case, I am going to have the optimum solution satisfy all of
this equality constraints.
Now just keep in mind that if you are seeing course on optimization
for the first time it is not very easy or natural to understand this
constraints right away. However, what we are going to do is in the next
slide we will take an example and then show you how these things
work. One thing that I want you to keep in mind is if we had let us say
an unconstrained optimization problem objective function in n
variables, I always look at whether the optimum conditions have
enough equations and variables for me to be able to solve the system of
equations and clearly you know in the unconstrained case you have n
equations and n variables and I clearly made the point in the equality
constraint case that for every equality constraint you add an extra
parameter.
348
.
There is a specific problem in solving this -what are called the KKT
conditions-the conditions that I showed in the previous slide are called
the KKT conditions. It is not easy to solve the KKT conditions directly
in the inequality case because of the complimentary slackness condition
which says either μ could be 0 or z could be 0. So, we have to make a
choice as to which is 0 so that makes this a combinatorial problem.
349
Let us take a look at a numerical example to bring together all the
ideas that we have described till now. In this particular case it is a
multivariate optimization problem which is actually called quadratic
programming and this is called quadratic programming because the
objective function is quadratic and the constraints are linear. Those
types of problems are called quadratic programming problems.
I think this is the same objective that we have been using till now in
the several examples. So, let us say this is the objective function and let
us assume that we have constraints of the form shown here. Let us
assume the first constraint is 3 x1 + 2 x2 < = 12, the second constraint is
2 x1 + 5 x2 > = 10 and the third constraint is x 1 < = 1. Now I am not
doing anything more to this problem, but nonetheless I just want you to
remember that to be consistent with whatever we have been saying this
should be converted to a less than equal to constraint which we will see
how that happens in the next slide.
Now let us look at this pictorially. Remember that the value of the
objective function is going to be plotted in the z direction coming out
of the plane of the screen that you are seeing. So, the representation of
that are these contours that we see here, which are constant objective
function contours. So, I am just trying to see and explain how this
picture speaks to both the objective function and the constraints. So,
the objective function is actually represented in this picture as this
constant value contours. So, if I am moving on this the objective
function value the same we have repeated this several times.
So, if you put all of these regions together the only region which is
feasible is shaded in brown colour here. So, if you take a point any
point in this region it will be satisfying this constraint because it is to
this side of this line, it will satisfy the second constraint because it is to
the side of the line and it will satisfy the third constraint because it is to
this side of the line. Notice that if you take any point anywhere else
you will not be feasible. For example, a point here would violate
constraint 1, but it would be feasible from constraint 2 and 3 viewpoint
nonetheless all the constraints have to be satisfied. Now similarly if
you take a point here while it satisfies constraint 1 and 3 it will violate
constraint 2.
350
Now, if you notice the optimum point is going to lie here and we are
going to try and find out this value through the conditions that we
described in the last few slides.
Then what you do is, you differentiate this with respect to x 1 and x2.
You will get the first 2 conditions that we have been talking about for
the con-strained case with equality constraint, unconstrained case and
so on. So, the 2 equations for the 2 decision variables so, when you
differentiate this whole expression with respect to x 1 4 x1 comes out of
this term right here and this is only a function of x2. So, differential
with respect to x1 will become 0 and I will have 3 μ 1 x 1 when I
differentiate this with respect to x1 I get 3 μ1 and from the second term
I will get - 2 μ2 and from the third term I will get μ3. So, th= 0. So
351
basically this equation you have and then when I differentiate the same
expression with respect to x2.
So, I get 8 x2 from this here and from the second term I get 2 μ1, I
get - 5 μ2 from this term and from this term I get nothing because it is
only a function of x1. So, I get this equal to 0 so, this basically gives
you the condition that that we have talked about which is of the form in
the equality constraint case remember I said you have - ∇ has to be = σ
λ I ∇of hi and in the inequality case you also add + σ μi ∇of gi if gi are
the inequality constraints.
So, you kind of back out these 2 equations from this constraint and
you have 2 equations here because ∇ of x is of size 2 by 1 h is 2 by 1 2
by 1 because there are 2 decision variables and these are scalars and
this is a linear weighted sum. So, if you notice all of this, you see that I
have 2 equations. However, I have 5 variables that I need to compute, I
need to compute a value for x1, I need to compute a value for x 2 and
then I need to compute μ1, μ2 and μ3. So, let us see how we do that.
We go back and add the complementary slackness conditions. Also
keep in mind that other than this we also have to make sure that
whatever solution we get still has to satisfy the 2 inequality constraints
also.
352
Then I could substitute those values back into this equation and I
have 2 equations I can calculate x1 and x2. So, this is one option. So, in
one of the previous slides I had mentioned that this becomes a
combinatorial problem because we could also assume that this goes to
0. So, this term goes to 0 and now let us assume actually this is
nonzero, this is nonzero, let us see what happens to that case do we
have enough equations and variables.
So, in this case we will have 1 equation, 2 equations, the third equation will
be μ2 has to be 0 because we have assumed this is not 0 the fourth
equa-tion has to be μ3 is 0 because we have assumed this is not 0. Now
the fth equation becomes the one which we have assumed which is the
term inside the bracket is 0. So, in which case again I have 1 2 3 and
then μ2 = 0 μ3 = 0 as 5 equations in 5 variables. You could for
example, assume that this and this are 0 in which case I have to
compute μ2 and μ1 and then let us say this is not 0 then μ3 is 0, but in
that case also you will have 1 equation, 2 equation th= 0 is 1 equation
and th= 0 is 1 equation. So, again I have 5 equations and 5 variables
Now, I am going to use one notation here so, that we can understand
the table in the next line. So, whenever I assume that an equality
constraint is exactly satisfied; that means, when I say 3 x 1 + 2 x2 - 12 =
0, then we say this constraint is active it is active because the point is
already on the constraint. If I take a point here then that is not on the
line so, that is basically less than = 0 so, I will say it is inactive. So, for
every constraint I can say whether it is active or inactive. So, if this
constraint is active; that means, 3 x1 + 2 x2 - 12 is 0. If this constraint is
active; that means, 2 x1 + 5 x2 - 10 = 0 and if this constraint is active;
that means, x1 = 1.
353
So, in the example that we are considering right now, there are 3
inequality constraints and as I mentioned in the previous slide if the
inequality constraint is exactly satisfied that does it becomes an
equality constraint then we call that an active constraint and if the
constraint is not exactly satisfied we call it as an inactive constraint.
And now we have 3 inequality constraints and each of these constraints
could be either active or inactive.
So, there are 2 possibilities for each of these constraints and since
we have 3 constraints there are 2 to the power 3 possibilities which =
the 8 possibilities that we have here. So, what I am going to do in this
case is we are going to enumerate all possibilities for you to get a good
understanding of how this approach works when you have inequality
constraints. So, let me pick let us say a couple of rows from this table
to explain the ideas behind how this works and what we are going to do
is in the next slide we are actually going to see graphically what each
of this case means.
So, let us look at the first row for example, here the choice we have
made is all the 3 inequality constraints are active. That means, they all
become equality constraints. Notice something interesting remember
there are 2 decision variables. So, each inequality constraint is
basically representing one half space for a line and when they become
active each of these constraints become an equality constraint each of
them become line.
354
all the 3 equations. So, in this case you cannot solve this problem it is
infeasible because though I have enough equations a subset of these
equations 3 equations are in 2 variables and I cannot find a valid
solution. We will understand what this is geometrically in the next
slide.
Let us look at some other condition here. Let us pick for example,
row 5. So, if you look at row 5, we have made a choice that the first
constraint is active, the second constraint is inactive and the third
constraint is inactive, that basically means the first constraint equal to
0, the second and third constraints have to be less than equal to 0,
which needs to be tested after we go through the solution process. Now
much like how I described before in this case also we will be able to
find 5 equations and 5 variables and we can solve for x1 and x2 which is
shown here and we have solved for μ which is shown here.
Now if you just look at this solution, you will you will say it seems
to satisfy everything because I have a solution for x1 and x2 and I have
μ1, μ2, μ3 where μ1 is - 4.36 and μ2 and μ3 are 0. However, when you
look at this μ you will see that one of the μs is negative. Which
basically means that this cannot be an optimum point based on the
conditions we showed in a couple of slides back, not only that on top
of it when you actually put these 2 values into the constraint x 1 is less
than equal to 1 it is not satisfied because x1 is 3.27.
So, if you take this row for example, it looks like both this μ being
positive is not satisfied and this constraint is also not satisfied. An
interesting thing to note here is we have to go back and check only the
constraints that we have assumed to be inactive because the active
constraint is already at 0. So, whatever solution again will
automatically satisfy the active constraint. Let us say row 6 for
example, here we have made a choice that the first constraint is
inactive, the second constraint is active and the third constraint is
inactive.
355
When you look at row 4, where we have assumed constraint 1 is
inactive and 2 and 3 are active, I get a solution x 1 1.6. I get all μ s to be
positive which is also one of the conditions for an optimum point and
then I have to only verify this constraint here because other 2 are active
constraints and they are providing equations for us to solve so they are
like they are going to be valid. Now when you put these 2 values into
the first constraint you will check that it actually satisfies the inequality
also. So, this is a case where all constraints and KKT conditions are
satisfied. So, this is the optimum point so, the optimum solution is 1.6
which is what we had indicated in the slide with the geometry of this
problem.
So, let us look at each of this case in terms of where the optimum
lies and how we interpret this. So, if you take the case one where we
took all the constraints to be active we said we have 2 variables and 3
equations and that is not satisfiable in general. Geometrically what this
means is we want the point to lie on all the 3 lines at the same time
because we have assumed all 3 lines are active concerned. That means,
all of them have to be equal to 0.
So, you find that you cannot get a point where the all 3 lines
equations are satisfied. So, if I want to satisfy 2, I will get a point here
nonetheless it would not satisfy the other constraint and so, on. So, this
is the case of not all 3 lines intersect so we do not have a solution here.
Similarly you can look at each of these cases and you can see what
happens in each of these and you will notice that only in case 4 will
you have the solution that you got from the Kuhn Tucker condition and
the actual optimum being satisfied.
356
In every other case you will see that there is some problem or other.
So, if you go back I think we looked at case 5 and 6 then the solution
for case 5 is this, the solution for case 6 is this and we said these two
are not good solutions because they violate the condition x 1 is less than
equal to 1 which basically seen here because the feasible region is to
this side of line x1 is less than equal 1, but this point is violating this
constraint and here again you see that this point is violating this
constraint.
So, you have to be careful about the conditions when you look at
those, but if you stick to this type of writing the equations, where you
write - ∇ of a σ λ ∇ hi + σ μ ∇ gi and if you write all the constraints as
less than equal to inequality constraints, then μ s have to be positive
for the point to be an optimum point which is what we saw here.
357
of giving the foundations for understanding other data science
algorithms that outside of this course that you might go and study.
I have also described the key ideas behind how to solve constrained
optimization problems when you have equality and inequality
constraints. Keep in mind that while I have shown you the conditions, I
have not shown a proof or a derivation of these conditions in a formal
manner. The equality constraint case I appealed to intuition to tell you
why the conditions turn out to be the way they are. However, all of this
can be formally proved and you can derive these conditions based on
formal mathematical arguments.
So, I hope to see you continue these lectures and understand more
of data science.
Thank you.
358
Data Science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 29
Introduction to Data Science
So, let me start with this laundry list of techniques that people
usually see when they look at any curriculum for data science or any
website which talks about data science or many books that talk about
data science. I have just done some colour coding in terms of the
techniques that you will see in this course in green.
359
.
And other techniques are out there which we will not be teaching in
this course, but which would be a part of more advanced course. So,
there are techniques such as Regression analysis, K - nearest -
neighbour, K - means clustering, logistics regression, Principle
component analysis, all of which you will see in this course then
people talk about Predictive modelling under that there are techniques
such as Lasso, Elastic net that you can learn.
360
Quadratic discriminant analysis, Naive Bayes classifier, Hierarchical
clustering and many more such as deep networks and so on. So, to get
a general idea of data science one might be tempted to ask that if all of
these collections of techniques solve data science problems then, one
would like to know what types of problems are being solved really and
once one understands the types of problems that are being solved then,
the next logical question would be - do you need so many techniques
for the types of problems that you are trying to solve.
So, this would be typical questions that that one might be interested
in answering. What I am going to do is I am going to give you my view
of the types of problems that are being solved and why there are so
many techniques to solve these types of problems. Since this is a first
course on data science for engineers we going to cover major
categories of problems that are of most interest to engineers. This is not
to say that other categories of problems do not exist or that they are not
interesting.
361
So, let us look at what classification problems relate to so these are
types of problems where you have data which are in general labeled
and I will explain what label means and whenever you get a new data
you want to assign a label to that data.
So, now the data science problem is the following if I give you a
new data point let us say x 1 star x2 star all the way up to x n*, the
algorithm should be able to classify and say this point is likely to have
come from either class 1 or it could have come from class 2. So,
assigning a label to this new data in terms of what is the likelihood of
this data having come from either class 1 or class 2 is the classification
problem. Let us say if you assign the likelihood of this coming from
class 1 as 0.9 and from class 2 as 0.1 then one would make the
judgment that this data point is likely to belong to class 1.
Now, let us see how this is useful in a real problem. So, I will give
you 2 examples one example is something that people talk about all the
time nowadays which is called fraud detection. So, let us take one
particular case of fraud detection for example, so, whenever we go and
use our credit card we buy something and the credit card gets charged.
So, let us say there are certain characteristics of every transaction that
you record such as the amount the time of the day the transaction is
made the place from which the transaction is made the type of product
that is bought through the transaction and so on. So, you can think of
many such attributes. Let us say those are the attributes that
characterize every single transaction.
Let us assume that there are many people and they are making
transactions and you have transactions listed like this and you find out
that of these, these were actually fraudulent transactions these were
transactions that was not legal or was not made by the person who
owns the credit card and these are transactions which are legal. So, this
is something that you label based on exploring each transaction which
you think might not be right and actually when you find out that that
transaction was not legal then you put it into the basket which is illegal
transaction.
362
Now, if you use a data science algorithm a binary classification
algorithm to be able to give the likelihood of a transaction being
correct or fraudulent based on this easily calculatable attributes or
easily monitorable or measurable attributes. Then, whenever a new
transaction takes place you could run it through this classifier and then
find the likelihood of this transaction being fraudulent.
Now, when we talked about this data and we talked about the binary
classification problem, we talked about just 2 classes x 1 and x2, but in
reality there could be problems where there are multiple classes. One
very good engineering example would be fault diagnosis or prediction
of failures. Where you might have, let us say a certain equipment a
pump or a compressor or distillation column whatever the equipment
might be and then the working of that equipment is let us say
characterized by several attributes. How much power it draws, how
much performance does it give, is there vibration, is there noise, what
is the temperature and so on.
363
So, now, you could have engineering data x which let us say talks
about the characteristic of let us say a pump and the pump is
characterized, the operation of the pump is characterized, by let us say
several attributes x1 to xn. And if you have legacy data or historical data
where you have been operating pumps for years and years and then you
know that if these variables take values in this block then everything is
fine with the pump.
So, I write n for normal and then you could have a block of data and
that data might have been the data that is recorded whenever there is a
particular type of fault in the pump let me call this fault f one. Then
you could have another block of data which could have been seen
when there is fault f2 and so on. So, we will just stick to 2 faults f1 and
f2. Let us assume these are the only 2 failure modes that are possible.
Now you start operating the pump and then at some point you get this
data and then you ask the following question. Based on this data would
be possible for me to say if the pump is operating normally or is there
likely to be failure mode 1 that is the current situation of the pump or is
it failure mode 2 that is the current situation of the problem.
So, in this case you see that there are 3 classes n, f1 and f2. So, this
is what is called a multi class problem. So, again when a new data
comes in we want to label this as either normal f1 or f2 if it is normal,
you do not do anything. If it is f1, if it is very severe then you stop the
pump and then x it. If it is not very severe you let the maintenance
know that this pump is going to come up for maintenance at some time
and in the next shutdown of the plant this pump needs to be
maintained. So, that is how classification problems are very important
in engineering context.
364
you are going to use to classify is of this form. So, you see the
difference between this and this, this is non-linear, this is linear, then
we could easily extend the concepts that we have learnt in terms of the
half spaces and so on to do classification for these types of problem
using non-linear decision boundaries.
So, you would say if the points are to this side it is 1 class and if the points
are to the other side it is another class. One has to do this carefully de
ning the equivalent ideas for non-linear decision boundaries equivalent
to the linear case very carefully and the minute you move from linear
to non-linear then there are a host of other questions that come about.
And these questions are really related to what type of non-linear
function should one use.
365
form and once I choose a functional form how do I also identify the
parameters that are in the functional form. So, a simple example is if it
is a linear functional form then I say y = a₀ x + b0 let us say.
In this case the functional form is linear and the parameters are a₀
and b naught if you assume that it is a quadratic functional form then
you could do a₀ x2+ a₁ x + a₂. So, in this case the functional form is
quadratic and there are 3 parameters now a₀ a₁ and a₂. So, when you
do this function approximation you will have to figure out both the
function and the parameters and in classification problems you want to
come up let us say in the linear case with a line or a hyper plane where
these points are as far away from this as possible. In the function
approximation case what you want to do is, you want to find a line or a
hyper plane such that these points are clustered around that and this is a
linear problem which is what we are going to see in this course as
linear regression.
Now the same non-linear version of the pro problem similar to the
picture on the top is shown here. Here you want to have a non-linear
surface or a curve that goes through these points and these points are
clustered around that curve. So, in summary there are really only two
types of problems that we predominantly solve from an engineering
viewpoint using data sciences, these are classification problems and
function approximation problems.
So, if there are only two types of problems that we are really
solving then one might ask why are there so many techniques for
solving these types of problems and one standard question that comes
about whenever someone does data science is if a particular technique
is better than another technique and the proponent of one technique
will say this is a greatest technique the proponent of other technique
will say that is a greatest technique and you know this debate keeps
going on and so on.
366
.
So, we have many objects on the table that is shown in the slide,
then if I asked you how many articles are there in the table you would
quickly say well there is a camera, there is a cup, there are two mobile
phones, there is a watch, there is a pen, bottle and so on. So, basically
we can kind of count or see whatever there is to see and then
enumerate and then say these are the objects or articles in the table. So,
in some sense we can count all that is there to see. So, this at this point
you will say this is all that is on the table then I asked you the question
is that all really that is there on the table.
And not to take this very literally, but this illustrates the key idea
that I want to use when we go back to answering the question as to
367
why there are so many techniques for data science. So, carrying on I
will ask you is that all there is on the table and then if I ask you this
question what about things that we cannot really see. So, in the table
there might be millions of teaming microorganisms which are not
visible to us to the naked eye.
So, let us say there are 4 microorganisms there are four chemicals,
now the assumption here is when someone comes up with a chemical
like this they have tested it, they have shown that it works for that
particular microorganism very theoretically and repeatedly they have
shown that it works. So, we cannot re-ally go back and question
whether this chemical is good for this microorganism because that has
been demonstrated reasonably well.
So, if there are these 4 microorganisms what you would do is. You
have to make an assumption as to what exists on the table. So, let us
say you make the assumption that microorganism one is what is there
on the table. So, you pick up the fluorescent chemical one and then
spray it. Now if you see fluorescence and then you would come to the
conclusion yes my assumption is right this is the microorganism that is
also on the table.
Now, the interesting thing is if it does not for us then the conclusion
is not that the chemical is bad because that is been provably shown to
work for this particular case, you would only assume that the
assumption that you made is not right. So, you have to go back to the
next assumption which would be microorganism 2 is there on the table
and then you look at the fluorescence chemical 2 and so on.
So, once you do this exercise and let us say when you use chemical
2 and 4 it fluoresced one and 3 did not fluoresce then at the end of that
exercise you could go back and then say the articles on the table or the
camera and the and the cell phone and so on and also microorganism 2
and 4. Now notice how you have been able to see what you cannot
visually see using this assumption validation cycle. So, this is an
important thing to understand and I am going to connect this to the
techniques and data science in the coming slide.
368
.
There are tons of attributes that one could actually measure and
monitor and you really want to see how many of these attributes are
really going to contribute to the problem that we are trying to solve or
how many of these attributes are really important. So, we are going we
are going with big data from two dimensions to multiple, multiple
dimensions and the question then is how do I understand the
organization of the data, in multiple dimensions where I cannot see
multiple dimensions I cannot see beyond 3 D.
369
.
So, you start with multi dimensional data you make these
assumptions and then what you do is the following.
370
.
So, this is equivalent to picking the chemical that has been shown to
make a certain type of organism fluoresce. So, you choose the
technique and then you deploy this technique and if the answer makes
sense and we will see what it means when we say make sense
mathematically from a data science viewpoint then the data is likely to
be organized in conformity with the assumptions that you have made.
So, important the key, it is important to look at the key words that
we are using it is “likely" to be organized and assumptions are
important. So, likely would mean we will have to do some metric and
different people will use different metrics and different levels of
satisfaction of that metric to be convinced that what they have is right
and wrong. So, that is where subjectivity comes in, but if the answers
make sense then we will say the data is likely to be organized in
conformity with the assumptions.
Now, hopefully the previous iteration where you actually use some
assumptions and saw that the assumptions were violated and that it was
not likely that those assumptions are the one that are valid for this
371
problem. Even though you failed in that attempt you still got something
out of it which would help you in modifying the assumption. So, this
assumption modification process could be done with more knowledge
from failed attempts from before.
Now you continue with this process till the answers are satisfactory
and notice in this process how you are seeing the invisible. So, you are
able to see data in n dimensions. So, for example, you cannot clearly
see hundred variables plot them and then see whether they are linearly
separable or not. But if you use a linear classifier and it worked very
very well then you know that the data is likely to be organized in such
a way that a hyper plane could separate this data into two groups. So,
you have started seeing the invisible much like the thought experiment
we did with the table case.
Now, this question of likely and makes sense are very important.
So, how do I ensure that I test to see whether the results that I have are
good enough or not. That is done using test data in many of these data
analytic techniques. So, the test data is very important when we do this
exercise and as we teach different techniques you will you will see how
this is important and will explain this in greater detail.
So, since there are so many assumptions, there are many many
combinations of assumptions you can make. There are many
techniques which are ne tuned and developed particularly to solve
problems where data is organized according to the assumptions that are
used in the technique. So, that is the reason why you have so many
techniques. So, in some sense when you look at all of these techniques
it is not as important or as interesting to compare these techniques
blindly in terms of this is better than the other one and so on.
372
So, in this lecture the first introductory lecture on data science I
wanted to right away address the questions of the type of problems that
we solved and in summary most of the problems that you solve in data
science can be categorized as either classification problems or function
approximation problems that is one take home message from this
lecture and the other message is that there are several techniques for
solving these data science problems we wanted to know why there are
so many techniques.
So, I will see you again in the next lecture on the use of a
framework for solving data science problems.
Thank you.
373
Data Science for Engineers
Prof. Raghunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 30
Solving Data Analysis Problems – A Guided Thought Process
So, you could start talking about a class of problems which could be
either performance related or improving the operations, doing things on
time and so on. So, typically you start with a loose set of words a
vague definition of a problem and the data that you have. Now the
question really is to then drive your thought process towards something
that is codable, something that you can process the data with to derive
value to do any problem that you are solving and so on.
I just want to mention that this is not set in stone. We are trying to
give structure to your thought process when you solve data analysis
374
problems. So, to do this we are actually going to take a very simple
example and then illustrate how you should think about solving data
science problems and at the end of it we will come up with a flow chart
that might be useful. So, the problem that we are going to deal with is
what is called the data imputation problem.
375
would learn in this course and then use that in this guided thought
process also to solve problems that are of interest.
So, let me read the problem. So, it says reading from five sensors, x
one takes five, are made available to you for 100 different tests and in
the website we also have this file of data which you can use to go
through this process on your own the readings are not arranged
according to any order. So, basically what this means is you are not in
for any time sequence or any other sequence from this data samples.
Now if all the data were available then you could process this data
for other activities. However, in this case there are some records which
has data that is missing these are marked as not available n a. Now let
us assume your supervisor has asked you if you can come up with any
ideas that can be employed to rationally fill the missing values. So, the
question is can you develop a data analytic approach to answer this
question. So, this is a very typical question that we deal with in real
problems and there are many solutions depending on the situation that
one looks at. Here we gives one example to illustrate the idea of the
framework.
So, I am going to call the first step in any data analysis problem
solving as defining the problem. You define the problem in as broad a
way as possible. So that it is very very understandable to everyone and
this part of problem definition and the second step which is what I
want to call as problem characterization are very important steps and in
fact, if these are done properly then the solution, uncovering the
solution process, becomes easier.
376
So, in this case the problem definition is actually fill in missing data
records very simple. I have some data which is missing I simply need
to ll in these missing data records. Now we drill down a little more and
we go on to step 2 which is what type of problem is this. Now in this
simple case it will turn out to be one type of problem that I am going to
talk about, but in more complicated cases it might be a combination of
these basic problem types. So, here at least for this example the
problem characterization goes like this. So, we see that given part of
the information which is the data that is available fill the missing
information.
So, one way to think about this is I need to get some knowledge
about the missing information from somewhere and the only place that
I can get this information is really from whatever I know about this
data from whatever I have with me currently. So, I might say the idea
is really to somehow relate the missing information or missing data to
the known information or known data. So, that seems like a logical
way to solve this problem.
The minute we get to this point right here then we understand that
this is a function approximation problem where I am going to write this
equation where I have x unknown and if somehow I can relate it to the
data, known data or x that is known, then whenever I have something
missing I could simply put it into this function and as solve for my
variable. So, at the highest level it proceeds in a rather general fashion,
but you will see you will have to add more and more ideas into solving
these problems as we go along.
377
So, there are many types of assumptions that you can make. So,
since there is the first course on data analysis and this the first time we
are introducing this framework we are going to keep things very
simple. Let us take an assumption which is very commonly made in
solving these types of problems, we might simply say these variables
are let us say independent of each other; that means, what value a
particular variable takes does not really affect the value of other
variables.
So, if I have let us say let us say these are all known data and let us
say this is missing data and I have something like this I have something
like this and if I let us say want to just fill in this missing data we start
and then get this set of data separately where nothing is missing and
then basically say the best way to really fill this is to take an average of
these two and put it here. So, somehow we are going to talk about this
most likely value. So, we will see how we feel the most likely value.
But basically what we are a essentially saying is when I want to fill this
data all I am going to do is I am going to look at the values only in this
column and I am not really going to look at values in the other columns
because I have assumed that these variables are not related to each
other.
Now, in this case, this assumption can quite easily be verified right
at this point from your statistics part of the lecture. You would have
seen a quantity called correlation coefficient or quantity that measures
the correlation between variables that you calculate. So, you could
calculate and find out whether these variables are correlated or not and
if they are not correlated then this assumption is fine, but if they are
correlated then this assumption that no relation exists between the
variables it is not a good one to make. So, I want to emphasize that
some of the assumptions that that we make can be verified right at this
stage based on known statistical ideas and other ideas in terms of linear
algebra and so on.
So, if you could verify this assumption and then if there was no
relationship, then you say the most likely value then you have to define
what the most likely value means. There are many ways of actually
defining the most likely value the simplest is what I said you could
take the average or you could take the median value you could take a
378
mode and so on. So, there are many ways of actually defining what the
most likely value is.
So, let us assume in this case that this assumption is not satisfied
then you go back to the drawing board and then say I have to change
my assumption. So, this is where what I explained in the last lecture is
important. If this assumption is not satisfied this does not mean there is
any problem with the correlation calculation that you have done. That
is a good calculation do to do anyway. The only problem is that the
assumption that these are not related to each other is not a good
assumption to make. So, we go and modify that assumption.
379
So, we said we have samples in rows and variables in columns, we
said if there are a lot more samples than variables which is the case
here because we have 100 samples and out of those we have picked out
samples that are complete in all respect. So, let us assume there are a
certain number of samples which are complete we have only 5
variables. So, the number of records where information is complete is
going to be lot more than the 5 variables.
So, basically m is greater than n and then if you want to use things
that we have learned before to come back and say I want to solve this
problem using things that I have learned, you would quickly realize
that we can use the notion of rank and null space to solve this problem
and why is it that we can use the notion of rank a null space to solve
the problem. We said if you want to identify how many of these
variables are actually independent the quantity that you should
compute is the rank of the matrix which is what we saw in the linear
algebra class.
So, if it turns out that if the rank of this matrix m by n matrix, let us
say whatever is the number of records that are complete times 5,
because we have 5 variables in this problem, if the rank of this matrix
turns out to be 2, then we automatically know that there are 3
relationships. And we also know how to identify these relationships.
We can use the notion of null space to identify these relationships.
Then if you want to fill this data what you do is basically take these
two known values and substitute these two values into the 3 equations.
So, those are now known so these 3 equations will go from 5
unknowns to 3 unknowns. So, you have now 3 equations in 3 variables
then you can basically solve this problem and fill in this data. If for
example, there are 4 variables that are missing in a record then that is
one case that we have discussed already in the linear algebra
framework. We have let us say 4 variables missing then the 1 variable
that we know the value for, we can substitute into these 3 equations
and then we will end up with 3 equations in 4 unknowns.
So, this is a case where I have a lot more variables than equations.
So, there are infinite number of solutions. So, one possible solution that
you could use is to use the pseudo inverse and then find a solution to
this problem. Similarly if I have for example, 2 variables that are
missing, but 3 are available then when I put these 3 values into the 3
380
equations I will end up with the system of equations where I have 3
equations in 2 variables.
Now if the equations are all perfect, you could pick any 2 equations
from this and then simply solve for the 2 variables. But even if there
are minor errors in these equations and the equations are not really
perfect in terms of just drop-ping one equation and solving with other 2
equations, you could still use the notion of pseudo inverse again to
solve this problem where I have less variables and more equations.
So, this is a case where we said if all the equations are not
consistent you might not be able to find solutions, but we know pseudo
inverse is a concept that can be used to solve all types of these cases
where you have the same number of equations and variables more
equations than variables and less equations and variables and so on. So,
this we saw in detail in the linear algebra part of the lecture. So, you
notice how a general statement of a problem, fill in data, can be eshed
out in a very systematic way and then you can use concepts that you
know, right now since you know from this course at least only
concepts from linear algebra and statistics.
So, maybe you could actually take the completely filled in data with
your data imputation and use the data set for whatever the intended
application is and then look at whether you are getting a performance
that you are happy with. Now if you are not getting a performance that
you are happy with then you basically do not blame the null space
concept, but you say maybe the problem is that these relationships are
not linear or if you assume that there is no error or noise in the data
maybe that assumption is not valid.
381
non-linear relationships or you could say there is a lot of noise. So, I
might use some other idea to fill in the missing data and so on. So, you
could say if the noise you could you could attribute a particular
distribution for that and depending on the distribution you could use
the correct technique and so on.
So, if the solution is realized at this point well and good. If not you
go back again to making assumptions. So, the step 3 is where we keep
going back where we keep refining our assumptions and what our
assumptions can be verified right away; we verify whatever
assumptions, we have to wait till the final result to verify, we verify,
and then if things work out we are happy otherwise we keep this
assumption validation cycle till we solve the problem to our
satisfaction.
So, in summary I would say the start of all of this is the first step is
a problem arrival whole lot of words very diffuse problem statement.
Step one is to convert this into one problem statement or set of problem
statements as precise as possible and then to solve that problem you do
what I would call as problem characterization. So, you break down this
high level problem statement into sub problems and you kind of draw a
ow process saying if I solve this sub problem then this result I am
going to use in this sub problem and so on.
So, you can think of this like a flowchart that you are drawing with
these sub problems and in general if possible you get to a granularity
level where you are able to identify the class of problem that the sub
problems belong to. In this case of this course we are calling these as
function approximation or classification problem so you identify these
382
problem sorry as either function approximation are classification
problems.
So, just for classification there are so many techniques which one
do you choose. So, we already addressed this in the last lecture you
have to look at the assumptions and pick the right method for solution
and if it turns out that for the kinds of assumptions that you have made
that you do not like any method that is out there then you tweak the
existing algorithms to a little bit and then find a method that is useful
or that will work for your problem and then once you do this then you
basically actualize the solution in some software environment of choice
and you basically then get the solution and assess whether the
assumptions are good, whether the solution satisfies your requirements
and if it does you are done. If it does not you go back and relook at
your assumptions and then see how you change or modify your
assumptions so that you get a solution that you are comfortable with.
383
use while we are going through this whole process and then data that
has never been used when we were developing the solution and you
basically test your algorithms or the flow process that they have come
up with on data that has not been seen. So, that is called the test data
and this is an important thing to remember and we will emphasize this
more when we teach linear regression and classification. So, this is a
critical component of assessing assumption. So, this we will see in
more detail later.
So, the whole thing that we have described till now I have as a
flowchart here where we start with the problem statement problem
conceptualization, solution conceptualization, method identification
and realization of solution and finally, when all are assumptions you
think are satisfied, the solution is acceptable then you are home clear if
not you go back and then redo this till you get a solution that is of
value in terms of the problem that you are solving
So, with this the gentle introduction to data science part of the
lecture is done. So, we looked at the types of problems, the techniques
and why so many techniques and so on and we also provided a
framework to guide your thought process and as I mentioned before
this is a process that you can use to think about many different
problems in a framework in the same way. So that the solution
development process becomes easier for you as you go along. You
might tweak this framework and then you will have some mental
picture or mental framework that you use to solve data science
problems which could be this which could be a tweaked version of this
or something different, nonetheless the important thing to remember is
384
that you should think about the problem in a consistent way whenever
you solve a problem.
So, what I mean by this is if you use whatever framework it is for a particular
type of problem you become aware that you are using such a mental
framework when you are solving a problem then you can take the same
framework to many different problems then that becomes your thought
process for solving data science problems. You do not keep looking at
books and say I have to do this, this and this. You have your unique
scientific method for thinking about these problems and solving these
problems.
So, with this we finished this part of the course and the next set of
lectures would be on linear regression which is a type of a technique or
a technique for solving function approximation problems and then after
that, we will have a series of lectures on some clustering algorithms
which can be used largely for classification problems, but can also be
used in solving function approximation problems and we will then
close the course with a case study and one practical problem
description thank you and I hope to see you again when linear
regression is started.
Thank you.
385
Data Science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture- 31
Module: Predictive Modelling
386
(Refer Slide Time: 01:16)
387
deviation of the individual values from the respective means x i - x̅
whole square summed over all the values divided by n or n - 1 as the
case may be.
388
values of xi on the x axis yi under y axis and for each of these points
and we can see whether these points are oriented in any particular
direction. For example, the figure on the left here indicates that the y i
increases as xi increases there seems to be a trend in this. In particular
we can say there is even a linear trend as xi increases yi corresponding
the increase is in a linear fashion.
Whereas, if you look at the third figure the data that we find seems
to be having no bearing on each other. That is x y i values do not seem
to depend in any particular manner on the xi values. When xi increases
maybe yi increases for some cases and yi decreases that is why it is
spread in all over the place. So, we can say there is little or no
correlation. This is a qualitative way of looking at it, we can quantify
this and there are several measures that have been proposed depending
on the type of variable and the kind of association you are looking for.
389
The numerator represents the covariance between x and y can also
be computed in this manner we can expand that definition we have for
the covariance and we can find that it is nothing but the product of x i yi
- n times the mean of x and the mean of y which represents the
covariance of x and y and the denominator represents the standard
deviation. We can look at this division by the denominator as what is
called normalisation.
So, this value is now bounded. We can show that r xy we take a any
value between - 1 and + 1. - 1 if it takes a value we say that the 2
variables are negatively correlated if r xy takes a value close to one we
say they are positively correlated, on the other hand if rxy happens to
value close to 0 it indicates that x and yi have no correlation between
them. Now, what how we can use this we will see.
390
have a variable where you have indicated your scale on a scale of say 0
to 10 the let us say the course the those kind of variables are typically
you do not apply a Pearson’s correlation there are other kinds of
correlation coefficients defined for what we call ranked or ordered
variable ordinal variables.
So, let us look at some sample examples this is a very famous data
set called the Anscombe’s data set. Here there are 11 data points for
each of this there are 4 data set I have only shown 3 of them, each of
them contains exactly11 data points corresponding to xi and yi these
points have been carefully selected. In the first one if you look at if you
plot the scatter plot you will see that there seems to be linear
relationship between y and x in the first data. In the second data if you
look at this figure you can conclude that there is a non-linear
relationship between x and y and the third one you can say when there
is a perfect linear relationship for all the data points except one which
seems to be an outlier which is indicated be far away from that line.
391
whether the, you actually apply into first data set or second data set or
the third data set. In fact, the fourth data set has no relationship
between x and y and it turns out to be they have the same correlation
coefficient. So, what it seems to indicate is that if we apply the
Pearson’s correlation and we find the high correlation coefficient close
to one in this case.
Here are 3 examples. In the first example I have taken 125 equally
spaced values between 0 and 2 π for x and I have actually computed y
as cos of x. So, this is a relationship between y and x in this case is a
sinusoidal relationship. So, if you apply the Pearson’s correlation
coefficient compute the Pearson correlation coefficient for this data set
you get a very low value close to 0 indicating as if there is no
association between x and y, but clearly there is a relationship because
it is non-linear.
392
finally, give you a correlation coefficient which is very small, does not
indicate imply that there is no relationship between y. All you can
conclude from this is perhaps there is no linear relationship between x
and y.
393
(Refer Slide Time: 13:18)
394
(Refer Slide Time: 13:57)
So, we are ranked all of these points notice that the sixth and the
first value both are tied. So, they get the rank 6 and 7 which is the
midway, the half of it. So, we have given it a rank of 6.5 because there
is a tie. Similarly if there are more more than 2 values which are tied
we take all these ranks and average them by the number of data points
which have equal values and correspondingly you have to in the rank.
We also ranked the corresponding y values for example, in this case
the tenth value has a rank 1 and so on so forth, eighth value has a rank
2 and so, on.
So, we have given a rank in a similar manner now once you have
got the rank you compute the difference in the ranks. So, in this case
the difference in the rank for the first data point is 2 and we square it,
similarly we take the difference in the second data point in the ranks
between xi and yi which is 2 and square it we get 4.
So, like this we take the difference in the ranks square it and we get
the final what we call the d squared values we sum over all values and
then we compute this coefficient. It turns out that this coefficient also
will be lie between - 1 and + 1 and - 1 indicating a negative association
395
and + 1 indicating a positive association between the variables and in
this particular case the rank the Spearman rank correlation turns out to
be 0.88.
396
Let us look at some of the things as I said 0 means no association.
When there is the positive association between y and x then the r s
values or the Spearman’s thing will be + 1 like the Pearson’s
correlation and similarly when y decreases with x then we say that you
know the Spearman’s rank correlation is likely to be close to - 1 and
so, on. The difference is between Pearson’s and Spearman is not only
can it be applied to ordinal variables even if there is a non-linear
relationship between y and x the spearman rank correlation can be high
it will not likely to be 0, it will have a reasonably high value. So, that
can be used to distinguish maybe to look for the kind of relationship
between y and x.
So, let us apply it to the Anscombe data set in this case also we find
that the, for the first one the Spearman rank correlation is quite high in
the second one also reasonably high. In fact, the Pearson also a sign
notice that the Pearson was same for all this and the third one also is
fairly high 0.99. So, all of these things it is indicating that there is a
really strong association between x and y.
397
(Refer Slide Time: 17:18)
Suppose we had applied I would suggest that you apply it to the cos
x example and y squared x = y square example you will find that the
spearman rank correlation for these will be reasonably high it may not
be close to one, but it be high indicating there is some kind of a non-
linear relationship between them even though Pearson’s correlation
might be low. So, third type of correlation coefficient that is used for
ordinal variables is called the Kendall’s rank correlation and this
correlation coefficient also measures the association between ordinal
variable. In this case what we de ne is a concordant and a discordant
pair.
398
(Refer Slide Time: 19:11)
399
We can take a item here there are about 7 observations let us say 7
different wines or tea or coffee and there are two experts who ranked
the taste of the tea or coffee or wine on a scale between 1 to 10. For the
first the expert number 1 gives it a rank of 1 and expert 2 also ranks it 1
for the second one the expert 1 ranks it 2 while the expert 2 ranks it in
a scale or gives it the value of 3 and so on so forth for the 7 different
types of thing.
Now, you compare data point 1 and data point 2. In this case experts
opinion is that 2 is let us say better than 1, expert 2 also says 2 is better
than 1. So, it is a concordant pair. So, 1 and 2 are concordant that is
what is indicated here. Similarly if I look at the data point 1 and 3
expert 1 says it is better 3 is better than 1, expert 2 also says 3 is better
than 1. So, it is a concordant pair similarly if you look at 2 and 3 both
agree in agreement 3 is better than 2, 3 is better than 2 so, it is a
concordant.
Let us look at the fourth and the first one looks like expert 1 says it
is better, expert 2 also says it is better concordant, but the second and
fourth if you compare expert 1 says it is better fourth one is better than
the second, but expert 2 disagrees he says the fourth thing is worse than
the second one. So, there is a discordant pair of data that is indicated by
D. So, 4 and 2 are discordant 4 and 3 are discordant.
So, basically we are saying y and x are associated with each other
also there is a strong association. Otherwise it is not strongly associated
or completely if the expert 2 completely disagrees with expert 1 you
might get even negative values. So, the high negative value or high
positive value indicates that the 2 variables x and yin this case are
associated with each other. Again this can be used for ordinal variables
because it can work with ranked values here as we have seen in this
example.
400
(Refer Slide Time: 21:56)
401
Data Science for Engineers
Prof. Shankar Narasimhan
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 32
Linear Regression
402
So, let us take some examples let us take a business case. So,
suppose we are interested in finding the effect of price on the sales
volume. Why do we want to actually find this effect, we may want to
determine what kind of price? We want to set the selling price of an
item in order to either boost sales or get a better market share.
So, that is why we are interested in finding what effect does price
have on the sales. So, the purpose has to first define what why are we
doing this in the first place in this case our ultimate aim is to x the
selling price. So, as to increase our market share that is the reason we
are trying to find this relationship.
403
(Refer Slide Time: 03:43)
404
(Refer Slide Time: 05:04)
405
So, these are various ways of denoting the regression problem, we
will always look at the simplest problem to start with which is the
simple linear regression, which consists of only one independent one
dependent variable and analyze it thoroughly.
So, the first thing that the various questions that we want to first ask
before we start the exercise is do we really think there is a relationship
between these variables. And if we believe there is a relationship, then
we would not want to find out whether such a relationship is linear or
not.
Also we are interested since we are dealing with data that that has
ran errors or stochastic in nature and we only have a small sample that,
we can gather from there from the particular application. We want to
ask this question, what is the accuracy of our model? In terms of how
accurately we can estimate the relationship or the parameters in this
model.
And if we use this model for prediction purposes subsequently how
good it is? So, these are some of the questions that we would like to
answer, even in the process of developing the regression model.
406
(Refer Slide Time: 08:06)
So, there are several methods also, that are available in the literature
for performing the regression depending on the kind of assumptions
you make and the kind of problems that may you may encounter. As I
said the simple linear regression is the very very basic technique,
which we will discuss thoroughly. Multiple linear regression is an
extension of that for multiple independent variables, but there are other
kinds of problems you may encounter, when you have several variables
independent variables.
407
(Refer Slide Time: 09:15)
So, you are really interested in how this selling price affects sales.
That is the purpose that you have actually got. In the case of a the
engineering case we said the purpose is to replace a difficult to
measure variable, by other easily measured variables and this model
using a combination of the model and other easily measured variables
we are predicting a variable, which is difficult to measure online. And
then obviously, we can monitor the process using that that parameter.
So, the purpose for each thing has to be well defined, then that leads
you to the selection of variables. Which is the output variable that you
want to predict and what are the input variables that you think are
going to affect the output variable. And so, you choose to set of
variables and take measurements, get a sample, do design of
experiments, which is not talked about in this whole what we called
lecture. So, we will do proper design of experiments in to get what we
call meaningful data and once you have the data, we have to decide the
type of model. When we say type of model it is a linear model or non-
linear model.
So, let us say we have chosen one type of model, then you have to
actually choose the type of method that you are you going to use in
408
order to derive the parameters of the model or identify the model as we
call it.
Once you have actually done that unfortunately when we use a
method it comes with a bunch of assumptions associated you would
like to validate or verify, whether the assumptions we have made, in
deriving this model are correct or perhaps they are wrong. What this is
done by using, what we call residual analysis or residual plots. So, we
will examine the residual to kind of judge, whether the assumptions
were made in developing the model are acceptable or not.
Sometimes you might have also a few data you may experiment and
data may be very bad, but you do not know this a priori, you would
like to throw them out they might affect the quality of your model. And
therefore, you would like to get rid of these bad data points and only
use those good experimental observations for building the model. How
to identify such outliers or bad data is also part of the regression. You
remove them and then you actually have to redo this exercise.
Finally once you develop the model you want to actually do
sensitivity analysis. Is there a if we have a small error in the data how
well how much it affects the response variable and so on.
409
(Refer Slide Time: 15:12)
So, the data that you use in building this model or regression model
is also called the training data. You have used the model to train you to
use the data to train the model or estimate the parameters of the model.
Such that a data set is also denoted as the training data set. Now once
you have built the model you would like to see how well does it
generalize can it predict the output variable or the dependent variable
for other values of the independent variable which you have not seen
before.
So, that comes to the testing phase of the model. So, you are
evaluating the fitted model using data which is called test data. This
test data is different from the training data. So, when you do
experimental observations, if you have a lot of data you set apart the
sum for training and remaining for testing typically 70 or 80 percent of
the data experimental data is used for training or fitting the parameters.
And the remaining 20 are used to test the model. This is typically done,
if you have a large number of data points.
410
well the model predicts on data that it does not seen before and once
you have satisfied with it, then you can stop. Otherwise if this model
that you have developed even the best model that you have developed
under whatever assumptions linear model and so on so forth that you
assumed, is not adequate for your purpose you go ahead and change
the model type, you may want to actually now consider a non-linear
model maybe introduce a quadratic term or you might want to more
look at a more general form and redo this entire thing.
It may also turn out that whatever you do you are not getting a good
model then maybe you should even look at the set of variables you
have chosen and also the type of experiments that we have conducted.
So, there could be problems with those that is probably affecting the
model development phase. So, when all your attempts have failed you
may want to even look at your experimental data that you have
gathered. What how did the how did you conduct the experiments,
whether there was any problem with that or the variables when you
select and did you miss out some important ones.
So, let us take one small example, which we will use throughout.
This is a data of 14 observations small sample, which we have taken on
a servicing problem service agents. These service agents, let us say it is
like Forbes aquaguard service agent that comes to your house. They go
visit several houses and they take a certain amount of time to kind of
service the unit or repair it if it is down.
So, they will report the total amount of time taken in minutes let us
say for through that they have spent on servicing different customers
and the number of units that they have serviced in a given day. So, let
us assume that every day the service agent goes out on his rounds and
notes the total amount of time he has actually spent and tells at the end
of the day reports to his boss the number of units that he has repaired
he or she has replied. Let us say that there are several such agents
roaming around the city and so on and each of them come back and
report.
Let us say there are 14 such data points of the same person or
multiple persons that you have actually gathered and from this the
question that we want to actually answer let us say is given this data
suppose as an agent gives you data, you monitor him for week or
month on how many how much time they spending, and how many
units is repairing every day and want to judge the performance of the
411
agent service agent. In order to reward or appropriately kind of you
know improve his productivity.
So, if you know a relationship between the time taken and number
of units repaired which you believe should happen if somebody takes
more time and is doing nothing not repairing much, then there is some
inefficiency in them maybe he is wasting too much time in between
travel or whatever. So, we need to find out right. So, the purpose is to
actually judge the service agent performance and do performance
incentives in order to improve productivity of these agents. So, we are
interested in developing a relationship between number of units and the
time taken by something or vice versa now.
412
Now, comes the exact mathematical form in which we state this
problem. We have data points x I of the independent variable we have
n data points in the example n is 14 and y is the dependent variable
which we want to use for prediction or whatever purpose that we want
for this model.
In this case as I said, given both x and y I would like to rate if some
service agent comes and tells this is the time, I have spent and this is a
number of units I have I have repaired and based on it is performance
for a month or a week you would like to rate the service agent that is
the purpose for building this model.
In this particular case where you can say this linear model is only an
approximation of the truth and anything that we are not able to explain
perhaps can be treated like a random error modeling error. Whatever be
the reason the most important thing to note is that this particular model
413
form re ordinary least squares methodology a formulation does not
allow error in the independent variable.
So, when you choose the independent variable one of the things you
to do carefully is that you should ensure that this thing is the most
accurate among the 2 variables. If you have A₂ variable case you
should choose the independent variable as the one which is the most
accurate one. In fact, it should be probably error free.
So, let us take the case of the units and minutes. Typically the
number of units repaired by a service agent will be reported exactly,
because he will have a receipt from each customer and saying that the
unit was serviced, you give this back and the total number of receipts
that the service agent has gathered precisely represents the number of
units service.
So, the time that has been reported contains other factors that we
may not have precisely considered, unless the service agent goes with
that stopwatch and measures exactly the time for repair. Typically you
will report the total time spent in in servicing all of these units
including travel time and so on that is the kind of data that you might
get.
So, it does not matter how you cast this equation, how you build this
model, it is more important that when you apply ordinary least squares,
you should actually ensure that the independent variable is an
extremely accurate measurement or it represents the truth as closely as
414
possible. Whereas, y could contain other factors or errors and so on and
it is this method is tolerant to it.
So, this goes if on the other hand if you believe both x and y contain
significant error, then perhaps you should consider other methods
called total least squares or principal component regression that we will
talk about later. If positive not in this lecture, but if we have the time
we will do it later.
So, essentially what I am saying is that once you have decided
based on purpose based on the kind of quality of the of the
measurements, what is the independent dependent variable, then you
can go ahead and say given all the observations n observations, what is
the best estimate of β0 and β1. As I said that β0 is the intercept
parameter and β1 is the slope actually geometrically interpret β 0, β0
represents the value of y when x is 0. So, when you put x = 0 and you
look at where this line intercepts the y axis this vertical distance is β 0
and the slope, which represents the slope of this regression line that is
β1 so, your estimating the intercept and slope.
So, now what is the methodology for estimating this β 0 and β1. So,
what we will do is we will do a kind of a thought experiment you give
values of β0 and β1 and then you can draw this line. So, we will ask
different people let us say, values of β0 and β1 and draw appropriate
lines. Again the line shape that the slope and the intercept will be
different depending on what value you propose for β0 and β1. Then
once you have done this we will actually go back and find out how
much deviation is there between the observed value and the line. In this
particular case we will say the observed value let us take this observed
value is yi corresponding to this xi, which is 8.
Now, the line if this particular equation is correct then this is the
predicted value of y, which means for this given value of x i according
to this equation you believe y predicted should be here. And then this
deviation between the observed value and the predicted value, which is
on this line, the vertical distance is what we call the estimated error.
So, you do not know what the actual error is, but if you propose
values for β0, β1, immediately I am able to derive an estimate for this
error which is the vertical distance of the point from that line. We
estimate this error for all data points.
So, overall the data points, we will compute this distance which is
geometric distance is nothing, but square of this value we will compute
this and sum over all the data points n data points. And we try to find
415
β0 and β1, which minimizes this sum squared value or minimizes the
sum of the vertical distances or the point from that line.
So, the notion of a best t line in the least square sense or the
ordinary least square sense is one that minimizes the vertical distance
of the points from the proposed line. Now you can, once you set up this
formulation, then we can say then who over gives the best β 0 and β1
will have the minimum vertical distance of the points from that line.
And this can be done now analytically instead of asking you now for
this β0 and β1 I try to solve this optimization problem, which means
minimize this, find out β0 β1, which minimizes this and this what is
called the unconstrained optimization problem with 2 parameters you
differentiate this with respect to β0 set it = 0 for those called the first
order conditions.
Those of you have done a little bit of optimization will know that
our calculus, will know all I have to do is differentiate this function
with respect to β0 set it = 0 differentiate this function with respect to β 1
set it = 0 and solve the resulting set of equations. And finally, I will get
the solution for β0 and β1, which minimizes this sum squared error. So,
the least squares technique uses this as a criterion in order to derive the
best values of β0 and β1.
Of course, you can counter by saying I will use some other metric
maybe I should have used absolute value. That will make the problem
difficult this method was proposed in the late 1700s by Gauss or
another person called Legendre and it has become popular as a
methodology although in recent years other methods have taken over.
So, the method of least squares is a very popular technique and it
gives you parameters analytically for the simplest cases. So, you get β 0
estimated. So, the estimate that you derive is not what you it is not that
you should you should treat this estimate as actually the truth it is an
estimate from data. Had you given me a different sample maybe I
would have got a different estimate remember that. The estimate is
always a function of the sample that you are given.
416
parameter. Of course, one could also ask suppose I know that if x is 0 y
is 0 I know that a priori.
In this particular case for example, if you do not service any units
which means you have not traveled you are not let us say you are on
holiday then clearly you would have taken 0 time for servicing. So, I
know in this particular case perhaps that that if you process 0 units you
should not have taken any time.
So, essentially you are taking the variance around 0 and the cross
covariance around 0 0 0 and then you will get the estimated value of β1.
Of course, β0 in that case is assumed to be 0. So, the line will pass
through 0 0 and you will get another slope. You are forcing the line to
pass through 0 0. Remember you have to be careful when you do this,
because it will, unless you are sure that should pass through the origin,
you should not force this thing you will get a bad t. If you know it and
you want demand it it makes physical sense then you are well within
your rights to force β0 = 0 do not estimate it. That can be done by
simply taking the cross covariance and variance around 0 instead of
around their respective means .
417
So, this is as far as getting the solution is concerned. So, now, once
you get the solution you can ask for every given xi, what is the
corresponding predicted value of y i using the model. So, you plug in
your value of xi in the estimated model which is using the estimated
parameter β̂0 and β̂1 and I will call this prediction ŷ1 it is also an
estimated quantity for any given xi,I can estimate the corresponding yi
using the model.
So, you can do this for any new point which you have not seen
before in the test set also. Let us look at some couple of other measures
which you can derive from this. We can talk about what is called the
coefficient of determination r squared, which is defined in this manner.
It is just 1 - difference between the observed value and the predicted
value squared difference summed over all data points, divided by the
variance of y, which is (yi - y̅)2.
So, you can say this much variability exists in the data suppose I
build the model and try to predict yi if xi had a influence on y, then I
should be able to reduce it is variability I should be able to do a better
prediction and the difference between yi and ŷi should be lower. If xi
had a strong influence in determining y .
418
relationship is good and so on. And the Anscombe data for example,
when we saw last class, if you try to find the Anscombe data for the 4
datasets you will get all r squared close to one and that does not mean
that the linear model is good.
So, you can regard the denominator as fitting a model with just the
parameter β0. On the other hand the numerator you have used 2
parameters to t the model. Whenever you use more parameters
typically you should get a better t. So, generally the numerator value is
obtained, because you have used 2 parameters. Whereas, the
denominator you used only 1 parameter so, you have to account for the
fact that, you have used more parameters to obtain a better t and not
because there is a linear relationship between yi and xi.
So, you should go back and account for this what we call the
number of parameters you have used or the number of degrees of
freedom that is used in estimating the numerator. For example, you
have n data points and in this case the p = 2 parameters. So, n - I am
sorry p = 1, which happens to be only β 1. So, n - 2 would represents the
number of degrees of freedom used to estimate this numerator
variability whereas, n - 1 is used to estimate the denominator
variability, because you have used only the parameter β 0 for
denominator whereas, you used 2 parameters to estimate the
numerator.
419
(Refer Slide Time: 38:31)
So, loading of the data set you would have already seen, lm is the
one that you used to build the model, you indicate what is the
dependent and independent variable. And then you will get an output
that is given here first you will get the range of residuals, which I said
is the estimated value of ε i for all the data points in this case all the 14
residuals are not given the max value min value the first quartile third
quartile in the median are given here.
And I will only now look at 2 parameters the β 0, which is the first
the intercept is called the β0 estimated values here and the slope
parameter the estimated values 15.5 for this particular data set. Now I
will also now only focus on this particular line, which talks about the R
squared value, which we explained to judge the quality of the model it
is a very high R squared you get or the adjusted R squared.
420
subsequent lectures I will explain them. And, we will see you in the
next lecture.
Thank you.
421
Data Science for Engineers
Prof. Shankar Narasimhan
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 33
Model Assessment
Then even if you fit a model you may want to find out which
coefficients of the linear model are relevant. For example in the one
variable case that we saw one independent variable the only 2
parameters that we are fitting are the intercept term β0 and the slope
term β1. So, we want to know whether we should have fitted the
intercept or not whether we should have taken it as 0. When we have
several independent variables in multilinear regression we will see that
422
it is also important to find out which variables are significant, whether
we should use all the independent variables or whether we should
discard some of them.
So, this particular test for finding which coefficients of the linear
model are significant is useful not only in the univariate case but more
useful in multi linear regression, where we are would not identify
important variables. Suppose, the linear model that we fit is acceptable
then we would not actually see whether we can improve the quality of
the linear model. When fitting linear model using the method of least
squares we make several options about the errors that corrupt the
dependent variable measurements.
So, are these assumptions really valid? So, what are some of the
assumptions that we make about the errors that corrupt the
measurements of the dependent variable. We assume that the errors are
normally distributed. Only this assumption can actually justify the
choice of the method of least squares. We also assume that the errors in
different samples have the same variance. So, this is called
homoscedasticity assumption. So, we are assuming that the errors in
different samples are also having the same variance.
In general, the these two statement assumptions about the errors that
they normally distributed with identical variance can be compactly
represented by saying that εi the error corrupting measurement i is
normally distributed with zero mean and σ2 variance. Notice the σ2 is
same and does not depend on i which means it is the same for all
samples i = 1 to n that is the assumption we are making when we use
the standard method of least squares.
Now, we also assume that all the measurements that we have made
are reasonably good and there are no bad data points or what we call
outliers in the data. We saw that even when we are estimating a sample
mean, one that data can result in a very bad estimate of the mean. So,
similarly in the method of least squares if we have one bad data point,
it can result in a very poor estimate of the coefficients. So, we want to
remove such bad data from our data set and improve maybe fit a linear
model only using the remaining measurements and that will improve
the quality of the linear model.
So, these are some of the things that we need to actually verify.
These assumptions what we are made about the errors whether they are
reasonable or not if there are bad data, can be remove them or not. And
so, we will look at the first two questions in this lecture which is to
assess whether the linear model that we are fitted is good and how do
we decide whether the coefficients of the linear model are significant.
423
(Refer Slide Time: 04:33)
The second important property that we need to derive about the estimates
424
is the variability of estimates. Notice we get different-different
estimates of β̂₀ depending on the sample that we have derived and
therefore, we want to ask what is the spread of these estimates of β̂ ₀
and β̂1 if we were to repeat this experiment. We can show again
through based on the assumptions we have made that the variance of β̂1
will be = σ2 by Sxx. Sxx represents the variance of x or x - x̅ the whole
squared summed over all the samples. Whereas σ2 represent the
variance of the error that corrupts the dependent variable y. So, σ2 is
the error variance, S xx is the variance of the independent variable. So,
this ratio we can show will be equal to the variance of β̂1.
We have fitted a linear model, so, for every xi we can predict from
the linear model what is the estimate of ŷi for every sample. Then we
can take the difference between the measure and the predicted value of
the dependent variable sum squared divided by n - 2 that is a good
estimate of σ hat squared which is the error in the dependent variable.
Now, why do we divide by n - 2 instead of n - n or n - 1? Very simple,
ŷi was estimated using the linear model. It had 2 parameters β 0 and β1
which represents means that 2 of the data points have been used to
estimate β0 and β1 and therefore, only the remaining n - 2 samples are
available for estimating this σ squared.
Suppose, you had only two samples then your numerator would be
exactly 0, because you are more than two samples your variability and
that variability is caused by the error in the dependent variable. That is
one of the reasons that you are dividing by n - 2 because two data
points have been used to estimate the parameters β0 and β1. Now, this
particular numerator term is also called the sum squared errors or SSE
for short and so, σ̂ 2 is nothing, but SSE divided by n - 2.
425
So, from the data after you have fitted the model you can compute
this value and compute the SSE and obtain an estimate for σ̂2. So, you
do not need to be told the information about the accuracy of the
instrument used to measure the dependent variable you can get it from
the data itself.
So, now finally, not only we have got the first moment properties of
β̂₀β̂1 as well as the second moment properties which is variance of β̂ 1
and variance of β̂0, we can also derive the distribution of the
parameters. In particular β̂1 can be shown to be normally distributed.
Of course, with because the expected value β̂1 is β 1 it is normally
distributed with β1. The true unknown value of β 1 is the mean and the
variance given by σ if you substitute σ̂ 2 here you can finally, show that
this is nothing, but σ oh, I am sorry. So, this is unknown σ 2 divided by
S xx, σ2 is essentially here we have derived this σ 2 by Sxx is the
variance of β̂1.
Now, if you do not know σ2 you can replace this σ2 is with this σ̂2
SSE by n - 2. So, once you have derived the distribution of the
parameters we can perform hypothesis testing on the parameters to
decide whether these are significantly different from 0 and that is what
we are going to do. We can also derive what we call confidence
intervals for these estimates based on their distribution characteristics
that is the mean and the variance.
426
size you need to have and correspondingly you can obtain the interval
from the distribution.
Notice this is very similar to the normal thing which says that the
true value will between estimate + or - 2 times the standard deviation.
The reason why we have 2.18 instead of 2 is because we are no longer
obtaining the critical value from the normal distribution, but from the t
distribution because σ2 is estimated from the data and not known up
priori.
So, the distribution slightly changes it is not the normal distribution,
but the t distribution and that is what we are pointed out here. The
steep one 2.18 is nothing, but the critical value 2.5 percent critical
value, upper critical value with 12 degrees of freedom. Why 12 degrees
of freedom because, you have in this particular example we had
fourteen points and we used two of the points for estimating the two
parameters. So, n - 2 is the degrees of freedom with represents 12. In
general, depending on of number of data points this value 2.18 will
change. So, that changes is the degrees of freedom of the t distribution
from which you should pick the upper and lower critical value. So,
lower critical value is - 2.18, the upper critical values 2.18, 2.5 percent.
So, the overall is 5 percent. 90 no, this confidence interval represents
the 95 percent confidence interval for β1.
So, what all we are going to state is that the β 1 to unknown β1 lies
within this interval with ninety five percent confidence that is what we
are saying. β̂1 can be estimated from data s β̂1 can be estimated from
data. So, you can construct the confidence interval. Similarly, you can
construct the 95 percent confidence interval for β 0 from its variance.
So, we are doing the same thing β̂₀+ or - 2.18 times standard deviation
of β̂₀ estimated from data which is what we call s β̂0.
427
(Refer Slide Time: 15:29)
So, let us look at why would want to actually do this hypothesis test.
We have fitted linear model assuming that you know that there is a
linear dependency between x and y and we have obtained an estimate
of β̂1. Also we have also fitted an intercept term. We may want to ask
should is the intercept term significant. Maybe the line should be pass
through 0, 0 the origin, maybe the y variable that is not depend on x1
in a significant manner which means β̂1 is approximately equal to 0 that
unknown β1 is exactly equal to 0 although we have got some estimate
for β̂1 non zero the estimate for β̂1.
So, between these two models we want to pick whether the reduced
model is acceptable or maybe the full model is to be accepted and the
reduced model should be rejected that is what we are doing when we
test this hypothesis β1 = 0 versus β1 not = zero Remember, the β1 can be
the positive or negative and that is why we are doing a two sided test.
428
So, we can do it 2 ways we can actually reject the null hypothesis if
the confidence interval for β̂1 β1 includes 0. So, notice that we have
constructed the confidence interval for β1. So, this term β̂1 - 2.18 maybe
negative and this maybe positive in which case the interval includes 0
and then we have to definitely we will might make a decision that that
β1 is insignificant and actually true β1 = 0. On the other hand if both
these quantities if the interval is to the left of 0 which is completely
negative or to the right of 0 which means both these quantities are
positive, then this interval will not contain 0 and then we can make the
conclusion reject the null hypothesis at β1 equal is 0 which means β1 is
significant. So, from the confidence interval itself is possible to make
the reject or the null hypothesis.
429
So, before performing the F test to check whether a reduced model
is adequate or we should accept the full model we will use some
definitions for sum squared quantities. Notice that let us say that we
had set of data in this case we have the example of the number of units
that were repaired and the time taken in minutes to repair the units by
different sales person and we had fourteen such data points, fourteen
such salesman, who have actually reported the data. So, the red points
actually represents the data and the best, the linear t, using the method
of least squares using all the data points we got something that is
indicated by the blue line.
The distance between yi and ŷi. So, now suppose we assume that
the slope parameters relevant then we would have fitted this blue line
and for every xi let us take this xi yi corresponding to this independent
variable, the predicted value of yi using this linear model would be the
intersection point of this vertical line with the blue line which
represented by the blue dot which is what we call ŷi. And therefore,
this vertical distance between the measure and the predicted value is
the sum squared errors, is called SSE, (yi - ŷi)2 and this is the total error
if we include the slope parameters in the t.
430
So, SS total is the will always be greater than SSE and therefore,
this difference SSR will also be positive all of these a positive
quantities. Now, one can you separate SS total as the goodness of t if
we assume a constant model, we can interpret SSE as the goodness of t
of the linear model and therefore, we can now use this to perform a
test. Literally intuitively we can say that if the reduction by including
the slope parameter that is SST - SSE is significant, then we conclude
it is worthwhile including this extra parameter, otherwise not. This can
be converted into hypothesis test formal hypothesis test and that is
what is called the F test.
So, this represents the sum squared errors for the reduced order
model fit, SSE represents the goodness of fit for the alternate
hypothesis fit. This difference if it is large enough as I said then we can
actually say maybe it is worthwhile going with the alternate hypothesis
rather than null hypothesis. So, SSR which is the difference between
this should be large enough.
431
So, normalisation what the denominator represents in some sense of
percentage SSE is the error obtained for the alternative hypothesis.
Remember because of the different number of parameters used in the
model we have to take that into account the numerator SST has n - 1
degrees of freedom because we are fitting only one parameter, this has
n - 2 of freedom because you fitting 2 parameters. So, the difference
actually means it is only one extra parameter. So, there is numerator
which is SSR has only one degree of freedom which is n - 1 - n - 2
whereas, the denominator SSE has n - 2 degrees of freedom because it
has 2 parameters which is fitted. So, we are dividing the SSE by n - 2
the number of degrees of freedom.
So, average sum squared errors per degree of freedom that is what
we are saying, and that is your normalisation SSR divided by this
normalised is this quantity and we can show formally that is an F-
statistic because it is a ratio of two squared quantities and each squared
quantities is itself a χ squared variable because it is a square of a
normal variable. Therefore, this the ratio of two χ squared and we have
seen in hypothesis testing the ratio of two χ squared variable is an F
distribution with appropriate degrees of freedom. The numerator
degrees of freedom is 1, the denominator degrees of freedom is n - 2.
So, there are now several ways for deciding whether the linear
model that we have fitted is good or not. We could have used r squared
value we said that if it is close to + 1 then we should that is one
indicator that the linear model maybe good. It is not sufficient what I
call sufficient to conclude, but it is good indicator we can also do the
test for β significance of β1 if we conclude that β1 is not significant then
maybe then a linear model is not good enough we have to find
something else or we can do an F test and conclude whether including
the slope parameter is significant. So, these are various ways by which
we can decide that the linear model is acceptable or not or the t is good.
We cannot stop this we have to do further test, but at least these are
good in initial indicators that we are on the right track.
432
(Refer Slide Time: 28:38)
But, also it also tells you what is this standard deviation, estimated
standard deviation, of this parameter which is S β̂₀ of the intercept it
also tells you what is the standard deviation of this estimate for β 1
which turns out to be 0.501 505 all of this calculations from the data
using the formulas we have described. Now, once it has given out that
we can actually now perhaps construct confidence intervals and find
out whether this these are significant not or R itself actually tells you
something whether these if you run hypothesis test whether you can
will conclude whether β̂₀ is significant or β̂1 a significant and that is
indicated by what is call this a p value that it is reported.
So, if you get a very high value, t values represents the statistic
which have again described earlier. So, it has computed the statistic for
you for β̂₀ and the statistic for testing whether β 1 = 0 or not and it has
computed this statistic value and it has compared with the critical value
while the distribution, t distribution with appropriate degrees of
freedom and concluded that the upper critical or the probabilities 0.239
433
which means if you get very high value for this anything greater than
0.01 or 0.05, it means you should reject the you should not reject the
null hypothesis on the other hand if you get a very low value it means
you should reject the null hypothesis, with greater confidence you can
reject the null hypothesis.
So, therefore, you can conclude from these values that β̂₀is
insignificant which means β̂₀= 0 is a reasonable hypothesis, β̂1 is not =
0 is a reasonable hypothesis. Let us go and see whether this makes
sense of this data. We know that if the no units are repaired then
clearly no time should be taken by this sales repair person. Which
means because you have not taken any time for servicing, yes because
you have not repaired any units. So, this line technically should pass
through 0, 0 and that is what you have said. But, however, we went
ahead really and fitted a intercept term, but the tests for hypothesis says
you can safely assume β0 the intercept is 0 it makes physical sense also
and we could have only fitted β1 that is good enough for the data.
So, perhaps you should redo this linear t with β 0 and only using β1
and you will get a slight different solution and you can test again. So,
another way of deciding whether the significant whether the slope
parameters significant or not this to look at the F statistic. Notice F
statistic is very high and this p values very low which means you will
reject the null hypothesis that the reduced model is adequate implying
that you should use β1 including β1 is very good you will get a better t
using β1 in your modeling. So, the high value of the statistic indicates
that it will reject the null hypothesis or a low value of p value for this F
this F statistic indicates that you will reject the null hypothesis even at
very low significance level.
434
(Refer Slide Time: 33:40)
Now, all this we have done only for single thing. We will be
extending it to the multi-linear case and we will also look at other
assumptions, the influence of bad data and so on in the following
lecture. So, see you in the next lecture.
435
Data Science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 34
Diagnostics to Improve Linear Model Fit
436
We have seen this before in the when we analyzed statistical
measures in one of the lectures, the Anscombe data set is consists of 4
data sets. Each one of them having 11 data points, x versus y they are
synthetically constructed to illustrate the point. For example, these 4
data sets are plotted x versus y, the scatter plot is given, first data set
here second third and fourth. And in all of these if you actually look at
it look at the scatter plot we may say that look a linear model is
adequate for the first data set, and perhaps for the third data set, but the
second data set indicates that the linear model may not be a good
choice. A non-linear model or a quadratic model may be a better fit.
The last data set is a very poorly a designed data set. You can see that
the experiment is conducted only at 2 distinct values of x, you have one
value of x here for which you have 10 experiments conducted you have
got 10 different y values for the same x. And then you have one more
experimental observation at a different value of x.
So, you should in this case you should not attempt to fit a linear
model with the data. Instead, you should ask the experimenter to go
and collect data different values of x, then come back and try to check
whether that is valid. Unfortunately, when we actually apply linear
regression to these data sets, and then find the slope and the intercept
parameter we find that in all 4 cases we get the same intercept value of
3, you can see that all 4 data sets you get a value of 3, and you also get
the same slope parameter which is point 5 in all 4 cases.
So, the regression model if you are fit to any of these data 4 data
sets you will get the same estimate of the intercept and slope.
Furthermore, you get the same standard error of fit which is 1.12 for
intercept and point one for the slope. And if you run a confidence
interval for the slope parameter, you may end up accepting that this
slope is acceptable for all 4 cases. And you may conclude incorrectly
conclude that the linear model is adequate. You can actually run the r
squared value it will be the same for all 4 data sets. You can run the a
hypothesis test whether a reduced model is acceptable compared to a
model with the slope parameter, again you will reject the null
hypothesis using the f statistic. And you may conclude for all 4 cases,
you will get the same identical result that a linear model is a good fit.
Clearly it is not so.
One can of course, do scatter plots and try to judge it in this
particular case. Because it is a univariate example, but when you have
many independent variables, then you have to examine several such
scatter plots and that may not be very easy.
So, if you assume there are 100 independent variables, you have to
examine 100 such plots of y versus x. And it may not be possible for
you to make a visual conclusion from that. So, we will use other kinds
437
of plots called residual plots, which will enable us to do this whether it
is a univariate regression problem or a multivariate regression problem,
we will see what these are.
So, the main questions that we are trying to ask now whether a
linear model is adequate. We have some measures we have seen, but
they are not adequate, we will use additional things, and when we did
the linear regression, we did make additional assumptions although
they were not been stated explicitly, we assume that the errors that
corrupt the dependent variables are normally distributed and they have
identical variance. Only under this these assumptions can you use a
least squares method to perform linear regression. That you can at least
prove that the least squares method has some nice properties.
438
(Refer Slide Time: 06:27)
But the residual which is a result of the fit will not have the same
variance, for all data points. In fact, you can show that the variance of
the ith sample is σ2 which represents the error in the measured value of
the dependent variable multiplied by 1 - pii; where pii is defined by
this. Notice that pii depends on the ith sample, numerator depends on
the ith sample, therefore p ii depends on the ith sample and varies with
sample to sample.
So, the variance of the residual will not be identical for all
examples, it is given by this quantity. We also can show that the
residuals are not independent even though we assume that the errors
corrupting the measurements are all independent. The residuals in the
samples are not independent and they have a correlation covariance
and that covariance can be shown to be given by this quantity. The
reason for the variance not being identical of the residuals or them
439
being correlated is because you notice that this ŷi they have actually
have here is a result of the regression it depends on all variables all
measurements. It is not dependent only on the ith measurement. This
predicted value is a function of all the observations, and that because of
that it introduces a correlations between the different residuals. And
also, imparts different variants to different residuals. And so, having
derived this, notice that even if we do not have a priori knowledge of σ
square which is the variance of error in the measurements we can
estimate this quantity.
We have already seen this in the previous lecture, we can replace
this by s s e by n - 2, which is an estimate of this σ 2 and substitute this
to get an estimated variance of each residual.
440
(Refer Slide Time: 10:21)
So, we will plot the residual what we call a residual plots will we
plot the residuals with respect to the predicted or fitted value of the
dependent variable remember there is. Only one dependent variable
even if there are multiple independent variables we have only one
dependent variable, we can plot the residuals with respect to the
predicted value of the dependent variable. And the predicted values
you obtain after the regression remember.
So, this is called the residual plot what is called the residual versus
the fitted or predicted value and this plot is very useful in testing the
validity of the linear model, in determining whether the errors are
normally distributed, assumptions on errors are ok and whether the
variances of all errors are identical or not which is called the
homoscedastic case, which means the errors in all measured values are
identical or the variance of the error in different measured values are
non-identical which is called the heteroscedastic case or
heteroscedastic error. So, let us see how each of these how the plot
looks for each of these cases.
441
(Refer Slide Time: 11:28)
Now, let plot the residual plot for the 4 data sets provided by
Anscombe. Notice that we have done the regression model regression
model we computed all these parameters r squared confidence interval
they all turned out to be identical they gave us no clues, whether the
linear model is good for all 4 data sets or not. Basically they say they
would say that the linear model is adequate.
But when we do the residual plot here we have plotted the residual
versus the I have plotted with respect to the independent variable, but
because it is a univariate case, we have plotted with respect to the
independent variable. But technically you should plot the residual with
respect to the predicted value of the dependent variable, remember
because we presume that the predicted variable is linearly dependent
on x, in this case it may not matter the pattern will look the same you
can try it out for yourself if you plot the residual with respect to the
predicted value of the variable dependent variable. Then you will get
this kind of pattern of the residuals for the 4 data sets. The first data set
if you look at it exhibits no pattern. The residuals seems to be
randomly distributed between this case between - 3 and + 3 and
whereas for the second data set there is a distinct pattern. The residuals
look like a quadratic like a parabola. And so, therefore, there exists a
pattern in the data set 2. For the third data set basically, you can say
that there is no pattern, except that are constant more or less linear are
constant, there seems to be a small bias because of the slope left in the
residuals. Data set 4 as we saw before is a poorly designed
experimental data set. All the y values are obtained at a single x value
and that is what the residuals are also showing. The 10 of the data
points obtained at the same x value are showing different residuals and
the one single residual at a different x value showing something.
442
So, from this you cannot judge anything. All you can say is that the
experimental data set is very poorly designed, and you need to get back
to the experimenter and ask him to provide a different data set. Now
based on this we can safely conclude that data set one clearly a linear
model is adequate. All the measures, previous measures also, were
satisfied. And now the residual plot also shows a random pattern which
means a random or what we call no pattern, then linear model is
adequate. Whereas, for data set 2 by look at the residual plot we can
conclude that a linear model is inadequate should not be used for this
data set.
For the third data set however, we know there is one data point that
is lying far away, and perhaps that is the one that is causing all of this
slightly a linear pattern here. And if we remove this outlier and retry it
maybe this resolve this problem will get resolved, and linear model
may be adequate for data set 3. For data set 4 again there is a distinct
constant pattern and therefore, we can conclude that linear model
should not be used.
443
Now, the test for normality can also be done using the residuals. We
have already seen the prob even we did statistical analysis the notion of
a probability plot. Where we plot the sample quantiles, against the
quantiles from the distribution with which we want to compare.
So, if we want to compare, whether a set of given a given set of
data. Follows a certain distribution then we plot the sample quantiles
from the quantiles drawn for that particular distribution against which
we want to test this sample. So, in this case we want to test, whether
residuals that we have the standardized residuals, come from a standard
normal distribution. And therefore, we will take the quantiles from the
standard normal distribution and plot it. Just to recap what do we mean
by quantile it is a percentile data value below which a certain
percentage of data lies. For example, if you want to find given a data
set, what is the value below which 10 percent of the data lies, maybe -
1.28 here we are given, which means 10 percent of the samples lie
below - 1.28. 20 percent of the samples lie below - 0.84 and so on and
so forth.
444
So, this is a sample Q-Q plot I have taken for some arbitrary
random data set samples drawn from the standard normal distribution
and you can see that if you do the normal Q-Q plot for the residuals
after fitting the regression line, it seems to closely follow the 45-degree
line. So, the theoretical quantiles computed from the standard normal
distribution and the quantiles computed from the sample residuals,
standardized residuals, the fall on the 45-degree line. And therefore, in
this case we can safely conclude that the errors in the data come from a
standard normal distribution.
So, a Q-Q plot if this thing is in does not happen. If we find the
significant deviation of this quantiles from the 45-degree line, then a
normal distribution assumption is incorrect. Which means we have to
modify the a least square subjective function to actually accommodate
this. It may not have it may or may not have a significant effect on the
regression coefficient, but there are ways of dealing with it which I will
not go into.
445
the residuals. Whereas, such a effect is not found on the data set
corresponding to the left-hand side thing. So, here I have plotted the
standardized residual for 2 different data sets. Just to illustrate the type
of figures you might get. If you get a figure such as in the left then we
can safely conclude that the errors in different measurements have the
same variance, whereas if you have a funnel type of effect then you
know that the errors, where a variances increases as the value
increases. So, it depends on the value itself, which implies that you
cannot use a standard least squares method, you should use a weighted
least squares method.
446
The last thing that we need to do is also clean out the data, we do
not want to use data that have got large errors what we call outliers,
points which do not conform to the pattern that is found in the bulk of
the data. And the outliers can be easily identified using hypothesis test
of the residual for each sample. We have actually found the residual for
example, we have actually found the standardized residuals. So, the
standardized residual roughly follow we know it follows a t
distribution. But we can for large enough number of samples, we can
assume that it follows a normal distribution. So, if I use a 5 percent
level of significance we can run a test, hypothesis test for each sample
residual, and if the residual lies outside - 2 to + 2, we can conclude that
the sample is an outlier.
447
(Refer Slide Time: 23:34)
448
initial indicators are that a linear model is adequate. Now let us go
ahead and do the residual analysis for this.
So, we can remove all these 4 outliers at the same time, if you want
as I said that is not a good idea perhaps we should remove only sample
number 35 which has furthest away from the boundary with the
residual with the largest magnitude. And redo this because of lack of
time, I have just removed all 4 at the same time and then done the
analysis. My suggestion is you do one at a time and then repeat this
exercise for yourself. Here we have removed these 4 samples 4 13 35
and thing and run the regression analysis.
449
(Refer Slide Time: 27:02)
Again, you can see that the regression analysis, maintain retaining
all the samples is shown on the right-hand side of the plot. And their
corresponding intercept coupon the slope as well as the f test statistic
and so on r squared value is shown here. And once we remove these 4
samples which we outliers and then rerun it, now the fit seems to be
much better. It is also seen on the left-hand side that the t is much
better you can see that the r squared value has gone up to 0.99 from
0.775. The again the test on the intercept and the coupon rate or slope
shows that that they are significant and therefore, you should not
assume that they are close to 0.
It also shows that the f statistic has also a low p value which means,
you raise the null hypothesis that reduce model is adequate which
means the linear model with the slope parameter is a much better fit.
So, all of this indicator show seem to indicate show that a linear model
is adequate and the t seems to be good. But we should do a residual
plot again with this data. And if that actually shows no pattern we can
actually stop there we can say that are no outliers. And therefore, we
can conclude that regression model that we have fitted for this data is a
reasonably good one.
Next class will see how to actually extend all of these ideas to the
multiple linear regression which consist of many independent variables
and one dependent variable.
Thank you.
450
Data Science for Engineers
Prof. Shweta Sridhar
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 35
Simple Linear Regression Modelling
451
(Refer Slide Time: 00:39)
Now, let us see how to load the data. Now you have been given the
data set bonds in the text format, the extension of the file is dot “txt".
To load the data from the file, we use the function read dot delim.
So, read dot delim reads a file in a table format and creates a data
frame out of it. The input arguments to the function are file and row
dot names. Now file is the name of the file from which you want to
read the data and row dot names are essentially the row ids. It can be a
452
vector of names or only a single number corresponding to the column
name.
453
We can also view the first few rows of any data set, head and tail
functions will help us to do that. Now, head of bonds will give us the
first 6 rows from the data and tail of bonds will give us the last 6 rows
from the data.
Now, let us look at the description of the dataset, now the data has 2
variables coupon Rate and Bid Price. Now, coupon rate refers to the
fixed interest rate that the issuer pays to lender. Bid price is the price
someone is willing to pay for the bond.
454
Now, we have seen how to load the data and how to view the data.
Let us now see what the structure of the data is. By structure I mean
that each variable and it is data type. We use the function str and the
input to the function is a dataframe. Now we exactly want to see
whether the variable data type are same as what we expected them to
be, if not we need to coerce them to the respective data types.
So, now this should ring a bell, because we have learned the
function as dot followed by the name of the data type and we will use
this function to coerce it if the variable is not of the desired type. Now
for this dataset, I run the function I say str of bonds now bonds is the
name of my data frame. So, the output reads as data frame bonds is of
the type data frame it has 35 observations of 2 variables. The first
column being coupon rate, which is of the type numeric and I have the
first few values being displayed.
The next column Bid Price is also of the type numeric and the first few
values of the same column are being displayed.
Now, let us look at the summary of the data. So, the summary
function followed by the name of the data frame in this case bonds will
give us 5 number summary and mean from the data.
Now, the first column which is coupon rate, I have the 5 number
summary and the mean and I also have the 5 number summary and the
mean for the second column. So, till now we have seen how to load the
data, how to view the data, We have also looked at the structure and
the summary.
455
(Refer Slide Time: 04:02)
Now, let us see how to visualize the data. So, to visualize the data I
use the plot function. We have covered the plot function earlier in the
visualization in r section, now the input to the plot function are
basically x and y. In this case x refers to my coupon rate and y refers to
my bid price. So, in order to access the variables, I need to give the
name of the data frame followed by a dollar symbol. So, I say access
coupon rate from the bonds data and access bid price from the bonds
data.
So, I can also give a title to my plot. So, inside the parameter main
you can specify the title of your plot, xlab is nothing, but x label. So, I
am assigning it as x “Coupon Rate" and y label I am assigning it as
“Bid Price". So, the plot is on right hand side.
So, the title is bid price versus coupon rate like how we have
assigned it on the y axis I have bid price and I have labeled it as bid
price and on the x axis, I have coupon rate and I have labeled it as
coupon rate.
Now, we see a linear trend. Now there are some points which are
completely outside the range of coupon rate. Now let us see if our
linear model will help us to identify these points.
456
(Refer Slide Time: 05:19)
So, to start with let us build a linear regression model. So, building
a linear model is done using the function lm. So, the inputs for the
function are formula and data, by formula I mean I am regressing
dependent variable versus the independent variable.
457
(Refer Slide Time: 07:02)
We will use a function called ab line and the input for the function
is bondsmod which is my linear model.
Now, we have already gone back and seen how to plot coupon rate
and bid price. Now in addition to the plot you need to mention this
command. So, ab here refers to the intercept and slope. If your
equation is of the form y = a + bx, then a is my intercept and b is my
slope.
In this case a is β̂₀ and b is β̂1. So, let us see how the plot looks. So,
on my right I have the plot we are now able to see how the regression
line fits. It fits pretty badly and it is also not identify the outliers. So,
we can say that regression line is indeed getting affected by these
outliers.
(Refer Slide Time: 07:59)
458
So, now let us take a look at the model summary. So, I have regress
bid price versus coupon rate from the data bonds, you can also use the
other command.
So, summary is a function the input to the summary function
becomes a linear model. So, we have bondsmod as the linear model,
this is the first look at the summary. So, this is how it looks when you
run the command and this is how it would look in the callzone.
So, we have 4 sections of output we have call, we have residual, we
have coefficients, and we have some few heuristics at the bottom. So,
now, let us look at each of these and what they mean in depth.
Now, call displays the formula which we have used. So, in this case
I have used the formula bidprice versus coupon rate so regress
bidprice, which is my dependent variable with my independent variable
which is coupon rate and from the data bonds.
Now, this is the way for you to check if you have given the right
dependent and independent variable. The next section is residual. So,
what are residuals are nothing, but difference between the observed
and predicted values. So, in our equation earlier we saw we had a
parameter called εi. So, that εi corresponds to residuals below the
residuals is the 5 number summary for the residual.
459
standard error. So, standard error is the estimated standard deviation
associated for the slope and intercept.
I have the next column as t value. So, what is t value? It is the ratio
of estimate by the standard error and it is also an important criterion for
the hypothesis testing. The column after that is the probability.
So, what is the null hypothesis? The null hypothesis is that the
estimates will be = 0. At the last line I have an F statistic and a
corresponding p-value associated with it. Now the F statistic is again
used to test the null hypothesis which is nothing, but slope = 0.
So, in this lecture we saw how to load a data, how to plot and how
to visualize, how to build a linear model and how to interpret the
results from the linear model? So, in the next lecture we will see how
to assess our model and we will see if we can improvise our model.
Thank you.
460
Data Science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 36
Simple Linear Regression Model Assessment
461
(Refer Slide Time: 00:40)
Now, let us start with model assessment. So, there are a few
questions which we need to answer before we go into model
assessment. After having built a model, we first need to check how
good is our linear model. Now, we need to identify which coefficients
in the linear model are significant.
Now, if you have multiple independent variables, then we also need to
identify which of them are important. We also need to know can we
improvise the model further. As a part of this, we are going to look at
are there any bad measurements in the data. So, by bad measurements
we mean are there any other outliers in the data which could affect the
model. This question alone will be handled in the next lecture.
So, let us look at how to answer the first two questions.
462
Now, from the earlier lecture we saw how to look at the summary.
We also know how to interpret it now. Now, this is the first gist of
summary that you get when you run the command.
So, I had regressed BidPrice with coupon rate from the data bonds and
bondsmod was my linear model. I also have the estimates here which
are nothing but the intercept and slope. So, let us look at the first level
of model assessment.
So, the first level of model assessment is done using the R squared
value. Now if you go back and see, the R squared value for our model
is 0.7516. Now this is pretty close to 1. Though not very close, but it is
still closer to 1.
So, we can say that, yes, the model we have developed is reasonably
good, but not really good. It also tells us the assumption that we made
initially to begin with, that there is a linear relationship between x and
y. We are also going to look at hypothesis testing. As a part of this, we
are going to look at the hypothesis testing on coefficients and then on
the full and reduced model. Now, let us see what these full and reduced
model are. So, first let us do the hypothesis test on coefficients.
463
(Refer Slide Time: 02:36)
464
Now, this becomes my full model. The confidence interval is
computed to check if the slope is significant. Now, this test is a two
sided test, since we see β̂ 1= 0 or != 0. At 95 percent confidence level,
that is, at α = 5 percent. We get the critical value to be 2.0345. Now, let
us see how to compute this critical value.
So, we know that α = 0.05. And n here is the number of
observations in your data. Now, in this case I have 35 observations. P
becomes my number of independent variables. Here I have only my
one independent variable. So now, we know from the statistics module
how to compute the quantiles for a t from a t distribution.
After having done this we get the quantiles to be = 2.03. So, this is
the critical value we are going to use this to compute the confidence
interval. Now, we earlier saw from the summary that the slope, which
is nothing but the β̂ 1= 3.0661 and the standard deviation associated
with it was 0.3068. So, the confidence interval is computed as the
estimate + or - the critical value into the standard error.
So, by doing so we get the lower bound as 2.44, and the upper
bound as 3.69. So now, we know that, this interval does not encompass
0, that is, anywhere between the interval I do not have 0. So, this itself
is indicative of the fact that my β̂ 1 that is the slope is significant.
465
Now, let us do a hypothesis test on the models. So, to do so we use
the F statistic. So, let us go back and revisit what the F statistic is. So, F
statistic is nothing but my sum squared residual divided by the sum
square error by the degrees of freedom for the denominator. So, for the
sum square residual, we know it is of the form of summation of ŷi - y̅
the whole square. So, I have only one degrees of freedom, since, I am
using only one parameter to compute it.
Whereas for the sum square error it is the summation of (yi - ŷi)2.
Now, I am using two parameters to compute it. So, the degrees of
freedom reduced by 2. So, hence I have the denominator as n - 2. So,
this is how you would compute the sum squared error. So, I am
summing my yi which is nothing but from bid price bond dollar
BidPrice. I have the fitted values, and I know the mean which is y̅ of
the bid price. I am squaring the term, and I am summing it. Now, we
know that my num the number of observations we have a 35 from the
data. So, from the formulae we know our F statistic is computed as
SSR by SSE into n - 2. Degrees of freedom n - 2 go to the numerator,
and we get the F statistic to be = 99.87.
Now, this F statistic is what is returned by the summary, which is
given in the last line of the summary.
So, let us see what conclusions can we draw from these two tests.
We know that the F statistic from the table. 1 and 3 degrees of freedom
is 4.17 at 5 percent significance level.
What we observe is 99.87, at 1 and 33 degrees of freedom. Now, this is
greater than the theoretical value that we get from the distribution.
466
(Refer Slide Time: 07:40)
So, what conclusion can we draw now? So, we know that, we can
reject the null hypothesis, since the confidence interval does not
include 0. And hence β̂ 1 which is the slope is also significant.
So, in this lecture, we saw how to assess the model. We also looked at
how to answer some of the important question that gets associated
while assessing a model. We also saw how to identify the significant
coefficients. In the next lecture we will look at how to identify outliers
and how to improvise a model.
Thank you.
467
Data Science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture-37
Simple Linear Regression Model Assessment
468
(Refer Slide Time: 00:35)
So, let us see, what outliers are. So, outliers are points, which do not
con-form to the bulk of the data. Now, a point is considered an outlier,
if the corresponding standardized residual falls outside, - 2 and + 2 at 5
per-cent significance level.
469
(Refer Slide Time: 01:10)
So, you can see a plot function on the left hand side. Now, for the
residual plot, I am going to plot fitted values on the x axis and the
standardized residual from the model we have built. Now, we built a
470
linear model called bondsmod and we are going to calculate the
standardized residuals for it.
So, that becomes my y. I am also giving a title like I earlier said.
The title for this plot is residual plot and my x label is nothing, but
predicted values for bid price. Similarly, my y label is standardized
residual. Now, after doing so, we need to set the confidence region. So,
let us see how to do that. We again use the same command a b line.
Now, a b line is what we have used to fit the linear model onto the plot.
Now, with the same command, we can give the confidence region
as well. Now, I am going to set the height, which = h here, as 2 and the
line type as 2. So, height is at which you want the line to be drawn and
line type is nothing but how you want the line to be drawn. So, you can
have dashed lines solid line, dashed and a dot. So, you have several
options in there similarly, I also need a lower confidence limit. So, I
am setting that to be = - 2 and for the same limit I am setting the line
type to be = 2.
Now, let us see how the plot looks. On the right hand side, I have
the plot. So, we can see that there are two lines drawn at + and - 2 that
defines the confidence level. Now, on the y axis, I have standardized
residuals and on the x axis, I have predicted values for bid price. Now,
from the plot, we can see that there are two outliers, which are really
farther, there is one, which is close to the upper confidence limit. And
there is one, which is exactly almost close to the lower confidence
limit.
So, let us see how to identify these. So, from the plot, we may not
be able to tell which points are these. By points, I mean in the row IDs.
We are going to use another function called identify, that will help us
identify the indices of these samples. Now, let us see what identify
function does.
471
So, it treats the position of the graphic pointer, when the mouse
button is pressed it, then searches the coordinates given an x and y for
the point closest to the pointer. Now, if the point is close enough to the
pointer, its index will be returned as a part of the value code.
Now, let us look at the syntax for it. So, identify is a function and x and
y are my input parameters. So, what are my x and y, they are the
coordinates of the points in the scatter plot. Now, let us see how to use
this function to identify the indices.
On my left, I have the same commands for the residual plot. So, this
is what we saw in the last slide. I now, use the identify function to
identify the indices. Now, my input for this is fitted values of the bonds
model and the standardized residuals the reason. I am giving fitted
values is, because on the plot, I want the indices to be found. So, the
plot has fitted values from the model and the standardized residuals
from the model.
So, on this plot I want my indices to be identified. So, I give the
same inputs that I have used for the plot command for identify
function. So, again here, if you see I have fitted values from the bonds
model and I am plotting for the y parameter, I am plotting the
standardized residuals. Now, once you execute the command, you will
not get the output immediately. What will be displayed is the following
snippet on the left, you will see a finish button and you will see a
message being displayed. Now, on this plot, we will need to click and
identify each of the points. Now, let us see how to do that.
472
(Refer Slide Time: 05:33)
473
Now, once
you have identified all the outliers, you need to click the finish button
that is present on the top right, corner of the graphical window, you can
also press escape to finish. Now, after terminating the indices are
displayed on the console and on the plot. Now So, you can see on the
console, I have the indices being displayed as 4 13 34 35, but this will
give you only the value.
So, now, to know where your outliers lie on the plot, I am going to
look at the plot. So now, I know the 13th point of the sample is the
farthest, which is here, followed by the 35th sample, followed by the
sample 4 and then I have one more sample, which is here, which is the
34th sample. So, after identifying these outliers, we are going to start
by removing one at a time and we are going to build a new model.
Now, let us see how to do that. Now, I will start by removing one point
at a time.
474
The first point that I am going to remove is sample 13 that is the
13th point, because it is the farthest in the plot. So, to start with, I am
going to create a new data frame called bonds new and it will have all
rows of bonds except the 13th row. So, then I am going to create
another object called bonds mod one, which is the linear model that is
being built for the new data. So, I am going to regress bid price from
the new data frame bonds new with coupon rate from the same data set.
Now, after building the new linear model, which does not contain the
13th point, that is an outlier.
We are going to repeat the same process again that is on the residual
plot, we are going to identify the outliers for the new data. So, on my,
right. I already have the residual plot with the outliers being identified.
So, from the snippet, we can see that for the new data, I have my 4th
point, 33rd point and 34th point being, are being identified as outliers.
So, now, this new data will contain only 34 data observations, because
we have already removed one observation. So, the indices for the new
data will change.
So, the farthest point in this data is the 34th point and after that I
have the 4th point and followed by that, there is also one point on the
line, which is the 33rd point, for this new data. Now, we can see that, if
you compare this plot and the earlier plot this point, which is located
here was below this line and that is because we had an extreme outlier
in the previous case, that had a smearing effect on the remaining
points. Now, after building this new linear model let us take a look at
the summary.
475
On the left is the summary of the old model bondsmod that contains
all the points. On the right, I have the summary of the new model,
which does not contain the 13th sample. So, from the R squared values
of the two model, we can see that there is a drastic change by just
removing one extreme point. So, from 0.17516, the R square improves
to 0.8077. So, that is a quite drastic change. Now, let us remove all the
other points one by one and let us see how the R squared value
changes.
476
Now, I am removing the remaining points one by one. So, earlier I
started by removing 13th point. Now, I am going to remove the 35th.
So, let us see, what the a square value is, after removing the 35th point.
So, the R square changes from 0.80 to 0.88. So, there is a quite big leap
here as well.
So, after removing the 35th point, I am able to see a pretty good change
in the R squared value. Now, let us look at what the R squared value is
If I remove the fourth point. So, the R squared value improves from
0.88 to 0.98. So, that is also pretty good jump. So, now, these indices
are for the old data. So, I also have one more index to remove, which is
index 34. Now, let us see what happens, if we remove this.
So, on the left you can see, that I have removed the 4 index
basically, 4 13 34 and 35. These points I have removed from my data
and similarly, for bid price also, I have removed these points and I am
going to t the new model. So, bondsmod one does not have any outliers
477
now and I am going to plot the regression line over the data. So, our
regression line fits the data pretty well, though there are some points,
which are really away, but it does not change the nature of the slope
drastically. So, this is a pretty good model and we have removed all the
possible outliers that we thought were influencing the regression line.
Thank you.
478
Data Science for Engineers
Prof. Shankar Narasimhan
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 38
Multiple Linear Regression
479
on up to b β times pxp, where β1, β2, βp represents the slope parameters
or the effect of the individual independent variables on the dependent
variable.
480
βp we actually set up the minimization of the sum squared of errors. In
order to set it up in a compact manner using vectors and matrices, we
de ne the following notations. Let us define the vector y where which
consists of all the n measurements of the dependent variable y 1 to yn,
we have also done one further things we have subtracted the mean
value of all these measurements from each of the observations.
So, the first one represents the first sample value of the dependent
variable y1 - the mean value of y over all the measurements, y̅. So, the
first sample is mean shifted value of the first observation, the second
coefficient or second value in this vector is the second sample value -
the mean value of the dependent variable and so on for all the n
observations we have.
So, these are the mean shifted values of all the n samples for the
dependent variable. Similarly we will construct a matrix x where the
first column corresponds to variable, independent variable 1. Again
what we do is take the sample value of the first independent variable
and subtract the mean value of the first independent variable. That
means, we take the mean of all these n samples for the first variable
and subtract it from each of the observations of the first independent
variable.
So, the first coefficient here will be x1 represents the sample value
of the first independent variable, first sample first independent
variable, - the mean value of the first independent variable. And we do
this for all n measurements of the first independent variable. Similarly
we do this for the second independent variable and arrange it in the
second column. So, this one represents the observation the first
observation of the second independent variable - the mean value of the
second independent variable and we do this for all p variables
independent variables.
481
Notice that we have not included β0. We have eliminated that in
directly by doing this mean subtraction I will show you how that
happens. But you can take it that right now we have only interested in
the slope parameters this linear model only involves the slope
parameters β1 to βp, does not involve the β0 parameter because that has
been effectively removed from the linear model using this mean
subtraction idea.
σ2 identity in this form it means all the epsilons, ε1 to εn, have all
have the same variance σ2homoscedastic assumption. And we also
assume that ε1 and ε2 are uncorrelated or εi and εj are uncorrelated if i
is not equal to j, in which case we can write the covariance matrix of ε
as σ2i.
482
(Refer Slide Time: 08:54)
And this is the solution that minimizes the sum squared errors, this
objective function that we have written. So, once we get β 1 the slope
parameters β0 can be estimated as the mean value of y - the mean
vector XT times the slope parameter. Notice that this is very similar to
what we have in the univariate case where it says β 0 estimate is nothing
but y̅ - x̅ into β1. So, it is very similar to that you can see.
You can also compare the solution for the slope parameters, for the,
with the univariate case which says β̂1 is SXY divided by SXX. Notice
that X transpose y represents SXy and X TX represents SXX in the uni-
variate case you were diving SXy by SXX in the multivariate case
483
division is represented by an inverse. So, you get X TX inverse terms
times X Ty. So, you can see the it is very very similar to the solution for
the univariate case except that these are matrices and vectors and
therefore, you have to be careful. You cannot simply divide it as matrix
times inverse times a vector that is a solution for β which is slope
parameters.
Now, again you can go back and look at the univariate case. There
the variance of β1 slope parameter will be σ2 by SXX in this case it is
σ2 into XTX inverse. So, XTX represents SXX. σ2 is the variance of the
error corrupting the dependent variables. We may have a priori (Refer
Time: 13:35) knowledge sometimes in most cases we may not be able
to know this value of σ2 we may not be given this. So, we have to
estimate the σ2 from data and we will show how to get this.
These two parameters that actual we can show the first parameter
says that the estimates of β1 the slope parameters are unbiased. So, β̂
are unbiased estimator it is an unbiased estimator of the of the true
value β. Moreover you can show that among all linear estimators
because β hat is a linear function of y. Notice that (XTX) -1XT is nothing
but matrix which basically multiplies the measurements y. So, β hat
can be interpreted as a linear combination of the measurements.
Therefore, it is known as a linear estimator.
Among all such linear estimators we can show that β̂ has the least
variance. Therefore, it is called a blue estimator or a unbiased estimator
with the best linear unbiased estimator that is what it blue represents,
best in the sense of having the least variance.
484
(Refer Slide Time: 14:52)
Now, we can also estimate as I said σ2 from the data and that σ2
estimate is nothing but the after you fit the linear model you can take
the predicted value for the ith sample from the linear model and
compute this residual yi - ŷi which is the measured value - the
predicted value for the ith sample, square it take the sum or all possible
samples, n samples, divided by n - p - 1. Again if you go back to your
linear case univariate case you will find that the denominator is n - 2.
Here you have n - p - 1 because you are fitting p + 1 parameters p is
slope parameters + 1 o set parameter.
So, you can see a one to one similarity between the univariate
regression problem and the multi multiple linear regression problem in
every derivation that we have given here. Now, once we have
estimated σ hat, the variance of the error used from the data you can go
back and construct confidence intervals for each slope parameter we
can show that the true slope parameter lies in this confidence interval
for any confidence interval you may choose 1 - α, α represents like a
level of significance.
So, if you say α = 0.05, 1 - α would represents 0.95. So, that will be
a 95 percent confidence interval. Correspondingly I will find the
critical value from the t distribution n with n - p - 1 degrees of freedom
485
and this represents α by 2 the upper lower value probability value from
the t distribution and this is the upper critical value where the
probability area under the curve beyond the value is α by 2. So, n - p -
1 represents the degrees of freedom notice that in the univariate case it
would have been n - 2, very very similar.
So, the confidence interval for βj for any given α can be computed
using this particular formula and the term here se of β̂j represents the
standard deviation of the estimate of β̂j and that is given by the
diagonal element, diagonal element here of this quantity with σ square
replaced by the estimate here.
Now, we will, can also compute the correlation between y and y hat
which tells you whether the predicted value from the linear model is,
resembles or closely related to, the measured value. So, typically we
will draw a line between the y the measured value and the predicted
value and see whether it is these things fall on the 45 degree line and if
it does then we think that the t is good. Another way of doing this is to
find the correlation coefficient between y and ŷ which is simply using
486
the standard thing yi - y̅ multiplied by ŷi - y hat bar. Summed over all
quantities divided by the standard deviation of in y and the standard
deviation in ŷ that is for normalization.
On the other hand if we are not improved the t because of x’s any of
the x’s, then the numerator will be almost equal to the denominator and
therefore, this one will be close to 0. So, value of R square close to 1 as
before represents indication of a good linear t whereas, a value close to
0 indicates the t is not good. We can also compute adjusted R square to
account for the degrees of freedom notice that the numerator has n - p -
1 degrees of freedom whereas, the denominator has n - 1 degrees of
freedom therefore, we can
do an adjusted R squared which divides the SSE by the appropriate
degrees of freedom.
We can say this is the error due to per degree of freedom that is there in
the t whereas, the denominator represents the error because we have
fitted only the o set parameter there are n - 1 degrees of freedom this is
the error per degree of freedom. So, this kind of a thing is also a good
indicator instead of using R squared we can use adjusted value of R
square. So, these are all very very similar again to the univariate linear
regression problem.
487
So, we can use, we can check R squared and see whether the values
close to one and if it is we can say maybe linear model is good to t the
data, but that is not a confirmatory test. We have to do the residual plot
as we did in linear regression, univariate linear regression, and that is
what we are going to do further. So, we are going to find whether the
fitted model is adequate or it can be reduced further. What this reduced
further means we will explain. In the univariate case there is only one
independent variable, but here there are several independent variables.
Maybe not all independent variables have an effect on y. Some of the
independent variables may be irrelevant. So, one way of trying to find
whether a particular independent variable has an effect is to test the
corresponding coefficient.
488
We will consider a specific case here where we do the F test statistic
for the case when we have a reduced model and compare it with the
full model. The reduced model we will consider with k parameters
specifically let us consider the reduced model with only one parameter
which means that we have only the constant intercept parameter we
will not include any of the independent variables. And compare it with
the full model which contains all of the independent variables
including the intercept.
So, the reduced model is one which contains only the o set
parameter and no independent variables the full model is a case where
we consider all the independent variables and the intercept parameter.
So, the number of parameters we are estimating in the reduced model
is only 1, so k = 1 and the full model is the case where we have all the
independent variables p independent variables. So, we are estimating p
+ 1 parameters in the full model.
So, what we do is perform a fit and compute the sum squared errors
which is nothing but the difference between y the measured value and
the predicted value. So, we will first take the model containing only the
o set or the intercept parameter and estimate. In this case of course, y
bar will be the best estimate. And we will compute sum squared errors
which is nothing but the variance of the measured measurements for
the dependent variable. Then we will also perform a linear regression
containing all the parameters, independent variables, and in this case
we will if we compute the difference between y and y predicted and
take the sum squared errors that is the SSE of the full model.
So, the difference in the fit which is difference in the sum squared
errors between the reduced model fit and the full model fit that is the
numerator, divided by what we call the degrees of freedom. Notice the
full model as p + 1 parameters p independent variable + the o set and
the reduced model in this particular case contains only 1 parameter, so,
k=1
So, the degrees of freedom will be p. So, you divide this difference
in the sum squared errors by p denominator is the sum squared errors
of the full model which contains n - p - 1 degrees of freedom because p
+ 1 parameters have been fitted therefore, the degrees of freedom is the
total number of measurements - p - 1. So, we divide the sum squared
489
errors for the denominator by the number of degrees of freedom and
then take this ratio as defined and that is your F statistic.
Then we compare the test statistic with the critical value and if the
test statistic exceeds the critical value at this level of significance ,then
we reject the null hypothesis. That is we will say the full model is
better choice and the independent variables do make a difference. And
this is a standard thing that R function will provide. This particular
comparison between the reduced model which has no independent
variables and the full model which contains all the independent
variables in multi linear regression.
490
(Refer Slide Time: 27:51)
So, in this case we have what is called the price data where
customers are being asked to rate the food and the other aesthetics of a
particular restaurant and we also and cost of the particular dinner also
data(Refer Time: 28:15) obtained for these restaurants. And the
location of these restaurants whether on they are on the east side of a
particular street in New York or the west side. Typically New York
Westside is probably a little poorer whereas, the east side probably is a
little richer neighborhood.
491
(Refer Slide Time: 29:17)
The third one is the scatter plot between price and service and the
last one is price versus location. And similarly you can actually
develop a scatter plot between food and decor which is here or food
and service and so on.
Even though we consider all these variables that we have obtained
like food ,decor, service, location as independent. It is possible when
we select these variables they are not truly independent there might be
interdependencies between
the what we call so called independent variables that can give rise to
problem in regression which we will see later the what we call the
effect of collinearity. But a scatter plot may reveal some dependencies,
inter dependencies, between the independent so called independent
variables.
So, for example, if we look at the scatter plot between food and
decor it is seems to be completely randomly distributed this does not
seem to be any quite correlation. However, food and service seems to
be very strongly correlated there seems to be a linear relationship
between food and service.
So, perhaps you do not need to include both these variables we will
see later that that it is true. But in this just a scatter plot itself reveals
492
some interesting features and so we will now go ahead and say perhaps
a linear mode between price and food and decor is, seems to be pointed
out or, indicated by this scatter plots let us go ahead and build one.
It also gives you the standard error for each coefficient as well as
the offset parameter which is nothing but the σ value for the estimated
quantities and it also gives you the probability values p values as we
call them. And notice if the p value is very low it means that this
coefficient is significant. We cannot take it that this value is. Any low
value of this indicates that the corresponding coefficient is significantly
different from 0.
So, in this case the first three has very low p values and therefore,
they are significant, but service has a high p value therefore, it seems to
indicate that this coefficient is insignificant is equal almost = 0 that is
what this indicates. If you look at the east which is this independent
location parameter that as does not have a very low p value. But it is
still not bad 0.03 and therefore, it is significant only is insignificant
only if you take a level of significance of 0.025 or something like that.
If you take 0.1 or 0.05 and so on you will still consider this east ,this
coefficient, to be significant and that is what this is basically pointing
out this star indicates that.
So, now we will go ahead and try to actually look at the F value
also, the F statistics says that the full model as compared to the reduced
model of using only the intercept is actually significant. Which means
493
the constant model is not good and including these variables results in
a better t or explanation of the price and therefore, you should actually
include this. Whether you should include all of them or only some of
them we can do different kinds of test to find that.
What we have done in this particular case is only compare the
model with-out any of these independent variables which is called the
constant model with all of these variables included that is the only two
model comparisons we have made. The reduced model is one
containing only the interceptor and the full model is one which
contains intercept and all four independent variables and thus the p
value it has given the corresponding F statistic.
We have done that. We have only included now food, decor and
east and done the regression again and it turns out that regression thing
is still what you call the R squared value is improved not improved
significantly, but not reduced and F value is significant and we get the
more or less the same coefficients for the other parameters also the
intercept and the slope parameters.
It indicates that x3 is not adding any value to the prediction of y. The
rea-son for this as we said if you look at the scatter plot service and
494
food are very strongly correlated therefore, only either food or service
needs to be included in order to explain price and not both right. And
in this case service is being removed, but you can try removing food as
the variable and try to t between price, decor and decor service and east
and you will find that the regression is as good as retaining food and
eliminating service.
495
(Refer Slide Time: 36:00)
The last thing is we have these outliers if you want to improve the t
you may want to remove let us say the outlier which is farthest away
from the boundary.
For example, you may want to remove 56 and redo the linear
regression multi multiple linear regression and again repeat it until
there are no outliers. That will improve the R squared value and the fit
quality of the fit little more.
So, we have not done that we leave this as exercise for you. So,
what we have done is we have seen that whatever was valid for the
univariate regression can be extended to the multi multiple linear
regression except that scalars there will get replaced by vectors and
496
matrices corresponding. What was a variance there it will become a
variance covariance matrix here, what was a vector there scalar there
might mean scalar might become a mean vector here.
So, you will see a one to one correspondence, but the residuals plot
and interpretation of confidence interval for β all of this, the F statistic
more or less similar. Except that understand in the multiple linear
regression there are several independent variables all of them may not
be relevant.
We may be able to take only a subset and I will actually handle subset
selection as a separate lecture. For the time being we are just done a
significance test on the coefficient in order to identify the irrelevant
independent variables, but there are better approaches and we will take
it up in the following lectures.
497
Data science for Engineers
Prof. Shankar Narasimhan
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 39
Cross Validation
498
modeling. Similarly, if you are building a non-linear regression model.
For example, let us take a polynomial regression model where the
dependent variable y is written as a polynomial function of the
independent variable x. Let us assume there is only one variable
x. You can write the regression model, non-linear regression model, as
y = β0 + β1 which is the linear part. You can also include non-linear
terms such as β2 x2 and so on up to βn xn, where x2 x3 and so on are
higher powers of x.
So, you can always get the mean squared error of training data to 0
by choosing sufficient number of parameters in the model. Therefore,
you cannot use the training data set in order to find out the best number
of, optimal number of, parameters to use in the model. So, we do
something called cross validation. This has to be done on a different
499
data set that is not used in the training. We call this the validation data
set and we use the validation data set in order to decide the optimal
number of parameters or meta parameters of this model. So, we will
use this polynomial regression model as an example throughout in
order to illustrate this idea of cross validation.
So, over fitting, basically means you are using unnecessarily more
parameters than necessary to explain the data. On the other hand, if you
use less parameters, you actually or not sufficiently your model is not
going to be that accurate. Typically, there are two measures for
determining the quality of the model. One is called the bias in your
prediction error and if you know if you increase the number of
500
parameters of the model, this bias squared of the bias term will start
decreasing.
However, the variability in your model predictions that will start
increasing as you increase the model complexity or number of
parameters. So, it is basically that trade o between these two that gives
rise to this minimum value of the MSE on the validation set.
That is what you are looking for. So, want this optimal trade o between
the bias which keeps reducing as you increase the number of
parameters and the variance which keeps increasing as the number of
parameters in the model increases. So, this is what we are going to find
out by cross validation.
So, if you have a large data set, then you can always divide this data
set into 2 parts; one used for training typically 70 percent of the data
samples you will use for training and the remaining, you will set apart
for the validation. So, let us call the samples that you use for training as
x1, y1, x2, y2 and so on where x represents the independent variable, y is
the dependent variable. And the validation set, we will denote it by the
symbols x0 i and y0 i where there are n t observations in the validation
set. So, typically as I said if you have a large number of samples, you
can set apart 70 percent of the samples for training and the remaining
30 percent, we can use for validation. Now, you can always of course,
define the mean squared error in the training set after building the
model.
So, this is nothing but the difference between the measured, observed
value of the dependent variable - the predicted value after you have
estimated the parameters, let us say using least squares regression. So,
this is the prediction error on the training data square over all the
501
samples and taken the average, that is the mean squared error that we
have seen before.
You can do a similar thing for the validation set also. You can take
the difference between the measured value or observed value in the
validation sample set - the predicted value for the validation samples
and again you can take the sum square difference between the
observation - the predicted value for the validation set squared over all
samples on the averaged average value.
So, that is called the mean squared error on the test or validation.
So, this particular term as I said the MSE on training is not useful for
the purpose of deciding on the optimal number of parameters of the
model; however, this test MSE test or the mean squared error or the
validation data set is the one that we are going use for finding the
optimal number of parameters of the model.
502
using what is called K fold cross validation and bootstrapping and so
on.
So, this is what we will do if you do not have enough data. We will
first look at the case of where we have sufficient number of samples.
So, essentially we have n samples and you can divide it to a training set
con consisting of n t samples and the remaining samples you actually
use for validation. This is the hold out sample as we call it.
So, you build the model using only the training set and after you
have built the model you test it find the m square mean squared error
on the validation set, do this for every choice of the parameter. So, if
you have for example, a polynomial model, you will first see whether a
linear model is good then you will check as quadratic model and a
cubic model and so on. You keep increasing the degree of the
polynomial and for each case, build the model using the training set
and see how the MSE of that particular model is on the validation set
and plot this MSE on the validation set as a function of the degree of
the polynomial.
So, this is what we are going to do for one of the examples and then
see how we pick the optimal number of parameter. So, here is a case
example of mileage of some auto automobiles and the horse power of
the engine.
So, essentially this particular data set contains 300 data points.
Actually this is sufficiently large, but we are going to assume this is not
large enough and we use the validation and cross validation approach
on this data set. So, we have 300 data points or more of automobiles,
different types of automobiles, for which the horsepower and the
503
mileage is given. We are going to fit a polynomial or a non-linear
model between mileage and horsepower, yeah of course we can also
try a linear model, but a polynomial model means you can also try
quadratic and cubic models and so on, so forth. So, this is what we are
going to illustrate.
So, you can choose the optimal order degree of the polynomial to fit
if this particular case as 2 that is a quadratic model ts this data very
well. That is what you actually conclude from this particular cross
validation mechanism. Of course on the right hand side we have shown
plots for different choices of the training set. For example, if you
choose 200 data points out of this randomly and perform regression,
polynomial regression, for different polynomial order degree and plot
the MSE, you will get let us say one curve in this case let us say the
yellow curve you get here.
Similarly, if you take another random set and do it, you will bet
another curve. So, these different curves correspond to different
random samples taken from this 300 thing as training and the
remaining is used as testing. You can see that as you increase the
degree of the polynomial, the variability or the estimates or the range
of the estimates is very very large.
So, it indicates that if you over t, you will get a very high variability
whereas on the other hand if you choose order of the polynomial 1 or 2
you will find that the variability is not that significant comparatively.
So, typically if you over fit the model, you will find high variability in
your estimates that you obtain or the mean squared error values that
you obtain. All this is good if you have a large data set.
504
(Refer Slide Time: 14:10)
What happens when you have extremely small data set and you
cannot divide it into training and validation set. You do not have
sufficient samples for training. Typically, you need reasonable number
of samples in the training set to build the model. Therefore, in this case
we cannot set apart or divide it into a 70, 30, what I call, division and
therefore you have to do some other strategies. So, these strategies or
what are called cross validation using a k fold cross validation or a
bootstrap. That we will see.
Here again, we will predict the performance of the model on the
validation set, but the validation set is not separated from the training
set precisely. But on the other hand, it is drawn from the training set
and we will see how we do this. So, these methods k fold cross
validation is useful when we have very few samples for training.
505
(Refer Slide Time: 15:08)
So, I will first start with, leave one out cross validation or what is
called LOOCV. In this case, we have n samples as I said n is not very
large may be 20 or 30 samples we have for training. So, what we will
do is you first leave out the first sample and use the remaining samples
for building your model. That means, you will use samples 2, 3, 4 up to
n to build your model and once you have built the model you test the
performance of the model or predict for this sample that you have left
out and you will get an MSE for the sample.
And similarly, in the next round what you do is, leave out the
second sample and choose all the remaining for training and then use
that model for predicting on the sample that is left out. So, in every
time, you build a model by leaving out one sample from this list of n
samples and predict the performance of the model on the left out
sample.
This you will do for every choice of the model parameter. For
example, if you are building a non-linear regression model, you will
build an regression model using let us say only β0 and β one the linear
term and predict the MSE on this left out sample. You will also build a
model using the same training set, quadratic model, and predict it on
this and so on, so forth. So, that you will get a MSE for the left out
sample for all choices of the parameters; that you want to try out.
Do this with the second sample being left out and the third sample
being left out in turn. So that every sample has a chance of being in the
validation set and also be being part of the training set in the other
cases. So, once you have done this, for a particular choice of the model
parameters, let us say you have building a linear model. You find out
506
the sum squared value of the prediction errors on the left out sample
over all the samples. for example, you would have got a MSE for this,
MSE for this, MSE for this.
When the first sample, second sample and third sample was left out
that you are cumulating it here and taking the average of all that. This
you do repeatedly for every choice of the parameters in the model. For
example, the linear the quadratic the cubic and so on so forth and you
get the mean squared error or cross validation error for different values
of the parameters which you can plot.
Here again we have shown the mean squared error for different
choice of the degree of the polynomial for the same data set. In this
case, we have used left Leave One Out Cross Validation strategy. That
means, if we have 300 samples we left out one sample, built the model
using 299 samples predicted on the sample that is left out do this in
turn, average over all of this for every choice of these parameter,
degree of the polynomial and mean squared error we have plotted.
Again we see that the MSE on the cross validation leave one out cross
validation reaches more or less the minimum for a degree of the
polynomial is equal to 2, after which it just keeps remains more or less
at. So, the optimal in this case is also indicated as a second order
polynomial is best for this particular example.
507
(Refer Slide Time: 18:24)
508
(Refer Slide Time: 19:27)
We can also do what is called the k fold cross validation where here
instead of leaving one out, we first divide the entire training set into k
folds or K groups. So, let us say the first group contains, let us say, the
first 4 data samples the second group contains the next 4 and so on, so
forth and we have divided this entire n samples into k groups. Now
instead of leaving one out, we will leave one group out. So, for
example, in this first case we will leave the first 4 samples that
belonging to group one and use the remaining samples and build a
model for whatever choice of the parameters we have used, let us say
we are building a linear model.
We will use the remaining groups, build the linear model and then
predict for the set of samples in group 1 that was left out and compute
the MSE for this group. Similarly in the next round, what we will do is
leave group 2 out, build the model let us say the linear model that we
are building with the remaining groups and then find the prediction
error for group 2 and so on, so forth, until we find the prediction error
for group k and then we average over all these groups.
So, the MSE in this case for all groups, where there are k groups,
and 1 by k that will be the cross validation error for leave this k fold
cross validation. Now, you can you have to repeat this for every choice
of the parameter, you have done this for the linear model you have to
do this for the quadratic model cubic model and so on, so forth and
then you can plot this cross validation error for leave or for this k fold
cross validation.
509
Notice that if k = n, you are essentially going back to Leave One
Out Cross Validation. In practice you can choose the number of groups
is equal to either 5 or 10 and do a 10 fold cross validation or 5 fold
cross validation. This is obviously less expensive computationally as
compared to Leave One Out Cross Validation and as you see that leave
one out cross validation will do a model fitting n times for every choice
of the parameter whereas k fold cross validation will do the model
building k times for every choice of the parameter.
Again we have illustrated this k fold cross validation for this mile
auto data, again we plot the MSE for different degrees of the
510
polynomial. We have used a 10 fold cross validation and we are
plotting this error.
And we will see that here also the minimal error occurs at 2 showing
that a quadratic model is probably best for this particular data after
which the error actually essentially flattens out. So, cross validation is
a important method or approach for finding the optimal number of
parameters of a model. This happens in clustering, this will happen in
non-linear model fitting and principal component analysis and so on
and it is useful. Later on, you will see in the clustering lectures, the use
of cross validation for determining the optimal number of clusters.
Thank you.
511
Data science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 40
Multiple Linear Regression Model Building and Selection
512
(Refer Slide Time: 00:57)
513
So, let us start by loading the data so, the data set ‘nyc’ is given to
you in a “csv" format and to load the dataset we are going to use the
function read dot csv.
So, the inputs for the function read dot csv it is similar to what we
saw in the previous lecture for read dot delim. So, read dot csv reads
the file in the table format and creates a data frame from it. So, the
syntax is read dot csv and the inputs to the function are file and row
names. So, le is the name of the file from which you want to read the
data and row names is the vector giving the actual row names, could
also be a single number.
514
So, let us see how to load the data now so, assuming ‘nyc.csv’ is in
your current working directory the command is read dot csv followed
by the name of the le in double quotes. Now, once this command is
executed it will create an object nyc which is a data frame. Now, let us
see how to view the data.
Now view of nyc will display the data frame in a tabular format.
There is a small snippet below which shows you how the output looks.
So, I have price, food, decor, service and east as the 5 variables. So,
say suppose if your data is really huge and you do not want to view the
entire data then we can use head or tail function. So, head will give you
the first 6 rows from a data frame and tail will give you the last 6 rows
from the data frame.
So, now, let us look at the description of the data set we have already
loaded it and we viewed it, but we do not know yet what the
description is.
515
(Refer Slide Time: 02:57)
So, the data is about menu pricing in restaurants of New York City. So,
y which is my dependent variable is the price of the dinner, there are 4
other independent variables. So, I have food which is one of the
independent variables it, is the customer rating of the food then I have
decor which is the customer rating of decor, then I have service which
is the customer rating of the service and east.
So, east is whether the restaurant is located on the east or west side
of the city. So, now, our objective is to build a linear model with y
which is price and with all the other 4 independent variables. Before
we go on building a model let us say if our data exhibits some
interdependency between the variables. So, for me to do that I am
going to use a “pair wise scatter plot." So, I am going to use same
function plot which we have earlier used.
Now, since I have multiple variable I am going to give the data frame
as my input and I am just giving a heading as pair wise scatter plot.
516
(Refer Slide Time: 04:06)
On my right this is the output you will get so, we can see that all the
variables are mentioned across the diagonals. So, when one moves
from left to right the variables on my left will be in the y axis and the
variables above or below will be on the x axis. So, let us take the first
row for instance. So, I have price on the left. So, price is in the y and I
have food below. So, food becomes the x axis.
Now, this is the plot for price versus food similarly I have price
versus decor and price versus service and price versus east. I am going
to the next row which is food on the y axis. So, if you take food versus
decor the data is randomly scattered so, it does not show any
correlation patterns, but whereas, if you see for food versus service you
see strong patterns being exhibited here. So, let us see what the
correlation is as such for all of these. So, correlation is a function and
Professor Shankar has told you how it is computed.
So, cor is the function in R. I need to give the dataset with all the
variables now round tells you to how many decimal points you want
round off the number to. So, if I give round and I am giving the input
as my correlation function and if I am saying 3 it means round of the
number to 3 decimal places. So, let us see how to interpret the output.
So, the correlation for price versus price will always be 1.
So, let us look at food and decor so, correlation between food and
decor is 0.5 which is pretty low, but whereas, if you look at food and
service it is almost equal to 0.8 which is quite high. So, we can see that
food and service are correlated, but one of them can be dropped while
building a final model. So, as we go along let us see which of the two
we have to drop.
517
(Refer Slide Time: 06:25)
So, like I earlier said, my dependent variable is only one here mean
which is denoted by y. I have several independent variables which are
denoted by xi and i code ranges from 1 to p, where p is the total number
of independent variables. Now let us see how to write this equation
with multiple independent variables. Again I have ŷ which is the
predicted value now I have β̂₀ which is the intercept then I have β̂ 1x1 +
β̂2 x2 so on and so forth up to β̂p xp. So, β̂₀ is the intercept and β1
518
hat β̂2 hat etcetera are the slopes.
So, ε is the error. So, if you could recall from your earlier lectures in
OLS, the assumption is that, so, error is present only in the
measurement of dependent variable and not on the independent
variable. So, independent variables are free of errors whereas, there is
always some error present in the measurement of y. So, this ε is an
unknown quantity which has 0 mean and some variance, now for any i
th observation this is how my equation is written.
So, now, let us go and build a model. So, the function to build a
multiple linear model is same as what we used in the univariate case.
Here also I am going to use lm now again the syntax is l m and there
are 2 input parameters formula and data. Now the syntax is slightly
different compared to the univariate case. So, I have my dependent
variable here then I have a tilde sign and how many ever independent
variables I have I am going to separate them with a + sign. Say for
instance I have 2 independent variables in my data. So, I am regressing
the dependent variable with 2 independent variables so, the 2
independent variables have to be separated by a + sign.
So, now, let us see how to do it for our data nyc. So, again I have l m
so, I am regressing price with all the 4 input variables which is food,
decor, service and east and I am taking these variables from the data
nyc. So, you can either separate the independent variables by a + sign.
So now, if you want to say regress price with all the 4 inputs, there is
another way you can write the same command. So, I say regress price
and then I give a tilde sign and then I say a dot. So, this means regress
price with all the input variables from the data nyc So, if you are going
to give all the input variables for regression then you can go with this,
519
but if you have a subset of variables that you want to build a model
with, then you can specify the variables separated by a + sign. So, just
to reiterate this is the form of my equation. So, now, let us go and see
how to interpret the summary. So, after having built this model I am
going to look at the summary of it.
So, this snippet gives you a just at the summary. So, if you could
recall in the first lecture of simple linear regression, we looked at what
each of this line here means in depth. So, we have the formula in the
first line we have the residuals and the 5 point summary of them ,then
we look in at the coefficients. So, here we say that intercept, food,
decor, service and east and these are the coefficients for these
variables.
520
which is very low as that compared to food and decor, it is still OK and
the significance star is only one which tells you that look if I have a
significance level of say 0.025 or 0.01 then I can reject this term, but
till then I can always keep it.
So, let us look at the r squared value. The r squared value is 0.628
and the adjusted r square is 0.619 and the f statistic value is really high
which is 68.76. So, this tells you that compared to the reduced models
which are the only intercept my full model is performing better and I
should retain it. So, now, that we know service is not significant, let us
build a new model dropping service.
So, I have dropped service and I have built a new model and I am
calling it nycmod_2. So, let us jump on to the coefficient section. So,
the estimates are not drastically different before and after removing the
service variable. So, this tells you that service is not very important.
So, again if you look at the p value it tells you that these variables are
very significant and if you look at the r squared value here down.
So, the r squared value before and after removing service is not
changed much this itself is an indicator that service is not helping us in
explaining the variation in price. The adjusted r square has changed a
bit and that is only because we have removed one variable and the
degrees of freedom change. The f statistic again is really really high
telling you that full model with food, decor and east is performing
better compared to your reduced model with only the intercept.
521
(Refer Slide Time: 13:15)
If you recall from the scatter plot, we saw there was a high
correlation between food and service. So, now, we built a model
dropping service, let us now retain service and build a model dropping
food. So, I have dropped food from here. So, let us take a look at this
summary.
If you take a look at this summary though the p value tells you that
all the variables are significant, if you look at the r squared value it has
dropped from 0.628 to 0.588 which is a huge decrease and even the
522
adjusted r square has decreased. So, this tells you that service is less
important and food is explaining the price in a much better sense than
service.
So, the r squared value and the scatter plots tell us to go ahead with
the linear model where we still need to verify the assumptions we make
on the errors using residual analysis. So, this task we are going to leave
it to you as an exercise you can do it and verify these assumptions.
Thank you.
523
Data science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture – 41
Classification
524
So, let us look at what classification means. So, we have described
this before if I am given data that is labelled to different classes that is
what I start with, then if I am able to develop an algorithm which will
be able to distinguish these classes. And how do you know that this
algorithm distinguishes these classes. Whenever a new data point
comes if I send it through my algorithm and if I originally had K
different labels the algorithm should label the new point into one of the
K groups.
But here now if I get a new point if I get a point like this, then I
would like the classifier to say that this point most likely belongs to the
class which is denoted by star points here and that is what would
happen and similarly if I have a point here I would like that to be
classified to this class and so on. Now, we can think of two types of
problems. The simpler type of classification problem is what is called
the binary classification problem. Basically binary classification
problems are where there are 2 classes, yes and no.
So, examples are for example, if you have data for a process or an
equipment and then you might want to simply classify this data as
belonging to normal behaviour of the equipment or abnormal
behaviour of the equipment. So, that is just binary. So, if I get a new
data I want to say from this data if the equipment is working properly
or it is not working properly. Another example would be if let us say
you get some tissue sample and you characterize that sample through
certain means and then using those characteristics you want to classify
this as a cancerous or a non cancerous sample. So, that is another
binary classification example.
525
So, if I have annotated data where the data is labelled as being normal
or as being collected when fault F1 occurred or as having been
collected when fault F2 occurs, fault F3 occurs and so on, then
whenever a new data point comes in we could label it as normal in
which case we do not have to do anything or if we could label it as one
of these fault classes then we could take appropriate action based on
what fault class it belongs to. Now from a classification view point, the
complexity of the problem typically depends on how the data is
organized.
So, if we were to draw a hyper plane here and then say this is all
class 1 and this is class 2 then these points are classified correctly,
these points are classified correctly and these points are poorly
classified or misclassified. Now, if I were to come out similarly with a
hyper plane like this you will see similar arguments where these are
points that will be poorly classified.
So, whatever you do if you try to come up with something like this
then, these are points that would be misclassified. So, there is no way
in which I can simply generate a hyper plane that would classify this
data into 2 regions. However, this does not mean this is not a solvable
problem it only means that this problem is not linearly separable or
what I have called here as linearly non separable problems.
So, you need to come up with not a hyper plane, but very simply in
layman terms curved surfaces and here for example, if you were able to
generate a curve like this then you could use that as a decision function
and then say on one side of a curve I have data point belonging to class
2 and on the other side I have data points belonging to class 1.
So, in general when we look at classification problems we are we
look at whether they are linearly separable or not separable and from
526
data science view point this problem right here becomes lot harder
because when we look at the decision function in a linearly separable
problem we know the functional form, it is a hyper plane, and we are
simply going to look for that in the binary classification case.
So, when I have something like this here let us say this is class 1,
this is class 2 and this is class 3 what I could possibly do is the
following. I could do hyper plane like this and a hyper plane like this.
Now then I have these 2 hyper planes, then I have basically 4
combinations that are possible. So, for example, if I take hyper plane 1,
hyper plane 2, as the 2 decision functions, then I could generate 4
regions. For example, I could generate + +, + -, - +, - -. So, you know
that for a particular hyper plane you have 2 half spaces, a positive half
space and a negative half space.
So, when I have something like this here then basically what it says
is, the point is in the positive half space of both hyper plane one and
hyper plane 2 and when I have a point like this, this says the point is in
the positive half space of hyper plane 1 and the negative half space of
hyper plane 2 and in this case you would say it is in the negative half
space of both the hyper planes.
So, now, you see that when we go to multi class problems if you
were to use more than one hyper plane then depending on the hyper
planes you get a certain number of possibilities. So, in this case when I
527
use this 2 hyper planes I got basically 4 spaces as I show here. So, in
this multi class problem which is I have 3 classes, if I could have data
belonging to one class falling here data belonging to another class
falling here and let us say the data belonging to the third class falling
here for example.
So, this is another important idea that, that one should remember
when we go to multi class problems. So, when we solve multi class
problems, we can treat them directly as multi class problems or you
could solve many binary classification problems and come up with a
logic on the other side of these classification results to label the
resultant to one of the multiple classes that you have.
So, Kernel methods are important when we have linearly non separable
problems. So, with this I just wanted to give you a brief idea on the
conceptual underpinnings of classification algorithms the math behind
all of this is what we will try to teach at least some of it is what we will
try to teach in this course and in more advance machine learning
courses you will see the math behind all of this in much more detail.
528
techniques and use case studies that give a distinct classification
flavour to the results that we are going to see using these techniques.
So, I will start logistic regression in the next lecture and I hope to see you
then.
Thank you.
529
Data science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 42
Logistic regression
530
(Refer Slide Time: 01:26)
Just to recap the things that we have seen before, we have talked
about binary classification problem before. Just to make sure that we
recall some of the things that we have talked about before. We said
classification is the task of identifying, what category a new data point,
or an observation belongs to. There could be many categories to which
the data could belong, but when the number of categories is 2, it is
what we call as the binary classification problem. We can also think of
binary classification problems as simple yes or no problems where, you
either say something belongs to particular category, or no it does not
belong to that category.
531
X1 to Xn. We can also call this as input features as shown in this slide.
And these input features could be quantitative, or qualitative. Now
quantitative features can be used as they are. However, if we have
going to use a quantitative technique, but we want to use input features
which are qualitative, then we should have some way of converting this
qualitative features into quantitative values. One simple example is if I
have a binary input like a yes or no for a feature. So, what do we mean
by this. So, I could have yes let us say 0.1 0.3 and, another data point
could be no 0.05 - 2 and so on.
So, you notice that these are quantitative numbers while these are
qualitative features. Now, you could convert this all into quantitative
features by coding yes as 1 and no as 0. So, then those also become
number. This is very crude way of doing this, there might be much
better ways of coding qualitative features into quantitative features and
so on. You also have to remember that there are some data analytics
approach, that can directly handle these qualitative features without a
need to convert them into numbers and so on.
So, you should keep in mind that that is also possible. Now that we
have these feature, we go back to our pictorial understanding of these
things. Just for the sake of illustration, let us take this example, where
we have this 2 dimensional data. So, here we would say X is x 1 x2; two
variables let us say x1 is here x2 is here. So, it is organized into data like
this. Now let us assume that all the circular data belong to one category
and all the starred data, belong to another category. Notice that circled
data would have certain x1 and certain x2 and, similarly starred data
would have certain x1 and certain x2. So, in other words all values of x1
and x2 such that the data is here, belongs to one class and, such that the
data is here belongs to another class.
532
Now, if we were able to come up with hyper plane such as the one
that is shown here, we learn from our linear algebra module that to one
side of this hyper plane is half space, this side is a half space and,
depending on the way the normal is defined, you would have positive
value and a negative value to each side of the hyper plane. So, this is
something that we have dealt with in detail, in one of the linear
algebraic classes.
So, I could have a data point, which is true value here; however,
because of noise it could slip to the other side and so on. So, as I come
closer and closer to this then you know the probability, or the
confidence with which I can say it belongs to particular class, can
intuitively come down. So, simply saying yes this data point and, this
data point belongs to class 0 is 1 answer, but that is pretty crude. So,
the question that this logistic regression answers is can we do
something better using probabilities. So, I would like to say that the
probability that this belongs to class 1 is much higher than this,
because it is far away from the decision boundary. So, how do we do
this, is the question that logistics regression addresses.
533
So, as I mentioned before the probability of something being from a
class, if we can answer that question, that is better than just saying yes
or no answers, right.
So, one could say yes this belongs to class, a better nuanced answer
could be that yes it belongs to class, but with a certain probability as
the probability is higher, then you feel more con dent about assigning
that class to the data. On the other hand if you model through
probabilities, we do not want to lose binary answer like yes or no also.
So, if I have probabilities for something I can easily convert them to
yes or no answers through some thresholding, which we will see in the
logistics regression methodology when we describe that. So, while we
do not lose the ability to categorically say, if a data belongs to a
particular class or not by modelling this probability. On the other hand,
we get a bene t of getting a nuanced answer, instead of just saying yes
or no.
534
So, the question then, is how does one model these probabilities. So,
let us go back and look at the picture that we had before let us say this
is x1 and x2. Remember that this hyper plane, would typically have this
form here the solution is written in the vector form. If I want to expand
it in terms of x1 and x2, what I could do is I could write this as β 0 + β11
x1 + β12x2. So, this could be the equation of this line in this two
dimensional space. Now one idea might be just to say this itself is a
probability and then let us see what happens. The difficulty with this is,
this p of x is not bounded because, it is just a linear function. Whereas
you know that the probability has to be bounded between 0 and 1. So,
we have to find some function which is bounded between 0 and 1. The
reason why we are still talking about this linear function is because,
this is the decision boundary.
So, you could think of something slightly different and, then say
look instead of saying p of x is this let me say log (p (x ))= β 0 + β1 x. In
this case you will notice that it is bounded only on one side. In other
words, if I write log (p (x ))= β0 + β1 x, I will ensure that p of x never
becomes negative; however, on the positive side p of x can go to ∞.
That again is a problem because we need to bound p of x between 0
and 1. So, this is an important thing to remember. So, it only bounds
this on one side.
535
(Refer Slide Time: 11:16)
So, is there something more sophisticated we can do? The next idea
is to write p of X as what is called as sigmoidal function. The
sigmoidal function has relevance in many areas. So, this is the function
that is used in neutral networks and other very interesting applications.
So, the sigmoid has an interesting form which is here. So, now, let
us look at this form right here. I want you to notice two things number
one is still we are trying to stick this hyperplane equation into the
probability expression because, that is the decision surface. Remember
intuitively somehow we are trying to convert that hyper plane into
probability interpretation. So, that is the reason why we are still
sticking to this β0 + β1 x. Now let us look at this equation and then see
what happens.
536
Now if you take β0 + β1 x to be a very large positive number, then
the numerator will be a very very large positive number and the
denominator will be 1 + that very large positive number. So, this will
be bounded by 1 on the upper side. So now, from the equation for the
hyper plane, we have been able to come up with the definition of a
probability, which makes sense, which is bounded between 0 and 1.
So, it is an important idea to remember. By doing this what we are
doing is the following. If we were not using this probability, all that we
will do is we will look at this equation and whenever a new point
comes in we will evaluate this β0 + β1 x and then based on whether it is
positive or negative, we are going to say yes or no.
Now, what has happened is instead of that, this number is put back
into this expression and depending on what value you get you get a
probabilistic interpretation. That is the beauty of this idea here. You
can rearrange this in this form and then say log (p (X))/ (1 – p( X)) = β 0
+ β1 x. The reason why I show you this form is because the left hand
side could be interpreted as log of odds ratio, which is an idea that is
used in several places. So, that is the connection here.
So, we still have to figure out what are the values for this. Once we
have a value for this, any time I get a new point I simply put it into the
537
p of x equation that we saw in the last slide and then get a probability.
So, this still needs to be identified and obviously, if we are looking at a
classification problem where I have this on this side and stars on this
side, I want to identify these β0 , β11 and β12 in such a way this
classification problem is solved. So, I need to have some objective for
identifying these values. Remember in the optimization lectures I told
you that that all machine learning techniques can be interpreted in
some senses an optimization problem.
So, here again we come back to the same thing and then we say
look we want to identify this hyper plane, but I need to have some
objective function that I can use to identify these values. So, these β 0 ,
β11 and β12 will become the decision variables but I still need an
objective function. And as we discussed before when we were talking
about the optimization techniques, the objective function has to reffect
what we want to do with this problem. So, here is an objective function
looks little complicated, but I will explain this as we go along. So, I
said in the optimization lectures we could look at maximizing or
minimizing. In this case, what we going to say is I want to find value
for β0 , β11 and β12 such that this objective is maximized.
So, take a minute to look at this objective and then see why
someone might want to do something like this. So, when I look at this
objective function, let us say I again draw this and then let us say I
have these points on one side and the other points on the other side. So,
let us call this class 0 and let us call this class 1. So, what I would like
to do is I want to convert this decision function into probabilities. So,
the way I am going to think about this is, when I am on this line I
should have the probability being = 0.5, which basically says that if I
am on the line I cannot make a choice between class 1 and class 2.
538
a small probability. So, it might interpret the probability as the
probability that the data belongs to class 1 for example, and whenever I
take a data point from this side and, put it into that probability function,
then I want the probability to be very high because I want that as the
probability that the data belongs to class 1. So, that is the basic idea.
So, in other words we can paraphrase this and then say for any data
point on this side belonging to class 0, we want to minimize p of x
when x is substituted into that probability function and, for any point
on this side when we substitute these data points into the probability
function, we want to maximize the probability. So, if you look at this
here what they say is if this data point belongs to class 0 then y i is 0.
So, whenever a data point belongs to class 0 anything to the power 0 is
1 so, this will vanish. So, in the product there will be functions of this
form, which will be 1 - p of xi and because yi is 0 this will become 1.
So, this will become something to the power 0 1. So, this term will
vanish and the only thing that will remain is 1 - p of x i. So, if we try to
maximize 1 - p of x i, then that is equivalent to minimizing p of xi. So,
for all the points that belong to class 0 we are minimizing p of xi.
Now, let us look at the other case of a data point belonging to class
1, in which case yi is 1 so, 1 - 1 0. So, this term will be something to
the power 0 which will become 1. So, it cannot drop out. So, the only
thing that will remain is p of xi now yi is 1. So, power 1 will be just left
with p of xi. And since this data belongs to class 1, I want this
probability to be very large. So, when I maximize this it will be large
number.
So, you have to think carefully about this equation. There are many
things going on here, number 1 that this is a multiplication of the
probabilities for each of the data point. So, this includes data points
from class 0 and class 1. The other thing that you should remember is
let us say I have a product of several numbers, if I am guaranteed that
every number is positive right, then the product will be maximized
when each of these individual numbers are maximized. So, that is the
principle that is also operating here, that is why we do this product of
all the probabilities.
However if a data point belongs to class 1, I want probability to be
high. So, the individual term is just written as p of xi. So, this is high
for class 1. When a data point belongs to class 0, I still want this
number to be high, that means, this number will be small. So, it
automatically takes care of this as far as class 0 and class 1 are
concerned. So, while this looks little complicated, this is written in this
way because it is easier to write this as one expression.
Now let us take a simple example to see how this will look. Let us
say I have class 0, I have 2 data points X1 and X2 and class 1, I have 2
data points X3 and X4. So, this objective function when it is written out
would look something like this. So, when we take let us say these
539
points belonging to class 0 then I said the only thing that will be
remaining is here. So, this will be 1 - p of X 1 for the second data point
it will be 1 - p of X 2, then for the data third data point it will be p of X 3
and for the fourth data point it will be p of X4.
So, this would be the expression from here. So, now when we maximize this,
then since p of X s are bounded between 0 and 1, this is a positive
number, this is a positive number, positive number positive number
and, if the product has to be maximized, then each number has to be
individually maximized. That means, this has to be maximized. So, it
will go closer and closer to 1 the closer to 1 it is better. So, you notice
that X4 would be optimized to belong in class 1 Similarly X3 would be
optimized to belong in class 1 and when you come to these two
numbers, you would see that this would be a large number if p of X1 is
a small number. So, p of X1 basically means that X1 is optimized to be
in class 0. And similarly X2 is optimized to be in class 0. So, this is an
important idea that we have to understand in terms of how this
objective function is generated.
Now, one simple trick you can do is take that objective function and
take a log of that and then maximize it. So, if I am maximizing a
positive number X, then that it is equivalent to maximising log of X
also. So, whenever this is maximized that will also be maximized, the
reason why you do this it makes the product into a sum makes it looks
simple. So, remember from our optimization lectures, we said we got a
maximise this objective. So, we always write this objective in terms of
540
decision variables and the decision variables in this case are β naught β
1 1 and β 1 2 so, we described before. So, what happens is each of
these probability expressions, if you recall from your previous slides,
will have these 3 variables and x i are the points that are already given.
So, you simply substitute them into this expression.
So, I will have something like β0 + β11 x1 + β12 x2 and so on + β1n xn,
this will be an n + 1 variable problem, there are n + 1 decision
variables, these n + 1 decision variables will be identified through this
optimization solution. And for any new data point, once we put that
data point into the p of x function that sigmoidal function that we have
described, then you get the probability that it belongs to class 0 or class
1.
So, this is the basic idea of logistic regression. In the next lecture, I
will take very simple example with several data points to show you
how this works in practice and I will also introduce notion of
regularization, which would help in avoiding over fitting when we do
logistic regression. I will explain what over fitting means in the next
lecture also, with that you will have theoretical under-standing of how
logistics regression works and in a subsequent lecture doctor Hemanth
Kumar would illustrate, how to use this technique in R on a case study
problem.
So, that will give you the practical experience of how to use
logistics regression and how to make sense out of the results that you
get from using logistics regression on an example problem.
541
Data science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture- 43
Logistic Regression
And the sigmoidal function that we used is given here and notice
that this is your hyperplane equation. And in n dimensions this quantity
is a scalar, because you have X elements n elements in X and n
elements in β1 and this becomes something like β0 + β11 x1 + β12 x2 and
so on β1n xn.
So, this is a scalar and then we saw that if this quantity is a very
large negative number, then the probability is 0 and if this quantity is a
very large positive number the probability is 1. And the transition of
the probability at 0.5 remember I said you have to always look at it
from 1 classes viewpoint.
So, let us say if you want class 1 to have high probability and class
0 is a, row prob, low probability case, then you need to have a
threshold that we described before that you could convert this into a
binary output by using a threshold. So, if you were to use a threshold of
542
0.5, because probabilities go from 0 and 1. And then you notice that
this p of X becomes 0.5 exactly when β 0 + β1 X = 0. This is because p
of X then = e 0 divided by 1 + equal 0 which is equal to 1 by 2.
Also notice another interesting thing that this equation is then the
equation of the hyperplane. So, if I had data like this and data like this
and if I draw this line any point on this line is the probability is equal to
0.5 point. That basically says that any point on this line in this 2 d case
or hyperplane, in the n dimensional case will have an equal probability
of belonging to either class 0 or class 1 which makes sense from what
we are trying to do. So, this model is what is called a logit model.
So, a line will separate this. And in a typical case in these kinds of
classification problems this is actually called as supervised
classification problem. We call this a supervised classification problem
because all of this data is labeled. So, I already know that all of this
data is coming from class 0 and all of this data is coming from class 1.
543
Just to keep in mind that there would we use problems like this.
Remember at the beginning of this course I talked about fraud
detection and so on. Where you could have lots of records of
fraudulent let us say credit card use and all of those instances of
fraudulent credit card use you could describe by certain attribute.
So, for example, the time of the day whether the credit card was done
at, the place where the person lives credit card transfer or credit card
use was done at the place the person lives and many other attributes.
So, if those are the attributes let us say many attributes are there. And
you have lots of records for normal use of credit card and some records
for fraudulent use of credit card.
Then you could build a classifier which given a new set of attributes
that is a new transaction that is being initiated, could identify what
likelihood it is of this transaction being fraudulent. So that is one other
way of thinking about the same problem. So, nonetheless as far as this
example is concerned what we need to do is we have to fill this column
with zeros and ones. If I fill a column with row with 0 then that means,
this data belongs to class 0 and if I fill it 1 then let us say this belongs
to class 1 and so on.
So, this is what we are trying to do, we do not know what the classes are.
544
(Refer Slide Time: 06:04)
And as we see here there are 3 decision variables, because this was
A₂ dimensional problem. So, 1 coefficient for each dimension and then
1 constant. Now once you have this then what you do is, you have your
expression for p of X which is as written before the sigmoid. So, this is
a sigmoidal function that we have been talking about. Then whenever
you get a test data, let us say 1 3, you plug this into this sigmoidal
function and you get a probability. Let us say the first data point when
you plug in you get a probability this.
So, if you use a threshold of 0.5 then what we are going to say is
anything less than 0.5 is going to belong to class 0 and anything greater
than 0.5 is going to belong to class 1. So, you will notice that this is 0
class 0, class 1, class 1, class 0, class 0, class 1, class 0, class 0, class 0.
545
So, as I mentioned in the previous slide what we wanted was to fill
this column and if you go across row then it says that particular sample
belongs to which class. So, now, what we have done is we have
classified these test cases, which the classifier did not see while you
were identifying these parameters.
So, typically what you do is if you have lots of data with class labels
already given one of the good things, you know that one should do is to
split this into training data and the test data. And the reason for
splitting this into training and test data is the following. In this case if
you look at it, we built a classifier based on some data and then we
tested it on some other data, but we have no way of knowing whether
these results are right or wrong.
So, we just have to take the results as it is. So, ideally what you
would like to do is, you would like to use some portion of the data to
build a classifier. And then you want to retain some portion of the data
for testing and the reason for retaining this is because the labels are
already known in this.
So, if I just give this portion of the data to the classifier, the
classifier will come up with some classification. Now that can be
compared with the already established labels for those data points. So,
from verifying how good your classifier is it is always a good idea to
split this into training and testing data. What proportion of data you use
for training, what proportion of data used for testing and so on are
things to think about.
Also there are many different ways of doing this validation as one
would call it with test data. There are techniques such as k fold
validation and so on. So, there are many ways of splitting the data into
train and test and then verifying how good your classifier is.
Nonetheless the most important idea to remember is that one should
always look at data and partition the data into training and testing so
that you get results that are consistent.
546
(Refer Slide Time: 10:23)
So, if one were to draw these points again that that we use this in
this exercise. So, these are all class 1 data points these are class 0 data
points and this is your hyperplane that a logistic regression model
figured out and these are the test points that we tried with this
classifier. So, you can see that in this case everything seems to be
working well, but as I said before you can look at results like this in 2
dimensions quite easily.
547
(Refer Slide Time: 11:42)
548
(Refer Slide Time: 14:23)
So, with this the portion on logistic regression comes to an end what
we are going to do next is we are going to show you an example case
study, where logistic regression is used for a solution. However, before
549
we do this case study since all the case studies on classification and
clustering will involve looking at the output from the r code, I am
going to take a typical output from the r code and there are several
results that will show up. These are called performance measures of a
classifier. I am going to describe what these performance measures are
and how you should interpret these performance measures once you
use a particular technique for any case study.
Thank you for listening to this lecture and I will see you in the next lecture.
550
Data science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture- 44
Performance measures
when you run, r code for any of the classifiers that you are going to
see as the teaching assistants describe how to do case studies and
generate this table. The intention in this lecture is the following; since,
you are going to see a result like this, for most classification problems.
What I want to do, is I want to really, look at all of these, terms here.
So, for example, there is accuracy, what does sensitivity means,
specificity means positive predictive value and so on mean.
So, I am going to, first, describe what these mean in this lecture and
once you understand this, in all the future lectures, whenever you look
at this, you will see, how well your classifier is doing. So, let us look
at, the first thing on this slide, which is something like this here. So,
this is a case study which you will see later where, based on certain
551
attributes of a car, you are trying to classify whether it is a Hatchback
or an SUV.
So, that is the kind of example problem that we have here. So, really
the two classes are Hatchback and SUV. And this table that you see
right here is what is called as the confusion matrix. So this is the result
that you typically get. So, the way to interpret this is the following. So,
if you go down this path this is what the classifier predicts for a given
data and this direction is the actual label for that class.
552
(Refer Slide Time: 05:25)
So, the previous table when we put this in this form, remember
when I said reference in the previous table, this was a true condition.
So, this was the true condition, this was Hatchback, this was the true
condition, this was SUV and so on and this is the predicted. So, in this
down, you get prediction, this is the truth. So, if you take this right
here, the true condition is positive. The predicted condition is also
positive. So, this is what we call as a true positive result.
So, the true condition is also negative. So, this is again a success
case. So, when we think about this this way, then basically if we have
553
only the diagonal elements and the o diagonal elements are zero then,
we have a perfect classifier in some sense. There are no mistakes that
have been made by the classifier. Now, in statistics this is called the
power of the test, this is called a type one error and this is called the
type two error.
554
(Refer Slide Time: 09:33).
So, the first definition is how accurate the classifier is. So, this
accuracy is very simply defined by, in all the samples how many times
did the classifier get the result right. So, we had, true positive, true
negative, false positive, false negative. So, I already told you that the
first letter tells you the truth or the success of the classifier. So, in these
four cases the true positive and true negative are the success cases and
these are failure cases. So, accuracy would be true positives + true
negatives divided by the total number of samples. So, this gives you
how many times did I get, did it right or how many times did the
classifier get it right. Now, because N is a sum of all of these and we
notice from the last slide that these are the o diagonal elements and I
said the best classifier is one which has 0 o diagonal elements. So,
when N is the sum of all these four, if these both are 0 N will become
TP + TN. So, accuracy = 1 or the maximum value that accuracy can
take is 1. So, this is an important measure that people use, to study the
performance of classifiers.
555
(Refer Slide Time: 11:03)
556
So, these are two other measures that people use. And there is also
another measure, which is balanced accuracy, which is sensitivity +
specificity by 2, that is an average of sensitivity and specificity. We
will come back to this, because these are in some sense both things that
we should look at and I will tell you strategies, where, you can get
sensitivity be 1 always or specificity to be 1 always, but clearly, will
also tell you that those will not be the most effective classifiers for us
to use.
557
correct is given by, this kind of number and so on. So, these are
important other measures that one could use.
So, this is little more complicated than the other measures. So, I am
just going to de ne this and then going to show you the formula for this.
So, what you want to know is this Κ gives you the observed accuracy,
whatever the classifier gives you as a result - what accuracy, you
would expect for a classifier, which is designed based on this notion of
random chance, divided by 1 - expected accuracy. So, this is the
definition of Κ. So, what you want to have is this observed accuracy to
be larger than expected accuracy.
558
(Refer Slide Time: 17:22).
559
So, let us go back to the same example, that we had and then look
at, these numbers for, for this example, this example; we talked about
hatchback and SUV. Clearly, we are not talking about a positive label,
a negative label here. However, if you want to use, these measures, you
have to make one of these a positive label and the one, a negative label,
you could make either one positive or negative, but whenever, there is
a result like this. That is shown in, in R, then the, the first one is the
positive label and the second one is the negative label. In fact, you can
see that here positive class = hatchback. So, this is the positive label.
So, now let us look at this and then see whether we can do all these
calculations for this example. So, let us look at this number. We will
see, what this is. So, the prediction is hatchback and the truth is also
hatchback. So, this is true positive. Now, here the prediction is
hatchback, but the truth is SUV. So, this is a false positive and here the
prediction is a SUV and the truth is hatchback. So, this is what I would
call as false negative and here the prediction is SUV and the truth is
also SUV. So, we will call this as the true negative. So, true positive =
10, false positive = 1, false negative = 0 ,true negative = 9. So, this is
what we have this now, let us go through the formulae that we had
before and then see whether all of this fits in. So, the accuracy is the
number of times we got it right. So, in this case, if I sum up all the
diagonal elements, which will be true positive + true negatives.
Those are the number of times, we got this right. So, the numerator
for accuracy will be 19 divided by the total number of samples, which
is 20. So, you see that 0.95 answer for this. Now, when we look at
sensitivity, we said sensitivity is defined as true positive divided by
true positive + false negative. So, this is how we de ne sensitivity. So,
this is going to be = 10 divided by 10 + 0. So, you get 1 here.
Similarly, you can verify the specificity to be 0.9. Now, let us look at
the positive predictive value. So, the positive predictive value is one
where, of all the labels that were identified as positive. How many of
them were actually true positive. So, that is what this would be.
So, again of all the labels that are predicted as negative, how many
are actually correct? So, if you do this calculation. So, true negative is
560
9 divided by true negative, 9 + false negative 0. So, 9 by 9, which is 1,
which is what you see here. The other ones prevalence detection rate,
detection prevalence and, balanced accuracy are very-very simple
calculations based on the formulae, that we have in the slide, of this
lecture. So, this kind of, gives you an idea of how to interpret the
results that come out of, a R code, for a classifier. Now, also you notice
that, this Κ value is here and based on the complicated formula that I
showed you before.
Now, if one were to ask a question as to, what are good values for
this, then that I, where a little bit of subjectivity comes in. There are
applications, where, you might say, sensitivity is very important or
there might be applications very much, say specificity is little more
important than sensitivity and so on. So, it is kind of application
dependent and depending on what you are going to use these results
for, that is something that you should really think about, before finding
out which, which of these numbers are important. From your
application viewpoint, nonetheless, what we wanted to do was in one
lecture kind of, give you the calculations for all of these. So, that, it is a
handy reference for you, when you work with the case studies, in R.
One last curve, which is seen in, in many papers and reported is this
curve called ROC, which is an acronym for receiver operating
characteristics and this was originally developed and used in, signal
detection theory.
What this ROC curve is, is, a graph between sensitivity and 1 -
specificity. So, it is important to notice that this is 1 - specificity. So
clearly, we know that the best value for sensitivity is 1 and the best
value for specificity is also 1. So, ideally you want both of these to be
561
one that would mean that you know, as you go, along this sensitivity
curve.
So, for example, if you take a particular sensitivity, this is the ROC
curve, this right here, this is ROC curve. So, if you take a particular
sensitivity. So, you push your sensitivity to be more and more let us
say, you go to 0.7 then the, specificity is, let us say this is 0.22,
something like that. So, let us say, the specificity is 0.78, because 0.22
is 1 - specificity. So, the way to think about this is the, the best
specificity point, is actually this right, because 0 1 - specificity is 0
would tell me specificity is 1.So, this is best point for specificity, but it
is the worst point for sensitivity, because sensitivity is 0. Now, as you
try to push your sensitivity to be more and more, then if you are sitting
on this curve, you are going away from the best specificity point.
So, this happens to be the best specificity worst sensitivity and as
you push this sensitivity more and more, you go further away from
your best specificity point. So intuitively, if you want a good ROC
curve, then what you want is the slope here to be, something like this.
So, as I go away from sensitivity, I do not want to lose too much
specificity. So, if the curve becomes something like this, it is better,
because at the same 0.7, if you notice, I have given up only this much.
Whereas, for this curve, I have to give up this much and if it is even
sharper, then you give up less and less.
So, this curve kind of benchmarks different classifiers so, that is the
most important thing to remember. Another thing to remember is, if
you told me that I want the best sensitivity, I do not care about
specificity, then that is a very trivial solution. The reason is, remember,
how sensitivity is defined? Sensitivity is defined as TP by TP + false
negative. So, this is how sensitivity is defined. So, how many times do
I get true positive divided by true positive by + false negative. Now,
think about this, if I want to make this one, which is the best number
for sensitivity, then my strategy is very simple, I will do no
classification. Every label, I will simply call it positive. Now, if I do
that then let us see what happens to this. Notice an important thing, I
said this is what the classifier predicts and this is the truth or the falsity
of what the prediction is. Now, if I come up with a strategy, a
classifier, which simply says positive for every label, let us see what
happens to sensitivity. So, you will have true positive on the numerator
divided by true positive, and this false negative will be 0 and the false
negative will be 0, because negative speaks to the prediction, but I have
a classifier, where I am never going to play it like negative.
So, this will be 0. So, it will be true positive by true positive
sensitivity will be 1. Similarly, if I come up with a classifier, which
does nothing, but says everything is negative label without doing
anything, then that classifier will have specificity value of 1 right. So,
if I want to get a sensitivity value of 1, I simply come up with a
classifier, which does nothing but says everything is a positive label
562
and if I want, a specificity of 1, I come up with a classifier which says
every label is negative without doing anything. So, both of these
classifiers are useless.
So, there has to be some give take and that given take is what is
shown by this curve right here and if you take a normalized area and
then say this area under ROC is this yellow portion, then you would
notice that better and better ROC curves are things like this. So that
means, as the area under the Roc curve goes closer and closer to 1, I
am getting better and better classifier designs.
563
(Refer Slide Time: 30:20).
So, this is a for a single classifier, but if you have several classifiers
that you are using, then I would say this classifier model one. This is
classifier with model 2 classifier, model 3, then this classifier is better
than this Classifier is better than this classifier, because if you take at
any sensitivity level, if I go across the amount I give up in specificity,
because this is the best specificity point for classifier 3 is less than this,
is less than this. So, if I have to pick this to get the same sensitivity, I
have to give a lot more of specificity. So, that is a key idea, when you
try to benchmark different classifiers in terms of their performance.
So, I hope, this gives you an idea of, how you can benchmark the
performance of various classifiers and how to interpret numbers that
one would typically see with, with the confusion matrix and so on. So,
this is an important lecture for you to understand so that, when these
case studies are done and when the results are being presented, you
will know, how to interpret them and understand these results, thank
you very much.
In the next lecture, after, an case study on logistics regression is
presented to you, I come back and talk about two different types of
techniques. One is called K means clustering, the other one is really
just looking at neighborhood and doing classification in a very
nonparametric fashion, which is called the K nearest neighbor
approach. So, I will talk about both of these, in later lectures.
Thank you.
564
Data science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 45
Logisitic Regression implementation in R
565
(Refer Slide Time: 00:38)
566
So, let us look at the problem statement. We are going to use
automotive crash testing case to illustrate this concept.
567
(Refer Slide Time: 01:34)
So, several cars have rolled into an independent audit unit for crash
test and they have been evaluated on a defined scale from poor to
excellent with poor being - 10 and excellent being + 10. So, from - 10
to + 10 is the scale and they are being evaluated on a few parameters.
So, let us look at the parameters they have been evaluated on.
So, I have the manikin head impact which is at what impact the
head of the car crashes, the manikin body impact, the impact on the
body of the car, the interior impact, the heat ventilation air
conditioning impact and the safety alarm system.
Now, each crash test is very expensive to perform and hence the
company does a crash test for only 100 cars. At the end of the crash
568
test, the type of the car is noted. So, that type here is either hatchback
or SUV. However, since the crash test is very expensive to perform
every time, so the company is going to take this data build a model and
with this model it should be able to predict the type of the car in future.
So, for this we are going to reserve a part of the data for building a
model and for training and the rest of the data will be kept for analysis.
So, for this we have 100 cars in total, out of which 80 cars is going
to be taken as train and the remaining 20 cars is going to be taken as
test. So, the 80 cars is given in crash test underscore 1 dot csv file and
the remaining 20 cars is given in crash test underscore 1 underscore
test dot csv.
Now, we need to use logistic regression technique to classify the car
type as hatchback or SUV.
569
(Refer Slide Time: 03:37)
Now, let us look into the solution approach. So, before we jump into
modelling, let us get the things ready. We need to set the working
directory, clear the variables in the workspace, we also need to load the
required packages. So, in this case glm is an inbuilt function so we do
not need any specific package to loaded whereas to use the confusion
matrix I need a package called carrot which I am going to load before I
begin my modeling. I am also going to clear all the variables in the
environment using this function which we have already learnt.
570
So, now let us read the data. So, the data for this case study like I
said is provided to you with these to file name. So, crash test
underscore 1 is the train data and crash test underscore 1 underscore
TEST is the test data.
Now, to read the data from a csv le we are going to use the csv
function. Like I said from my earlier lecture it reads a file in table
format and creates a data frame from it. So, its syntax is given below
the inputs are le and column name.
571
So, now let us read the data. So, I am going to use a function read
dot csv followed by the name of my le. Now once this command is run,
its going to save it in an object called crash test underscore 1 which is a
data frame. Similarly I do it for the other data set as well and now I
have another object crash test underscore one underscore test. Now
both these data frames will be reflected in the environment.
Now, let us view the data. So, I am going to use the view command
followed by the name of my data frame. So, this is how it appears.
Once you run the command a separate tab will appear with all the
variables and the values.
572
Now, let us try to understand the data. The data set crash test
underscore one contains 80 observations of 6 variables and similarly
crash test underscore 1 underscore TEST contains 20 observations of 6
variables. Now, like I said earlier we have 5 variables here which have
been measured at the end of a crash test and if you can see from the
data, the first five columns are the details about the car and the last
column is the label which says whether the card type is hatchback or
SUV.
(Refer Slide Time: 06:15)
So, let us look at the structure of the data. By structure I mean the
variables and their corresponding data types. So, structure is the
command which is represented as str. I need to give an object to it as
input the object here is the desired object for which we want to look at
the structure.
(Refer Slide Time: 06:36)
573
So, if you look at the structure of the train data it tells you that
crashTest_1 is of the type data frame with 80 observations and 6
variables and all the five variables are numeric and the class variable
which is card type is a factor with levels hatchback and SUV.
Similarly you can also look at the structure of the test set. So now,
let us look at the summary of the data and let us see what it has to tell
about the data. So, summary is the five point summary if the input is a
data frame and if the input is an object than the corresponding
summary for the object is returned. So, this is the syntax.
574
So, the summary for the train data which is crashTest_1 is given
below in the snippet. So, for the numerical variables it is a five point
summary with minimum first quartile and median, mean, third quartile
and maximum. For the categorical variable which is the factor car type
here it returns the frequency count.
So, for the test it again returns a five point summary for the
numerical variables and for the car type it tells me that there are 10 cars
of type hatchback and 10 cars of the type SUV.
575
So, now let us look at the function glm which we are going to use
for logistic regression. So, glm stands for generalized linear model and
the input to it is a formula, a data and family. So, formula is basically a
symbolic representation of the model you want to t. So, in our case it
become the car type. So, it is basically a class. Data is a data frame
from which you want to obtain your variables and family is binomial if
you use logistic regression. There are also other families which are
listed inside the function but if you write family = binomial then it
specifically corresponds to logistic regression.
Now, let us look at the model. Now if you run the model logisfit in
your console. This is what is displayed. In the first line it displays the
formula, in the next line it displays the coefficient then I have degrees
576
of freedom. So, it displays two degrees of freedom. So, the first
degrees of freedom is when you have a null model that is only with the
intercept and in the second case you have a degrees of freedom = 74
which means that I have included all the variables into my modeling.
577
estimation and it is a derivative of Newton Raphson method. So, it tells
you that the number of iterations that it has taken is 25.
Now, let us find the odds. To find the odds we are going to use the
predict function and the syntax is predict and my input is an object for
which I want to predict it. Now, for our data I am going to use predict.
My input here is the logistic regression model, now if I do not give any
data then the function assumes that I want to predict it on the train set
which is crashTest_1 in our case. Now, type = response gives you the
output as probabilities, but if you do not mention this it by default
gives you the log odds.
578
Now, let us plot the probabilities. So after you run the command I
have saved it as logistrain and I am going to use the plot function to
plot the probabilities. On to my right I have the plot on the y axis I
have the probabilities and on the x axis I have the index. So, from this
plot it is clear that the classes are well separated, but we still do not
know which site belongs to which car type. So, let us see how to find
out which side belongs to which car type.
579
(Refer Slide Time: 15:13)
So, now let us predict this on the test data. I am again going to use
the predict function. My input again is the logistic regression model.
Now my new data is the test data. So, crashTest_1_TEST is the test set.
Now, since I want to again model the probabilities, I am going to give
type = response, now once this command is executed it gets stored as
an object logispred and now I will plot the logispred. So, if I plot it, I
have the plot on my right, again I have the predicted values of
probability on my y axis and I can see that on the x I have the index.
Now, even for the test set the classes are well separated, now I
know that the points which fall here belong to the class of hatchback
and the points which are above belong to the class SUV. Now,
logispred is the output it has the prob-ability values and I have a small
snippet below that shows me the probability values. Now, I have
shown you only for the first four points you will also get similar values
for the remaining points.
580
(Refer Slide Time: 16:28)
Now, let us look at the result. Now we want to classify whether the
test point is hatchback or SUV by setting a threshold value. So, in this
case I am going to set a threshold value of 0.5.
So, now, I am going to say that from the data crashTest_1_test
create a column call logispred and if the value that we have calculated
for that point which is logispred, if that is less than or = 0.5, then
assigned hatchback under this column and again from the same data if
the logispred is greater than 0.5 assigned SUV under this column.
If you do that and if you run the commands this is how it creates a
column. So, if you can see the last column which is the 7th column
contains the predicted values and under each of these I have whether it
is hatchback or SUV. Now, the reason to do this is to check how
accurately our classifier is able to classify an unseen data.
581
(Refer Slide Time: 17:34)
So, now let us look at the confusion matrix. So, the function is
confusion matrix with a capital M, now again to use this function you
should have already loaded the library caret.
Now, I have the confusion matrix below. So, if you look at it, I have
the reference labels here and I have the predicted labels here. So, this
says that predicted as hatchback truly hatchback there are 10 cases, it
has identified all the 10 hatchbacks correctly. But, predicted as
hatchback, but truly SUV is one and predicted as SUV and truly SUV
are 9. So, out of the 10 SUV cases it has identified 9 correctly and
there has been 1 mis-classification.
Now, if you look at the accuracy value it is 0.95. If you can recall
from Professor Raghu’s performance measure lecture accuracy is
nothing but the sum of the true positive and true negative divided by
the total number of observations which is 20 in this case.
582
(Refer Slide Time: 19:21)
So, I again have the command on the top. I have the sensitivity
value which is equal to one. The positive labels here are hatchback and
all of them have been identified correctly, whereas if you can see there
has been one misclassification and hence this specificity drops to 0.9.
There is also something called balanced accuracy which = 0.95. Now,
balanced accuracy is the average of sensitivity and specificity.
Thank you.
583
Data science for Engineers
Prof. Ragunathan Rengasamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 46
K - Nearest Neighbors (kNN)
584
(Refer Slide Time: 00:30)
I just want to make sure that we get the terminology right. We will
later see that the k nearest neighbor, there is one parameter that we use
for classifying, which is the number of neighbors that I am going to
look at. So, I do not want you to wonder, since we are using anyway a
parameter in this k nearest neighbor why am I calling it nonparametric.
So, the distinction here is subtle, but I want you to remember this. The
number of neighbors that we use in the k nearest neighbor algorithm
that you will see later, is actually a tuning parameter for the algorithm,
that is not a parameter that I have derived from the data.
585
Whereas, I could say, I will use a k nearest neighbor with two
neighbors three neighbors and so on, so that is a tuning parameter. So, I
want you to remember the distinction between a tuning parameter and
parameters that are derived from the data and the fact that k nearest
neighbor is a nonparametric method, speaks to the fact that we are not
deriving any parameters from the data itself. However, we are free to
use tuning parameters for k nearest neighbors, so that is an important
thing to remember. It is also called a lazy learning algorithm, where all
the computation is deferred until classification.
So, what we mean by this is the following. If I give trained data for
example, for logistics regression. I have to do this work to get these
parameters, before I can do any classification for a test data point . So,
without these parameters I can never classify test data points. However,
in k nearest neighbor it just give me data and a test data point I will
classify.
So, we will see how that is done, but no work needs to be done
before I am able to classify a test data point. So, that is an other
important difference between k nearest neighbor and logistic regression
for example. It is also called as an instant based learning where the
function is approximated locally. So, we will come back to this notion
of local as I describe this algorithm.
586
also predict for the train data points itself what class they should
belong to and then maybe compare it with the label that the data point
has and so on.
So, there are ways to address this, but when we say, when the
amount of data is large, all that we are saying is since there is no
explicit training phase, there is no optimization with a large number of
data points, to be able to identify parameters that are useless at later in
classification. So, in other words, in other algorithms you will do all
the e ort a priori and once you have the parameters then classification
becomes, on the test data point becomes, easier. However, since kNN
is a lazy algorithm all the, all the calculations are deferred till you had
actually have to do something, at that point there might be lot more
classic, lot more calculations if the data is large.
587
that is where its used the most. You could also use it with very simple
extensions or simple definitions for function approximation problems
also, and you will see as I describe this algorithm how it could be
adapted for function approximation problems quite easily.
So, what basically we are saying is, if there is a particular data point
and I want to find out which class this data point belongs to, all I need
to do is look at all the neighboring data points and then find which
class they belong to and then take a majority vote and that is what is
the class that is assigned to this data point. So, its something like if you
want to know a person, you know his neighbors, something like that is
what use using k nearest neighbors.
588
As I mentioned before, this k, the number of neighbors we are going
to look at, is a tuning parameter and this is something that you select.
So, you use a tuning parameter, run your algorithm and you get good
results, then keep that parameter if not you kind of play around with it
and then find the best k for your data. The key thing is that because we
keep talking about neighbors, and from a data science viewpoint
whenever we talk about neighbors, we have to talk about a distance
between a data point and its neighbor.
We really need a distance metric for this algorithm to work and this
distance metric would basically say what is the proximity between any
two data points. The distance metric could be Euclidean distance,
Mahalanabis distance, Hamming distance and so on. So, there are
several distance metrics that you could use to basically use k nearest
neighbor.
So, there might be many classes. So, multi class problems are also very very
easy to solve using kNN algorithm. So, let us anyway stick to binary
problem. Then what you are going to do is, let us say I have a new test
point which I call it Xnew and I want to find out how I classify this. So,
the very first step which is what we talk about here, is we find a
589
distance between this new test point and each of the labelled data
points in the data set. So for example, there could be a distance d 1
between Xν and X1, d 2 between Xν and X2, d3 and so on and dl. So,
once you calculate this distance, then what you do is you have n
distances and this is the reason why we said you need a distance metric
in last slide for a kNN to work.
So, the distance is zero then the point is Xν itself. So, any small
distance is the closest to Xnew and as you go down it is further and
further. Now the next step is very simple. If let us say you are looking
at k nearest neighbors with k = 3, then what you are going to do, is you
are going to find the first three distances in this and this distance is
from Xn, this distance is from X5 and this distance is from X3.
So, it says to find the class label that the majority of this k label data
points have and I assign it to the test data point, very simple. Now I
also said this algorithm with minor modifications can be used for
function approximation. So, for example, if you so choose to, you
could take this and then let us say if you want to predict what an output
will be for a new point, you could find the output value for these three
points and take an average. For example, very trivial, and then say that
is the output corresponding to this, so that becomes of adaptation of
this for function approximation problems and so on. Nonetheless for
classification this is the basic idea. Now if you said k = 5 then what
you do is, you go down to 5 numbers and then do the majority vote, so
that is all we do. So, let us look at this very simple idea here.
590
(Refer Slide Time: 15:05)
Let us say this is actually the training data itself and then I want to
look at k = 3 and then see what labels will be for the training data
itself. The blue are actually labeled, so this is supervised. So, the blue
are all belonging to class 1 and the red is all belonging to class 2, and
then let us say for example, I want to figure out this point here blue
point. Though I know the label is blue, what class would k nearest
neighbor algorithm say this point belongs to. Say if I want to take k =
3, then basically I have to find three nearest points which are these
three, so this is what is represented.
And since the majority is blue, this will be blue. So, basically if you
think about it, this point would be classified correctly and so on. Now
even in the training set for example, if you take this red point, I know
the label is red; how-ever, if I were to run k nearest neighbor with three
data points, when you find the three closest point, they all belong to
blue. So, this would be misclassified as blue even in the training data
set. So, you will notice one general principle is, there is a possibility of
data points getting misclassified, only in kind of this region where
there is a mix of both of these data points.
591
algorithm for complicated non-linear boundaries, which you would
have to guess a priori if we were using a parametric approach, so that is
a key idea here. a similar illustration for K = 5.
Now if I want to let us say a check this data point from the training
set itself, then I look at its five neighbors, closest 1 2 3 4 5, all of them
are red. So, this is classified as red and so on. So, this is the basic idea.
Now, you do not have to do anything till you get a data point. So,
you could verify how well the algorithm will do on the training set
itself. However, if you give me a new test data here, so which is what
you shown by this data point. Then if you want to do a classification
there is no label for this. Remember the other red and blue data points
already have a label from prior knowledge, this does not have a label.
So, I want to find out a label for it. So, if I were to use k = 3, then
for this data point I will find the three closest neighbors, they happen to
be these three data points. Then I will notice that two out of these are
red, so this point will get a label + 2. If on the other hand the test data
point is here and you were using K = 5 then, you look at the 5 closest
neighbor to this point and then you see that two of them are class 2 and
three are class 1, so majority voting, this will be put into class 1. So,
you will get a label of class 1 for this data point. So, this is the basic
idea of k nearest neighbor, so very very simple.
592
(Refer Slide Time: 19:17)
593
So, its worthwhile to kind of think about this, do some mental
experiments to see why these kinds of things might be important and so
on. Now the other aspect is scaling. So, for example, if there are two
attributes, let us say in data temperature and concentration, and
temperatures are in values of 100, concentrations are in values of 0.1
0.2 and so on. When you take a distance measure and these numbers
will dominate over this.
So, it is always a good idea to scale your data in some format before
doing this distance. Otherwise while this might be an important
variable from a classification viewpoint, it will never show up, because
these numbers are bigger and they will simply dominate the small
number. So, feature selection and scaling are things to keep in mind.
And the last thing is curse of dimensionality. So, I told you that while
this is a very nice algorithm to apply, because there is not much
computation that is done at the beginning itself. However, if you
notice, if I get a test data point and I have to find let us say the 5 closest
neighbor, there is no way in which I can do this, it looks like, unless I
calculate all the distances.
So, that can become a serious problem, if the number of data points
in my database is very large. Let us say I have 10000 data points, and
let us assume that I have an algorithm k nearest neighbor algorithm
with K = 5. So, really what I am looking for, is finding 5 closest data
points from this data base to this data point. However, it looks like I
have to calculate all the 10,000 distances and then sort them and then
pick the top 5. So, in other words to get this top 5 have to do so much
work. So, there must be smarter ways of doing it, but nonetheless one
has to remember the number of data points and number of features, one
has to think how to apply this algorithm carefully.
594
So, the best choice of k depends on the data and one general rule of
thumb is, if you use large values for k, then clearly you can see you are
taking lot more neighbors, so you are getting lot more information. So,
the effect of noise on classification can become less. However, if you
take large number of neighbors, then your decision boundaries are
likely to become less crisp and more di use.
So, because if let us say there are two classes like this, then for this
data point if you take a large number of neighbors, then you might pick
many neighbors from the other class also, so that can make the
boundaries less crisp and more di use. On the fifth slide, flipside, if you
use smaller values of k then your algorithm is likely to be affected by
noise and outliers, however, your decision boundaries as a rule of
thumb are likely to become crisper. So, this is, these are some things to
keep in mind.
Thanks.
595
Data science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 47
K- nearest neighbours implementation in R
In the process we will show how to read the data from dot csv le,
how to understand the data that is being loaded into the workspace of R
and how to implement this K-nearest neighbours algorithm in R using
this knn function. And we will also talk about how to interpret the
results that this knn algorithm gives to us.
596
(Refer Slide Time: 01:12)
Before we jump into the case study, let us review some key points
from the previous lecture of Prof. Raghu. If you remember knn is
primarily used as a classification algorithm. It is a supervised learning
algorithm. When I say supervised learning algorithm that means the
data that is provided to you has to be labelled data and knn is a non-
parametric method. So, what do you mean by this non-parametric
method is that there is no extraction of the parameters of the classifiers
from the data itself. And there is no explicit training phase involved in
this knn algorithm.
597
(Refer Slide Time: 03:07)
What happened is unexpectedly, they got 450 cars. Now, since they
have the testing facility for testing only 315 cars, they will not be able
to check all the 450 cars very thoroughly and the servicemen will not
work longer than the normal working hours.
So, what they have done is they have hired a data analyst to help
them out from the situation. If you are the data analyst which is hired
by this automotive service station person, how can you save the day for
this new service station is the problem statement.
598
(Refer Slide Time: 04:34)
Now, let us see how a data scientist can save a day for this service
station people. Since, service station has capacity to thoroughly check
315 cars, they have thoroughly checked all the 315 cars and given the
data in this service traindata.csv. Now, for the rest of the cars among
the 450, they cannot thoroughly check all the data and they have
checked only those attributes which are easily measurable and they
have given them in this service test data dot csv. So, essentially the
data scientist has data which is like a training data for him which
contains few attributes and with a label whether a service is needed or
not.
And he also has a data for which now all the other attributes are
present, he do not have this column where whether the service is
needed or not. The idea here is how do one use this data, service train
data, to comment upon for the readings which present which are
present in the service test data to tell whether service is needed or not
in this case. So, the idea is to use this knn classification technique to
classify the cars in the service test data le which cannot be tested
manually and say whether service is needed or not. Now, let us see
how do you solve this case study in R.
599
(Refer Slide Time: 06:26)
First you have to get things ready. When I say get things ready I
mean you have to set the working directory as the directory in which
the given data les are available. That you can do using set working
directory command and the corresponding path you can give here.
Otherwise you can use GUI option also to set the working directory.
And this command here is used to clear all the variables in the
environment of R. You can very well use the brush button in the
environmental history pan to clear the variables in the workspace.
And another important thing one has to do is, for this knn
implementation, we need two external packages which are caret and
class, one has to install this caret and class packages if they have not
installed it already. So, the way to install this packages we have
explained in our R modules, you can install the packages through the
command window using this command install dot pack-ages and the
package name and say dependencies = true or you can use the GUI to
install the packages. So, please install this packages caret and class.
And once you install, you can load those packages using the library
command as we have explained already. We will see why is this
packages important as we go along this lecture.
And library caret is for generating the confusion matrix which Prof.
Raghu would have talked about when he is talking about this
performance matrix of a classifier. And this library class is a library
which contains different classification algorithms. And here we are
going to use it for implementing this knn.
Now, let us see how to read the data.
600
(Refer Slide Time: 08:32)
From the given les and for this case, a data is being provided in two
les as we have already seen servicetraindata.csv and service test data
dot csv. So, in order to read this data from the csv files, function we
use is read.csv function. Let us look what this read dot csv function
takes and what it returns.
(Refer Slide Time: 08:59)
This read dot csv file reads a file in a table format and creates a data
frame from it. The syntax for this read dot csv function is as follows,
read dot csv the filename, and the row names. Let us look at what this
input arguments le and row dot names means. File is essentially the
name of the file from which you have to read the data. And row dot
names is a vector of row names, it can be either a vector giving the
actual row names or a single value which specifies what column of the
601
data set is having the row names. Let us see how to read the data in this
particular case.
(Refer Slide Time: 09:48)
As we have seen the data has been given in this two dot csv files,
we can use read dot csv function to read the data. As we have seen in
the syntax of read.csv we have to give the filename that is the filename
service train dot data from which I want to load the data. I will give
this file name. And I am assigning this two a variable called service
train when you execute this command what happens is, it reads a data
from the service train data file and assign it to this variable which is of
the form data frame.
Similarly, you will read the data from service test data and assign it
to variable service test which is again a data frame. In the R
environment, once you execute these commands you will see two data
frames which are service test and service train which are having this
315 observations of 6 variables and 135 observations of 6 variables.
Remember why this 315? 315 is the number of cars that they can
thoroughly check, but they have given in this 315 the 6 variables are
the attributes which are easily measurable and one column which says
whether service is needed or not. And this 135 cars they have 6
variables, they have measured all the 5 attributes which are important
and the 6 attribute is also given here we will see why the 6 attribute is
given and so on as we go on in this lecture.
Now, let us see what is there in this service train and service test
data. One way to see what is there in this service test and service train
is to use the view command.
602
(Refer Slide Time: 11:42)
This view command helps you to see the data frames. For example,
if you want to see what is there in the service train data frame, what
you have to do is this view service train will show a table like this in
your editor environment. Now, you can see that there are how many
attributes 1, 2, 3, 4, 5, 6 attributes. And if you see these are the five
attributes which are measured for testing whether the service is needed
or not, and this attribute is basically saying if service is needed or not.
Similarly, you can see for the service test data set which is shown
here. For now, what we assume is will act such a way that we do not
know this column and we will come back to this. Now, if you observe
here, there are 135 entries for which they have not thoroughly checked
they just measured this 6 quantities and they want to figure out whether
service is needed or not using the knn algorithm that is the whole idea.
Since, you have viewed what is there in this service test and service
train data sets, now is there any way to know what are the data types of
the these attributes that are there in this service train and service test is
the next question that comes to mind. Now, let us understand the data
and little more detail
603
(Refer Slide Time: 13:12)
What we have seen till now is the service train contains 315
observations of six variables, service test contains 135 observations in
6 variables. And variables that are present in the data sets are oil
quality, engine performance, normal mileage, tyre wear, HVAC wear
and service. And I as I mentioned earlier this 5 are the attributes that
tells about the condition of the car. And this attribute simply says
whether service is needed or not that is what here.
First five columns are the details about the car and the last column
is the label which says whether a service is needed or not. Now, let us
ask this question what are the data types of each of these attributes,
how one get the data types of the attributes that are there in the data.
So, since we have understood the data now. Let us look at what is
the structure of the data.
(Refer Slide Time: 14:11)
604
When you say structure of data what do we mean by that is in the
data set you have what are the variables that are there, and what are
their data types. So, the way you get the structure of data in R is using
this structure function. What does this structure function do, structure
function compactly display the internal structure of an R object. The
syntax for the structure function is as follows. Structure function takes
one input argument which is an object. What is this object, this object
is essentially any R object about which you want to have some
information.
Now, let us see the structure of two data frames what we have read
from the two dots csv files.
(Refer Slide Time: 14:58)
You can see the structure of the service train data frame. Here, if
you execute this command structure of service train, what it gives is
the following information which says service train is a data frame
which contains 315 observations of six variables. And the variables are
oil quality, engine performance and so on. And they will say the data
type of all this five attributes is numeric, and the last attribute service is
a factor with two levels that means we have yes or no in this attribute.
And this one two represents each entry for example one corresponds to
no and two corresponds to yes and so on. Let us use the structure
command on the service test data and see what it has.
605
(Refer Slide Time: 15:54)
This is the output you see when you execute this command here. It
says the service test is also data frame which contains 135 observations
in 6 variables. These are the variables that are available. The first 5
variables are numeric type variables, and the service variable is a factor
with two levels which contains yes or no.
Since we have seen the structure of the data, let us ask this question is
there any way that I will get a summary of the data which I have read.
(Refer Slide Time: 16:27)
The answer is yes, you can get. The summary of data is obtained by
the summary function. Essentially what it does is it invokes particular
methods depending upon the class of the argument that goes along with
this summary function. For example, summary function gives a 5 point
summary for numeric attributes in the data. Syntax for the summary
function is as follows. The summary function takes one argument
606
which is an object. This object is any R object about which you want to
get some information.
Let us use the summary function on our data frames which we have
loaded and see what the results are.
(Refer Slide Time: 17:15)
You can use the same summary on service test and you can see that
it will return you the 5 point summary for all the numeric variables and
607
it will return you the number of no values and yes values in the service
test.
Let us keep this number in mind we have 99 no values and 36 yes
values in the service test. As I said earlier, we are going to act in such a
way that we do not know the true yes and no values and we use knn to
predict which of them are yes and which of them are no.
(Refer Slide Time: 18:20)
608
(Refer Slide Time: 19:51)
There are certain comments here, let us study what those comments
are. So, as we have seen in the previous lecture, K-nearest neighbour is
a lazy algorithm and can do prediction directly with the testing data set.
It accepts training and testing data sets and the class variable of interest
that is outcome categorical variable and the parameter k as I have
mentioned is to specify the number of nearest neighbours that are to be
considered for the classification.
So, the way I implement this knn algorithm is through this knn
command. As a training data set I will give all my service train dataset.
Remember I have a negative 6 here, I will talk about it while later. And
the test data set what I have given is the attributes in the service test
except the 6th column. And in the class variables, I have given this 6th
column as my classification parameter.
And let us say I want to build a knn which takes the number of
nearest neighbours as 3. So, these are the input arguments for this knn
function. When I execute this whole command here, it will calculate
the labels for the test data set and store them in this predicted knn. I
will show you the results in the coming slide.
609
Once you give all these parameters, execute this. The knn will
classify the test data points and then store the labels in this predicted
knn. Let us look at the results and what this predicted knn contains.
(Refer Slide Time: 22:32)
So, as we have seen in the earlier slide, predicted knn is the output
from the algorithm, which has categorical variable yes or no indicating
whether service is needed or not for each case in the test data. When
you print this predicted knn, this is the output you see. It essentially
says in this 135 values you have, first car no service is needed, and
second car no service is needed, and for the 23rd car service is needed
and so on.
So, that is what this knn algorithm does and you have actually nished
your job of classifying the test cars as whether the service is needed or
not. When you do not have this luxury of knowing the true value this is
where you stop. But in R case what happened is, we already have the
true values whether service is needed or not for this data set what we
have. Now, when you have this luxury of knowing the true classes, you
can generate what is called confusion matrix and see how well you
have classified this performing.
610
(Refer Slide Time: 23:51)
So, there are two ways of generating this confusion matrix. One you
can generate the confusion matrix manually, the other way is to use
this caret package which can generate confusion matrix and along with
it lot of other parameters what Prof. Raghu has talked about in his
performance matrix lecture. Let us see how to generate this confusion
matrix manually. So, this predicted knn is the labels that is being
protected using the knn algorithm. And when you observe this
command here, this is the last column of the service test data frame
which says the true labels of whether the service is needed or not.
611
(Refer Slide Time: 26:28)
What we have seen this is the way you generate the confusion
matrix manually. Once you have this confusion matrix, you can
calculate the accuracy.
Since, knn is managed to predict all the no cases has correctly has
no and all the yes cases correctly as yes, your accuracy is 1. This is
how you generate the confusion matrix manually. Now, let us see how
to generate this confusion matrix using the caret package and the
command confusion matrix.
612
Along with this confusion matrix, we will also get a lot of
parameters such as sensitivity, specificity, etcetera. So, the reason why
you have sensitivity = 1. And specificity = 1 in this case is because all
the positive classes are correctly classified all the negative classes are
also correctly classified that is the reason why you have the ideal
values of one and one for sensitivity and specificity.
In summary what we have seen in this lecture is how to read the dot
csv files, how to use the structure and summary functions to know the
data types and the summary of R objects and how to implement this K-
nearest neighbours algorithm, which is a supervised learning algorithm
which needs labelled data. And we have also seen how to implement
this K-nearest neighbours algorithm in R using this knn function.
So, with this we end this tutorial session on how to implement knn
algorithm in R. In the next lecture, Prof. Raghu will talk about this k
means clustering algorithm after which I will come back with a case
study on how to implement k means clustering.
Thank you.
613
Data science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 48
K-means Clustering
So, we are into the last theory lecture of this course. I am going to
talk about K-means clustering today. Following this lecture there will
be a demonstration of this technique on an case study by the TAs of
this course. So, what is K-means clustering? So, K-means clustering is
a technique that you can use to cluster or partition a certain number of
observations, let us say N observations, into K clusters.
So, these are labelled data points and the classification algorithms
job is to actually find decision boundaries between these different
614
classes. So, those are supervised algorithms. So, an example of that
would be the K nearest neighbour that we saw before. Where we have
labelled data and when you get a test data, you kind of bin it into the
most likely class that it belongs to. However, many of the clustering
algorithms are unsupervised algorithms in the sense that you have N
observations as we mentioned here but they are not really labelled into
different classes.
615
So, when we were talking about optimization for data science, I told
you that you know all kinds of algorithms that you come up with in
machine learning there will be some optimization basis for these
algorithms. So, let us start by describing what K-means clustering
optimizes or what is the objective that is driving the K-means
clustering algorithm. So, as we described in the previous slide there are
N observations x1 to xN and we are asking the algorithm to partition
this into K clusters. So, what does it mean when we say we partition it
into K clusters? So, we will generate K sets s 1 to sk and we will say this
data belongs to this set and this data belongs to the other set and so on.
So, to give a very simple example, let us say you have this
observations x1, x2 all the way up to xN and just for the sake of
argument let us take that we are going to partition this data into two
clusters ok. So, there is going to be one set for one cluster, cluster 1
and there is going to be another set for the other cluster 2. Now the job
of K-means clustering algorithm would be to put these data points into
these two bins. So, let us say this could go here this could go here, x 3
could go here xN could go here maybe an x4 could go here and so on.
So, all you are doing is you are taking all of these data points and
putting them into two bins and while you do it what you are looking for
really in a clustering algorithm is to make sure that all the data points
that are in this bin have certain characteristics which are like, in the
sense that, if I take two data points here and two data points here.
These two data points will be more like and these two data points will
be more like each other, but if I take a data point here and here they
will be in some sense unlike each other. So, if you cluster the data this
way then it is something that we can use where we could say look all
of these data points share certain common characteristics and then we
are going to make some judgments based on what those characteristics
are.
616
calculate the mean of that set and I will find out a solution for this
these means in such a way that this within cluster distance x which is in
the set - I mean is minimized not only for one cluster, but for all
clusters. So, this is how you will de ne this objective. So, this is the
objective function that is being optimized and as we have been
mentioned mentioning before this μi is a mean of all the points in set s
i.
So, this is one data point this is another data points on. So, there are
8 data points and if you simply plot this you would you would see this
and when we talk about cluster. So, notionally you would think that
there is a very clear separation of this data into two clusters.
617
that is it. So, how do we find the two clusters? Because we have no
information labels for these and this is an unsupervised algorithm what
we do is we know ultimately there are going to be two clusters. So,
what we are going to do is we are going to start o two cluster centres in
some random location. So, that is the first thing that we are going to
do. So, you could start off two clusters somewhere here and here or
you could actually pick some points in the data itself to start these two
clusters.
Now, you could do this or like I mentioned before you could pick a
point which is not there in the data also and we will see the impact of
choosing this cluster centres later in this lecture.
Now, that we have two cluster centres then what this algorithm does is the
following. It finds for every data point in our database, this algorithm
first finds out the distance of that data point from each one of this
cluster centres. So, in this table we have distance one, which is the
distance from the point 1 1 and we have distance two, which is the
distance from the point 3 3. Now, if you notice since the first point
from the data itself is 1 1 the distance of 1 1 from 1 1 is 0. So, you see
618
that distance one is 0 and distance two is the distance of the point 1 1
from 3 3.
Now, since we want all the points that are like 1 1 to be in one
group and all the points which are like 3 3 to be in the other group, we
use a distance as a metric for likeness. So, if a point is closer to 1 1
then it is more like 1 1 then 3 3 and similarly if a point is close to 3 3 it
is more like 3 3 than 1 1. So, we are using a very very simple logic
here.
Now, what you do is, you know that these group positions are the
centres of these groups were randomly chosen now, but we have now
more information to update the centres because I know all of these data
points belong to group 1 and all of these data points belong to group 2.
So, a better representation for this group, so initially the representation
for this group was 1 1, a better representation for this group would be
the mean of all of these 4 samples and the initial representation for
619
group 2 was 3 3 , but a better representation for group 2 would be the
mean of all of these points. So, that is step 3.
So, we compute the new mean and because group 1 has points 1, 2,
3, 4, we do a mean of those points and the x is 1.075 and y is 1.05.
Similarly for group 2, we do the mean of the labels are the data points
5, 6, 7, 8 and you will see the mean here. So, notice how from 1 1 and
3 3 the mean has been updated. In this case because we chose a very
simple example the updation is only slight. So, these points move a
little bit.
Now, you redo the same computation because I still have the same
eight observations but now group 1 is represented not by 1 1, but by
1.075 and 1.05 and group 2 is represented by 3.95 and 3.145 and not 3
3. So, for each of these points you can again calculate a distance 1 and
distance 2, and notice previously this distance was 0 because the
representative point was 1 1. Now that the representative point has
become 1.075 and 1.05 this distance is no more 0, but it is still a small
number. So, for each of these data points with these new means you
calculate these distances. And again use the same logic to see whether
distance 1 is smaller or distance 2 smaller and depending on that you
assign these groups.
620
repeating this process there is no never going to be any reassignment.
So, this clustering procedure stops.
Now, if it were the case that because of this new mean let us say for
example, if this had gone into group 2 and let us say this and this had
gone into group 1 then correspondingly you have to collect all the
points in group 1 and calculate a new mean collect all the points in
group 2 and calculate a new mean. And then do this process again and
you keep doing this process till the mean change is negligible or no
reassignment. So, one of these two things happen then you can stop the
process. So, this is a very very simple technique.
So, in that sense an algorithm like this allows you to work with just
raw data without any annotation. What I mean by annotation here is
without any more information in terms of labelling of the data and then
start making some judgments about how this data might be organized.
And you can look at this multi dimensional data and then look at what
is the optimum number of clusters, maybe there are 5 groups in this
multi dimensional data which would be impossible to find out by just
looking at an excel sheet.
But once you do an algorithm like this then it maybe organizes this
into 5 or 6 or 7 groups then that allows you to go and probe more into
621
what these groups might mean in terms of process operations and so
on. So, it is an important algorithm in that sense.
Now, we kept talking about finding the number of clusters. Till now
I said you let the algorithm know how many clusters you want to look
for, but there is something called an elbow method where you look at
the results of the clustering for different number of clusters and use a
graph to find out what is an optimum number of clusters. So, this is
called an elbow method and you will see a demonstration of this in an
example.
622
The case study that follows this lecture, you will see how this plot
looks and how you can make judgments about the optimal number of
clusters that you should use. So, basically this approach uses what is
called a percentage of variance explained as a function of number of
clusters but those are very typical looking plots and you can look at
those and then be able to figure out what is the optimal number of
clusters.
So, all of this will be assigned to this group and all of these data
points are closer to this. So, they will be assigned to this group then
you will calculate the mean and once a mean is calculated there will
never be any reassignment possible afterwards. Then you have clearly
these two clusters very well separated.
So, after the first round of the clustering calculations, you will see
that the center might not even move because the mean of this and this
might be some-where in the center. So, this will never move, but the
algorithm will say all of these data points belong to this center and this
center will never have any data points for it.
So, this is a very trivial case, but it just still makes the point that if
you use the same algorithm depending on how you start your cluster
centres, you can get different results. So, you would like to avoid
situations like this and these are actually easy situations to avoid, but
when you have multi dimensional data and lots of data and which you
cannot visualize like I showed you here it turns out it is not really that
623
obvious to see how you should initially. So, there are ways of solving
this problem, but that is something to keep in mind. So, every time you
run an algorithm if the initial guesses are different you are likely to get
at least minor differences in your results.
And the other important thing to notice, look at how I have been
plotting this data. Typically I have been plotting to this data so that you
know the clusters are in general spherical. But let us say if I have data
like this where you know all of this belongs to one class and all of this
belongs to another class. Now, K-means clustering can have di culty
with this kind of data simply be-cause if you look at data within this
class, this point and this point are quite far though they are within the
same class whereas, this point and this point might be closer than this
point in this point.
So, with this we come to the conclusion of the theory part of this
course on data science for engineers. There will be one more lecture
which will demonstrate the use of K-means on a case study. With that
all the material that we intended to cover would have been covered. I
hope this was an useful course for you.
And this course would have hopefully given you a good feel for the
kind of thinking and aptitude that you might need to follow up on data
science. And also the mathematical foundations that are going to be
quite important as you try to learn more advanced and complicated
624
machine learning techniques either on your own or through courses
such as this that are available.
Thanks again.
625
Data science for Engineers
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 49
K-means implementation in R
First will start with the problem statement of the case study
followed by how to solve the case study using R. As a part of the
solution methodology we will also introduce the following aspects
such as how to read the data from dot csv file, how to understand the
already read data which is in the workspace of your R, details about
this K-means function and how to interpret the result that are given by
this K-means function. Let us first look at the case study.
626
(Refer Slide Time: 01:30)
We have main this case study as clustering of trips, the reason will
become clear when you see the problem statement.
Let us look at the problem statement of the case study. An Uber cab
driver has attended 91 trips in a week. He has a facility in the car which
continuously monitors the following parameters for each trip such as
trip length, maximum speed, most frequent speed, trip duration,
number of times the brakes are used, idling time and number of times
the horn is being honked. Uber wants to group this trips into certain
number of categories based on the details collected during the trips for
some business plan. They have consulted Mister Sam, a data scientist,
to perform this job and the details of the trips are shared to him in a dot
627
csv format file with the name trip details dot csv. This is a problem
statement let us look how to solve this case study using R.
So, to solve this case study first we need to set up our R studio work space.
You need to copy the file which we have shared with you into the
working directory and clear the variables in the workspace. You can set
the working directory using the set working directory command and
you can pass the path of the directory which contains this data file as
628
an argument to the set working directory function. Or you can use the
GUI as we have specified in the R module to set the working directory.
And this command here removes all the variables that are in the
workspace and clear the R environment. You can very well use the
brush button to clear all the variables from the environment.
The next step is to read the data from the given file. And data for
this case study is provided in a file with name trip details dot csv. If
you notice the extension of this file is dot csv which means comma
separated value file. In R to read the data from a dot csv file we
use read.csv function.
629
Let us look what does the read.csv function takes as input argument
and what it gives us an output. read.csv reads a file in the table format
and creates a data frame from it. The syntax of the read.csv is as
follows; read dot csv it takes two arguments the first argument is a file
name and the second argument is the row names we will see what this
individual arguments are about.
The file is the name of the file from which the data has to be read,
row names is a vector of row names. This can be a vector giving the
actual row names are a single number specifying which column of the
table contains this row names. So, essentially the syntax is read dot
csv, the filename and if you have a column which species the row
names you can give that particular column as row names. In this case
we have the first column as the row indices that is the reason why we
have given this row dot names as 1.
Let us see how to read the data from the trip details dot csv. The
data from this tripdetails.csv can be read by executing this following
command here. I am using read dot csv command and I am trying to
read the data from tripdetails.csv and I know that in the first column of
the csv file I have the row names. Therefore, I have specified row dot
names as one and this is the filename from which I want to have read
the data.
And assigning the data which is read from this read dot csv to the
object called trip details. As mentioned in the help of read.csv, it reads
the data from this tabular format and then assign it to object call trip
details which is of type data frame.
630
(Refer Slide Time: 06:07)
Once you get this data onto your R environment you can view the
data frame using view function. Notice this is capital V and once you
run this command it will pop up a tabular column in your editor
window which shows the variables in the data frame and the number of
entries in the data frame.
In this case we have 7 variables and around 91 entries. This is how
you can view the data frame once it is loaded into your workspace.
So, to make it very clear, we have this 7 variables and there are 91
observations in this data frame. And the 7 variables are trip length,
631
maximum speed, most frequent speed, trip duration, brakes, idling time
and honking are noted for each trip.
Now, that we have seen how a data frame looks, it is time to see
what are the data types of each variable that is available in the data
frame. What is the way one can do that? One has to use the structure
function to do that. The structure function compactly displays the
internal structure of an R object.
632
(Refer Slide Time: 08:00)
Now, let us look at the structure of the data frame which we have
extracted from the trip details dot csv. The data frame which we have
extracted from trip details dot csv is trip details and I am passing that
data frame as an argument to this structure function. When I execute
this command, it will show that trip details is a data frame which
contains 91 observations of 7 variables and the 7 variables are trip
length, maximum speed, most frequent speed and so on.
633
(Refer Slide Time: 09:12)
634
When you execute the summary command on your data frame trip
details, it will give you five point summary for all the 7 variables of
your data frame.
635
(Refer Slide Time: 11:03)
636
(Refer Slide Time: 12:45)
637
Trip cluster has the following information. The first line essentially
gives you it has clustered the data into 3 cluster which are of sizes 46,
16 and 30 and if you sum this up, you will end up with your total
number of rows in your data frame that is 91 you can verify that. And
for each variable it will give the means of the clusters. So, because you
wanted to divide the whole data into 3 clusters the first row is the
cluster one information where the mean of the trip length is 19.9 and
the mean of the maximum speed is 48.21 and so on.
The next means of information that trip cluster contains is, it will
say among 91 rows what does each row belong to. For example, look
at here the first element that means, the first row belongs to the cluster
2 and the second element belongs to the cluster 3, total means belongs
cluster 2 and so on it will identify each row into one of this clusters.
And as you know K-means is a hard clustering algorithm that means,
each point has to be allotted to any of the clusters and no two clusters
contain similar data point otherwise you cannot have a single data
point belonging into 2 clusters.
638
This trip cluster has some more information which is known as
within clusters sum of squares. And if you see this contains 3 elements.
That means, how much is the variance in each of this clusters is what
this within cluster sum of the squares uses. The lesser the variance the
good are the clusters and it will also give what are the other
components that you can look from the trip clusters.
Essentially when you build a K-means algorithm it will give you the
following information. How many clusters it has built and how many
data points are there in each of these clusters and what are the means of
each of these clusters and how are each points are categorized into 1 of
the 3 clusters which you want to build and the other information.
639
creating a vector with NA’s which is of the size 10 by 1. And I am
initializing in empty list for the number of clusters.
This for loop will run from 1 to number of maximum clusters that is
in this case its 10 and for each value of i this K-means algorithm is
implemented, the objects are being stored into this drive classes and in
the drive classes the total within sum of squares distance is been
allocated into this wss of i. And the size of the cluster is allocated into
the components of the list which you have created.
So, once this for loop get executed you will get a vector of the
within sum of those errors and a list of the number of clusters. Now,
we can plot that with number of clusters in x axis and within sum of
squares value in y axis and type = b represents both line and points has
to be there and pch = 19 specify the symbol that has to be used along
with the plot and x lab and y lab has their normal meaning which are x
label and the y label.
Let us look at the plot. So, this is the x axis which is number of
clusters which is varied from 0 to 10 because the maximum number of
clusters we had is 0 to 10 and this y axis is total within sum of squares
values with respect to trips and if you can see this total within sum of
square value drastically decreased when one moved from 1 cluster to 2
clusters and from 2 clusters to 3 clusters and after that decrease in total
within sum of squares clusters is not much when compared to the
earlier ones.
640
That is the reason why I can choose this K = 3 as my optimal
number of clusters.
Thank you.
641
Data science for Engineers
Prof. Raghunathan Rengaswamy
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Lecture - 50
Data science for engineers – Summary
So, with this we come to the end of the course on Data Science for
Engineers. I am going to do a quick summary of the course. I hope all
of you had a productive time looking through the videos and learning
interesting ideas in data science.
642
After that we described a framework for solving data science
problems. This is a framework that you can use to conceptually break
down a large data science problems into smaller sub problems in some
work ow fashion. We then moved on to describing regression analysis
where we described techniques for univariate and multivariate
problems.
Now, that is you have done this course and hopefully you have
learned enough from this course. What is the logical next step if you
were excited by this course and want to know more about data science
after doing this course? I would say first is to do more practice on the
same ideas that I have been taught in this course. So, you might want to
look at other problems and other practice examples and exercises for
the concepts that we have already taught in this course. So, that is the
first thing to do.
Once you do that the data science problems that we described in this
course are rather simple. So, you might want to look at how people
solve more involved data science problems and whether you can use
the framework to break them down into smaller problems and then see
whether you can learn more about these problems. Now, as I
mentioned before we teach only very few machine learning techniques
643
in this course for the beginners. However, the many many more
algorithms that are out there such as decision trees, random forests,
support vector machines, kernel tricks and so on. So, we have listed
many of the commonly known and used algorithms for more
complicated or more complex machine learning problems.
So, the next logical step would be once you master the material that
has been taught in this course is to look at learning these algorithms
and you could use the same notion of understanding the assumptions
that are underlying these algorithms to get a good idea of why these
algorithms work the way they work. And also would understand the
technical details of these algorithms in terms of what is the learning
rule and why does it work and so on.
Thank you.
644
THIS BOOK IS NOT FOR SALE
NOR COMMERCIAL USE