Introduction
Introduction : Well
Learning Problems
Qt Define learning, 6&7 [INTU : Dec.-17, Marks 2]
Ans. : Learning is a phenomenon and process which
has manifestations of various aspects. Learning
process includes gaining of new symbolic knowledge
and development of cognitive skills through
instruction and practice. It is also discovery of new
facts and theories through observation and.
experiment.
Q.2 Define machine learning.
Ans.: A computer program is said to leam from
experience E with respect to some class of tasks T
and performance measure P, if its performance at
tasks in T, as measured by P, improves with
experience E.
Q.3 What is need of machine learning in this
era? GSP (INTU : Dec.-16, Marks 3]
‘Ans. : # Main goal of machine learning is to devise
learning algorithms that do the learning automatically
without human intervention or assistance.
* The machine learning paradigm can be viewed as
"programming by example.” Another goal is to
develop computational models of human learning
process and perform computer simulations.
* The goal of machine learning is to build computer
systems that can adapt and learn from their
experience.
* Machine
algorithms can figure out how to
asks by generalizing from
greater insights into their organizations. This
adaptive technology is being used by global
| enterprises to gain a competitive edge.
Machine learning algorithms discover the
relationships between the variables of a system
(input, output and hidden) from direct samples of
the system.
Q.4 What are T, P, E ? How do we formulate a
| machine learning problem ?
| Ans.: + In general, to have a well-defined learning
problem, we must identity these three features : the
class of tasks, the measure of performance to be
| improved, and the source of experience.
‘* A Robot Driving Leaming Problem
1, Task T : Driving on public, 4lane highway
| using vision sensors.
Performance measure P : Average distance
traveled before an error (as judged by human
i overseer).
Training experience E : A sequence of images
and steering commands recorded while
i observing a human driver.
+ A Handwriting Recognition Learning Problem
| 1 Task T : Recognizing and classifying
i handwritten words within images
2. Performance measure P : Percent of words
correctly classified.
3. Training experience E : A database of
handwritten words with given classifications.
* Text Categorization Problem
1, Task T : Assign a document to its content
category.
| 2 Performance measure P : Precision and Recall.
3. Training experience E : Example pre-classified
| documents,
ay
Scanned with CamScannerlearning ?
‘Ans. + Following are some of the reasons
1. Some tasks cannot be defined well except by
examples, For example + recognizing people.
2. Relationships and correlations can be hidden
within Iarge amounts of data. To solve these
problems, machine Jeamning and data mining
ray be able to find these relationships.
4. Human designers often produce machines that do
‘not work as well as desired in the environments
in which they are used.
4. The amount of knowledge available about certain
tasks might be t0o large for explicit encoding by
humans.
5, Environments change time to time.
& New knowledge about tasks is constantly being
discovered by humans.
6 List the phases of machine learning ?
‘Ans. : Typically follows three phases :
1. Training : A training set of examples of correct
behaviour is analysed and some representation of
the newly leant knowledge is stored. This is
some form of rules.
Validation : The rules are checked and, if
wnecessary, additional training is given. Sometimes
‘additional test data are used, but instead, a
jruman expert may validate the rules, or some
other automatic knowledge - based component
may be used. The role of the tester is often called
the opponent.
3. Application : The rules are used in responding to
some new situation.
Q7 What is meant by machine learning ? What is
its need to today's society ? Explain successful
applications of machine learning ?
UG LINTU : Dec.-17, Marks 10)
‘Ans.: + Examples of successful applications of
machine leaming : :
1. Leaming to recognize spoken words.
2. Leaming to drive an autonomous vehicle.
3. Leaming to classify new astronomical
structures.
: eh
Ing
4. Leaming to play world-ass backgamyy
5. Spoken language understanding.
context of a limited domain, “iethn
meaning of something uttered by v""™ine
the extent that it can be clasteg i et
fixed set of categories M0 one
Face Recognition :
# Face recognition task is effortlessly ang
we recognize our friends, relative °Y 4
members, We also recognition by lookn -
photographs. In photographs, they are in’, ug
pose, hair styles, background light, may
without makeup. makeup ay
© We do it subconsciously and cannot exp,
we do it. Because we can't explain tow
we can't write an algorithm. We do
My
Face has some structure. It is not
collection of pixel. It is symmetric stu ™
contains predefined components like = hk
eye, ears. Every person face is a pattem com
of a particular combination of the fren
analyzing sample face images of a pena,”
learning program captures the pattern oak e
i person and uses it to recognize if a new as
face or new image belongs to this specific
not. es Petion
«© Machine learning algorithm creates an optin
soodel af the concept belng lend based'gr
or past experience. oa
Q.8 Explain the difference between mach
learning and data mining. 7
[
}
| In machine learning the
| main goal is to learn a
In data mining, the main
| goal is to discover new
| model, which can be used interesting information
| to predict future events. which describes the cure
| data set.
| Te considered data as It considered data as
_sseondary, primary.
| Machine learning uses Data mining uses simple
relatively complex and models ot local pattems
global models.
a
emacs
_-—
Scanned with CamScannerae ee
Only ance mpes ina milion of row
somes ne
Tolsmoneor fw | Tofnd ll intresting
patterns which describe the
| data set.
carefully defined models,
| this can be used to predict
\ future events.
0.9 What
learning ?
‘Ans.: The ingredients of machine leaming are as
follows
1. Tasks : The problems that can be solved with
machine leaming. A task is an abstract
representation of a problem. The standard
methodology in machine learning is to learn one
task at a time. Large problems are broken into
small, reasonably independent sub-problems that
are learned separately and then recombined.
» Predictive tasks perform inference on the current
data in order to make predictions. Descriptive
tasks characterize the general properties of the
data in the database
2 Models : The output of machine learning.
Different models are geometric models,
probabilistic models, logical models, grouping
and grading.
= The model-based approach seeks to create a
modified solution tailored to each new
application. Instead of having to transform your
problem to fit some standard algorithm, in
model-based machine learning you design the
algorithm precisely to fit your problem.
= Model is just made up of set of assumptions,
expressed in a precise mathematical form. These
assumptions include the number and types of
variables in the problem domain, which
variables affect each other, and what the effect
of changing one variable is on another variable.
«Machine leaming models are classified as :
Geometric model, Probabilistic model and
Logical model.
3. Features : The workhorses of machine learning. A.
good feature representation is central to
achieving high performance in any machine
learning task.
faction starts from an initial set of
derived values
non redundant,
are the Ingredients of machine
facilitating yy ———aedeton
Beneraization stepy eT leaning
and
+ Feature 5
Q10 What Is an 4
Influens
on machine learning? tern
ing. Optimal codes and their
relationship to op training sequences for
timal
encoding a hypothesis,
Q41 What Is meant by tar
get functi a
learning program ? SS (anTy : baerié, Mae a
‘arget function is a
to find. Once an algorithm finds its target function,
that function can be used to predict results. The
function can then be used to find output data related
te inputs for real problems where, unlike training
sets, outputs are not included.
Q.12 Explain basic design issue and approaches of
maciiine leaming,
Ams. Designing a Leaming System
Goal : Design a system to leam how to play
checkers and enter it into the world checkers
tournament.
1) Choose the training experience
2) Choose the target function
3) Choose a representation for the target function
4) Choose a function approximation algorithm
Training Experience
‘* How training experience influences performance
goal ?
1. Type of feedback : Direct vs Indirect
2, Learning strategy : Have a teacher or not ?
Exploration vs Exploitation ?
Scanned with CamScannerMachine Learning
3. Diversity of training : Is the training data
representative of the task ? How many peers
should we play with? How many tactics should
we try when playing with self ?
* Let us decide that our program will lear by
playing with itself and formulate the leaming
problem,
* Choosing the training experience :
1, Direet or indirect feedback
2 Degree of leamer's control
3. Representative distribution of examples
* Learning Goal is to : Define precisely a class of
problems that forms interesting forms of learning,
explore algorithms to solve such problems,
understand fundamental structure of learning
problems and processes.
# Design choice 1: The problem for selecting type of
training experience from which our system will
Tear. Direct training examples. Just a bunch of
board states together with a correct move.
Design Choice 2 : Indirect training. A bunch of
recorded games, where the correctness of the moves
is inferred by the result of the game.
Leaming is most reliable when the training
examples follow a distribution similar to that of
future test examples.
Q43 How to choose and represent the target
function ?
Ans.:¢ It determines exactly what type of
knowledge will be learned and how this will be used
by the performance program.
* Choosing a representation for the target function :
1. Expressive representation for a close function
approximation
2. Simple representation for simple training data
and learning algorithms
VQ) = Wo tWyXy Hat W6Xe
X,2 : Number of blacl/red pieces on the board
X34: Number of black/red kings on the board
Xs,¢ : Number of black/red pieces threatened
can be captured on red/black next turn)
* Consider the chess board program.
—— ~~ litreg
« ChooseMove : B-> M where B is any tg
state and M is a legal move (hopefully 4, My
Tegal move)
function V : B -> Rwhich
* Alternatively, whch ng
B to some real value where higher sf
assigned to better board states. =m
@ Now use the legal moves to generate
subsequent board state and use V to
best one and therefore the best legal move,
1. V(b) = 100, if b is a final board state tha ,
2. V(b) = 100, if bis a final board state thay Pe
3. V(b) = 0, if b is a final board state that is . nn
4, V(b) = V(b"), if bis not a final state where 5.”
the best final board state starting pgs
assuming both players play optimally '
# While this recursive definition specifies a yj,
V(b) for every board state b, this definition i”
usable by our checkers player because it ig
efficiently computable. ™
the
« For representation
1. Use a large table with an entry specifyn
value for each distinct board state, :
2 Collection of rules that match against fetus
the board state.
3. Quadratic polynomial function of predefne
board features.
Q.14 How to adjust the weights ?
Ans. : © Choose the weights w; to best fit the set ¢
training examples.
© Minimize the squared error E between the trin
values and the values predicted by the hypothesis
Es L Mersin (b)-(o))?
(b-Virain(b))¢ training examples
Require an algorithm that will incrementally reire
weights as new training examples become available
and it will be robust to errors in these estimated
training values.
* Least Mean Squares (LMS) is one such algorithn.
Scanned with CamScannerIng
9.15 Define useful perspective on machi learning.
rs: One useful perspective on machine learning is that it involves searchin
very lay
hypotheses to determine one that best fits the observed data and any per mice of Foe
y the
‘Ans.
learner
Q.16 Describe the Issues in machine learning ?
‘Ans. + Issues of machine learning are as follows :
« What leaming algorithms to be used ?
«How much training data is sufficient ?
«When and how prior knowledge can guide the learning process ?
‘« What is the best strategy for choosing a next training experience ?
« What is the best way to reduce the learning task to one or more function approximation problems ?
+ How can the learner automatically alter its representation to improve its learning ability ?
—
| 1.4: Concept Learning and the General to Specific Ordering
Q.17 What Is hypothesis ?
‘Ans.: * A hypothesis is a vector of constraints for each attribute
1. Indicate by a?" that any value is acceptable for this attribute
2. Specify a single required value for the attribute
3. Indication by a "O" that no value is acceptable
If some instance x satisfies all the constraints of hypothesis h, then h classifies x as a positive example
(h(x) = 1).
Q.18 What is the inductive learning hypothesis ?
‘Ans.: Any hypothesis found to approximate the target function well over a sufficiently large set of training
examples will also approximate the target function well over other unobserved examples.
Q.49 Explain with example concept learning task.
Ans. : + Inducing general functions from specific training examples is a main issue of machine learning.
*Concept Leaming: Acquiring the definition of a general category from given sample positive and
negative training examples of the category.
‘*Concept Leaming can be seen as a problem of searching through a predefined space of potential
is that best fits the training examples.
generalto-specific ordering of hypotheses, and the search can be
advantage of a naturally occurring structure over the hypothesis space-
Leaming: Inferring a boolean-valued function from training examples
ning is the learning of bird-concept from the given examples of birds
|
Scanned with CamScannerMachine Leeming
the definition of a concept from given examples.
termining a mapping fom a set of IMP
ductive learning methods.
Jing data to corre
ation.
1 We are trying tle t variables to a Boolean value,
«Concept learning involves 4
‘Such methods are known 8 it
If a function can be found which
well for unseen data. This process
maps trai ct classifications, then it will also work
is known as generaliz
1d enjoys his favor
ter sport”
fe : Leam the “days on which my frien ite water SPO
+ Exampl
Water Forecast Enjoy Sport
‘Humidity Wind
T Warm Same YES
jormal Strong
Strong,
High St07E
[Banal] iy [ ARTem
| Sunny | Warm
by six attributes. The task is to leam fo predict the
of its attribute values.
hypothesis found to approximate the target function well
es will also approximate the target function well over
+ A set of example days, and each is described
‘value of EnjoySport for arbitrary day, based on the values
«The inductive learning hypothesis : Any
over a sufficiently large set of training examp!
other unobserved examples.
« Although the Ieaming task isto determine a hypothesis ( h) identical the target concept cover the
rte eet of instances ( X), the only information available about ¢ is its value over the taining
examples.
«Inductive learning algorithms can at best guarantee that the output hypothesis fits the target concept
over the training data.
Lacking any further information, our assumption is that the best hypothesis regarding unseen
instances is the hypothesis that best fits the observed training data. This is the fundamental
assumption of inductive learning.
« Hypothesis representation (constraints on instance attributes) :
1. Any value is acceptable is represented by ?
2. No value is acceptable is represented by ®
Q20 Illustrate general to specific ordering of hypotheses in concept learning? US [INTU : Dec.-17, ‘Marks 5]
ithms for concept learning organize the search through the hypothesis space by relying on
ic ordering of hypotheses.
ivantage of this naturally occurring structure over the hypothesis space, we can design
ithms that exhaustively search even infinite hypothesis spaces without explicitly
every hypothesis.
+ Anup trust for knowodige
Scanned with CamScanner«Consider two hypotheses:
ht = (Gunny, 2, 2, Strong, ?, ?)
2 = (Sunny, 2, 2, 2,7 2)
«Now consider the sets of instances that are classified positive by hand by 42
fewer constraints on the instance, it classifies more instances as positive, Because ho —
‘ein fact, any instance classified positive
that h2 is more general than hl.
by hl will also be classified posit
Positive by h2, There
fore, we
say
One learning method is to determine the most specific hypothesis that matches all th
‘s More-General-Than-Or-Equal Relation : Let hl and h2 be two boolean-valued f a
X. Then hi is more-general-than-or-equal-to h2 (written hl 2 2). If and only if
‘raining data,
‘ons defined over
satisfies h2 also satisfies hl. any instance that
ahi is more-general-than h2 ( hl > h2) if and only if hl 2 h2 is true and h2 > ht is false.
12 is more-specific-than hl. «We also say
hy 2 hy i Vxe Xi hy X= 1 by O=1
‘Specific ~
S \
| \
A VvYv hy =
Lattice WRAY, a=
x |
\ AYN |
| YN |
General " ,
Fig, Q.20.1
1.5 : FIND-S Algorithm
Q.21 Explain the key properties of FIND-S algorithm for concept learning with necessary example.
‘EB INT : Dec.-17, Marks 5]
‘Ans. : © The key property is that for hypothesis spaces described by conjunctions of attribute constraints, FIND-S
is guaranteed to output the most specific hypothesis within Hi that is consistent with the positive training
examples.
«Its final hypothesis will also be consistent with the negative examples provided the correct target
concept is contained in H, and provided the training examples are correct.
sthm starts from the most specific hypothesis and generalize it by considering only
re examples. As long as the hypothesis space contains 2 hypothe that
, and the training data contains no errors, ignoring negative examples
ost spcitc hypothesis within H tat is consent wit he FSS
|
Scanned with CamScannera Im
18
oe if the correct target cng
Machine Leaning
©The final hypothesis will also be consistent with negative =
H, and the training examples are correct.
‘Sky Air Temp Humidity | Wind _
/ Samay Warm Normal Stor
High _|_Stons
Sunny Warm High a
Rainy ‘Cold High Strong Coot =f |
Sunny Warn High Strong,
= <0,0,0, 0 07
oa
h = Sunny, Warm, Normal, Strong, Warm, Sam
h =
h = Sunny, Warm, ?, Strong, ?,
Forecast Enjoy Spoq~)
Example
1
Algorithm :
‘Initialize h to the most specific hypothesis in H =
© For each attribute training instance
For each attribute constraint a in
If the constraint is not satisfied by x.
Then replace a, by the next more general constraint satisfied by x.
Output hypothesis h
Example |
_-Humidity
Air Temp
GE [INTU : Dec.-16, Marks 10)
Q.22 Describe hypothesis space search by FIND-S algorithm,
* The FIND-S algorithm illustrates one way in which the more-general than partial ordering can be wed?
igniter the search for an acceptable hypothesis.
©The search moves from hypothesis to hypothesis, searching from the most specific to progressiely
more general hypotheses along one chain of the partial ordering.
«The hypothesis space search performed by FIND-S.
__ al
— Scanned with CamScannerathe search begins (hg) with the most specific hypothesis in H, then considers in
hypotheses (hy through hg) a8 mandated by the training examples,
sin the insance space diagram postive tening examples are denoted by *.~ negative
instances that have not been presented as ttalning examples are denoted by a slid ties
aAt each stage the hypothesis is the most specific hypothesis consistent
observed up to this point.
Instances X
Specific
General
Ng = <2, 8, 0, 8,2, a>
x =,+ hy = |
xp = , + ha =
x= , ~ hy =
xx = , + hg =
Fig. 0.22.4
1.6 : Version Space and Candidate Elimination Algorithm
Q.23 What Is version space ?
‘Ans. : * Version space : A set of all hypotheses that are consistent with the training examples.
«The version space, denoted VS_H,D, with respect to hypothesis space H and training examples D, is
the subset of from H consistent with the training examples in D.
# A version space is a hierarchical representation of knowledge that enables you to keep track of all the
useful information supplied by a sequence of learning examples without remembering any of the
examples.
The version space method is a concept learning process accomplished by managing multiple models
within a version space.
Q.24 Explain characteristics of version space.
‘Ans.: Characteristics :
are represented using version spaces.
alternative plausible descriptions of a heuristic.
is applicable to all: known positive examples and no known:
Scanned with CamScannerMachine Learning
4. A version space description consists of two
‘complementary trees :
4. One that contains nodes connected to overly |
general models, and
ii, One that contains nodes connected to overly
specific models,
5. Node values/attributes are discrete.
Q.25 Describe compact representation for version
spaces.
Ans. :
* Instead of enumerating all the hypotheses consistent |
with a training set, we can represent its most
specific and most general boundaries. The |
hypotheses included in-between these Wo |
boundaries can be generated as needed.
* Definition: The general boundary ( G) with respect
to hypothesis space H and training data D, is the
set of maximally general members of H consistent !
with D.
* Definition : The specific boundary (S) with respect |
to hypothesis space H and training data D, is the
set of minimally general (ie, maximally specific) |
members of H consistent with D. |
« Fig. Q25:1 shows general and specific hypothesis. |
Most general
hypothesis fs Empl
° I
Q . ? Oo i
+ Examples
Most specific.
hypothesis
Fig. 0.25.1 General and specific hypothesis
+ Each specialization must be a generalization of
some specific concept description: No specialization
‘can be a specialization of another general concept
description. Fig. Q.252 shows boundary set with
i
j
|
|
G-seot
Unda sats
S-sot
Most specific
hypothesis
Q.25.2 Boundary set with hypothes,
Fig
0.26 What are advantages and disadvantage, 4
version space method ?
Ans.
‘Advantages of the version space mothod:
1, Can describe all the possible hypotheses ny
Janguage consistent with the data,
2. Fast (close to linear).
Disadvantages of the version space method:
1. Inconsistent data (noise) may cause the tpy
concept to be pruned.
2, Leaming disjunctive concepts is challenging
Q.27 Describe List-Then-Eliminate Algorithm.
© List-Then-Eliminate algorithm initializes ti
version space to contain all hypotheses in H, th,
eliminates any hypothesis found inconsistent wis
any training example.
The version space of candidate hypotheses ths
shrinks as more examples are observed, uti
ideally just one hypothesis remains that i
consistent with all the observed examples.
If insufficient data is available to narrow te
version space to a single hypothesis, then te
algorithm can output the entire set of hypothe
consistent with the observed data.
© List-Then-Eliminate algorithm can be appli
whenever the hypothesis space H is finite. It
many advantages, including the fact that it ®
guaranteed to output all hypotheses consistent "#*
the training data.
a
Scanned with CamScanner1s exhaustively enumerating. all hypotheses nue, —___Introduetion
san
esis spazces
+ Unfortunately, it =
“nvealistic requirement
for all but the most trivial hypoth
0.28 Define candidate ellmination algorithm.
‘Ans: The candidate-Elimination algorithm computes the version space contai
hypotheses from H that are con:
\didate-Elimination.
sistent with an observed sequence of traning on a (and ony ho
mp!
0.29 Write algorithm for Can
Ans:
Algorithm :
Given:
© A representation language.
A set of positive and negative examples expressed in that language,
Compute + a concept description that is consistent with all the positive examples and none of the
negative examples.
Method :
«Initialize
features are variables).
e Initialize $, the set of maximally specific hypotheses, to contain one element : the first positive
G, the set of maximally general hypotheses, to contain one element: the mull description (al
example.
« Accept a new training example.
If the example is positive:
1. Generalize all the specific
‘The new specific models involve minimal changes.
«Each new specific model is a specialization of some general model,
is a generalization of some other specific model.
‘models that fail to match the positive example.
models to match the positive example, but ensure the following :
#No new specific model
2. Prune away all the general
If the example is negative +
1. Specialize all general models to prevent ma
following :
«The new general models involve minimal changes.
«Each new general model is a generalization of some specific model.
«eNo new general model is a specialization of some other general model.
2. Prune away all the specific models that match the negative example.
elf S and G are both singleton sets, then :
if they are identical, output their value and halt.
ing cases were inconsistent
tch with the negative example, but ensure the
t. Output this result and halt.
ing examples.
Scanned with CamScannerIntroa,
in the language.
1-12 ~ .
Machine Learsng str
jtent descriPt
0 = No consis e convers®
1 = Answer (version SP re implicitly included,
ge a
2h All descriptions in the langu™
ontaining all (and ony
iples. Moy
th example. ace «t
2.30 Explain candidate elimination algorithm version 5P82 ©
Ans.:+ The candidate-Elimination algorithm computes 1 araining &*
hypotheses from H that are consistent with an observed annie car"
«Example : Learning the concept of “Japanese Eco! cade, TYPE
‘+ Features : Country of Origin, Manufacturer,
j_Ontsin
Color,
‘Example Type
Positive
Manufacturer
Honda
Positive Example 1: (Japan, Honda, Blue, 1980, Economy)
‘Initialize G to a singleton set that includes everything.
G= (07220)
‘Initialize S to a singleton set that includes the first positive example.
S = { Japan, Honda, Blue, 1980, Economy) }
© Negative Example 2 : (Japan, Toyota, Green, 1970, Sports)
* Specialize G to exclude the negative example.
G = { (2, Honda, ?, 2, 2), (?, ?, Blue, ?, 2), (2, 2, 2, 1980, 2), (2, 2, 2 2, Economy) }
S = { (apan, Honda, Blue, 1980, Economy) }
* Positive Example 3 : (Japan, Toyota, Blue, 1990, Economy)
© Prune G to exclude descriptions inconsistent with the positive example.
G = ((?, 2, Blue, 2, 2), (2, 2, 2, ?, Economy) }
* Generalize S to include the positive example :
5 = (Japan, ?, Blue, ?, Economy) }
+ Negative Example : (USA, Chrysler, Red, 1980, Economy)
* Specialize G to exclude the negative example (but stay consistent with S)
G = (22, Blue, 2, 2), Japan, 2,2, 2, Economy) }
S = { Qapan, ?, Blue, ?, Economy) }
Honda, Red, 1990, Economy)
Scanned with CamScannerearning
a ie
* Example is inconsistent with the veston-epa
lon-space,
G cannot be specialized,
S cannot be generalized,
‘* The version space collapses,
* Conclusion : No conjunctive hypothesis is consistent with th
data set
1
[7 nea
'B algorithm L for the set of
= (&, e(x)>} be an
Q31 What Is an Inductive bias ?
Ans.: *Consider a concept learnin,
defined over X, and let De
; assertions B su
corresponding training examples De the following formula hola "at ®°Y treet concept c and
(¥x; ENB D.AIE LO, DI
Q.32 Explain fundamental Property of Inductive Inference.
Ans.: *A leamer that makes no a priori assumpti ing the i
iptions regarding the ident
rational basis for classifying any unseen instances. 6 NS SREY OF Ue age concept bas ro
* Inductive Leap : A leamer should be able to generalize trainin;
data using prior assumptions i
to classify unseen instances. ° Sicilia abies
* The generalization is known as inductive leap and our prior assumptions are the inductive bias ofthe
learner,
¢ Inductive Bias (prior assumptions) of Candidate-Elimination Algorithm is that the target concept can
be represented by a conjunction of attribute values, the target concept is contained in the hypothesis
space and training examples are correct.
Q.33 Explain with example inductive bias.
‘Ans. : © The Candidate-Elimination algorithm will converge toward the true target concept provided itis given
accurate training examples and provided its initial hypothesis space contains the target concept.
What if the target concept is not contained in the hypothesis space ?
# Can we avoid this difficulty by using a hypothesis space that includes every possible hypothesis ?
m i ize to
«How does the size of this hypothesis space influence the ability of the algorithm to generalize
d instances ?
init be
space influence the number of training examples hat must
only conjunctions of ate
; : oe ;
sted the hypothesis space to include resent even simple disunciv®
the hypothesis space is unable to TP
or Sky = Cloudy.”
Scanned with CamScannerCoo!
From first two examples : $2 : , Warm, Normal, Strong, Cool, Change>
* This is inconsistent with third examples, and there are no hypotheses consistent with these three
examples PROBLEM : We have biased the learner to consider only conjunctive hypotheses. We
require a more expressive hypothesis space.
* The obvious solution to the problem of assuring that the target concept is in the hypothesis space H
is to provide a hypothesis space capable of representing every teachable concept.
Q.34 Which are the three learning algorithms from weakest to strongest bias ?
‘Ams.: *ROTE-LEARNER : Learning corresponds simply to storing each observed training example in
memory. Subsequent instances are classified by looking them up in memory. If the instance is found
in memory, the stored classification is retumed. Otherwise, the system refuses to classify the new
instance. Inductive Bias : No inductive bias
* CANDIDATE-ELIMINATION : New instances are classified only in the case where all members of the
current version space agree on the classification. Otherwise, the system refuses to classify the new
instance. Inductive Bias : the target concept can be represented in its hypothesis space,
*FIND-S : This algorithm, described earlier, finds the most specific hypothesis consistent with the
training examples. It then uses this hypothesis to classify all subsequent instances. Inductive Bias: the
target concept can be represented in its hypothesis space, and all instances are negative instances
unless the opposite is entailed by its other knowledge.
1.8 : Decision Tree Learning : Introduction and Representation
Q.35 What is decision tree ?
Ans. Decision tree Jearning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree.
4 decision tree is a tree where each node represents a feature(attribute), each link(branch) represents a
decision(rule) and each leaf represents an outcome(categorical or continues value).
on tree of a classification tree isa tree in which each internal node is labeled with an input
The arcs coming from a node labeled with a feature are labeled with each of the possible
the feature.
nodes of decision tree ?
tree has two kinds of nodes
An up thrust for knowledge
Scanned with CamScanner* Build a decision tree for classifying
Positive or negative instances of a
* Supervised leaming, batch Processing of training
examples, using a preference bias,
* A decision tree is a tree where
a. Each non-leaf nod
attribute (feature)
» Each leaf node has associated with it a
classification (+ or -)
© Each are has associated with it one of the
Possible values of the attribute at the node
from which the arc is directed,
* Internal node denotes a test on
Tepresents an outcome of
Tepresent class labels or class
A decision tree is a flow.
where each node denote:
value,
an attribute. Branch
the test. Leaf nodes
distribution,
-chart-like tree structure,
"Sa test on an attribute
each branch represents an outcome of the
fest, and tree leaves represent classes or class
distributions. Decision trees can easily be converted
2.38 Write decision tree algorithm.
Ans. i+ To generate decision tree from the taining
tuples of data partition D.
Input :
1. Data partition ( D)
class then
labeled with the
turn N as a leaf
ss in D
(D, attribute list)
nm;
Introduction
8 Then attribute list > attribute list
> splitting
attribute
For (each outcome j of spliting criterion )
Let D, be the set of data tu
outcome j;
If D, is empty then attach a leat |
majority class in D to node N;
12 Else attach the node retumed by Generate
decision tree (Dj , atribute list) to node N;
13. End of for loop
14. retum N;
ples in D satisfying
1 labeled with the
Q.39 Describe characteristics for
‘appropriate
Problems for decision tree learning.
Ans, : « Decision tree learning is generally best suited
to problems with the following characteristics :
1. Instances are represented by attribute-value
Pairs. Fixed set of attributes, and the attributes
take a small number of disjoint possible values,
2. The target function has discrete output values,
Decision tree leaming is appropriate for a
boolean classification, but it easily extends to
Jearning functions with more than two possible
output values,
Disjunctive descriptions may be required.
Decision trees naturally represent disjunctive
expressions,
4 The training data may contain errors. Decision
tree leaming methods are robust to errors, both
errors in classifications of the training examples
and errors in the attribute values that describe
these examples,
5. The training data may contain missing attribute
values. Decision tree methods can be used even
When some training examples have unknown
values.
6 Decision tree learning has been applied to
Problems such as leaming to classify
Q.40 Define the following :
a. Input variable b. Leaf node c. Intemal nodes
d. Depth
Ans. :
a. Input variable : Each member of the set ( x1, xp,
+ Xn} is called an input variable.
b. Leaf mode ; A node without further branches is
called a leaf node. The leaf nodes return class
Scanned with CamScannerlabels and, in some implementations, they return
the probability scores.
€. Internal nodes are the decision or test points.
Each internal node refers to an input variable or
an attribute. The top internal node is called the
root.
4. Depth : The depth of a node is the minimum
number of steps required to reach the node from
the root.
Q41 List the advantages and disadvantages of
decision tree.
‘Ans. : Advantages :
1. Rules are simple and easy to understand.
2. Decision trees can handle both nominal and
numerical attributes.
3. Decision trees are capable of handling datasets
that may have errors.
4. Decision trees are capable of handling datasets
that may have missing values.
5. Decision trees are considered to be a
nonparametric method.
6. Decision trees are self-explanatory.
Disadvantages :
1. Most of the algorithms require that the target
attribute will have only discrete values.
2. Some problem are difficult to solve like XOR.
3. Decision trees are less appropriate for estimation
tasks where the goal is to predict the value of a
continuous attribute.
4. Decision trees are prone to errors in classification
problems with many class and relatively small
number of training examples.
2.42 How to evaluate a decision tree ?
‘Ans. © A decision tree can help you make tough
ths and outcomes, but
1 correctly.
models of possible
ible outcomes in a trée
wn as "branches" off
a decision tree to help
. Decision tree is evaluated as follows :
‘with domain experts,
actermine if the decision rules are sound
«Next, look at the depth and nodes of the ‘tree
Having too many layers and obtaining nodes with
few members might be signs of overfitting,
the model fits the training set well,
«In overfitting,
the new samples in the
‘but it performs poorly on
testing set.
+ Overfitting occurs when
describes random error or nol
underlying relationship.
« Overfitting is when a classifier fits the training data
too tightly. Such a classifier works well on the
training data but not on independent test data. It is
a general problem that plagues all machine learning,
methods.
‘« Overfitting generally occurs when a model is
excessively complex, such as having too many
parameters relative to the number of observations.
a. statistical model
ise instead of the
« To prevent over-fitting we have several options :
1. Restrict the number of adjustable parameters the
network has - e.g. by reducing the number of
hidden units, or by forcing connections to share
the same weight values.
2. Stop the training early, before it has had time to
learn the training data too well.
. Add some form of regularization term to the
error/cost function to encourage smoother
network mappings.
4. Add noise to the training patterns to smear out
the data points.
1.9: Basic Decision Tree
Learning Algorithm
Q.43 Define information gain.
Ans.:¢ Entropy measures the impurity of a
collection. Information Gain is defined in terms of
Entropy.
Information gain tells us how important a given
attribute of the feature vectors is.
Scanned with CamScannerGain(S, A) = Entropy(s)—
Sy
[5s lentopy(s,
vevatuesiay [S|
where Values (A)
for attribute A
which attribute A has
Q.44 What Is the role of information gain in
decision tree leaning ?
) is the set of all possible values
and Sy is the subset of $ for
value v
USP LINTU : Dee.-16, Marks-3)
Ans. + © Information gain measures how well a given
attribute separates the training examples according to
their target classification
*1D3 uses this information gain measure to select
among the candidate attributes at
each step while
growing the tree,
+ Entropy is used for measuring information gain,
+ Putting together a decision tree is all a matter of
choosing which attribute to test at each node in the
tree.
+ We shall define a measure called information gain
‘which will be used to decide which attribute to test
at each node.
‘Information gain is itself calculated using a measure
called entropy, which we first define for the case of
a binary decision problem and then define for the
general case
Q45 Explain Gini Index and Entropy of dicision
tree algorithm.
Ans. : * One of the decision tree algorithms is CART
ificatic Tree).
decision or target
decision tree is
decision or target
the decision tree is
for building both
Decision Trees. The
een —~——____[ntraduction
CART is Gint Index. The decision tree built by
CART algorithm is always a binary decision tree.
*Gini impurity is a measure of how often a
randomly chosen element from the set would be
incorrectly labeled if it was randomly labeled
according to the distribution of labels in the subset,
* Gini index, entropy and twoing rule are some of
the frequently used impurity measures.
* Gini Index for a given node t :
GINI() = pGlN1-pG ly) = Dpgin?
i
i
Maximum of 1-I/n< (number of classes) when
records are equally distributed among all classes =
maximal impurity.
* Minimum of 0 when all records ‘belong to one class
= complete purity.
* Entropy at a given node by :
Entropy (1) =D PGI) log pil)
i
* Maximum (log n.) when records are equally
distributed among all classes(maximal impurity).
* Minimum (0.0) when all records belongs to one
class (maximal purity).
* Entropy is the only function that satisfies all of the
following three properties
1. When node is pure, measure should be zero
2. When impurity is maximal (Le. all classes
equally likely), measure should be maximal
3. Measure should obey multistage property
* When a node p is split into k partitions (children),
the quality of the split is computed as a weighted
sum :
"
GNou = 2 St Ging) = S pgiv?
® i
where nj = number of records at child i, and
n = number of records at node p.
+ A problem with all impurity measures is that they
depend only on the number of (training) patterns of
different classes on either side of the hyperplane.
Thus, if we change the class regions without
changing the effective areas of class regions on
Scanned with CamScannerinit) ay
ine) =
[Bem
Fig, 45.4
either side of a hyperplane, the impurity measure
of the hyperplane will not change.
* Thus the impurity measures do not really capture
the geometric structure of class distributions, Also,
all the algorithms need to optimize on some
average of impurity of the child nodes and often it
is not clear what kind of average is proper.
Q.48 Which attribute is best ? How to select best
attributes ?
Ans: + We would like to select the attribute that is
‘most useful for classifying examples.
Information gain measures how well a given
attribute separates the training examples according
to their target classification,
© 1D3 uses this information gain measure to select
among the candidate attributes at each. step while
growing the tree.
‘mn order to define information gain precisely, we
use a measure commonly used in information
theory, called entropy
© Entropy characterizes the impurity of an arbitrary
collection of examples.
* Putting together a decision tree is all a matter of
to test at each node in the
called information gain
which attribute to test
lated using a measure
first define for the case of
and then define for the
tion, C, and a set of
yportion of examples
and_ the
proportion of examples categorized as negative by
Cis p-, then the entropy of S is :
Entropy(S) = ~P+ logz (p+ )-P—log2(P-)
Where
Sis a sample of training examples
pt is the proportion of positive examples
p* is the proportion of negative examples
‘* Imagine having a set of boxes with some balls in. If
all the balls were in a single box, then this would
be nicely ordered, and it would be extremely easy
to find a particular ball.
«If, however, the balls were distributed amongst the
boxes, this would not be so nicely ordered, and it
might take quite a while to find a particular ball.
‘© If we were going to define a measure based on this
notion of purity, we would want to be able to
calculate a value for each box based on the number
of balls in it, then take the sum of these as the
overall measure.
© We would want to reward two situations: nearly
empty boxes, and boxes with nearly all the balls in.
This is the basis for the general entropy measure,
which is defined as follows.
* Given an arbitrary categorization, C into categories
Cy sr Gq and a set of examples, S, for which the
Proportion of examples in c; is pi, then the entropy
of Sis:
Entropy(S) = Pa Jog. (Pi)
Q.47 Explain ID3 algorithm.
‘Ans. : © The calculation for information gain is the
most difficult part of this algorithm,
* ID3 performs a search whereby the search states are
decision trees and the operator involves adding a
node to an existing tree. It uses information gain to
measure the attribute to put in each node, and
performs a greedy search using this measure of
worth,
Scanned with CamScannercategorised in categories c,, then :
the highest for information gain relative
ii TESy contains ont
Attribute A~
‘scores highest
for gain (8, A)
Attribute B =
scores highest,
for gain (SA)
Leaf node
Category ¢
8, must conta Deraut
‘only: —e in leaf node d
‘ategorye
S, mest have no
‘examples iaing value
x for attribuse 8, and d
‘must be the category
‘containing the most,
members of S,,
Fig. Q47.1
a decision tree using the following Instances :
Scanned with CamScanner‘Machine Learning
yp (Prennis)
Fi we 10g (Peinema |
Ans. : Entropy(S) = ~ Peinema!982 (Pein pany. in 10821 Peay)
tennis 82
Papp 9821 Peers)
= = (10) » log (610)-2/00)* 1°82 )--3322
= 0) +0737 -(2/10)*-2922 “a
322 + 0.3922 = 1571
‘@/t0)-(i/10)* Joga (Y/10)-(1/10)» log, thy
—(Y10) *-3.322,
= 0.4422 + 0.4644 + 0.
and we need to determine the best of :
Gain(S, weather) = 1.571-(/Syun //10) * Entropy!
in )
= (Srain (10) * Entropy(Srain .
= 1.571-(03) « Entropy(Spun)(0-4)* Entropy(S wind )~(0-3) * Entropy(s,,.,)
1571-(03) +(0.918)-(04) * (0.81125)-(03) * (0.918) = 0.70
Gain(, parents) = 1.571-(Syes/10) * Entropy( Ses) ~ (Sno 10) * Entropy(Sno)
= 1571-(05) * 0-(0.5)* 1.922 = 1571 - 0.961 = 0.61
Gain(S, money) = 1.571-(S,in /10) * Entropy Si) - (Spoor |/19) * Entropy(Spoor)
1.571-(0.7) * (1.842)-(0.3)*0 = 1.571 - 1.2894 = 0.2816
This means that the first node in the decision tree will be the weather attribute. From the weite
node, we draw a branch for the values that
weather can take: sunny, windy and rainy :
* Now we look at the first branch.
Saunay ~ (W1, W2, WI0}. This is not empty, so
we do not put a default categorization leaf node
here.
*The categorizations of WI, W2 and W10 are
Cinema, Tennis respectively. As these are not all Fig. 0.48.1
cannot put a categorization leaf
we put an attribute node here,
(Spun) ~ (Swind 10) * Entropy(Sying)
| this is not empty, and they do
D the same class, so we put an
left blank for now. The same
with the third branch, hence
looks like this :
Scanned with CamScannerATR
eIn effect, we are
interested only in this part of the table >=
Weekend
(Example)
Hence we can calculate :
Gain(Sganny Parents) = 0.918-Syeg | \sp« Entropy(Sye,)=(s,
= 0918-(/8) 0-@/2)+ 0= 0918
GainGaamy money) = 0918-8, 15D * EDOPYSyen)~ (Seu /S)» Ent
= 0918-(/8) * 0918-(93}+0 ~ 0818- ajay
v0 8) * Entropyis,,)
PIS poor)
*0918-(0/3)+0~ 0518-9519 09
41.10 : Hypothesis Space Search in Decision Tree Learning
Q.49 Discuss hypothesis space search in decision tree learning.
Ans. : + The hypothesis space searched by IDS is the set of possible decison tres,
» IDS i
complex, hill-climbing search through this hypothesis space. rean eae
* Begins with the empty tree, then considers progressively more elaborate hypotheses in search of a
decision tree that correctly classifies the training data.
* The information gain measure guides the hill-climbing search.
+ Hypothesis Space : Set of possible decision trees
Search Method : Simple-to-Complex Hill-Climbing Search (only a single current hypothesis is
maintained (from candidate-elimination method)). No Be