0% found this document useful (0 votes)
12 views124 pages

Unit3 MLT

The document discusses Information Theory, focusing on decision trees and the ID3 algorithm for classification tasks. It explains concepts such as entropy, information gain, and the structure of decision trees, highlighting their effectiveness in classifying data based on attribute-value pairs. Additionally, it covers inductive and deductive learning, biases in hypothesis space, and the principle of Occam's Razor in the context of machine learning.

Uploaded by

aman700065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views124 pages

Unit3 MLT

The document discusses Information Theory, focusing on decision trees and the ID3 algorithm for classification tasks. It explains concepts such as entropy, information gain, and the structure of decision trees, highlighting their effectiveness in classifying data based on attribute-value pairs. Additionally, it covers inductive and deductive learning, biases in hypothesis space, and the principle of Occam's Razor in the context of machine learning.

Uploaded by

aman700065
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

UNIT # 3

Information Theory is a field of study concerned with quantifying


information for communication.

•Low Probability Event: High Information (surprising).


•High Probability Event: Low Information (unsurprising).

The calculation of information is often written as h(); for example:


•h(x) = -log( p(x)
The negative sign ensures that the result is always positive or zero.
Information will be zero when the probability of an event is 1.0 or a
certainty, e.g. there is no surprise.
n is the number of outcomes
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
Each internal node represents a test on an attribute. Each leaf node
represents a class.
The benefits of having a decision tree are as follows −
•It does not require any domain knowledge.
•It is easy to comprehend.
•The learning and classification steps of a decision tree are simple and
fast.
Decision Tree Induction Algorithm : ID3 ALGORITHM.
ID3 ALGORITHM: A machine researcher named J. Ross Quinlan in
1980 developed a decision tree algorithm known as ID3 (Iterative
Dichotomiser). Later, he presented C4.5, which was the successor of ID3.
ID3 and C4.5 adopt a greedy approach. In this algorithm, there is no
backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.
Decision Tree Learning
Based on the features in our training dataset, the decision tree model learns a series of
questions to infer the class labels of the examples.
For example, we could simply define a cut-off value along the sepal width feature axis and
ask a binary question: "Is the sepal width = 2.8 cm?“. Using the decision algorithm, we start
at the tree root and split the data on the feature that results in the largest information gain
(IG).
In an iterative process, we can then repeat this splitting procedure at each child node until
the leaves are pure. This means that the training examples at each node all belong to the
same class. In practice, this can result in a very deep tree with many nodes, which can easily
lead to overfitting. Thus, we typically want to prune the tree by setting a limit for the
maximal depth of the tree.
Decision Tree Classification
The main components of a decision tree are:

It is a top-down approach which helps us to make decisions about the target


variables.
A decision tree splits data into multiple sets of data. Each of these sets is then
further split into subsets to arrive at a decision.
Decision Tree
Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance.
Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of the
attribute in the given example. This process is then repeated for the subtree rooted at the
new node. Decision tree classifies Saturday mornings according to whether they are suitable
for work to do.
e.g. (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)
Decision Tree
In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values
of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and
the tree itself to a disjunction of these conjunctions.
(Outlook = Sunny Λ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain A Wind = Weak)
APPROPRIATE PROBLEMS FOR DECISION
TREE LEARNING
• Instances are represented by attribute-value pairs: Instances are described by a fixed set of attributes
(e.g. Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However,
extensions to the basic algorithm allow handling real-valued attributes as well (e.g., representing
Temperature numerically).

• The target function has discrete output values: The decision tree assigns a Boolean classification
(e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more
than two possible output values. A more substantial extension allows learning target functions with
real-valued outputs, though the application of decision trees in this setting is less common.

• The training data may contain errors.

• The training data may contain missing attribute values.


APPROPRIATE PROBLEMS FOR
DECISION TREE LEARNING
Decision tree learning has therefore been applied to problems such as learning to classify
medical patients by their disease, equipment malfunctions by their cause, and loan
applicants by their likelihood of defaulting on payments. Such problems, in which the task
is to classify examples into one of a discrete set of possible categories, are often referred to
as classification problems.
THE BASIC DECISION TREE LEARNING
ALGORITHM

Most algorithms that have been developed for learning decision trees are variations on a
core algorithm that employs a top-down, greedy search through the space of possible
decision trees. This approach is exemplified by the ID3(Iterative Dichotomiser 3) algorithm
(Quinlan 1986) and its successor C4.5 (Quinlan 1993). Our basic algorithm, ID3, learns
decision trees by constructing them top-down, beginning with the question "which attribute
should be tested at the root of the tree? To answer this question, each instance attribute is
evaluated using a statistical test to determine how well it alone classifies the training
examples. The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute, and the
training examples are sorted to the appropriate descendant node (i.e., down the branch
corresponding to the example's value for this attribute).
Construction of Decision Tree(ID3)
Our basic algorithm, ID3, learns decision trees by constructing them top-down, beginning
with the question "which attribute should be tested at the root of the tree?” To answer this
question, each instance attribute is evaluated using a statistical test to determine how well
it alone classifies the training examples. The best attribute is selected and used as the test at
the root node of the tree. A descendant of the root node is then created for each possible
value of this attribute, and the training examples are sorted to the appropriate descendant
node (i.e., down the branch corresponding to the example's value for this attribute). The
entire process is then repeated using the training examples associated with each descendant
node to select the best attribute to test at that point in the tree. This forms a greedy search
for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier
choices. A simplified version of the algorithm, specialized to learning Boolean valued
functions (i.e., concept learning)
We have four attributes in the above example.
There is a need to calculate attribute with
maximum information.
Calculate the Entropy of entire data set. Then
Entropy of each attribute is calculated.
Now, information gain for each attribute is
calculated. The attribute with maximum
information gain is selected as root node.
Attribute: Outlook
Attribute: Temp
Attribute: Humidity
Attribute: Wind
Which Attribute Is the Best Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each node in
the tree. We would like to select the attribute that is most useful for classifying examples.
What is a good quantitative measure of the worth of an attribute? We will define a statistical
property, called information gain, that measures how well a given attribute separates the
training examples according to their target classification. ID3 uses this information gain
measure to select among the candidate attributes at each step while growing the tree.
• Create a Root node for the tree
• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = -
• If Attributes is empty, Return the single-node tree Root, with label = most common value of Target
attribute in Examples
• Otherwise Begin
• A ←the attribute from Attributes that best* classifies Examples
• The decision attribute for Root ← A
• For each possible value, vi, of A,
• Add a new tree branch below Root, corresponding to the test A = vi
• Let Examples Vi, be the subset of Examples that have value vi for A
• If Examples Vi, is empty
• Then below this new branch add a leaf node with label = most common value of
Target attribute in Examples
• Else below this new branch add the subtree ID3(Examples vi, Target attribute,
Attributes – {A}))
Biased and Unbiased Hypothesis
Biased Hypothesis space: It does not consider all types of training examples.
E.g. Sunny ^ Warm^ normal ^Strong ^cool ^ change = Yes
^ stands for ‘AND’ . Here ‘change’ is the least effective parameter. Human brain can
ignore it. But machine is trained for this, it will not ignore and it will give ‘No’
outcome for ‘same’.

Unbiased Hypothesis space: It provide a hypothesis capable of representing all types


of examples.
possible instances 3X3X2X2 = 36
Target Instances = 236

Its impossible to consider such huge number of examples. So there is problem with
both biased and unbiased hypothesis.
Inductive and Deductive Learning
Inductive Learning : This type of learning is through examples.
From the examples rules are defined.
Deductive Learning : Already existing rules are applied to the
examples.
In humans theory can be given to understand the concept but
Incase of machine training, we cant apply theory but examples
are given to train it. Different situations are given to machine for
training. So Inductive learning is for machines and Deductive
learning is for humans.
Inductive Bias Inference
Idea of inductive bias: On the basis of previous training, machine generalizes the
new concept for new examples. E.g House building without any supervision of civil
engineer.
> symbol is used for inductively inferred from
x>y means y is inductively inferred from x. Here x is previously trained examples and
y is handled from the experience of x.
Example: Learning algorithm = L
Training Data Dc = {x, c(x)}
New Instance = xi
Represented as L(xi , Dc)
(Dc ^ xi) > L(xi , Dc)
So learning of new instance is inductively inferred from previous data set and x i

Å
Inductive Bias
The inductive bias (also known as learning bias) of a learning algorithm is the set of
assumptions that the learner uses to predict outputs of given inputs that it has not
encountered. In machine learning, one aims to construct algorithms that are able to learn to
predict a certain target output.

Inductive bias in Decision Tree:


❖Approximate bias of ID3: Shorter trees are preferred over longer trees. Trees that place high
information gain attributes close to the root are preferred over those that do not.
❖Occam's Razor: prefer the simplest hypothesis that fits the data.
❖Preference or search bias: preference for certain hypotheses over others with no hard restriction on
the hypotheses that can be enumerated. ID3 demonstrates a preference bias.
❖Restriction or language bias: a categorical restriction on the set of hypotheses considered. The
candidate elimination algorithm demonstrates a restriction bias.
❖In general, it is better to have a preference bias than a restriction bias. However, some learning
systems have both biases.
Occam’s Razor Hypothesis
It depends on philosophical statement: only believe explanations that are simple, include
fewer entities, fewer supposition etc.
Occam’s Razor states that for two or more competing statements, the one with the
fewest assumption is most likely to be correct.
When there are multiple hypotheses to solve a problem, the
simpler one is to be preferred. It is not clear as to whom this
principle can be conclusively attributed to, but William of Occam’s
(c. 1287 – 1347) preference for simplicity is well documented.
Hence this principle goes by the name, “Occam’s razor”.
This often means cutting off other possibilities or explanations,
thus “razor” appended to the name of the principle. It should be
noted that these explanations or hypotheses should lead to the
same result.
Occam’s razor is one of the simplest examples of inductive bias.
It involves a preference for a simpler hypothesis that best fits the
data. Though the razor can be used to eliminate other hypotheses,
Entropy
In order to define information gain precisely, we begin by defining a measure commonly
used in information theory, called entropy, that characterizes the (im)purity of an arbitrary
collection of examples. Given a collection S, containing positive and negative examples of
some target concept, the entropy of S relative to this Boolean classification is:

ID3(Examples, Target attribute, Attributes)


• Examples are the training examples.
• Target attribute is the attribute whose value is to be predicted by the tree.
• Attributes is a list of other attributes that may be tested by the learned decision tree.
• Returns a decision tree that correctly classifies the given examples.
Example
To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9
positive and 5 negative examples (we adopt the notation [9+, 5-] to summarize such a
sample of data). Then the entropy of S relative to this Boolean classification is

• Notice that the entropy is 0 if all members of S belong to the same class. For example, if
all members are positive p(+ve = 1), then p(-ve) , is 0, and Entropy(S) =-1 . log2(1) - 0 .
log2 0 = -1 . 0 - 0 . log2 0 = 0.
• Note the entropy is 1 when the collection contains an equal number of positive and
negative examples.
• If the collection contains unequal numbers of positive and negative examples, the entropy
is between 0 and 1.
Entropy
• A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy
to calculate the homogeneity of a sample. If the sample is completely homogeneous the
entropy is zero and if the sample is an equally divided it has entropy of one.
• Entropy is used in the special case where the target classification is Boolean.
• If the target attribute can take on c different values, then the entropy of S relative to this
c-wise classification is defined as

• where pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits. Note also that if the
target attribute can take on c possible values, the entropy can be as large as log2c.
Interpretation of Entropy
• The entropy function relative to a Boolean classification, as
the proportion, p(+ve), of positive examples varies between
0 and 1.
• One interpretation of entropy from information theory is
that it specifies the minimum number of bits of information
needed to encode the classification of an arbitrary member
of S (i.e., a member of S drawn at random with uniform
probability).
• If p(+ve) is 0.8, then a collection of messages can be
encoded using on average less than 1 bit per message by
assigning shorter codes to collections of positive examples
and longer codes to less likely negative examples.
INFORMATION GAIN
Given entropy as a measure of the impurity in a collection of training examples, we can now
define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute. More precisely, the
information gain, Gain(S, A) of an attribute A relative to a collection of examples S, is
defined as

where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for
which attribute A has value v (i.e. Sv = {s \in: E S|A(s) = v)). Note the first term in Equation
is just the entropy of the original collection S, and the second term is the expected value of
the entropy after S is partitioned using attribute A.
INFORMATION GAIN
The expected entropy described by this second term is simply the sum of the entropies of
each subset Sv, weighted by the fraction of examples(|Sv|/|S|) that belong to Sv. Gain(S, A) is
therefore the expected reduction in entropy caused by knowing the value of attribute A. Put
another way, Gain(S, A) is the information provided about the target &action value, given
the value of some other attribute A. The value of Gain(S, A) is the number of bits saved
when encoding the target value of an arbitrary member of S, by knowing the value of
attribute A.
To build a decision tree, we need to calculate two
types of entropy using frequency tables as follows:
Calculate Entropy of Attributes: 1(Outlook)
Choose attribute with the largest information gain as the
decision node, divide the dataset by its branches and repeat the
same process on every branch
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further
splitting.
The ID3 algorithm is run recursively on the non-leaf branches,
until all data is classified.
What is Inductive Learning?
From the perspective of inductive learning, we are given input samples (x) and output samples (f(x))
and the problem is to estimate the function (f). Specifically, the problem is to generalize from the
samples and the mapping to be useful to estimate the output for new samples in the future. In practice
it is almost always too hard to estimate the function, so we are looking for very good approximations
of the function. e.g.,
• Credit risk assessment.
• The x is the properties of the customer.
• The f(x) is credit approved or not.
• Disease diagnosis.
• The x are the properties of the patient.
• The f(x) is the disease they suffer from.
• Face recognition.
• The x are bitmaps of peoples faces.
• The f(x) is to assign a name to the face.
• Automatic steering.
• The x are bitmap images from a camera in front of the car.
• The f(x) is the degree the steering wheel should be turned.
When Should You Use Inductive Learning?
There are problems where inductive learning is not a good idea. It is important when to use and when
not to use supervised machine learning.
4 problems where inductive learning might be a good idea:
• Problems where there is no human expert. If people do not know the answer they cannot write a
program to solve it. These are areas of true discovery.
• Humans can perform the task but no one can describe how to do it. There are problems where
humans can do things that computer cannot do or do well. Examples include riding a bike or driving a
car.
• Problems where the desired function changes frequently. Humans could describe it and they could
write a program to do it, but the problem changes too often. It is not cost effective. Examples include
the stock market.
• Problems where each user needs a custom function. It is not cost effective to write a custom
program for each user. Example is recommendations of movies or books on Netflix or Amazon.
Two perspectives on inductive learning:
• Learning is the removal of uncertainty. Having data removes some uncertainty.
Selecting a class of hypotheses we are removing more uncertainty.
• Learning is guessing a good and small hypothesis class. It requires guessing. We don’t
know the solution we must use a trial and error process. If you knew the domain with
certainty, you don’t need learning. But we are not guessing in the dark.
A Framework For Studying Inductive
Learning
• Training example: a sample from x including its output from the target function
• Target function: the mapping function f from x to f(x)
• Hypothesis: approximation of f, a candidate function.
• Concept: A Boolean target function, positive examples and negative examples for the 1/0
class values.
• Classifier: Learning program outputs a classifier that can be used to classify.
• Learner: Process that creates the classifier.
• Hypothesis space: set of possible approximations of f that the algorithm can create.
• Version space: subset of the hypothesis space that is consistent with the observed data
HYPOTHESIS SPACE SEARCH IN DECISION
TREE LEARNING
As with other inductive learning methods, ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples. The hypothesis space searched by ID3 is
the set of possible decision trees. ID3 performs a simple-to-complex, hill-climbing search
through this hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that correctly classifies
the training data. The evaluation function that guides this hill-climbing search is the
information gain measure. By viewing ID 3 in terms of its search space and search strategy,
we can get some insight into its capabilities and limitations.
• By viewing ID3 in terms of its search space and search strategy, we can get some insight into its capabilities
and limitations.
• ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions,
relative to the available attributes. Because every finite discrete-valued function can be represented by
some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis
spaces (such as methods that consider only conjunctive hypotheses): that the hypothesis space might not
contain the target function.
• ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This
contrasts, for example, with the earlier version space candidate-elimination method, which maintains the
set of all hypotheses consistent with the available training examples. By determining only a single
hypothesis, ID3 loses the capabilities that follow from explicitly representing all consistent hypotheses.
For example, it does not have the ability to determine how many alternative decision trees are consistent
with the available training data, or to pose new instance queries that optimally resolve among these
competing hypotheses.
INDUCTIVE BIAS IN DECISION TREE
LEARNING
Inductive bias is the set of assumptions that, together with the training data, deductively
justify the classifications assigned by the learner to future instances.
Given a collection of training examples, there are typically many decision trees consistent
with these examples. Describing the inductive bias of ID3 therefore consists of describing
the basis by which it chooses one of these consistent hypotheses over the others. Which of
these decision trees does ID3 choose? It chooses the first acceptable tree it encounters in its
simple-to-complex, hill-climbing search through the space of possible trees. Roughly
speaking, then, the ID3 search strategy (a) selects in favor of shorter trees over longer ones,
and (b) selects trees that place the attributes with highest information gain closest to the
root. Because of the subtle interaction between the attribute selection heuristic used by ID3
and the particular training examples it encounters, it is difficult to characterize precisely the
inductive bias exhibited by ID3.
Issues in Decision Tree
1. Determine how deeply to grow the decision tree
2. Handling continuous attributes
3. Choosing an appropriate attribute selection measure
4. Handling training data with missing attribute values
5. Handling attributes with differing costs
6. Improving computational efficiency
Avoiding Overfitting the Data
In case of noisy data or when the number of training examples is very small then
simple algorithms like ID3 can produce trees that overfit the training examples.
Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h’ ∈ H, such that h has
smaller error than h’ over the training examples, but h’ has a smaller error than h over
the entire distribution of instances. Overfitting is learning too much from the data.
Impact of overfitting: Figure shows ID3 algorithm for medical patient have a form
of diabetes.
The solid line shows the accuracy of the
decision tree over the training examples,
whereas the broken line shows accuracy
measured over an independent set of test
examples (not included in the training set).
Lets consider one false example
Outlook=Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, Play = No
The addition of this incorrect example will now cause ID3 to construct a more
complex tree. The new example will be sorted into the second leaf node from
the left in the learned tree.
The result is that ID3 will output a decision tree (h) that is more complex than
the original tree (h’).
Now h will fit the collection of training examples perfectly, whereas the simpler
h’ will not. However, given that the new decision node is simply a
consequence of fitting the noisy training example, we expect h to outperform
h’ over subsequent data drawn from the same instance distribution.
The above example illustrates how random noise in the training
examples can lead to overfitting.
overfitting is possible even when the training data are
noise-free, especially when small numbers of examples are associated with
leaf nodes.
Avoiding Overfitting —
There are several approaches to avoiding overfitting in decision tree learning.
These can be grouped into two classes:
- Pre-pruning (avoidance): Stop growing the tree earlier, before it reaches
the point where it perfectly classifies the training data
- Post-pruning (recovery): Allow the tree to overfit the data, and then
post-prune the tree
a key question is what criterion is to be used to determine the correct final
tree size.
Criterion used to determine the correct final tree size
•Use a separate set of examples, distinct from the training examples, to
evaluate the utility of post-pruning nodes from the tree
•Use all the available data for training, but apply a statistical test to estimate
whether expanding (or pruning) a particular node is likely to produce an
improvement beyond the training set
•Use measure of the complexity for encoding the training examples and the
decision tree, halting growth of the tree when this encoding size is minimized.
This approach is called the Minimum Description Length
•MDL — Minimize : size(tree) + size (misclassifications(tree))
Reduced Error Pruning
•Reduced-error pruning, is to consider each of the decision nodes in the tree to be candidates
for pruning
•Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf
node, and assigning it the most common classification of the training examples affiliated with
that node
•Nodes are removed only if the resulting pruned tree performs no worse than-the original over
the validation set.
•Coincidental regularities: when there is only a small number of training examples, some attribute may be able to partition the examples very well, despite being
unrelated to the actual target function
•Reduced error pruning has the effect that any leaf node added due to coincidental regularities
in the training set is likely to be pruned because these same coincidences are unlikely to
occur in the validation set
The impact of reduced-error pruning on the accuracy of the decision tree
Rule Post-Pruning
•Infer the decision tree from the training set, growing the tree until the training
data is fit as well as possible and allowing overfitting to occur.
•Convert the learned tree into an equivalent set of rules by creating one rule
for each path from the root node to a leaf node.
•Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
•Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node becomes the
rule consequent (postcondition). For example, the leftmost path of the tree in
figure is translated into the rule
IF (Outlook = Sunny) ^ (Humidity = High)
THEN Play Tennis = No
Next, each such rule is pruned by removing any antecedent, or precondition,
whose removal does not worsen its estimated accuracy. Given the above
rule, rule post-pruning would consider removing the preconditions
(Outlook = Sunny) and (Humidity = High)
•It would select whichever of these pruning steps produced the greatest
improvement in estimated rule accuracy, then consider pruning the second
precondition as a further pruning step.
•No pruning step is performed if it reduces the estimated rule accuracy.
There are three main advantages by converting the decision tree to
rules before pruning
•Converting to rules allows distinguishing among the different contexts in
which a decision node is used. Because each distinct path through the
decision tree node produces a distinct rule, the pruning decision regarding
that attribute test can be made differently for each path.
•Converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves. Thus, it avoid
messy bookkeeping issues such as how to reorganize the tree if the root
node is pruned while retaining part of the subtree below this test.
•Converting to rules improves readability. Rules are often easier for to
understand.
2.OurIncorporating Continuous-Valued Attributes
initial definition of ID3 is restricted to attributes that take on a discrete set
of values.
1. The target attribute whose value is predicted by learned tree must be
discrete valued.
2. The attributes tested in the decision nodes of the tree must also be discrete
valued.
This second restriction can easily be removed so that continuous-valued
decision attributes can be incorporated into the learned tree. For example if
we want to acquire the marks obtained by students or sale prices of houses,
which is not a discrete data. For an attribute A that is continuous-valued, the
algorithm can dynamically create a new boolean attribute A, that is true if A <
c and false otherwise. The only question is how to select the best value for
the threshold c.
If decision tree have the following values for Temperature and the target
attribute PlayTennis.
If decision tree have the following values for Temperature and the
target attribute PlayTennis.

Take two consecutive values for the change of decision then find
the average of the two values. Find the information gain for those
values. Value having highest information gain will give the
threshold.
•Pick a threshold, c, that produces the greatest information gain. By
sorting the examples according to the continuous attribute A, then identifying
adjacent examples that differ in their target classification, we can generate a
set of candidate thresholds midway between the corresponding values of A. It
can be shown that the value of c that maximizes information gain must always
lie at such a boundary. These candidate thresholds can then be evaluated by
computing the information gain associated with each.
•In the current example, there are two candidate thresholds, corresponding to
the values of Temperature at which the value of PlayTennis changes: (48 +
60)/2, and (80 + 90)/2.
•The information gain can then be computed for each of the candidate
attributes, Temperature >54, and Temperature >85 and the best can be
selected (Temperature >54)
•This dynamically created boolean attribute can then compete with the other
discrete-valued candidate attributes available for growing the decision tree.
3. Alternative
There is a natural biasMeasures for
in the information gainSelecting Attributes
measure that favors attributes
with many values over those with few values.
•As an extreme example, consider the attribute Date, which has a very large
number of possible values. It will give the perfect classification (as in contrast
with wheather, its clear that tennis is to be played on the particular date).
What is wrong with the attribute Date? Simply put, it has so many possible
values that it is bound to separate the training examples into very small
subsets. Because of this, it will have a very high information gain relative to
the training examples.
•How ever, having very high information gain, its a very poor predictor of the
target function over unseen instances. i.e. if new date is given then it will not
be able to give the decision. This is because we are completely relying on
information gain.
•One way to avoid this difficulty is to select decision attribute
Based on some measure other than information gain.
Alternate measure-1
The gain ratio measure penalizes attributes such as Date by incorporating a
term, called split information that is sensitive to how broadly and uniformly the
attribute splits the data:

Note that Split Information is actually the entropy of S with respect to the
values of attribute A. This is in contrast to our previous uses of entropy, in
which we considered only the entropy of S with respect to the target attribute
whose value is to be predicted by the learned tree.
The Gain Ratio measure is defined in terms of the earlier Gain measure, as
well as this Split lnformation

The Split lnformation term discourages the selection of attributes with many
uniformly distributed values
Alternate measure-2
An alternative to the Gain Ratio, designed to directly address the above
difficulty is a distance-based measure introduced by Lopez de Mantaras in
1991.
This measure is based on defining a distance metric between partitions of the
data. Each attribute is evaluated based on the distance between the data
partition it creates and the perfect partition (i.e., the partition that perfectly
classifies the training data).
The attribute whose partition is closest to the perfect partition is chosen.
this distance measure avoids the practical difficulties associated with the
GainRatio measure, and it produces significantly smaller trees in the case of
data sets whose attributes have very different numbers of values.
4. Handling Missing Attribute Values
In certain cases, the available data may be missing values for some
attributes. For example, in a medical domain in which we wish to predict
patient outcome based on various laboratory tests, it may be that the
Blood-Test-Result is available only for a subset of the patients. In such cases,
it is common to estimate the missing attribute value based on other examples
for which this attribute has a known value.
Method-1
One strategy for dealing with the missing attribute value is to assign it the
value that is most common among training examples at node n.
Method-2
A second, more complex procedure is to assign a probability to each of the
possible values of A. These probabilities can be estimated again based on
the observed frequencies of the various values for A among the examples at
node n.
5. Handling Attributes with Differing Costs

In some learning tasks the instance attributes may have associated costs. For
example, in learning to classify medical diseases we might describe patients
in terms of attributes such as Temperature, BiopsyResult, Pulse,
BloodTestResults, etc. These attributes vary significantly in their costs, both
in terms of monetary cost and cost to patient comfort.
In such tasks, we would prefer decision trees that use low-cost attributes
where possible, relying on high-cost attributes only when needed to produce
reliable classifications.
ID3 can be modified to consider attribute costs by introducing a cost term into
the attribute selection measure. For example, we might divide the Gain by the
cost of the attribute, so that lower-cost attributes would be preferred. But give
more weights to low cost attribute as these attribute contribute more because
of more number of examples. Like Temperature and Pulse are low cost
attributes but should be given higher weights.
ID3 can be improved by introducing a cost term into attribute selection
measure.
For example, Gain should be divided by the cost of the attribute so that lower
cost attributes must be preferred.
Short Answers
Instance Based Learning
Other Algorithms takes the examples, set the target function and learn from them but Instance-based
learning methods simply store the training examples instead of learning explicit description of the
target function.
– Generalizing the examples is postponed until a new instance must be classified. – When a new
instance is encountered, its relationship to the stored examples is examined in order to assign a target
function value for the new instance.
When New data point given then it compare that with the stored training examples and classifies it.
• Instance-based learning includes nearest neighbor, locally weighted regression and case-based
reasoning methods.
• Instance-based methods are sometimes referred to as lazy learning methods because they delay
processing until a new instance must be classified. Decision tree, candidate elimination c4.5 are eager
learning algorithms.
• A key advantage of lazy learning is that instead of estimating the target function once for the entire
instance space, these methods can estimate it locally and differently for each new instance to be
classified.
Instance-based learning methods such as nearest neighbor and locally weighted
regression are conceptually straightforward approaches to approximating real-valued
or discrete-valued target functions. Learning in these algorithms consists of simply
storing the presented training data. When a new query instance is encountered, a set
of similar related instances is retrieved from memory and used to classify the new
query instance.
In eager based learning, when one example is given then one specific hypothesis is
created. When more examples are given then hypothesis is generalised based on those
examples.
In instance based learning classification or categorization of examples is not done in
starting but at the ending when new data point is available.
One disadvantage of instance-based approaches is that the cost of classifying new
instances can be high. This is due to the fact that nearly all computation or processing
takes place at classification.
A second disadvantage to many instance-based approaches, especially
nearestneighbor approaches, is that they typically consider all attributes of the
instances when attempting to retrieve similar training examples from memory. If the
target concept depends on only a few of the many available attributes, then the
instances that are truly most "similar" may well be a large distance apart.
In K nearest neighbor, if 10 attributes are present and only 8 attributes are significant
but rest 2 are not significant while assigning the target label then also this algorithm
consider all the attributes equally. Due to the importance given to insignificant
attributes target may be classified in the wrong category.
K- Nearest Neighbor
All instances correspond to the points in the N dimensional space 𝕽n . For simplicity and for
understanding we consider two dimensional space only.
The nearest neighbors of an instance are defined in terms of the standard I Euclidean distance.
If two points are given in the space then the
Euclidean distance is given by

Other distance measures are also there but most


commonly Euclidean distance measure is used in K nearest neighbor algo.
If three attributes are given for a new instance a1, a2, a3 and same three attribute are available for
other instances then the distance between a1 of new instance and a1 of some other instance is
calculated.
Similarly distance between every attribute of new instance and every attribute of existing instance
is calculated. This is repeated for every other existing instances.
The instance with minimum distance with new instance is taken and new instance is put in the
class of that particular instance.
More precisely, let an arbitrary instance x be described by the feature vector

where ar (x) denotes the value of the rth attribute of instance x. Then the distance between
two instances xi and xj is defined to be d(xi, xj), where

Wherever the distance is minimum that is the instance with which the new instance is
matching.
In nearest neighbor learning target function may be discrete valued or real valued.
Let us consider learning discrete valued target function of the form
where V is the finite set {v1,v2,v3…….vs}. In case of binary classification we have only two values
say v1 and v2. In other discrete classification, we may have more values.
The value f (x,) returned by this algorithm as its estimate of f (x,) is just the most common value of f
among the k training examples nearest to x,. If we choose k = 1, then the 1-NEAREST NEIGHBOR
algorithm assigns to f(x,) the value f (xi) where xi is the training instance nearest to x,. For larger
values of k, the algorithm assigns the most common value among the k nearest training examples.
Training algorithm: For each training example (x, f (x)), add the example to the list training examples
Classification algorithm: Given a query instance xq to be classified, Let xl . . .xk denote the k
instances from training examples that are nearest to xq, Return

where δ(a,b)=1 if a = b and where δ(a,b)=0 otherwise.


The case where the instances are points in a two-dimensional space and where the
target function is boolean valued. The positive and negative training examples are
shown by "+" and "-" respectively. A query point x, is shown as well. Note the
1-NEAREST NEIGHBOR algorithm classifies x, as a positive example in this figure,
whereas the 5-NEAREST NEIGHBOR algorithm classifies it as a negative example.
The k-NEAREST NEIGHBOR algorithm is easily adapted to approximating
continuous-valued target functions. To accomplish this, we have the algorithm
calculate the mean value of the k nearest training examples rather than calculate their
most common value. More precisely, to approximate a real-valued target function f :
!)In + !)I we replace the final line of the above algorithm by the line
Disadvantage of Basic KNN

S. Height Weight Target Distance Nearest Points


No.

1 150 50 Medium 8.06


2 155 55 Medium 2.24 1
3 160 60 Large 6.71 3
4 161 59 Large 6.40 2
5 158 65 Large 11.05

In the present example we can see that using KNN new instance(157,54) is classified as Large. But the
distance of large is 6.71 or 6.40 which is much greater than the distance of new instance with medium.
So new instance should be classified as medium.
Hence instead of tracking the number of repetitions of target some weightage must also be given to the
distance.
Distance-Weighted NEAREST NEIGHBOR Algorithm for discrete
valued function

The refinement to the k-NEAREST NEIGHBOR algorithm is to weight the


contribution of each of the k neighbors according to their distance to the query point
x, giving greater weight to closer neighbors.
For example, we might weight the vote of each neighbor according to the inverse
square of its distance from x,.
Training algorithm: For each training example (x, f (x)), add the example to the list
training examples
Classification algorithm: Given a query instance xq to be classified, Let xl . . .xk
denote the k instances from training examples that are nearest to xq, Return
Here one extra term is taken that is wi which is equal to
the inverse of square of distance between new instance
and nearest instance.
S.No. Height Weight Target Distance Nearest
Points

1 150 50 Mediu 8.06


m
2 155 55 Mediu 2.24 0.199 1
m
3 160 60 Large 6.71 0.022 3
4 161 59 Large 6.40 0.024 2
5 158 65 Large 11.05
f(xq)=o.199*δ(M,M)+0.022* δ(M,L)+0.024* δ(M,L)
=0.199+0+0=0.199
f(xq)=o.199*δ(L,M)+0.022* δ(L,L)+0.024* δ(L,L)
= 0+0.022+0.024=0.046
Now the new instance is classified as ‘Medium’ which is having very small distance
from new instance.
Distance-Weighted NEAREST NEIGHBOR
Algorithm for real valued function

Training algorithm: For each training example (x, f (x)), add the example to the list
training examples
Classification algorithm: Given a query instance xq to be classified, Let xl . . .xk
denote the k instances from training examples that are nearest to xq, Return
S.No. Height Weight Target Distance Nearest
Points

1 150 50 1.5 8.06


2 155 55 1.2 2.24 0.199 1
3 160 60 1.8 6.71 0.022 3
4 161 59 2.1 6.40 0.024 2
5 158 65 1.7 11.05
Give Decision Tree to represent following Boolean
function
Every Variable in Boolean function such as A, B, C has two possibilities that is True
and False
If the Boolean function is True we write Yes
If the Boolean function is False we write No
Every Boolean function is evaluated from left to right.
(A ˄¬B)˅(¬A ˄B)
Algorithms for Instance-Based
Learning
• k-Nearest Neighbor Learning,
• Locally Weighted Regression,
• Radial basis function networks, Case-based learning
• These are the simple algorithms. Learning in these algoritms consists of simply storing the training data.
Instance-Based Learning
• Idea:
• Similar examples have similar label.
• Classify new examples like similar training examples.
• Algorithm:
• Given some new example x for which we need to predict its class y
• Find most similar training examples
• Classify x “like” these most similar examples
• Questions:
• How to determine similarity?
• How many similar training examples to consider?
• How to resolve inconsistencies among the training examples?
1-Nearest Neighbor
• One of the simplest of all machine learning classifiers
• Simple idea: label a new point the same as the closest known point

Label it red.
1-Nearest Neighbor
• A type of instance-based learning
• Also known as “memory-based” learning
• Forms a Voronoi tessellation of the instance space
Distance Metrics
• Different metrics can change the decision surface

Dist(a,b) =(a1 – b1)2 + (a2 – Dist(a,b) =(a1 – b1)2 + (3a2 –


b2)2 3b2)2
• Standard Euclidean distance metric:
• Two-dimensional: Dist(a,b) = sqrt((a1 – b1)2 + (a2 – b2)2)
• Multivariate: Dist(a,b) = sqrt(∑ (ai – bi)2)
1-NN’s Aspects as an Instance-Based Learner:

A distance metric
• Euclidean
• When different units are used for each dimension
normalize each dimension by standard deviation
• For discrete data, can use hamming distance
D(x1,x2) =number of features on which x1 and x2 differ
• Others (e.g., normal, cosine)

How many nearby neighbors to look at?


• One

How to fit with the local points?



k – Nearest Neighbor
• Generalizes 1-NN to smooth away noise in the labels
• A new point is now assigned the most frequent label of its k nearest neighbors

Label it red, when k = 3

Label it blue, when k = 7


KNN Example

Similarity metric: Number of matching attributes (k=2)


•New examples:
Ye
• Example 1 (great, no, no, normal, no)
s
most similar: number 2 (1 mismatch, 4 match) yes
Second most similar example: number 1 (2 mismatch, 3 match) yes

Yes/N
• Example 2 (mediocre, yes, no, normal, no)
o
Most similar: number 3 (1 mismatch, 4 match) no
Second most similar example: number 1 (2 mismatch, 3 match) yes
Selecting the Number of Neighbors
• Increase k:
• Makes KNN less sensitive to noise

• Decrease k:
• Allows capturing finer structure of space

• Pick k not too large, but not too small (depends on data)
Curse-of-Dimensionality
• Prediction accuracy can quickly degrade when number of attributes grows.
• Irrelevant attributes easily “swamp” information from relevant attributes
• When many irrelevant attributes, similarity/distance measure becomes less reliable

• Remedy
• Try to remove irrelevant attributes in pre-processing step
• Weight attributes differently
• Increase k (but not too much)
Advantages and Disadvantages of KNN
• Need distance/similarity measure and attributes that “match” target function.

• For large training sets,


• Must make a pass through the entire dataset for each classification. This can
be prohibitive for large data sets.

• Prediction accuracy can quickly degrade when number of attributes grows.


• Simple to implement algorithm;
Requires little tuning;
Often performs quite weel!
(Try it first on a new learning problem).
Locally Weighted Regression
• KNN forms local approximation to f for each query point xq
• Why not form an explicit approximation f(x) for region surrounding xq
Locally Weighted Regression
• Locally weighted regression uses nearby or distance-weighted training examples to form this local
approximation to f. Its an instance based learning algorithm.
• We might approximate the target function in the neighborhood surrounding x, using a linear function, a
quadratic function, a multilayer neural network.
• The phrase "locally weighted regression" is called
• local because the function is approximated based only on data near the query point,
• weighted because the contribution of each training example is weighted by its distance from the query
point, and
• regression because this is the term used widely in the statistical learning community for the problem of
approximating real-valued functions.
Locally Weighted Regression
Weights are modified using Gradient Descent method of ANN. In this method error term is
calculated by finding the square of the difference between target output and calculated
output.

𝜂 is the constant learning rate


Weights are modified again and again until we get the minimum error.
Choose the weights that minimizes the squared error summed over the training examples
using Gradient Descent.
2. Minimize the squared error over the entire set D of training examples, while

weighting the error of each training example by some decreasing function K of its

distance from xq:

3. Now combine 1 and 2 to get locally as well as weighted regression


Now the contribution of instance x to the weight update
is now multiplied by the distance penalty K(d(xq,x)), and
that error is summed over only the k nearest training
examples.

We can find modified weights. Start with random


weights and modify them to get weights with minimum
squared error.
Radial Basis Functions
Radial Basis Function Networks

Each hidden unit produces an activation


determined by a Gaussian function
centered at some instance xu.

Therefore, its activation will be close to


zero unless the input x is near xu.

The output unit produces a linear


combination of the hidden unit
activations.
Case-based reasoning
• Instance-based methods
• lazy
• classification based on classifications of near (similar) instances
• data: points in n-dim. space
• Case-based reasoning
• as above, but data representted in symbolic form
• New distance metrics required
• In case-based reasoning, the training examples, the cases, are stored and accessed
to solve a new problem. To get a prediction for a new example, those cases that
are similar, or close to, the new example are used to predict the value of the target
CADET SYATEM: Example of CBR
• The CADET system employs casebased reasoning to assist in the conceptual design of simple mechanical
devices such as water faucets. It uses a library containing approximately 75 previous designs and design
fragments to suggest conceptual designs to meet the specifications of new design problems. Each instance
stored in memory (e.g., a water pipe) is represented by describing both its structure and its qualitative function.

• New design problems are then presented by specifying the desired function and requesting the corresponding
structure. The top half of the figure shows the description of a typical stored case called a T-junction pipe. Its
function is represented in terms of the qualitative relationships among the waterflow levels and temperatures at
its inputs and outputs.

• In the functional description at its right, an arrow with a "+" label indicates that the variable at the arrowhead
increases with the variable at its tail. For example, the output waterflow Q3 increases with increasing input
waterflow Ql.
The bottom half of this figure depicts a new design problem described by its desired function.
This particular function describes the required behavior of one type of water faucet.

Here Q, refers to the flow of cold water into the faucet, Qh to the input flow of hot water, and Q,
to the single mixed flow out of the faucet. Similarly, T,, Th, and T, refer to the temperatures of
the cold water, hot water, and mixed water respectively. The variable C, denotes the control
signal for temperature that is input to the faucet, and Cf denotes the control signal for waterflow.
Note the description of the desired function specifies that these controls C, and Cf are to
influence the water flows Q, and Qh, thereby indirectly influencing the faucet output flow Q, and
temperature T,.

Given this functional specification for the new design problem, CADET searches its library for
stored cases whose functional descriptions match the design problem. If an exact match is found,
indicating that some stored case implements exactly the desired function, then this case can be
returned as a suggested solution to the design problem. If no exact match occurs, CADET may
find cases that match various subgraphs of the desired functional specification.
Case Based Learning (CBR)

1.Retrieve- Gathering from memory an experience closest to the current problem.


2.Reuse- Suggesting a solution based on the experience and adapting it to meet the
demands of the new situation.
3.Revise- Evaluating the use of the solution in the new context.
4.Retain- Storing this new problem-solving method in the memory system.
Lazy & eager learning
• Lazy: generalize at query time
• kNN, CBR
• Eager: generalize before seeing query
• Radial basis, ID3, …
• Difference
• eager must create global approximation
• lazy can create many local approximation
• lazy can represent more complex functions using same H (H = linear functions)

You might also like