Unit3 MLT
Unit3 MLT
• The target function has discrete output values: The decision tree assigns a Boolean classification
(e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more
than two possible output values. A more substantial extension allows learning target functions with
real-valued outputs, though the application of decision trees in this setting is less common.
Most algorithms that have been developed for learning decision trees are variations on a
core algorithm that employs a top-down, greedy search through the space of possible
decision trees. This approach is exemplified by the ID3(Iterative Dichotomiser 3) algorithm
(Quinlan 1986) and its successor C4.5 (Quinlan 1993). Our basic algorithm, ID3, learns
decision trees by constructing them top-down, beginning with the question "which attribute
should be tested at the root of the tree? To answer this question, each instance attribute is
evaluated using a statistical test to determine how well it alone classifies the training
examples. The best attribute is selected and used as the test at the root node of the tree. A
descendant of the root node is then created for each possible value of this attribute, and the
training examples are sorted to the appropriate descendant node (i.e., down the branch
corresponding to the example's value for this attribute).
Construction of Decision Tree(ID3)
Our basic algorithm, ID3, learns decision trees by constructing them top-down, beginning
with the question "which attribute should be tested at the root of the tree?” To answer this
question, each instance attribute is evaluated using a statistical test to determine how well
it alone classifies the training examples. The best attribute is selected and used as the test at
the root node of the tree. A descendant of the root node is then created for each possible
value of this attribute, and the training examples are sorted to the appropriate descendant
node (i.e., down the branch corresponding to the example's value for this attribute). The
entire process is then repeated using the training examples associated with each descendant
node to select the best attribute to test at that point in the tree. This forms a greedy search
for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier
choices. A simplified version of the algorithm, specialized to learning Boolean valued
functions (i.e., concept learning)
We have four attributes in the above example.
There is a need to calculate attribute with
maximum information.
Calculate the Entropy of entire data set. Then
Entropy of each attribute is calculated.
Now, information gain for each attribute is
calculated. The attribute with maximum
information gain is selected as root node.
Attribute: Outlook
Attribute: Temp
Attribute: Humidity
Attribute: Wind
Which Attribute Is the Best Classifier?
The central choice in the ID3 algorithm is selecting which attribute to test at each node in
the tree. We would like to select the attribute that is most useful for classifying examples.
What is a good quantitative measure of the worth of an attribute? We will define a statistical
property, called information gain, that measures how well a given attribute separates the
training examples according to their target classification. ID3 uses this information gain
measure to select among the candidate attributes at each step while growing the tree.
• Create a Root node for the tree
• If all Examples are positive, Return the single-node tree Root, with label = +
• If all Examples are negative, Return the single-node tree Root, with label = -
• If Attributes is empty, Return the single-node tree Root, with label = most common value of Target
attribute in Examples
• Otherwise Begin
• A ←the attribute from Attributes that best* classifies Examples
• The decision attribute for Root ← A
• For each possible value, vi, of A,
• Add a new tree branch below Root, corresponding to the test A = vi
• Let Examples Vi, be the subset of Examples that have value vi for A
• If Examples Vi, is empty
• Then below this new branch add a leaf node with label = most common value of
Target attribute in Examples
• Else below this new branch add the subtree ID3(Examples vi, Target attribute,
Attributes – {A}))
Biased and Unbiased Hypothesis
Biased Hypothesis space: It does not consider all types of training examples.
E.g. Sunny ^ Warm^ normal ^Strong ^cool ^ change = Yes
^ stands for ‘AND’ . Here ‘change’ is the least effective parameter. Human brain can
ignore it. But machine is trained for this, it will not ignore and it will give ‘No’
outcome for ‘same’.
Its impossible to consider such huge number of examples. So there is problem with
both biased and unbiased hypothesis.
Inductive and Deductive Learning
Inductive Learning : This type of learning is through examples.
From the examples rules are defined.
Deductive Learning : Already existing rules are applied to the
examples.
In humans theory can be given to understand the concept but
Incase of machine training, we cant apply theory but examples
are given to train it. Different situations are given to machine for
training. So Inductive learning is for machines and Deductive
learning is for humans.
Inductive Bias Inference
Idea of inductive bias: On the basis of previous training, machine generalizes the
new concept for new examples. E.g House building without any supervision of civil
engineer.
> symbol is used for inductively inferred from
x>y means y is inductively inferred from x. Here x is previously trained examples and
y is handled from the experience of x.
Example: Learning algorithm = L
Training Data Dc = {x, c(x)}
New Instance = xi
Represented as L(xi , Dc)
(Dc ^ xi) > L(xi , Dc)
So learning of new instance is inductively inferred from previous data set and x i
Å
Inductive Bias
The inductive bias (also known as learning bias) of a learning algorithm is the set of
assumptions that the learner uses to predict outputs of given inputs that it has not
encountered. In machine learning, one aims to construct algorithms that are able to learn to
predict a certain target output.
• Notice that the entropy is 0 if all members of S belong to the same class. For example, if
all members are positive p(+ve = 1), then p(-ve) , is 0, and Entropy(S) =-1 . log2(1) - 0 .
log2 0 = -1 . 0 - 0 . log2 0 = 0.
• Note the entropy is 1 when the collection contains an equal number of positive and
negative examples.
• If the collection contains unequal numbers of positive and negative examples, the entropy
is between 0 and 1.
Entropy
• A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy
to calculate the homogeneity of a sample. If the sample is completely homogeneous the
entropy is zero and if the sample is an equally divided it has entropy of one.
• Entropy is used in the special case where the target classification is Boolean.
• If the target attribute can take on c different values, then the entropy of S relative to this
c-wise classification is defined as
• where pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits. Note also that if the
target attribute can take on c possible values, the entropy can be as large as log2c.
Interpretation of Entropy
• The entropy function relative to a Boolean classification, as
the proportion, p(+ve), of positive examples varies between
0 and 1.
• One interpretation of entropy from information theory is
that it specifies the minimum number of bits of information
needed to encode the classification of an arbitrary member
of S (i.e., a member of S drawn at random with uniform
probability).
• If p(+ve) is 0.8, then a collection of messages can be
encoded using on average less than 1 bit per message by
assigning shorter codes to collections of positive examples
and longer codes to less likely negative examples.
INFORMATION GAIN
Given entropy as a measure of the impurity in a collection of training examples, we can now
define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute. More precisely, the
information gain, Gain(S, A) of an attribute A relative to a collection of examples S, is
defined as
where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for
which attribute A has value v (i.e. Sv = {s \in: E S|A(s) = v)). Note the first term in Equation
is just the entropy of the original collection S, and the second term is the expected value of
the entropy after S is partitioned using attribute A.
INFORMATION GAIN
The expected entropy described by this second term is simply the sum of the entropies of
each subset Sv, weighted by the fraction of examples(|Sv|/|S|) that belong to Sv. Gain(S, A) is
therefore the expected reduction in entropy caused by knowing the value of attribute A. Put
another way, Gain(S, A) is the information provided about the target &action value, given
the value of some other attribute A. The value of Gain(S, A) is the number of bits saved
when encoding the target value of an arbitrary member of S, by knowing the value of
attribute A.
To build a decision tree, we need to calculate two
types of entropy using frequency tables as follows:
Calculate Entropy of Attributes: 1(Outlook)
Choose attribute with the largest information gain as the
decision node, divide the dataset by its branches and repeat the
same process on every branch
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further
splitting.
The ID3 algorithm is run recursively on the non-leaf branches,
until all data is classified.
What is Inductive Learning?
From the perspective of inductive learning, we are given input samples (x) and output samples (f(x))
and the problem is to estimate the function (f). Specifically, the problem is to generalize from the
samples and the mapping to be useful to estimate the output for new samples in the future. In practice
it is almost always too hard to estimate the function, so we are looking for very good approximations
of the function. e.g.,
• Credit risk assessment.
• The x is the properties of the customer.
• The f(x) is credit approved or not.
• Disease diagnosis.
• The x are the properties of the patient.
• The f(x) is the disease they suffer from.
• Face recognition.
• The x are bitmaps of peoples faces.
• The f(x) is to assign a name to the face.
• Automatic steering.
• The x are bitmap images from a camera in front of the car.
• The f(x) is the degree the steering wheel should be turned.
When Should You Use Inductive Learning?
There are problems where inductive learning is not a good idea. It is important when to use and when
not to use supervised machine learning.
4 problems where inductive learning might be a good idea:
• Problems where there is no human expert. If people do not know the answer they cannot write a
program to solve it. These are areas of true discovery.
• Humans can perform the task but no one can describe how to do it. There are problems where
humans can do things that computer cannot do or do well. Examples include riding a bike or driving a
car.
• Problems where the desired function changes frequently. Humans could describe it and they could
write a program to do it, but the problem changes too often. It is not cost effective. Examples include
the stock market.
• Problems where each user needs a custom function. It is not cost effective to write a custom
program for each user. Example is recommendations of movies or books on Netflix or Amazon.
Two perspectives on inductive learning:
• Learning is the removal of uncertainty. Having data removes some uncertainty.
Selecting a class of hypotheses we are removing more uncertainty.
• Learning is guessing a good and small hypothesis class. It requires guessing. We don’t
know the solution we must use a trial and error process. If you knew the domain with
certainty, you don’t need learning. But we are not guessing in the dark.
A Framework For Studying Inductive
Learning
• Training example: a sample from x including its output from the target function
• Target function: the mapping function f from x to f(x)
• Hypothesis: approximation of f, a candidate function.
• Concept: A Boolean target function, positive examples and negative examples for the 1/0
class values.
• Classifier: Learning program outputs a classifier that can be used to classify.
• Learner: Process that creates the classifier.
• Hypothesis space: set of possible approximations of f that the algorithm can create.
• Version space: subset of the hypothesis space that is consistent with the observed data
HYPOTHESIS SPACE SEARCH IN DECISION
TREE LEARNING
As with other inductive learning methods, ID3 can be characterized as searching a space of
hypotheses for one that fits the training examples. The hypothesis space searched by ID3 is
the set of possible decision trees. ID3 performs a simple-to-complex, hill-climbing search
through this hypothesis space, beginning with the empty tree, then considering
progressively more elaborate hypotheses in search of a decision tree that correctly classifies
the training data. The evaluation function that guides this hill-climbing search is the
information gain measure. By viewing ID 3 in terms of its search space and search strategy,
we can get some insight into its capabilities and limitations.
• By viewing ID3 in terms of its search space and search strategy, we can get some insight into its capabilities
and limitations.
• ID3's hypothesis space of all decision trees is a complete space of finite discrete-valued functions,
relative to the available attributes. Because every finite discrete-valued function can be represented by
some decision tree, ID3 avoids one of the major risks of methods that search incomplete hypothesis
spaces (such as methods that consider only conjunctive hypotheses): that the hypothesis space might not
contain the target function.
• ID3 maintains only a single current hypothesis as it searches through the space of decision trees. This
contrasts, for example, with the earlier version space candidate-elimination method, which maintains the
set of all hypotheses consistent with the available training examples. By determining only a single
hypothesis, ID3 loses the capabilities that follow from explicitly representing all consistent hypotheses.
For example, it does not have the ability to determine how many alternative decision trees are consistent
with the available training data, or to pose new instance queries that optimally resolve among these
competing hypotheses.
INDUCTIVE BIAS IN DECISION TREE
LEARNING
Inductive bias is the set of assumptions that, together with the training data, deductively
justify the classifications assigned by the learner to future instances.
Given a collection of training examples, there are typically many decision trees consistent
with these examples. Describing the inductive bias of ID3 therefore consists of describing
the basis by which it chooses one of these consistent hypotheses over the others. Which of
these decision trees does ID3 choose? It chooses the first acceptable tree it encounters in its
simple-to-complex, hill-climbing search through the space of possible trees. Roughly
speaking, then, the ID3 search strategy (a) selects in favor of shorter trees over longer ones,
and (b) selects trees that place the attributes with highest information gain closest to the
root. Because of the subtle interaction between the attribute selection heuristic used by ID3
and the particular training examples it encounters, it is difficult to characterize precisely the
inductive bias exhibited by ID3.
Issues in Decision Tree
1. Determine how deeply to grow the decision tree
2. Handling continuous attributes
3. Choosing an appropriate attribute selection measure
4. Handling training data with missing attribute values
5. Handling attributes with differing costs
6. Improving computational efficiency
Avoiding Overfitting the Data
In case of noisy data or when the number of training examples is very small then
simple algorithms like ID3 can produce trees that overfit the training examples.
Overfit: Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h’ ∈ H, such that h has
smaller error than h’ over the training examples, but h’ has a smaller error than h over
the entire distribution of instances. Overfitting is learning too much from the data.
Impact of overfitting: Figure shows ID3 algorithm for medical patient have a form
of diabetes.
The solid line shows the accuracy of the
decision tree over the training examples,
whereas the broken line shows accuracy
measured over an independent set of test
examples (not included in the training set).
Lets consider one false example
Outlook=Sunny, Temp = Hot, Humidity = Normal, Wind = Strong, Play = No
The addition of this incorrect example will now cause ID3 to construct a more
complex tree. The new example will be sorted into the second leaf node from
the left in the learned tree.
The result is that ID3 will output a decision tree (h) that is more complex than
the original tree (h’).
Now h will fit the collection of training examples perfectly, whereas the simpler
h’ will not. However, given that the new decision node is simply a
consequence of fitting the noisy training example, we expect h to outperform
h’ over subsequent data drawn from the same instance distribution.
The above example illustrates how random noise in the training
examples can lead to overfitting.
overfitting is possible even when the training data are
noise-free, especially when small numbers of examples are associated with
leaf nodes.
Avoiding Overfitting —
There are several approaches to avoiding overfitting in decision tree learning.
These can be grouped into two classes:
- Pre-pruning (avoidance): Stop growing the tree earlier, before it reaches
the point where it perfectly classifies the training data
- Post-pruning (recovery): Allow the tree to overfit the data, and then
post-prune the tree
a key question is what criterion is to be used to determine the correct final
tree size.
Criterion used to determine the correct final tree size
•Use a separate set of examples, distinct from the training examples, to
evaluate the utility of post-pruning nodes from the tree
•Use all the available data for training, but apply a statistical test to estimate
whether expanding (or pruning) a particular node is likely to produce an
improvement beyond the training set
•Use measure of the complexity for encoding the training examples and the
decision tree, halting growth of the tree when this encoding size is minimized.
This approach is called the Minimum Description Length
•MDL — Minimize : size(tree) + size (misclassifications(tree))
Reduced Error Pruning
•Reduced-error pruning, is to consider each of the decision nodes in the tree to be candidates
for pruning
•Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf
node, and assigning it the most common classification of the training examples affiliated with
that node
•Nodes are removed only if the resulting pruned tree performs no worse than-the original over
the validation set.
•Coincidental regularities: when there is only a small number of training examples, some attribute may be able to partition the examples very well, despite being
unrelated to the actual target function
•Reduced error pruning has the effect that any leaf node added due to coincidental regularities
in the training set is likely to be pruned because these same coincidences are unlikely to
occur in the validation set
The impact of reduced-error pruning on the accuracy of the decision tree
Rule Post-Pruning
•Infer the decision tree from the training set, growing the tree until the training
data is fit as well as possible and allowing overfitting to occur.
•Convert the learned tree into an equivalent set of rules by creating one rule
for each path from the root node to a leaf node.
•Prune (generalize) each rule by removing any preconditions that result in
improving its estimated accuracy.
•Sort the pruned rules by their estimated accuracy, and consider them in this
sequence when classifying subsequent instances.
Each attribute test along the path from the root to the leaf becomes a rule
antecedent (precondition) and the classification at the leaf node becomes the
rule consequent (postcondition). For example, the leftmost path of the tree in
figure is translated into the rule
IF (Outlook = Sunny) ^ (Humidity = High)
THEN Play Tennis = No
Next, each such rule is pruned by removing any antecedent, or precondition,
whose removal does not worsen its estimated accuracy. Given the above
rule, rule post-pruning would consider removing the preconditions
(Outlook = Sunny) and (Humidity = High)
•It would select whichever of these pruning steps produced the greatest
improvement in estimated rule accuracy, then consider pruning the second
precondition as a further pruning step.
•No pruning step is performed if it reduces the estimated rule accuracy.
There are three main advantages by converting the decision tree to
rules before pruning
•Converting to rules allows distinguishing among the different contexts in
which a decision node is used. Because each distinct path through the
decision tree node produces a distinct rule, the pruning decision regarding
that attribute test can be made differently for each path.
•Converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves. Thus, it avoid
messy bookkeeping issues such as how to reorganize the tree if the root
node is pruned while retaining part of the subtree below this test.
•Converting to rules improves readability. Rules are often easier for to
understand.
2.OurIncorporating Continuous-Valued Attributes
initial definition of ID3 is restricted to attributes that take on a discrete set
of values.
1. The target attribute whose value is predicted by learned tree must be
discrete valued.
2. The attributes tested in the decision nodes of the tree must also be discrete
valued.
This second restriction can easily be removed so that continuous-valued
decision attributes can be incorporated into the learned tree. For example if
we want to acquire the marks obtained by students or sale prices of houses,
which is not a discrete data. For an attribute A that is continuous-valued, the
algorithm can dynamically create a new boolean attribute A, that is true if A <
c and false otherwise. The only question is how to select the best value for
the threshold c.
If decision tree have the following values for Temperature and the target
attribute PlayTennis.
If decision tree have the following values for Temperature and the
target attribute PlayTennis.
Take two consecutive values for the change of decision then find
the average of the two values. Find the information gain for those
values. Value having highest information gain will give the
threshold.
•Pick a threshold, c, that produces the greatest information gain. By
sorting the examples according to the continuous attribute A, then identifying
adjacent examples that differ in their target classification, we can generate a
set of candidate thresholds midway between the corresponding values of A. It
can be shown that the value of c that maximizes information gain must always
lie at such a boundary. These candidate thresholds can then be evaluated by
computing the information gain associated with each.
•In the current example, there are two candidate thresholds, corresponding to
the values of Temperature at which the value of PlayTennis changes: (48 +
60)/2, and (80 + 90)/2.
•The information gain can then be computed for each of the candidate
attributes, Temperature >54, and Temperature >85 and the best can be
selected (Temperature >54)
•This dynamically created boolean attribute can then compete with the other
discrete-valued candidate attributes available for growing the decision tree.
3. Alternative
There is a natural biasMeasures for
in the information gainSelecting Attributes
measure that favors attributes
with many values over those with few values.
•As an extreme example, consider the attribute Date, which has a very large
number of possible values. It will give the perfect classification (as in contrast
with wheather, its clear that tennis is to be played on the particular date).
What is wrong with the attribute Date? Simply put, it has so many possible
values that it is bound to separate the training examples into very small
subsets. Because of this, it will have a very high information gain relative to
the training examples.
•How ever, having very high information gain, its a very poor predictor of the
target function over unseen instances. i.e. if new date is given then it will not
be able to give the decision. This is because we are completely relying on
information gain.
•One way to avoid this difficulty is to select decision attribute
Based on some measure other than information gain.
Alternate measure-1
The gain ratio measure penalizes attributes such as Date by incorporating a
term, called split information that is sensitive to how broadly and uniformly the
attribute splits the data:
Note that Split Information is actually the entropy of S with respect to the
values of attribute A. This is in contrast to our previous uses of entropy, in
which we considered only the entropy of S with respect to the target attribute
whose value is to be predicted by the learned tree.
The Gain Ratio measure is defined in terms of the earlier Gain measure, as
well as this Split lnformation
The Split lnformation term discourages the selection of attributes with many
uniformly distributed values
Alternate measure-2
An alternative to the Gain Ratio, designed to directly address the above
difficulty is a distance-based measure introduced by Lopez de Mantaras in
1991.
This measure is based on defining a distance metric between partitions of the
data. Each attribute is evaluated based on the distance between the data
partition it creates and the perfect partition (i.e., the partition that perfectly
classifies the training data).
The attribute whose partition is closest to the perfect partition is chosen.
this distance measure avoids the practical difficulties associated with the
GainRatio measure, and it produces significantly smaller trees in the case of
data sets whose attributes have very different numbers of values.
4. Handling Missing Attribute Values
In certain cases, the available data may be missing values for some
attributes. For example, in a medical domain in which we wish to predict
patient outcome based on various laboratory tests, it may be that the
Blood-Test-Result is available only for a subset of the patients. In such cases,
it is common to estimate the missing attribute value based on other examples
for which this attribute has a known value.
Method-1
One strategy for dealing with the missing attribute value is to assign it the
value that is most common among training examples at node n.
Method-2
A second, more complex procedure is to assign a probability to each of the
possible values of A. These probabilities can be estimated again based on
the observed frequencies of the various values for A among the examples at
node n.
5. Handling Attributes with Differing Costs
In some learning tasks the instance attributes may have associated costs. For
example, in learning to classify medical diseases we might describe patients
in terms of attributes such as Temperature, BiopsyResult, Pulse,
BloodTestResults, etc. These attributes vary significantly in their costs, both
in terms of monetary cost and cost to patient comfort.
In such tasks, we would prefer decision trees that use low-cost attributes
where possible, relying on high-cost attributes only when needed to produce
reliable classifications.
ID3 can be modified to consider attribute costs by introducing a cost term into
the attribute selection measure. For example, we might divide the Gain by the
cost of the attribute, so that lower-cost attributes would be preferred. But give
more weights to low cost attribute as these attribute contribute more because
of more number of examples. Like Temperature and Pulse are low cost
attributes but should be given higher weights.
ID3 can be improved by introducing a cost term into attribute selection
measure.
For example, Gain should be divided by the cost of the attribute so that lower
cost attributes must be preferred.
Short Answers
Instance Based Learning
Other Algorithms takes the examples, set the target function and learn from them but Instance-based
learning methods simply store the training examples instead of learning explicit description of the
target function.
– Generalizing the examples is postponed until a new instance must be classified. – When a new
instance is encountered, its relationship to the stored examples is examined in order to assign a target
function value for the new instance.
When New data point given then it compare that with the stored training examples and classifies it.
• Instance-based learning includes nearest neighbor, locally weighted regression and case-based
reasoning methods.
• Instance-based methods are sometimes referred to as lazy learning methods because they delay
processing until a new instance must be classified. Decision tree, candidate elimination c4.5 are eager
learning algorithms.
• A key advantage of lazy learning is that instead of estimating the target function once for the entire
instance space, these methods can estimate it locally and differently for each new instance to be
classified.
Instance-based learning methods such as nearest neighbor and locally weighted
regression are conceptually straightforward approaches to approximating real-valued
or discrete-valued target functions. Learning in these algorithms consists of simply
storing the presented training data. When a new query instance is encountered, a set
of similar related instances is retrieved from memory and used to classify the new
query instance.
In eager based learning, when one example is given then one specific hypothesis is
created. When more examples are given then hypothesis is generalised based on those
examples.
In instance based learning classification or categorization of examples is not done in
starting but at the ending when new data point is available.
One disadvantage of instance-based approaches is that the cost of classifying new
instances can be high. This is due to the fact that nearly all computation or processing
takes place at classification.
A second disadvantage to many instance-based approaches, especially
nearestneighbor approaches, is that they typically consider all attributes of the
instances when attempting to retrieve similar training examples from memory. If the
target concept depends on only a few of the many available attributes, then the
instances that are truly most "similar" may well be a large distance apart.
In K nearest neighbor, if 10 attributes are present and only 8 attributes are significant
but rest 2 are not significant while assigning the target label then also this algorithm
consider all the attributes equally. Due to the importance given to insignificant
attributes target may be classified in the wrong category.
K- Nearest Neighbor
All instances correspond to the points in the N dimensional space 𝕽n . For simplicity and for
understanding we consider two dimensional space only.
The nearest neighbors of an instance are defined in terms of the standard I Euclidean distance.
If two points are given in the space then the
Euclidean distance is given by
where ar (x) denotes the value of the rth attribute of instance x. Then the distance between
two instances xi and xj is defined to be d(xi, xj), where
Wherever the distance is minimum that is the instance with which the new instance is
matching.
In nearest neighbor learning target function may be discrete valued or real valued.
Let us consider learning discrete valued target function of the form
where V is the finite set {v1,v2,v3…….vs}. In case of binary classification we have only two values
say v1 and v2. In other discrete classification, we may have more values.
The value f (x,) returned by this algorithm as its estimate of f (x,) is just the most common value of f
among the k training examples nearest to x,. If we choose k = 1, then the 1-NEAREST NEIGHBOR
algorithm assigns to f(x,) the value f (xi) where xi is the training instance nearest to x,. For larger
values of k, the algorithm assigns the most common value among the k nearest training examples.
Training algorithm: For each training example (x, f (x)), add the example to the list training examples
Classification algorithm: Given a query instance xq to be classified, Let xl . . .xk denote the k
instances from training examples that are nearest to xq, Return
In the present example we can see that using KNN new instance(157,54) is classified as Large. But the
distance of large is 6.71 or 6.40 which is much greater than the distance of new instance with medium.
So new instance should be classified as medium.
Hence instead of tracking the number of repetitions of target some weightage must also be given to the
distance.
Distance-Weighted NEAREST NEIGHBOR Algorithm for discrete
valued function
Training algorithm: For each training example (x, f (x)), add the example to the list
training examples
Classification algorithm: Given a query instance xq to be classified, Let xl . . .xk
denote the k instances from training examples that are nearest to xq, Return
S.No. Height Weight Target Distance Nearest
Points
Label it red.
1-Nearest Neighbor
• A type of instance-based learning
• Also known as “memory-based” learning
• Forms a Voronoi tessellation of the instance space
Distance Metrics
• Different metrics can change the decision surface
A distance metric
• Euclidean
• When different units are used for each dimension
normalize each dimension by standard deviation
• For discrete data, can use hamming distance
D(x1,x2) =number of features on which x1 and x2 differ
• Others (e.g., normal, cosine)
Yes/N
• Example 2 (mediocre, yes, no, normal, no)
o
Most similar: number 3 (1 mismatch, 4 match) no
Second most similar example: number 1 (2 mismatch, 3 match) yes
Selecting the Number of Neighbors
• Increase k:
• Makes KNN less sensitive to noise
• Decrease k:
• Allows capturing finer structure of space
• Pick k not too large, but not too small (depends on data)
Curse-of-Dimensionality
• Prediction accuracy can quickly degrade when number of attributes grows.
• Irrelevant attributes easily “swamp” information from relevant attributes
• When many irrelevant attributes, similarity/distance measure becomes less reliable
• Remedy
• Try to remove irrelevant attributes in pre-processing step
• Weight attributes differently
• Increase k (but not too much)
Advantages and Disadvantages of KNN
• Need distance/similarity measure and attributes that “match” target function.
weighting the error of each training example by some decreasing function K of its
• New design problems are then presented by specifying the desired function and requesting the corresponding
structure. The top half of the figure shows the description of a typical stored case called a T-junction pipe. Its
function is represented in terms of the qualitative relationships among the waterflow levels and temperatures at
its inputs and outputs.
• In the functional description at its right, an arrow with a "+" label indicates that the variable at the arrowhead
increases with the variable at its tail. For example, the output waterflow Q3 increases with increasing input
waterflow Ql.
The bottom half of this figure depicts a new design problem described by its desired function.
This particular function describes the required behavior of one type of water faucet.
Here Q, refers to the flow of cold water into the faucet, Qh to the input flow of hot water, and Q,
to the single mixed flow out of the faucet. Similarly, T,, Th, and T, refer to the temperatures of
the cold water, hot water, and mixed water respectively. The variable C, denotes the control
signal for temperature that is input to the faucet, and Cf denotes the control signal for waterflow.
Note the description of the desired function specifies that these controls C, and Cf are to
influence the water flows Q, and Qh, thereby indirectly influencing the faucet output flow Q, and
temperature T,.
Given this functional specification for the new design problem, CADET searches its library for
stored cases whose functional descriptions match the design problem. If an exact match is found,
indicating that some stored case implements exactly the desired function, then this case can be
returned as a suggested solution to the design problem. If no exact match occurs, CADET may
find cases that match various subgraphs of the desired functional specification.
Case Based Learning (CBR)