Unsupervised Learning
Lecture 1: Introduction and Evaluation
Tho Quan
qttho@[Link]
Agenda
• Introduction to supervised learning
• A case study with simple classification approach
• Evaluation the classification
• Make it practical: un upgraded version
Supervised vs. Unsupervised Learning
Classification: Overview
• Classification
Training Data Classification Model Make Predictions on
unseen Data
• Example :
• Play goft
• Neuron Network
Process 1: Model Construction
Classification
Algorithms
Training
Data
NAM E RANK YEARS TENURED Classifier
M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
9
Process 2: Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Algorithms
• Classification Algorithms:
• Support Vector Machines
• Neural Network (multi-layer perceptron)
• Decision Tree
• K-Nearest Neighbor
• Naive Bayes Classifier…
K-Nearest Neighbor
• Classifying objects based on closest training examples in the feature
space
• Approximated locally
• All computation is deferred until classification
• Classified by a majority vote of its neighbors
Data
• The training examples are vectors in a multidimensional feature space,
each with a class label
• The training phase of the algorithm consists only of storing the feature
vectors and class labels of the training samples
Algorithm
- k is a user-defined constant
- Choose k nearest neighbors (NNS)
- Label is label which is most frequent among the k training samples
nearest.
K-Nearest Neighbor
• Classifying objects based on closest training examples in the feature
space
• Approximated locally
• All computation is deferred until classification
• Classified by a majority vote of its neighbors
Similarity and Representation
Ø How do we define similarity?
Ø How do we represent the objects whose similarities we wish to
measure?
Motivation for the VSM
• We get back to the case study of document classification
• VSM is an algebraic model for representing text documents
as vectors of index term
• A document is represented as a vector. Each dimension
corresponds to a separate term. Its values are tf-idf
weighting
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document. T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Measuring Distance
• Euclidean Distance
• Manhattan Distance
• Cosine Similarity
• Vector Length & Normalization
Euclidean Distance
a +b = c
2 2 2
v (2,1)
1
b
2 +1 = c
2 2 2
c= 5
a
u x
(0, 0) 1 2
Euclidean Distance
In p dimensions, let Euclidean distance between
two points u and v be:
p
dist(u, v) = å (u - v )
i =1
i i
2
Manhattan Distance
• The Manhattan distance (a k a city block distance) is
the number of units on a rectangular grid it takes to
travel from point u to v. y
2 units horizontally +
1 unit vertically = 3 units
p
DM (u, v) = å ui - vi 1
v (2,1)
i =1
b
u x
(0, 0) 1 2
Cosine Similarity
• The traditional vector space model is based on a
different notion of similarity: the cosine of the
angle between two vectors.
• To start our consideration of cosine similarity,
consider our document vectors not as points in
term space, but as arrows that travel from the
origin of the space to a particular address.
Calculating Cosine(x, y)
x(x1, …, xp) & y(y1,…, yp) are two
vectors in p-dimensions:
x× y
cos(x, y) =
p p
(å x i × x i ) (å yi × yi )
i =1 i =1
p
åx ×y i i
= i =1
(1)
p p
(å x i × x i ) (å yi × yi )
i =1 i =1
Vector Length and Normalization
v
a +b =c
2 2 2
c 2 +1 = c
2 2 2
b
a
c= 5
x
1 2
Vector Length and Normalization
p
• Vector length:
v = åv
i =1
2
i = v×v
• Normalization: v
u=
v
Thus x×y
cos( x, y ) =
x y
Using the Cosine for IR
• From now on, unless otherwise specified, we can assume that
all document vectors in our term-document matrix A have been
normalized to unit length.
• Likewise we always normalize our query to unit length.
• Given these assumptions, we have the classic vector space
model for IR.
Accuracy Measures
• A natural criterion for judging the performance of a classifier is
the probability for making a misclassification error.
• Misclassification
• The observation belongs to one class, but the model classifies it as a
member of a different class.
• A classifier that makes no errors would be perfect
• Do not expect to be able to construct such classifiers in the real world
• Due to “noise”
• Not having all the information needed to precisely classify cases.
Accuracy Measures
• To obtain an honest estimate of classification error, we use the
classification matrix that is computed from the validation data.
• We first partition the data into training and validation sets by random
selection of cases.
• We then construct a classifier using the training data,
• Apply it to the validation data,
• Yields predicted classifications for the observations in the validation set.
• We then summarize these classifications in a classification matrix.
Problem with accuracy
• If the training and test data are skewed towards one classification,
then the model will predict everything as being that class.
• In Titanic training data, 68% of people died. If one trained a model
to predict everybody died, it would be 68% accurate.
Confusion Matrix
Practical Metrics for Data Science Projects
• Case study: in the Flight Delay project with NTU, we applied the
following metrics
• Accuracy
• Precision
• Recall
• F-measure
Practical k-NN
• Need a data structure for calculate near neighbors fast.
• Data structure
• KD-tree
Example: 2-D Tree
2-d Tree for these points :
(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)
Tree Operation - Insert
(9, 5) => 9 > 7
(9, 5) => 5 < 6
(9, 5) => 9 > 8
9,5
Complexity
• Building a static k-d tree from n points O(n log n)
• Inserting a new point into a balanced k-d tree takes O(log n) time.
• Removing a point from a balanced k-d tree takes O(log n) time.
• Finding 1 nearest neighbour in a balanced k-d tree with randomly distributed
points takes O(log n) time on average.
Search Nearest Neighbor
Search Nearest Neighbor
Search K-Nearest Neighbor
• Apply the Search Nearest Neighbor with maintain the list
of points.
• Kd-trees are not suitable for efficiently finding the nearest
neighbour in high-dimensional spaces and approximate
nearest-neighbour methods should be used instead.