0% found this document useful (0 votes)

18 views40 pages

ShortCourse QTT Lecture1

Uploaded by

lam minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views40 pages

ShortCourse QTT Lecture1

Uploaded by

lam minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unsupervised Learning

Lecture 1: Introduction and Evaluation

Tho Quan
qttho@[Link]
Agenda

• Introduction to supervised learning

• A case study with simple classification approach
• Evaluation the classification
• Make it practical: un upgraded version
Supervised vs. Unsupervised Learning
Classification: Overview

• Classification

Training Data Classification Model Make Predictions on

unseen Data

• Example :
• Play goft
• Neuron Network
Process 1: Model Construction

Classification
Algorithms
Training
Data

NAM E RANK YEARS TENURED Classifier

M ike Assistant Prof 3 no (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
9
Process 2: Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Classification Algorithms

• Classification Algorithms:
• Support Vector Machines
• Neural Network (multi-layer perceptron)
• Decision Tree
• K-Nearest Neighbor
• Naive Bayes Classifier…
K-Nearest Neighbor

• Classifying objects based on closest training examples in the feature

space
• Approximated locally
• All computation is deferred until classification
• Classified by a majority vote of its neighbors
Data

• The training examples are vectors in a multidimensional feature space,

each with a class label
• The training phase of the algorithm consists only of storing the feature
vectors and class labels of the training samples
Algorithm
- k is a user-defined constant
- Choose k nearest neighbors (NNS)
- Label is label which is most frequent among the k training samples
nearest.
K-Nearest Neighbor

• Classifying objects based on closest training examples in the feature

space
• Approximated locally
• All computation is deferred until classification
• Classified by a majority vote of its neighbors
Similarity and Representation

Ø How do we define similarity?

Ø How do we represent the objects whose similarities we wish to
measure?
Motivation for the VSM

• We get back to the case study of document classification

• VSM is an algebraic model for representing text documents
as vectors of index term
• A document is represented as a vector. Each dimension
corresponds to a separate term. Its values are tf-idf
weighting
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document. T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Measuring Distance

• Euclidean Distance
• Manhattan Distance
• Cosine Similarity
• Vector Length & Normalization
Euclidean Distance

a +b = c
2 2 2

v (2,1)
1

b
2 +1 = c
2 2 2

c= 5
a

u x
(0, 0) 1 2
Euclidean Distance

In p dimensions, let Euclidean distance between

two points u and v be:

p
dist(u, v) = å (u - v )
i =1
i i
2
Manhattan Distance

• The Manhattan distance (a k a city block distance) is

the number of units on a rectangular grid it takes to
travel from point u to v. y

2 units horizontally +
1 unit vertically = 3 units
p
DM (u, v) = å ui - vi 1
v (2,1)

i =1
b

u x
(0, 0) 1 2
Cosine Similarity

• The traditional vector space model is based on a

different notion of similarity: the cosine of the
angle between two vectors.

• To start our consideration of cosine similarity,

consider our document vectors not as points in
term space, but as arrows that travel from the
origin of the space to a particular address.
Calculating Cosine(x, y)

x(x1, …, xp) & y(y1,…, yp) are two

vectors in p-dimensions:

x× y
cos(x, y) =
p p
(å x i × x i ) (å yi × yi )
i =1 i =1
p

åx ×y i i
= i =1
(1)
p p
(å x i × x i ) (å yi × yi )
i =1 i =1
Vector Length and Normalization

v
a +b =c
2 2 2

c 2 +1 = c
2 2 2
b

a
c= 5
x
1 2
Vector Length and Normalization

p
• Vector length:
v = åv
i =1
2
i = v×v

• Normalization: v
u=
v
Thus x×y
cos( x, y ) =
x y
Using the Cosine for IR

• From now on, unless otherwise specified, we can assume that

all document vectors in our term-document matrix A have been
normalized to unit length.
• Likewise we always normalize our query to unit length.
• Given these assumptions, we have the classic vector space
model for IR.
Accuracy Measures

• A natural criterion for judging the performance of a classifier is

the probability for making a misclassification error.

• Misclassification
• The observation belongs to one class, but the model classifies it as a
member of a different class.

• A classifier that makes no errors would be perfect

• Do not expect to be able to construct such classifiers in the real world
• Due to “noise”
• Not having all the information needed to precisely classify cases.
Accuracy Measures

• To obtain an honest estimate of classification error, we use the

classification matrix that is computed from the validation data.
• We first partition the data into training and validation sets by random
selection of cases.
• We then construct a classifier using the training data,
• Apply it to the validation data,
• Yields predicted classifications for the observations in the validation set.
• We then summarize these classifications in a classification matrix.
Problem with accuracy

• If the training and test data are skewed towards one classification,
then the model will predict everything as being that class.
• In Titanic training data, 68% of people died. If one trained a model
to predict everybody died, it would be 68% accurate.
Confusion Matrix
Practical Metrics for Data Science Projects

• Case study: in the Flight Delay project with NTU, we applied the
following metrics
• Accuracy
• Precision
• Recall
• F-measure
Practical k-NN

• Need a data structure for calculate near neighbors fast.

• Data structure
• KD-tree
Example: 2-D Tree

2-d Tree for these points :

(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)
Tree Operation - Insert
(9, 5) => 9 > 7

(9, 5) => 5 < 6

(9, 5) => 9 > 8

9,5
Complexity

• Building a static k-d tree from n points O(n log n)

• Inserting a new point into a balanced k-d tree takes O(log n) time.
• Removing a point from a balanced k-d tree takes O(log n) time.
• Finding 1 nearest neighbour in a balanced k-d tree with randomly distributed
points takes O(log n) time on average.
Search Nearest Neighbor
Search Nearest Neighbor
Search K-Nearest Neighbor

• Apply the Search Nearest Neighbor with maintain the list

of points.
• Kd-trees are not suitable for efficiently finding the nearest
neighbour in high-dimensional spaces and approximate
nearest-neighbour methods should be used instead.

Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
No ratings yet
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
28 pages
Unit 3
No ratings yet
Unit 3
100 pages
Instance Based Learning
No ratings yet
Instance Based Learning
20 pages
Data Mining: Distance & Similarity
No ratings yet
Data Mining: Distance & Similarity
25 pages
AAI Lecture 11 SP 25
No ratings yet
AAI Lecture 11 SP 25
77 pages
4K-Nearest Neighbor
No ratings yet
4K-Nearest Neighbor
38 pages
Session 5
No ratings yet
Session 5
36 pages
Nearest-Neighbor Classifier Guide
No ratings yet
Nearest-Neighbor Classifier Guide
2 pages
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
No ratings yet
T6 - KNN - Features, Distances &amp Amp Non-Parametric Models
23 pages
Chapter 4 - Part II
No ratings yet
Chapter 4 - Part II
44 pages
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
No ratings yet
2EL1730-ML-Lecture04-Non Parametric Learning and Nearest Neighbor
47 pages
Topic 3 ML (Hazem)
No ratings yet
Topic 3 ML (Hazem)
160 pages
Deep Learning in Healthcare Applications
No ratings yet
Deep Learning in Healthcare Applications
68 pages
Lecture 3
No ratings yet
Lecture 3
17 pages
Visual Recognition
No ratings yet
Visual Recognition
123 pages
UNIT-2 ML Notes
No ratings yet
UNIT-2 ML Notes
15 pages
ML 4
No ratings yet
ML 4
33 pages
AIML-Unit 4 Notes-Assignment 4
No ratings yet
AIML-Unit 4 Notes-Assignment 4
21 pages
K-Nearest Neighbour Classifiers
No ratings yet
K-Nearest Neighbour Classifiers
18 pages
Intro to Machine Learning for Data Science
No ratings yet
Intro to Machine Learning for Data Science
37 pages
1 - Nearest Neighbor Classification Handout
No ratings yet
1 - Nearest Neighbor Classification Handout
6 pages
LFD 2005 Nearest Neighbour
No ratings yet
LFD 2005 Nearest Neighbour
6 pages
cs4302 Lecture2
No ratings yet
cs4302 Lecture2
40 pages
k-NN Algorithm Overview & Applications
No ratings yet
k-NN Algorithm Overview & Applications
35 pages
07 CSE358 Intro To Machine Learning I
No ratings yet
07 CSE358 Intro To Machine Learning I
63 pages
SWE622 Lecture 3 Classification
No ratings yet
SWE622 Lecture 3 Classification
57 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
10 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Instance Based Learning: Vibhav Gogate The University of Texas at Dallas
No ratings yet
Instance Based Learning: Vibhav Gogate The University of Texas at Dallas
25 pages
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
No ratings yet
Warming-Up To ML, and Some Simple Supervised Learners (Distance-Based "Local" Methods)
29 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Module 5-Part 1
No ratings yet
Module 5-Part 1
30 pages
Supervised Learning & KNN Guide
No ratings yet
Supervised Learning & KNN Guide
27 pages
11 Text Categorization
No ratings yet
11 Text Categorization
25 pages
Week 09 Lesson 1 Intro Machine Learning 1 To 32
No ratings yet
Week 09 Lesson 1 Intro Machine Learning 1 To 32
61 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
K-Nearest Neighbors Classification Explained
No ratings yet
K-Nearest Neighbors Classification Explained
20 pages
CSE445 NSU Week - 5
No ratings yet
CSE445 NSU Week - 5
26 pages
Machine Learning Lecture 02
No ratings yet
Machine Learning Lecture 02
25 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Ch3 BayesianNetwork Onwards
No ratings yet
Ch3 BayesianNetwork Onwards
5 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
DS - Module 3
No ratings yet
DS - Module 3
65 pages
Statistical Methods in Artificial Intelligence CSE471 - Monsoon 2015: Lecture 02
No ratings yet
Statistical Methods in Artificial Intelligence CSE471 - Monsoon 2015: Lecture 02
26 pages
KNN & Support Vector Machines: Dr.S.Vasantharathna
No ratings yet
KNN & Support Vector Machines: Dr.S.Vasantharathna
22 pages
Email Spam Classification with KNN & SVM
No ratings yet
Email Spam Classification with KNN & SVM
6 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Data Mining Algorithms Comparison
No ratings yet
Data Mining Algorithms Comparison
32 pages
Classification (NaiveBayes KNN SVM DecisionTrees)
No ratings yet
Classification (NaiveBayes KNN SVM DecisionTrees)
105 pages
Machine Learning Algorithms - pptx-1
No ratings yet
Machine Learning Algorithms - pptx-1
129 pages
Machine Learning Course Outline
No ratings yet
Machine Learning Course Outline
50 pages
Predict Based Simmiliarity and Validation
No ratings yet
Predict Based Simmiliarity and Validation
19 pages
Chapter 6 ML Classifications
100% (1)
Chapter 6 ML Classifications
51 pages
K Nearest Neighbor Classification Guide
0% (1)
K Nearest Neighbor Classification Guide
32 pages
Lec 02
No ratings yet
Lec 02
27 pages
2 KNN
No ratings yet
2 KNN
67 pages
DWDM PPT
No ratings yet
DWDM PPT
35 pages
Unit - IV
No ratings yet
Unit - IV
78 pages
Ch2 - Lec2 - K Nearest Neighbour (KNN)
No ratings yet
Ch2 - Lec2 - K Nearest Neighbour (KNN)
18 pages
ls6-b Scara Robot Product Specifications cpd-57403
No ratings yet
ls6-b Scara Robot Product Specifications cpd-57403
2 pages
Designing A Model Reference Adaptive Con
No ratings yet
Designing A Model Reference Adaptive Con
5 pages
A Multi-Index Feedback Linearization Control For A
No ratings yet
A Multi-Index Feedback Linearization Control For A
14 pages
Design and Modeling of Mechanical Systems-III
No ratings yet
Design and Modeling of Mechanical Systems-III
1,225 pages
Ch16 - Sheet-Metal Forming Processes and Equipment
No ratings yet
Ch16 - Sheet-Metal Forming Processes and Equipment
42 pages
Ch14 - Metal-Forging Processes and Equipment
No ratings yet
Ch14 - Metal-Forging Processes and Equipment
26 pages
SOHO Menu
No ratings yet
SOHO Menu
4 pages
AP2050094152024
No ratings yet
AP2050094152024
1 page
English 10 Review: Units 1-5 Practice
No ratings yet
English 10 Review: Units 1-5 Practice
17 pages
FMF Racing Product Catalog PDF
No ratings yet
FMF Racing Product Catalog PDF
25 pages
Study and Design of Grid Connected PV Solar Power Plant
100% (2)
Study and Design of Grid Connected PV Solar Power Plant
12 pages
Affordable DIY Glue for Students
100% (3)
Affordable DIY Glue for Students
2 pages
108 Key Slokas from Bhagavad-gita
No ratings yet
108 Key Slokas from Bhagavad-gita
34 pages
DEWA UserManual
100% (1)
DEWA UserManual
42 pages
Electric Drives and Power Modulators
No ratings yet
Electric Drives and Power Modulators
11 pages
PNNL Classification of Materials
No ratings yet
PNNL Classification of Materials
5 pages
Recipe Box: Project 15193EZ
100% (2)
Recipe Box: Project 15193EZ
5 pages
Natural Hazards, Mitigation, and Adaptation
0% (1)
Natural Hazards, Mitigation, and Adaptation
33 pages
OM L7 Location Strategies
No ratings yet
OM L7 Location Strategies
14 pages
18 Must-Know Export Incentives in India If You Are An Exporter
No ratings yet
18 Must-Know Export Incentives in India If You Are An Exporter
2 pages
The Challenges of Agriculture and Rural
No ratings yet
The Challenges of Agriculture and Rural
17 pages
Analyzing PCM Line Codes with Pwelch
No ratings yet
Analyzing PCM Line Codes with Pwelch
11 pages
Manual de Instrucciones GA 11-22 - AII 229653 PDF
100% (2)
Manual de Instrucciones GA 11-22 - AII 229653 PDF
38 pages
Module 5 Reading and Comprehension I
No ratings yet
Module 5 Reading and Comprehension I
13 pages
Fresh Potato Types Guide for Chefs
No ratings yet
Fresh Potato Types Guide for Chefs
14 pages
Sample Paper Grade 8 Science
No ratings yet
Sample Paper Grade 8 Science
5 pages
11em Physics BBMCQ 2025-1
No ratings yet
11em Physics BBMCQ 2025-1
12 pages
HAL Hematology Intoduction To Blood Lab Manual English
No ratings yet
HAL Hematology Intoduction To Blood Lab Manual English
10 pages
Alugbati Seeds: A Sustainable Ink Source
83% (6)
Alugbati Seeds: A Sustainable Ink Source
4 pages
Reciprocating Compressors Appendix A
0% (1)
Reciprocating Compressors Appendix A
6 pages
Manual de Manutenção - Ha16rtj - Ha46rtj, Ha16rtjo - Ha46rtjo, Ha16rtjpro - Ha46rtjpro - Fevereiro 2022 - Inglês
100% (1)
Manual de Manutenção - Ha16rtj - Ha46rtj, Ha16rtjo - Ha46rtjo, Ha16rtjpro - Ha46rtjpro - Fevereiro 2022 - Inglês
536 pages
The Song of Hiawatha by Longfellow, Henry Wadsworth, 1807-1882
100% (2)
The Song of Hiawatha by Longfellow, Henry Wadsworth, 1807-1882
265 pages
Review Chapter 1 Physics Motion
No ratings yet
Review Chapter 1 Physics Motion
3 pages
June 2024 (v3) QP - Paper 4 CAIE Chemistry IGCSE
No ratings yet
June 2024 (v3) QP - Paper 4 CAIE Chemistry IGCSE
16 pages
Guarding The Gates The Canadian Labour Movement and Immigration 1872 1934 1st Edition David Goutor Instant Access 2025
No ratings yet
Guarding The Gates The Canadian Labour Movement and Immigration 1872 1934 1st Edition David Goutor Instant Access 2025
62 pages
Pylontech LV-Hub Manual 2.1
No ratings yet
Pylontech LV-Hub Manual 2.1
12 pages

ShortCourse QTT Lecture1

Uploaded by

ShortCourse QTT Lecture1

Uploaded by

Unsupervised Learning

Lecture 1: Introduction and Evaluation

• Introduction to supervised learning

Training Data Classification Model Make Predictions on

NAM E RANK YEARS TENURED Classifier

• Classifying objects based on closest training examples in the feature

• The training examples are vectors in a multidimensional feature space,

• Classifying objects based on closest training examples in the feature

Ø How do we define similarity?

• We get back to the case study of document classification

In p dimensions, let Euclidean distance between

• The Manhattan distance (a k a city block distance) is

• The traditional vector space model is based on a

• To start our consideration of cosine similarity,

x(x1, …, xp) & y(y1,…, yp) are two

• From now on, unless otherwise specified, we can assume that

• A natural criterion for judging the performance of a classifier is

• A classifier that makes no errors would be perfect

• To obtain an honest estimate of classification error, we use the

• Need a data structure for calculate near neighbors fast.

2-d Tree for these points :

(9, 5) => 5 < 6

(9, 5) => 9 > 8

• Building a static k-d tree from n points O(n log n)

• Apply the Search Nearest Neighbor with maintain the list

You might also like