Chapter1: Introduction
Notes on MLAPP
Wu Ziqing
School of Computer Science and Engineering
Nanyang Technological University
13/07/2018
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 1 / 25
Outline
1 What is Machine Learning
Definition
2 Types of Machine Learning
Supervised Machine Learning
Classification
Regression
Unsupervised Machine Learning
Clustering
Dimensionality Reduction
Graph Structure
Matrix Completion
Reinforcement Learning
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 2 / 25
Outline
3 Basic Concepts in Machine Learning
Parametric and Non-Parametric models
KNN classifier
The Curse of Dimensionality
Parametric Models for Classification and Regression
Linear Regression
Logistic Regression
Overfitting
Model Selection
No Free Lunch Theorem
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 3 / 25
Definition of Machine Learning
What is Machine Learning
Definition of Machine Learning:
A set of methods that can automatically detect patterns of data,
and then use the uncovered patterns to predict future data, or to
perform other kinds of decision making under uncertainty.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 4 / 25
Types of Machine Learning
Supervised/Predictive Machine Learning
Unsupervised/Descriptive Machine Learning
Reinforcement Learning
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 5 / 25
Supervised Machine Learning
Goal: to learn a mapping from inputs x to output y , given a labelled
input-output set:
D = {(xi , yi )}N
i=1
D: Also called Training Set
N: Number of training samples
xi : Usually is a D-dimensional vector. Can also be complex structured
object: image, text, time series and so on.
Also called the Attribute,Feature, or Covariates.
yi : If it is categorical, the problem is called classification or pattern
recognition.
If it is nominal and real-valued, the problems is called regression.
If it has some natural ordering, it problems is called Ordinal
Regression.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 6 / 25
Supervised Machine Learning
Classification
Types of classification according to output y ∈ {1, ..., C }:
Binary Classification: C = 2, there are only two labels of y : 0 or 1.
Multi-label Classification: C > 2, there are many labels of y .
Single Output Classification: each label of y are mutually
exclusive, thus, each input in x can be only mapped to one label of y .
Multiple Output Classification: labels on y are not mutually
exclusive, thus, each input in x can be mapped to many labels. (e.g.,
a person can be mapped to tall and strong)
This is better view as predicting multiple related binary class labels.
(e.g., a person is tall/not tall, strong/not strong)
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 7 / 25
Supervised Machine Learning
Classification
Function Approximation: We can formalise a Multi-label Classification
as follows:
We assume that y = f (x) for some unknown function f (x), our goal
is to estimate f (x) given training set D.
We can then estimate output ŷ = fˆ(x) = argmaxp(y = c|x, D),
which means to find the most probable class label given a novel input
x and a training set D. This is also called Generalisation.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 8 / 25
Supervised Machine Learning
Classification
Some real-world applications of classification:
Document classification (e.g., email spam filtering)
Image classification (e.g., handwriting recognition)
Object detection (e.g., face detection)
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 9 / 25
Supervised Machine Learning
Regression
Estimate f (x) to predict Continuous output ŷ given an input x and a
training set D.
fˆ(x) can be models like linear regression, polynomial regression and so on.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 10 / 25
Unsupervised Machine Learning
Goal: Find interesting patterns given only the input data (also known as
knowledge discovery):
D = {xi }N
i=1
The task can be formalise as (unconditional) density estimation, i.e.,
the model should be in the form of:
p(xi |θ)
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 11 / 25
Unsupervised Machine Learning
Clustering
To cluster a set of data into groups:
1 find the number of clusters K where K = argmaxK p(K |D)
2 infer which cluster k each data point zi belongs to by
zi = argmaxk p(k|xi , D)
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 12 / 25
Unsupervised Machine Learning
Dimensionality Reduction
For high dimensional data, the variability of the data may only appear on a
few latent factors. We can perform dimensionality reduction by
projecting the high dimensional data to lower dimensional subspace to
capture the essence of the data.
Advantages:
better prediction accuracy
enabling fast nearest neighbour searches
facilitates visualisation of high dimensional data
The most common approach of dimensionality reduction is called
Principal Component Analysis (PCA).
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 13 / 25
Unsupervised Machine Learning
Graph Structure
We want to learn the correlation of each variables, by a graph structure, in
a data set by computing:
Ĝ = argmaxG p(G |D)
In the graph, each variable is represented by a node and each edge
represents the direct dependence between two variables.
This Sparse Graphical Model can help to:
Gain knowledge by interpreting the graph structure.
Get better joint probability estimators to model correlations and make
predictions.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 14 / 25
Unsupervised Machine Learning
Matrix Completion
Sometimes the designed matrix will have missing values such as NAN (Not
a Number). Matrix Completion (Imputation) can infer plausible missing
values. Matrix Completion can be applied to:
Image inpainting: to denoise/ complete an image.
Collaborative filtering: To complete the missing entries in a matrix.
Market Basket Analysis: in a binary matrix, predict which cell will be
turned on given a few has been turned on. (predict what items will be
purchased together.)
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 15 / 25
Reinforcement Learning
Types of Machine Learning
Goal: Learn how to behave given occasional reward/punishment signals.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 16 / 25
Paramatric and Non-Parametric Models
Parametric Model: the model has fixed number of parameters.
Advantage: faster to use
Disadvantage: making stronger assumptions about nature of data
distribution
Non-Parametric Model: its number of parameters grows with the
amount of training data.
Advantage: more flexible
Disadvantage: computationally intractable for large datasets
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 17 / 25
Parametric and Non-Parametric models
KNN classifier
K Nearest Neighbour (KNN) classifier is an example of
Non-Parametric classifier. It works as following:
1 looks at K nearest neighbours of a test data x
2 calculate the fraction of its neighbours’ classes:
1 X
p(y = c|x, D, K ) = II(yi = c)
K
i∈NK (x,D)
where i ∈ NK (x, D) denotes the K nearest numbers to x in D and
II(e) is the indicator function which equals to 1 iff e is true.
3 assign a class c to x accordingly
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 18 / 25
Parametric and Non-Parametric models
The Curse of Dimensionality
Curse of Dimensionality: With high dimensional data, some algorithm
will perform poorly.
For example in D-dimensional-space KNN, for a hyper cube1 to contain
fraction f of total data as the neighbour data, the expected edge length is
eD (f ) = f 1/D . For a 10-d dataset, even to contain 1% of the data, the
length of the cube needs to extend to 63% of the length along each
dimension.
1
an n-dimensional shape with each line aligned in one dimension, perpendicular to
each other and of the same length
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 19 / 25
Parametric Models for Classification and Regression
Linear Regression
Linear regression asserts that the response is a linear function of the input:
y (x) = w T x + = D 2
P
j=1 wj xj + , where = N (µ, σ )
Or it can be written as:
p(y |x, θ) = N (x|µ(x), σ 2 (x)),
where µ(x) = w T x, σ 2 (x) = σ 2 and θ = (w , σ 2 )
We can also replace x with non-linear function of x, Φ(x), to make it
Polynomial Regression:
p(y |x, θ) = N (x|w T Φ(x), σ 2 (x))
Changing the input into a function of input is known as basis function
expansion.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 20 / 25
Parametric Models for Classification and Regression
Logistic Regression
To fit the linear regression to classification, we perform the following two
steps:
1 change the Gaussian Distribution into Bernoulli Distribution, which is
more suitable for binary response:
p(y |x, w ) = Ber (y |xµ(x)), where µ(x) = p(y = 1|x)
2 pass the linear combination of the inputs into a sigmoid function, to
ensure 0 ≤ µ(x) ≤ 1:
1
µ(x) = sigm(w T x), where sigm(η) = 1+e (−η)
If we set a threshold of 0.5,
ŷ (x) = 1 ⇐⇒ p(y = 1|x) > 0.5
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 21 / 25
Parametric Models for Classification and Regression
Overfitting
If we try to include every minor variation into the model, we are more
likely to include the noise than the true signal. It is called Overfitting.
Although overfitting may lead to perfect result in training set, it may
cause inaccurate prediction for novel data.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 22 / 25
Parametric Models for Classification and Regression
Model Selection
To determine weather the model is too complexed to overfit, we can
compute the Misclassification Rate, which is the proportion of the
incorrect prediction:
err (f , D) = N1 N
P
i=1 II(f (xi ) 6= yi )
What we are interested in is Generalisation error, which is the expected
value of Misclassification Rate over future data. Since we do not have
future data, we can divide the training data into training set and
validation set.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 23 / 25
Parametric Models for Classification and Regression
Model Selection(Cont.)
If the number of data in validation set is not enough for reliable prediction,
we can use the technique of Cross Validation (CV).
1 Divide the data into K folds.
2 For each fold k ∈ {1, 2, ..., K }, train the model using the rest of the
fold and test against kt h fold.
3 Do the training and testing in a round robin fashion.
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 24 / 25
Parametric Models for Classification and Regression
No Free Lunch Theorem
No Free Lunch Theorem: There is no universally best model.
“All models are wrong, but some are useful.” — George Box
Wu Ziqing (NTU) Chapter1: Introduction 13/07/2018 25 / 25