Introduction to Machine Learning
Introduction to Machine Learning
Yifeng Tao
School of Computer Science
Carnegie Mellon University
Yifeng Tao Carnegie Mellon University 1
Logistics
oCourse website:
http://www.cs.cmu.edu/~yifengt/courses/machine-learning
Slides uploaded after lecture
oTime: Mon-Fri 9:50-11:30am lecture, 11:30-12:00pm discussion
oContact:
[email protected] Yifeng Tao Carnegie Mellon University 2
What is machine learning?
oWhat are we talking when we talk about AI and ML?
Machine learning
Deep learning
Artificial intelligence
Yifeng Tao Carnegie Mellon University 3
What is machine learning
Natural language Computational
Computer vision
processing Biology
Machine Learning
Probability Statistics Calculus Linear algebra
Yifeng Tao Carnegie Mellon University 4
Where are we?
oSupervised learning: linear models
oKernel machines: SVMs and duality
oUnsupervised learning: latent space analysis and clustering
oSupervised learning: decision tree, kNN and model selection
oLearning theory: generalization and VC dimension
oNeural network (basics)
oDeep learning in CV and NLP
oProbabilistic graphical models
oReinforcement learning and its application in clinical text mining
oAttention mechanism and transfer learning in precision medicine
Yifeng Tao Carnegie Mellon University 5
What’s more after introduction?
Probabilistic
graphical
Deep learning models
Machine learning
Optimization Learning theory
Yifeng Tao Carnegie Mellon University 6
What’s more after introduction?
oSupervised learning: linear models
oKernel machines: SVMs and duality
oà Optimization
oUnsupervised learning: latent space analysis and clustering
oSupervised learning: decision tree, kNN and model selection
oLearning theory: generalization and VC dimension
o à Statistical machine learning
oNeural network (basics)
oDeep learning in CV and NLP
o à Deep learning
oProbabilistic graphical models
Yifeng Tao Carnegie Mellon University 7
Curriculum for an ML Master/Ph.D. student in CMU
o10701 Introduction to Machine Learning:
o http://www.cs.cmu.edu/~epxing/Class/10701/
o35705 Intermediate Statistics:
o http://www.stat.cmu.edu/~larry/=stat705/
o36708 Statistical Machine Learning:
o http://www.stat.cmu.edu/~larry/=sml/
o10725 Convex Optimization:
o http://www.stat.cmu.edu/~ryantibs/convexopt/
o10708 Probabilistic Graphical Models:
o http://www.cs.cmu.edu/~epxing/Class/10708-17/
o10707 Deep Learning:
o https://deeplearning-cmu-10707.github.io/
oBooks:
o Bishop. Pattern Recognition and Machine Learning
o Goodfellow et al. Deep learning
Yifeng Tao Carnegie Mellon University 8
Introduction to Machine Learning
Neural network (basics)
Yifeng Tao
School of Computer Science
Carnegie Mellon University
Slides adapted from Eric Xing, Maria-Florina Balcan, Russ
Salakhutdinov, Matt Gormley
Yifeng Tao Carnegie Mellon University 9
A Recipe for Supervised Learning
o1. Given training data:
o2. Choose each of these:
o Decision function
o Loss function
o3. Define goal and train with SGD:
o (take small steps opposite the gradient)
[Slide from Matt Gormley et al.]
Yifeng Tao Carnegie Mellon University 10
Logistic Regression
oThe prediction rule:
oIn this case, learning P(y|x) amounts to
learning conditional probability over two
Gaussian distribution.
oLimitation: only simple data distribution.
[Slide from Eric Xing et al.]
Yifeng Tao Carnegie Mellon University 11
Learning highly non-linear functions
of: X à y
of might be non-linear function
oX continuous or discrete vars
oy continuous or discrete vars
[Slide from Eric Xing et al.]
Yifeng Tao Carnegie Mellon University 12
From biological neuron networks to artificial neural networks
oSignals propagate through neurons in brain.
oSignals propagate through perceptrons in artificial neural network.
[Slide from Eric Xing et al.]
Yifeng Tao Carnegie Mellon University 13
Perceptron Algorithm and SVM
oPerceptron: simple learning algorithm for supervised classification
analyzed via geometric margins in the 50’s [Rosenblatt’57].
oSimilar to SVM, a linear classifier based on analysis of margins.
oOriginally introduced in the online learning scenario.
o Online learning model
o Its guarantees under large margins
[Slide from Maria-Florina Balcan et al.]
Yifeng Tao Carnegie Mellon University 14
The Online Learning Algorithm
oExample arrive sequentially.
oWe need to make a prediction.
oAfterwards observe the outcome.
oFor i=1, 2, ..., :
oApplication:
o Email classification
o Recommendation systems
o Add placement in a new market
[Slide from Maria-Florina Balcan et al.]
Yifeng Tao Carnegie Mellon University 15
Linear Separators: Perceptron Algorithm
oh(x) = wTx + w0, if h(x) ≥ 0, then label x as +,
otherwise label it as –
oSet t=1, start with the all zero vector w1.
oGiven example x, predict positive iff wtTx ≥ 0
oOn a mistake, update as follows:
o Mistake on positive, then update wt+1 ß wt + x
o Mistake on negative, then update wt+1 ß wt – x
oNatural greedy procedure:
o If true label of x is +1 and wt incorrect on x we
have wtTx < 0, wt+1Tx ß wtTx + xTx = wtTx + ||x||2,
so more chance wt+1 classifies x correctly.
o Similarly for mistakes on negative examples.
[Slide from Maria-Florina Balcan et al.]
Yifeng Tao Carnegie Mellon University 16
Perceptron: Example and Guarantee
oExample:
oGuarantee: If data has margin 𝛾 and
all points inside a ball of radius 𝑅, then
Perceptron makes ≤ (𝑅/𝛾)2 mistakes.
o Normalized margin: multiplying all points
by 100, or dividing all points by 100,
doesn’t change the number of mistakes;
algo is invariant to scaling.
[Slide from Maria-Florina Balcan et al.]
Yifeng Tao Carnegie Mellon University 17
Perceptron: Proof of Mistake Bound
oGuarantee: If data has margin 𝛾 and all points inside a ball of radius
𝑅, then Perceptron makes ≤ (𝑅/𝛾)2 mistakes.
oProof:
o Idea: analyze 𝑤𝑡T𝑤∗ and ǁ𝑤𝑡ǁ, where 𝑤∗ is the max-margin sep, ǁ𝑤∗ǁ=1.
o Claim 1: 𝑤𝑡+1T𝑤∗ ≥ 𝑤𝑡T𝑤∗ + 𝛾. (because 𝑥T𝑤∗ ≥ 𝛾)
o Claim 2: 𝑤𝑡+12 ≤ 𝑤𝑡2 + 𝑅2. (by Pythagorean Theorem)
o After 𝑀 mistakes:
o 𝑤𝑀+1T𝑤∗ ≥ 𝛾𝑀 (by Claim 1)
o ||𝑤𝑀+1|| ≤ 𝑅 𝑀 (by Claim 2)
o 𝑤𝑀+1T𝑤∗ ≤ ǁ𝑤𝑀+1ǁ (since 𝑤∗ is unit length)
o So, 𝛾𝑀 ≤ 𝑅𝑀, so 𝑀 ≤ (R/ 𝛾)2.
[Slide from Maria-Florina Balcan et al.]
Yifeng Tao Carnegie Mellon University 18
Multilayer perceptron (MLP)
oA simple and basic type of
feedforward neural networks
oContains many perceptrons that
are organized into layers
oMLP “perceptrons” are not
perceptrons in the strict sense
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 19
Artificial Neuron (Perceptron)
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 20
Artificial Neuron (Perceptron)
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 21
Activation Function
osigmoid activation function:
o Squashes the neuron’s output between 0 and 1
o Always positive
o Bounded
o Strictly increasing
o Used in classification output layer
otanh activation function:
o Squashes the neuron’s output between -1 and 1
o Bounded
o Strictly increasing
o A linear transformation of sigmoid function
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 22
Activation Function
oRectified linear (ReLU) activation:
o Bounded below by 0 (always non-negative)
o Tends to produce units with sparse
activities
o Not upper bounded
o Strictly increasing
oMost widely used activation function
oAdvantages:
o Biological plausibility
o Sparse activation
o Better gradient propagation: vanishing
gradient in sigmoidal activation
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 23
Activation Function in Alexnet
oA four-layer convolutional neural network
o ReLU: solid line
o Tanh: dashed line
[Slide from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf]
Yifeng Tao Carnegie Mellon University 24
Single Hidden Layer MLP
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 25
Capacity of MLP
oConsider a single layer neural network
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 26
Capacity of Neural Nets
oConsider a single layer neural network
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 27
MLP with Multiple Hidden Layers
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 28
Capacity of Neural Nets
oDeep learning playground
[Slide from https://playground.tensorflow.org]
Yifeng Tao Carnegie Mellon University 29
Training a Neural Network
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 30
Stochastic Gradient Descent
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 31
Mini-batch SGD
oMake updates based on a mini-batch of examples (instead of a
single example)
o The gradient is the average regularized loss for that mini-batch
o Can give a more accurate estimate of the gradient
o Can leverage matrix/matrix operations, which are more efficient
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 32
Backpropagation
oMethod used to train neural networks following gradient descent
oEssentially implementation of chain rule and dynamic programming
oThe derivative of last two terms:
[Slide from https://en.wikipedia.org/wiki/Backpropagation]
Yifeng Tao Carnegie Mellon University 33
Backpropagation
oIf oj is output à straightforward
oElse:
[Slide from https://en.wikipedia.org/wiki/Backpropagation]
Yifeng Tao Carnegie Mellon University 34
Weight Decay
[Slide from Russ Salakhutdinov et al.]
Yifeng Tao Carnegie Mellon University 35
Optimization: Momentum
oMomentum: Can use an exponential average of previous gradients:
o Can get pass plateous more quickly, by “gaining momentum”
o Works well in positions with bad Hessian matrix
oSGD w/o momentum, SGD w/ momentum
[Slide from http://ruder.io/optimizing-gradient-descent/]
Yifeng Tao Carnegie Mellon University 36
Momentum-based Optimization
o Nesterov accelerated gradient (NAG):
o Adagrad:
o smaller updates for params associated with frequently occurring features
o larger updates for params associated with infrequent features
o RMSprop and Adadelta:
o Reduce the aggressive, monotonically decreasing learning rate in Adagrad
o Adam
[Slide from http://ruder.io/optimizing-gradient-descent/]
Yifeng Tao Carnegie Mellon University 37
Demo of Optimization Methods
[Slide from http://ruder.io/optimizing-gradient-descent/]
Yifeng Tao Carnegie Mellon University 38
Take home message
oPerceptron is an online linear classifier
oMultilayer perceptron consists perceptrons with various activations
oBackpropagation is an implementation of calculating gradients of
neural network in a backward and dynamic programming way
oMomentum-based mini-batch gradient descent methods are used in
optimizing neural networks
oWhat’s next?
o Regularization in neural networks
o Widely used NN architecture in practice
Yifeng Tao Carnegie Mellon University 39
References
oEric Xing, Tom Mitchell. 10701 Introduction to Machine Learning:
http://www.cs.cmu.edu/~epxing/Class/10701-06f/
oBarnabás Póczos, Maria-Florina Balcan, Russ Salakhutdinov. 10715
Advanced Introduction to Machine Learning:
https://sites.google.com/site/10715advancedmlintro2017f/lectures
oMatt Gormley. 10601 Introduction to Machine Learning:
http://www.cs.cmu.edu/~mgormley/courses/10601/index.html
oWikipedia
Yifeng Tao Carnegie Mellon University 40