Meta Learning:
Learn to learn
Hung-yi Lee
What does “meta” mean? meta-X = X about X
Source of image: https://medium.com/intuitionmachine/the-brute-force-method-of-
deep-learning-innovation-58b497323ae5 (Denny Britz’s graphic)
這門課的作業在做甚麼?
感謝 沈昇勳 同學提供圖檔
Industry Academia
Using 1000 GPUs to try “Telepathize” (通靈) a set of
1000 sets of hyperparameters good hyperparameters
Can machine automatically determine the hyperparameters?
Machine
Learning 101
Dog-Cat Classification
Machine Learning 𝑓 = “cat”
= Looking for a function
Step 1: Function cat or dog?
with unknown
𝑓𝜽
Step 2: Define
loss function
Weights and biases of neurons are
Step 3: unknown parameters (learnable).
Optimization
Using 𝜽 to represent the
unknown parameters.
Training Examples
Machine Learning
Step 1: Function 𝑓𝜽 𝑓𝜽 𝑓𝜽
with unknown
Step 2: Define
𝐿 𝜽
loss function cat dog cat dog
Cross-entropy 𝑒1 𝑒2
Step 3:
Optimization 𝐾
cat dog cat dog
𝐿 𝜽 = 𝑒𝑘
Ground Truth
𝑘=1
Machine Learning 101
𝐾
Step 1: Function sum over
with unknown loss: 𝐿 𝜽 = 𝑒𝑘 training
𝑘=1 examples
Step 2: Define 𝜽∗ = 𝑎𝑟𝑔 min 𝐿 𝜽
𝜽
loss function
done by gradient descent
Step 3: 𝑓𝜽∗ is the function learned by
Optimization learning algorithm from data
Introduction of
Meta Learning
What is Meta Learning?
Can we learn this function?
Following the same
Training Examples three steps in ML!
function
Learning
𝐹 algorithm Hand-crafted
cat dog
input
𝑓 ∗ classifier Learned from data
Testing
output
cat
Meta Learning – Step 1
• What is learnable in a learning algorithm?
Component
Training Examples
Net Architecture,
Deep Initial Parameters,
𝐹 Learning Learning Rate,
cat dog
……
In meta, we will try to
learn some of them.
𝑓 ∗ classifier
Testing
cat
Meta Learning – Step 1
• What is learnable in a learning algorithm?
Component
Training Examples
𝐹𝜙 Net Architecture,
Deep Initial Parameters,
𝐹 Learning Learning Rate,
cat dog
……
𝜙: learnable components
𝑓 ∗ classifier
Testing
Categorize meta learning based
cat
on what is learnable
Meta Learning – Step 2
• Define loss function for learning algorithm 𝐹𝜙
𝐿 𝜙
𝐿 𝜙 𝐿 𝜙
Task 1
Apple & Train Test
Orange apple orange apple orange
Training
Tasks
Task 2
Train Test
Car & Bike
bike car bike car
Meta Learning – Step 2
Training
Task 1 How to define 𝐿 𝜙
Examples
apple orange
𝐿 𝜙
𝐹𝜙
classifier 𝑓𝜽𝟏∗
𝜽𝟏∗ : parameters of the classifier learned by 𝐹𝜙
using the training examples of task 1
Meta Learning – Step 2
Training
Task 1 How to define 𝐿 𝜙
Examples
apple orange
𝐿 𝜙
𝐹𝜙
classifier 𝑓𝜽𝟏∗
How can we know a classifier is good or bad?
Evaluate the classifier on testing set
Meta Learning – Step 2 Testing Examples
Training
Task 1
Examples
apple orange
𝑓𝜽𝟏∗ 𝑓𝜽𝟏∗
Testing 𝐹𝜙
Examples
𝑓𝜽𝟏∗
apple orange apple orange
apple orange prediction Cross-entropy Cross-entropy
1 Compute
𝑙 difference apple orange apple orange
Ground Truth
Meta Learning – Step 2 Testing Examples
Training
Task 1
Examples
apple orange
𝑓𝜽𝟏∗ 𝑓𝜽𝟏∗
Testing 𝐹𝜙
Examples
𝑓𝜽𝟏∗
apple orange apple orange
apple orange prediction Cross-entropy Cross-entropy
1 Compute
𝑙 difference apple orange apple orange
Ground Truth
Meta Learning – Step 2 Testing Examples
Training
Task 1
Examples
apple orange
𝑓𝜽𝟏∗ 𝑓𝜽𝟏∗
Testing 𝐹𝜙
Examples
𝑓𝜽𝟏∗
apple orange apple orange
apple orange prediction Cross-entropy Cross-entropy
1 Compute
𝑙 difference apple orange apple orange
Ground Truth
Meta Learning – Step 2
Training Task 2
Task 1
Examples
apple orange bike car
Testing 𝐹𝜙 Testing 𝐹𝜙
Examples Examples
𝑓𝜽𝟏∗ 𝑓𝜽𝟐∗
apple orange prediction bike car prediction
𝑙1 𝑙2
Total loss: 𝐿 𝜙 = 𝑙1 + 𝑙 2 (sum over all the
training tasks)
Meta Learning – Step 2
Training Task 2
Task 1
Examples
apple orange bike car
Testing 𝐹𝜙 Testing 𝐹𝜙
Examples Examples
𝑓𝜽𝟏∗ 𝑓𝜽𝟐∗
apple orange prediction bike car prediction
𝑙1 𝑁 𝑙2
Total loss: 𝐿 𝜙 = 𝑙 𝑛 (𝑁 is the number of the
𝑛=1 training tasks)
Meta Learning – Step 2 Testing Examples
In typical ML, you compute the
Task 1 loss based on training examples
In meta, you compute the loss
based on testing examples 𝑓𝜽𝟏∗ 𝑓𝜽𝟏∗
Hold on! You use testing
examples during training???
apple orange apple orange
apple orange prediction Cross-entropy Cross-entropy
1 Compute
𝑙 difference apple orange apple orange
Ground Truth
Meta Learning – Step 2 Testing Examples
In typical ML, you compute the
Task 1 loss based on training examples
In meta, you compute the loss
based on testing examples 𝑓𝜽𝟏∗ 𝑓𝜽𝟏∗
of training tasks.
apple orange apple orange
apple orange prediction Cross-entropy Cross-entropy
1 Compute
𝑙 difference apple orange apple orange
Ground Truth
Meta Learning – Step 3
𝑁
• Loss function for learning algorithm 𝐿 𝜙 = 𝑙𝑛
𝑛=1
• Find 𝜙 that can minimize 𝐿 𝜙 𝜙 ∗ = 𝑎𝑟𝑔 𝑚𝑖𝑛 𝐿 𝜙
𝜙
• Using the optimization approach you know
If you know how to compute 𝜕𝐿 𝜙 Τ𝜕𝜙
Gradient descent is your friend.
What if 𝐿 𝜙 is not differentiable?
Reinforcement Learning / Evolutionary Algorithm
Now we have a learned “learning algorithm” F𝜙∗
Framework Training Tasks
Not related to Task 1 Task 2
the testing task
apple orange bike car
only need little labeled training data
Learned
F𝜙∗ “Learning
Testing cat dog Algorithm”
Task Train
What we really Test 𝑓𝜽∗
care about
cat
ML v.s. Meta
Goal
Machine Learning ≈ find a function f
Dog-Cat 𝑓 = “cat”
Classification
Meta Learning
≈ find a function F that finds a function f
Learning
Algorithm
𝐹 =𝑓
cat dog cat dog
Training Examples
Machine Learning
Training Data One task
Meta Learning cat dog
Training tasks Train
Task 1
Train Test
Apple &
Orange apple orange apple orange
Task 2 Train Test
Car & Bike bike car bike car
Support set Query set
(in the literature of “learning to compare”)
Machine Learning Within-task Training
Train 𝐹 𝑓𝜽∗
cat dog
Hand-crafted
Meta Learning
Task 1 Train Test
orange
Training apple orange apple
Tasks
Task 2 Train Test
bike car bike car
Learning Across-task Training
F𝜙∗ Algorithm
Training Examples
Machine Learning
𝑓𝜽∗
Within-task
Test Testing
cat
Meta Learning
Training Tasks
Learned
F𝜙∗ “Learning
Within-task
Testing cat dog Algorithm”
Training
Task Train
Test 𝑓𝜽∗
Within-task
Across-task
Testing
Testing Episode cat
Loss
Machine Learning
𝐾
Sum over training
𝐿 𝜽 = 𝑒𝑘
examples in one task
𝑘=1
Meta Learning
𝑁
Sum over testing
𝐿 𝜙 = 𝑙𝑛 examples in one task
𝑛=1
Sum over training tasks
𝑁 If your optimization method needs to
𝐿 𝜙 = 𝑙𝑛 compute 𝐿 𝜙
Outer Loop in
𝑛=1
“Learning to initialize”
Across-task training
Training
includes within-task
Examples
apple orange training and testing
Inner Loop in
Testing 𝐹𝜙 “Learning to initialize”
Examples
Within-task Training
𝑓𝜽∗
apple orange prediction Within-task Testing
𝑙1 To compute the loss
Meta Learning v.s ML
• What you know about ML can usually apply to
meta learning
• Overfitting on training tasks
• Get more training tasks to improve performance
• Task augmentation
• There are also hyperparameters when learning a
learning algorithm ……
• Development task ☺
What is learnable in a
learning algorithm?
Review: Gradient Descent 𝜽∗
Network 𝜙
𝜽𝟎 Update 𝜽′ Update 𝜽′′
Structure Init
gradient gradient
Gradient
Descent Compute Compute
(Function 𝐹) Gradient Gradient
Training Training
Data Data
Learning to initialize
• Model-Agnostic Meta-Learning (MAML)
Mammals
Chelsea Finn, Pieter Abbeel, and Sergey Levine, “Model-Agnostic Meta-
Learning for Fast Adaptation of Deep Networks”, ICML, 2017
• Reptile
https://arxiv.org/abs/1803.02999
How to train your Dragon MAML
Antreas Antoniou, Harrison Edwards, Amos Storkey, How to train your MAML, ICLR, 2019
MAML Testing
Task 1 Task 2 Task
find good init
cat dog cat dog cat dog
Pre-training (Self-supervised Learning)
find good init
Trained by proxy tasks cat dog
(fill-in the blanks, etc.)
MAML Isn’t it domain adaptation / transfer learning?
Task 1 Task 2
find good init
cat dog cat dog cat dog
Pre-training (more typical ways)
find good init
cat dog cat dog cat dog
Use data from different tasks Also known as multi-task
to train a model learning (baseline of meta)
MAML v.s. Pre-training
• https://youtu.be/vUwOA3SNb_E
影片中有防不勝防
的業配
這就是 “meta 業配”
MAML is good because ……
• ANIL (Almost No Inner Loop)
Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals, Rapid Learning or
Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR, 2020
More about MAML
• More mathematical details behind MAML
• https://youtu.be/mxqzGwP_Qys
• First order MAML (FOMAML)
• https://youtu.be/3z997JhL9Oo
• Reptile
• https://youtu.be/9jJe2AD35P8
𝜙
Basis form: 𝜽𝒕+𝟏 ← 𝜽𝒕 − 𝜆𝒈𝒕
Optimizer Adagrad, RMSprop, NAG, Adam ……
Is the optimizer learnable?
𝜽∗
Can be learned by MAML
Network
𝜽𝟎 Update 𝜽′ Update 𝜽′′
Structure Init
gradient gradient
Gradient
Descent Compute Compute
(Function 𝐹) Gradient Gradient
Training Training
Data Data
Marcin Andrychowicz, et al., Learning to learn by
Optimizer gradient descent by gradient descent, NIPS, 2016
Network Architecture Search (NAS)
𝜽∗
𝜙
Network
𝜽𝟎 Update 𝜽′ Update 𝜽′′
Structure Init
gradient gradient
Gradient
Descent Compute Compute
(Function 𝐹) Gradient Gradient
Training Training
Data Data
Network Architecture Search (NAS)
𝜙 = 𝑎𝑟𝑔 𝑚𝑖𝑛 𝐿 𝜙 ∇𝜙 𝐿 𝜙 =?
𝜙
Network
Architecture
• Reinforcement Learning
• Barret Zoph, et al., Neural Architecture Search with Reinforcement
Learning, ICLR 2017
• Barret Zoph, et al., Learning Transferable Architectures for Scalable Image
Recognition, CVPR, 2018
• Hieu Pham, et al., Efficient Neural Architecture Search via Parameter
Sharing, ICML, 2018
An agent uses a set of actions to −𝐿 𝜙
determine the network architecture. Reward to be
𝜙: the agent’s parameters maximized
Network Architecture Search (NAS)
Across-task Update 𝜙 to maximize reward −𝐿 𝜙
Training
agent 𝜙 (RNN) form a
−𝐿 𝜙
network
Accuracy
of the
network
Train the network
Within-task Training
Network Architecture Search (NAS)
𝜙 = 𝑎𝑟𝑔 𝑚𝑖𝑛 𝐿 𝜙 ∇𝜙 𝐿 𝜙 =?
𝜙
Network
Architecture
• Reinforcement Learning
• Barret Zoph, et al., Neural Architecture Search with Reinforcement
Learning, ICLR 2017
• Barret Zoph, et al., Learning Transferable Architectures for Scalable Image
Recognition, CVPR, 2018
• Hieu Pham, et al., Efficient Neural Architecture Search via Parameter
Sharing, ICML, 2018
• Evolution Algorithm
• Esteban Real, et al., Large-Scale Evolution of Image Classifiers, ICML 2017
• Esteban Real, et al., Regularized Evolution for Image Classifier Architecture
Search, AAAI, 2019
• Hanxiao Liu, et al., Hierarchical Representations for Efficient Architecture
Search, ICLR, 2018
Network Architecture Search (NAS)
𝜙 = 𝑎𝑟𝑔 𝑚𝑖𝑛 𝐿 𝜙 ∇𝜙 𝐿 𝜙 =?
𝜙
Network
Architecture
• DARTS Hanxiao Liu, et al., DARTS: Differentiable Architecture Search,
ICLR, 2019
Data Processing? 𝜽∗
Network
𝜽𝟎 Update 𝜽′ Update 𝜽′′
Structure Init
gradient gradient
Gradient
Descent Compute Compute
(Function 𝐹) Gradient Gradient
Training Training
Data Data
Data Augmentation
Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M.
Robertson, Yongxin Yang, DADA: Differentiable Automatic Data Augmentation,
ECCV, 2020
Daniel Ho, Eric Liang, Ion Stoica, Pieter Abbeel, Xi Chen, Population Based
Augmentation: Efficient Learning of Augmentation Policy Schedules, ICML, 2019
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, Quoc V. Le,
AutoAugment: Learning Augmentation Policies from Data, CVPR, 2019
Sample Reweighting
• Give different samples different weights
Larger weights (focus on
tough examples)?
Smaller weights (the labels
are noisy)?
Sample Weighting Strategies Learnable 𝜙
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng,
Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting, NeurIPS, 2019
Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun, Learning to Reweight Examples
for Robust Deep Learning, ICML, 2018
Beyond Gradient Descent
Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan
Pascanu, Simon Osindero, Raia Hadsell, 𝜽∗
Meta-Learning with Latent Embedding Optimization, ICLR, 2019
Network
𝜽𝟎 Update 𝜽′ Update 𝜽′′
Structure Init This is a Network.
gradient is 𝜙
Its parameter gradient
Gradient
(Invent
Descent Compute
new learning algorithm! Compute
Not gradient descent anymore)
(Function 𝐹) Gradient Gradient
Training Training
Data Data
Until now …… How about?
cat cat
Learning
Algorithm 𝜽∗ Learning + Classification
(Function 𝐹) (Function 𝐹)
cat dog
cat dog
Training Data Testing Data Training Data Testing Data
https://youtu.be/yyKaACh_j3M
Learning to compare https://youtu.be/scK2EIT7klw
https://youtu.be/semSxPP2Yzg
(metric-based approach)
https://youtu.be/ePimv_k-H24
Applications
Few-shot Image Classification
• Each class only has a few images.
Class 1 Class 1 Class 2 Class 2 Class 3 Class 3 Which
3-ways 2-shot class?
• N-ways K-shot classification: In each task, there are
N classes, each has K examples.
• In meta learning, you need to prepare many N-ways
K-shot tasks as training and testing tasks.
Omniglot
https://github.com/brendenlake/omniglot
• 1623 characters
• Each has 20 examples
Demo:
Omniglot https://openai.com/blog/reptile/
20 ways Testing set
1 shot (Query set)
Each character Training set
represents a class (Support set)
• Split your characters into training and testing characters
• Sample N training characters, sample K examples from
each sampled characters → one training task
• Sample N testing characters, sample K examples from
each sampled characters → one testing task
http://speech.ee.
ntu.edu.tw/~tlkag
k/meta_learning_
table.pdf