0% found this document useful (0 votes)
14 views10 pages

Data Augmentation For Meta-Learning

This document explores data augmentation strategies specifically for meta-learning frameworks, highlighting their potential to improve performance in few-shot classification tasks. It identifies four distinct modes of augmentation—support, query, task, and shot augmentation—and demonstrates that query augmentation is particularly critical for enhancing model accuracy. The authors propose a new augmentation strategy called Meta-MaxUp, which combines these techniques to achieve significant performance boosts on popular benchmarks.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views10 pages

Data Augmentation For Meta-Learning

This document explores data augmentation strategies specifically for meta-learning frameworks, highlighting their potential to improve performance in few-shot classification tasks. It identifies four distinct modes of augmentation—support, query, task, and shot augmentation—and demonstrates that query augmentation is particularly critical for enhancing model accuracy. The authors propose a new augmentation strategy called Meta-MaxUp, which combines these techniques to achieve significant performance boosts on popular benchmarks.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Augmentation for Meta-Learning

Renkun Ni 1 Micah Goldblum 1 Amr Sharaf 2 Kezhi Kong 1 Tom Goldstein 1

Abstract Meta-learning frameworks use data for multiple purposes


during each gradient update, which creates the possibility for
Conventional image classifiers are trained by ran- a diverse range of data augmentations that are not possible
domly sampling mini-batches of images. To within the standard training pipeline. At the same time, it
achieve state-of-the-art performance, practitioners is still unclear how different categories of data within the
use sophisticated data augmentation schemes to training pipeline impact meta-learning performance. We
expand the amount of training data available for explore these possibilities and discover combinations of
sampling. In contrast, meta-learning algorithms augmentation types that improve performance over existing
sample support data, query data, and tasks on methods. Our contributions can be summarized as follows:
each training step. In this complex sampling sce-
nario, data augmentation can be used not only to
expand the number of images available per class, • First, we break down the meta-learning pipeline and
but also to generate entirely new classes/tasks. We find that each component contributes differently to
systematically dissect the meta-learning pipeline meta-learning performance: meta-learners are very sen-
and investigate the distinct ways in which data sitive to the amount of query data and number of tasks
augmentation can be integrated at both the image and less sensitive to the amount of support data.
and class levels. Our proposed meta-specific data
• Based on these findings, we uncover four modes of
augmentation significantly improves the perfor-
augmentations for meta-learning that differ in where in
mance of meta-learners on few-shot classification
the training pipeline they are applied: support augmen-
benchmarks.
tation, query augmentation, task augmentation, and
shot augmentation.

1. Introduction • We test these four modes using a pool of image aug-


mentations, and we confirm that query augmentation
Data augmentation has become an essential part of the train- is critical, while support augmentation often does not
ing pipeline for image classifiers and similar systems, as it provide performance benefits and may even degrade
offers a simple and efficient way to significantly improve accuracy in some cases.
performance (Cubuk et al., 2018; Zhang et al., 2017). In
contrast, little work exists on data augmentation for meta- • Finally, we combine augmentations and implement
learning. Existing frameworks for few-shot image classifica- a MaxUp strategy, which we call Meta-MaxUp, to
tion use only horizontal flips, random crops, and color jitter maximize performance. We achieve significant perfor-
to augment images in a way that parallels augmentation mance boosts for popular meta-learners on few-shot
for conventional training (Bertinetto et al., 2018; Lee et al., benchmarks such as mini-ImageNet, CIFAR-FS and
2019). Meanwhile, meta-learning methods have received Meta-Dataset.
increasing attention as they have reached the cutting edge of
few-shot performance. While new meta-learning algorithms
emerge at a rapid rate, we show that, like image classifiers,
2. Background and Related Work
meta-learners can achieve significant performance boosts 2.1. The Meta-Learning Framework
through carefully chosen data augmentation strategies that
are injected into various stages of the meta-learning pipeline. Meta-learning algorithms aim to learn a network that can
easily adapt to new tasks with limited data and generalize to
1
Department of Computer Science, University of Mary- unseen examples. In order to achieve this, they simulate the
land, College Park 2 Microsoft. Correspondence to: Renkun Ni adaptation and evaluation procedure during meta-training.
<[email protected]>.
To simulate an N -way classification task, Ti , we sample
Proceedings of the 38 th International Conference on Machine support data Tis and query data Tiq , so that Ti = {Tis , Tiq }.
Learning, PMLR 139, 2021. Copyright 2021 by the author(s). As we will detail in the following paragraph, support will
Data Augmentation for Meta-Learning

be used to simulate few-shot training data, while query will A different line of work instead applies regularizers to pre-
be used to simulate unseen testing data. Note that shot vent overfitting and improve few-shot classification (Yin
denotes the number of training samples per class available et al., 2019; Goldblum et al., 2020). Yet additional work
for fine-tuning on a given task during the testing phase. has developed methods for labeling and augmenting unla-
beled data (Antoniou & Storkey, 2019; Chen et al., 2019b),
Adopting common terminology from the literature, the
generative models for deforming images in one-shot met-
archetypal meta-learning algorithm contains an inner loop
ric learning (Chen et al., 2019c), and feature space data
and an outer loop in each parameter update of the training
augmentation for adapting language models to new unseen
procedure. In the inner loop, a model is first fine-tuned or
intents (Kumar et al., 2019).
adapted on support data Tis . Then, in the outer loop, the
updated model is evaluated on query data Tiq , and mini-
mizes loss on the query data with respect to the model’s 2.3. Few-shot Benchmarks
parameters before fine-tuning. This loss minimization step In this paper, we perform our experiments on the mini-
may require computing the gradient through the fine-tuning ImageNet and CIFAR-FS datasets as well as the Meta-
procedure. Existing meta-learning algorithms apply various Dataset benchmark (Vinyals et al., 2016; Bertinetto et al.,
methods for fine-tuning on support data during the inner 2018; Triantafillou et al., 2019). Mini-ImageNet is a few-
loop. Some algorithms, such as MAML and Reptile (Finn shot learning dataset derived from the ImageNet classifi-
et al., 2017; Nichol et al., 2018), update all the parameters cation dataset (Deng et al., 2009), and CIFAR-FS is de-
in the network using gradient descent during fine-tuning rived from CIFAR-100 (Krizhevsky et al., 2009). Each of
on support data. Other algorithms, such as MetaOptNet these datasets contains 64 training classes, 16 validation
and R2-D2 (Lee et al., 2019; Bertinetto et al., 2018), only classes, and 20 classes for testing. In each class, there are
update the parameters from the linear classifier layer during 600 images, and both Mini-ImageNet and CIFAR-FS have
the fine-tuning while keeping the feature extraction layers 60000 images in total. Meta-Dataset is a large-scale diverse
frozen. These methods benefit from the simplicity and the benchmark consisting of 10 different image classification
convexity of the inner loop optimization problem. Similarly, subdatasets with distinct data distributions. This diversity
metric learning approaches, such as (Snell et al., 2017; Kye allows us to measure cross-domain generalization.
et al., 2020), freeze the feature extraction layers as well,
and create class centroids from the support data during the
inner loop. These method have low cost training iterations,
3. The Anatomy of Data Augmentation for
and can be applied on deeper architectures to achieve better Meta-Learning
performance. In this work, we mainly focus on the latter al- 3.1. Where Does Dataset Diversity Matter Most? In the
gorithms due to their stronger performance. Further details Support, Query or Tasks?
of the algorithms used in our experiments can be found in
Section 4.1. Since data augmentation techniques aim to increase the
amount of training samples, learning algorithms that are sen-
2.2. Preventing Overfitting in Meta-Learning sitive to the amount of training data may benefit more from
these techniques. In this section, before we introduce data
Meta-learners are known to be particularly vulnerable to augmentations, we investigate how sensitive meta-learning
overfitting (Rajendran et al., 2020). One work, MetaMix, algorithms are to the amount of support data, query data,
proposes averaging support and query features to prevent and tasks. Typically, support and query data are sampled
the model from memorizing the query data and ignoring from the same pool (the entire training set).
support (Yao et al., 2020). Recently, another work adds
random noise to the label space to make the model rely To examine the impact of dataset diversity on various stages
on support data (Rajendran et al., 2020). In the context of meta-learning, we perform an ablation where we limit the
of few-shot classification, random shuffling labels within diversity of each stage. We first reduce the pool of support
tasks alleviates this kind of overfitting and is commonplace data to a fixed subset of only five independent samples per
in meta-learning algorithms (Yin et al., 2019; Rajendran class while sampling query data from the entire training set.
et al., 2020). However, as shown in Figure 1, overfitting That is, whenever a support image is sample from class c,
to training tasks remains a problem. One recent work has it is only sampled from the five-image subset associated
developed a data augmentation method to overcome this with that class instead of from all training data in that class.
problem (Liu et al., 2020). This method simply rotates all Interestingly, we find that test accuracy remains almost the
images in a class by a large degree and considers this new same as baseline performance (see Table 1). In fact, if we
rotated class distinct from its parent class. This effectively replace those five support images per class with fixed ran-
increases the number of possible few-shot tasks that can be dom noise images, we still only observe a small degradation
sampled during training. in performance. We then instead shrink the pool of query
Data Augmentation for Meta-Learning

data (but not support), and we see a much larger decrease in Task augmentation: We can increase the number of pos-
test accuracy. These experiments suggest that meta-learning sible tasks by uniformly augmenting whole classes to add
is fairly insensitive to the amount and quality of support but new classes with which to train. For example, a vertical flip
not query data. This observation agrees with our following applied to all car images yields a new upside-down car class
finding that augmenting query data is far more beneficial which may be sampled during training.
than augmenting support.
Since we also consider task-level augmentation, we now Shot augmentation: At test time, we can artificially am-
examine how sensitive meta-learning is to a decrease in task plify the shot by adding additional augmented copies of each
 As CIFAR-FS contains 64 training classes, there
diversity. image. Shot augmentation can also be used during training
are 645 = 7624512 5-way classification problems that can by adding copies of each support image via augmentation.
be sampled during each iteration of meta-learning. We re- Shot augmentation during training may be needed to prepare
duce the number of tasks by randomly batching classes into a network for the use of test-time shot augmentation.
just 13 distinct 5-way classification tasks before training,
and we only train on these 13 tasks. We do this in such a Existing meta-learning algorithms for few-shot image classi-
way that all classes, and therefore training data, are used fication typically apply standard augmentations (horizontal
during training. We observe that this process noticeably flips, random crops, and color jitter) on all images that come
degrades test accuracy, and we conclude that there may from the data loader without considering the purpose of each
be room to improve performance by augmenting the num- image. As a result, the same augmentation occurs on both
ber of tasks (see Table 1). To verify that this impact of support and query images (Gidaris & Komodakis, 2018;
dataset diversity generalizes, we run additional experiments Qiao et al., 2018). In Section 4, we test the four modes of
on Mini-ImageNet and with other backbones. The results data augmentation enumerated above in isolation across a
are shown in Appendix A, and these experiments support large array of specific augmentations. We find that query
the aforementioned findings as well. augmentation is far more critical than support augmentation
for increasing performance. In fact, support augmentation
often hurts performance. Additionally, we find that task aug-
Table 1. Few-shot classification accuracy (%) using R2-D2 and mentation, when combined with query augmentation, can
a ResNet-12 backbone for various data size manipulations on offer further boosts in performance when compared with
CIFAR-FS. “Support”, “Query” and “Task” columns denote the existing frameworks.
number of samples per class for support and query data and the
number of total tasks available for sampling. The first row contains 3.3. Data Augmentation Techniques
baseline performance. Confidence intervals have radius equal to
one standard error. For each of the data augmentation modes described above,
Support Query Task 1-shot 5-shot we try a variety of specific data augmentation techniques.
600 600 full 71.73 ± 0.37 84.39 ± 0.25 Some techniques are only applicable to support, query, and
5 600 full 70.97 ± 0.36 84.51 ± 0.24 shot modes or solely to the task mode. We use an array of
5 (random) 600 full 58.15 ± 0.36 76.26 ± 0.27 standard augmentation techniques as well as CutMix (Yun
600 5 full 60.25 ± 0.37 77.05 ± 0.28 et al., 2019), MixUp (Zhang et al., 2017), and Self-Mix (Seo
600 600 13 68.24 ± 0.38 81.77 ± 0.26 et al., 2020). In the context of the task augmentation mode,
we apply these the same way to every image in a class in
order to augment the number of classes. For example, we
3.2. Data Augmentation Modes use MixUp to create a half-dog-half-truck class where every
image is the average of a dog image and a truck image. We
Motivated by the observation that meta-learning is more
also try combining multiple classes into one class as a task
sensitive to the amount of query data and tasks than support,
augmentation mode.
we delineate four modes of data augmentation for meta-
learning which may be employed individually or combined. In general, techniques that greatly change the image distri-
bution (i.e. a vertical flip, which does not naturally appear
in the dataset) are better suited for task augmentations while
Support augmentation: Data augmentation may be ap-
techniques that preserve the image distribution (e.g., random
plied to support data in the inner loop of fine-tuning. This
crops, which produce images that are presumably within the
strategy enlarges the pool of fine-tuning data.
support of the image distribution) are typically better suited
for the support, query, and shot augmentation modes. The
Query augmentation: Data augmentation alternatively baseline models we compare to use horizontal flip, random
may be applied to query data. This strategy enlarges the crop, and color jitter augmentation techniques at both the
pool of evaluation data to be sampled during training. support and query levels since this combination is prevalent
Data Augmentation for Meta-Learning

in the literature. More details on our pool of augmentation Algorithm 1 Meta-MaxUp


techniques can be found in Appendix B. Require: Base model Fθ , fine-tuning algorithm A, learn-
ing rate γ, set of augmentations S, and distribution over
3.4. Meta-MaxUp Augmentation for Meta-Learning tasks p(T ).
Recent work proposes MaxUp augmentation to alleviate Initialize θ, the weights of F ;
overfitting during the training of classifiers (Gong et al., while not done do
2020). This strategy applies many augmentations to each Sample batch of tasks, {Ti }ni=1 , where Ti ∼ p(T ) and
image and chooses the augmented image which yields the Ti = (Tis , Tiq ).
highest loss. MaxUp is conceptually similar to adversar- for i = 1, ..., n do
ial training (Madry et al., 2019). Like adversarial training, Sample m augmentations, {Mj }m j=1 , from S.

MaxUp involves solving a saddlepoint problem in which Compute k = argmaxj L(Fθj , Mj (Tiq )), where
loss is minimized with respect to parameters while being θj = A(θ, Mj (Tis )).
maximized with respect to the input. In the standard im- Compute gradient gi = ∇θ L(Fθk , Mk (Tiq )).
age classification setting, MaxUp, together with CutMix, end for
Update base model parameters: θ ← θ − nγ i gi .
P
improves generalization and achieves state-of-the-art perfor-
mance on ImageNet. Here, we extend MaxUp to the setting end while
of meta-learning. Before training, we select a pool, S, of
data augmentations from the four modes as well as their 3. We further boost performance by combining augmen-
combinations. For example, S may contain horizontal flip tations with Meta-MaxUp. (Section 4.4)
shot augmentation, query CutMix, and the combination of
both. During each iteration of training, we first sample a 4. Our proposed augmentation Meta-MaxUp greatly im-
batch of tasks, each containing support and query data, as is proves performance on cross-domain benchmarks as
typical in the meta-learning framework. For each element in well. (Section 4.7)
the batch, we randomly select m augmentations from the set
S, and we apply these to the task, generating m augmented 4.1. Experimental Setup
tasks with augmented support and query data. Then, for
each element of the batch of tasks originally sampled, we We conduct experiments on four meta-learning algorithms:
choose the augmented task that maximizes loss, and we ProtoNet (Snell et al., 2017), R2-D2 (Bertinetto et al., 2018),
perform a parameter update step to minimize training loss. MetaOptNet (Lee et al., 2019), and MCT (Kye et al., 2020).
Formally, we solve the minimax optimization problem, ProtoNet is a metric-learning method that uses a prototype
learning head, which classifies samples by extracting a fea-
ture vector and then performing a nearest-neighbor search
h i
min ET max L(Fθ0 , M (T q )) ,
θ M ∈S for the closest class prototype. R2-D2 and MetaOptNet
instead use differentiable solvers with a ridge regression and
where θ0 = A(θ, M (T s )), A denotes fine-tuning, F is SVM head, respectively. These methods extract feature vec-
the base model with parameters θ, L is the loss function tors and then apply a standard linear classifer to assign class
used in the outer loop of training, and T is a task with labels. MCT improves upon ProtoNet by meta-learning con-
support and query data T s and T q , respectively. Algorithm fidence scores. We experiment with all of these different
1 contains a more thorough description of this pipeline in classifier head options, all using the ResNet-12 backbone
practice (adapted from the standard meta-learning algorithm proposed by Oreshkin et al. (2018) as well as the four-layer
in Goldblum et al. (2019)). convolutional architectures proposed by Snell et al. (2017)
and Bertinetto et al. (2018).
4. Experiments We perform our experiments on the aforementioned bench-
In this section, we empirically demonstrate the following: mark datasets, mini-ImageNet, CIFAR-FS, and Meta-
Dataset. A description of training hyperparameters and
computational complexity can be found in Appendix C. We
1. Augmentations applied in the four distinct modes be- report confidence intervals with a radius of one standard
have differently. In particular, query and task augmen- error.
tation are far more important than support augmenta-
tion. (Section 4.2) Few-shot learning may be performed in either the induc-
tive or transductive setting. Inductive learning is a stan-
2. Meta-specific data augmentation strategies can im- dard method in which each test image is evaluated sep-
prove performance over the generic strategies com- arately and independently. In contrast, transduction is
monly used for meta-learning. (Section 4.3) a mode of inference in which the few-shot learner has
Data Augmentation for Meta-Learning

access to all unlabeled testing data at once and there-


Table 3. Few-shot classification accuracy (%) using R2-D2 and a
fore has the ability to perform semi-supervised learning
ResNet-12 backbone on the CIFAR-FS dataset with combinations
by training on the unlabelled data. For fair compari- of augmentations and query CutMix. “S”,“Q”,“T” denote “Sup-
son, we only compare inductive methods to other induc- port”, “Query”, and “Task” modes, respectively. While adding
tive methods. A PyTorch implementation of our data augmentations can help, it can also hurt, so additional augmenta-
augmentation methods for meta-learning can be found at: tions must be chosen carefully.
https://github.com/RenkunNi/MetaAug
Mode 1-shot 5-shot
4.2. An Empirical Comparison of Augmentation Modes
CutMix 75.97 ± 0.34 87.28 ± 0.23
We empirically evaluate the performance of all four differ- + CutMix (S) 75.00 ± 0.37 85.37 ± 0.25
ent augmentation modes identified in Section 3.2 on the + Random Erase (S) 75.84 ± 0.34 87.19 ± 0.24
CIFAR-FS dataset using an R2-D2 base-learner paired with + Random Erase (Q) 75.08 ± 0.35 87.14 ± 0.23
both a 4-layer convolutional network backbone (as used in + Self-Mix (S) 76.27 ± 0.34 87.52 ± 0.24
the original work (Bertinetto et al., 2018)) and a ResNet- + Self-Mix (Q) 76.04 ± 0.34 87.45 ± 0.24
12 backbone. We report the results of the most effective + MixUp (T) 75.97 ± 0.34 86.66 ± 0.24
augmentations for each mode on the ResNet-12 backbone + Rotation (T) 75.74 ± 0.34 87.68 ± 0.24
in Table 2. Appendix D contains an extensive table with + Horizontal Flip (Shot) 76.23 ± 0.34 87.36 ± 0.24
various augmentations and both backbones.
Table 2 demonstrates that each mode of augmentation indi- 4.3. Combining Augmentations
vidually can improve performance. Augmentation applied
After studying each mode of data augmentation individually,
to query data is consistently more effective than the other
we combine augmentations in order to find out how augmen-
augmentation modes. In particular, simply applying CutMix
tations interact with each other. We build on top of query
to query samples improves accuracy by as much as 3% on
CutMix since this augmentation was the most effective in
both backbones. In contrast, most augmentations on support
the previous section. We combine query CutMix with other
data actually damage performance. The overarching con-
effective augmentations from Table 2, and we conduct ex-
clusion of these experiments is that the four modes of data
periments on the same backbones and dataset. Results on
augmentation for meta-learning behave differently. Existing
the ResNet-12 backbone are reported in Table 3, and a full
meta-learning methods, which apply the same augmenta-
table with additional results can be find in Appendix E. In-
tions to query and support data without using task and shot
terestingly, when we use CutMix on both support and query
augmentation, may be achieving suboptimal performance.
images, we observe worse performance than simply using
CutMix on query data alone. Again, this demonstrates that
meta-learning demands a careful and meta-specific data aug-
Table 2. Few-shot classification accuracy (%) using R2-D2 and mentation strategy. In order to further boost performance,
a ResNet-12 backbone on the CIFAR-FS dataset with the most we will need an intelligent method for combining various
effective data augmentations for each mode shown. Confidence augmentations. We propose Meta-MaxUp as this method.
intervals have radii equal to one standard error. Best performance
in each category is bolded. Query CutMix is consistently the most 4.4. Meta-MaxUp Further Improves Performance
effective single augmentation for meta-learning.
In this section, we evaluate our proposed Meta-MaxUp strat-
Method Mode 1-shot 5-shot egy in the same experimental setting as above for various
values of m and different data augmentation pool sizes. Ta-
Baseline - 71.95 ± 0.37 84.56 ± 0.25 ble 4 contains the results, and a detailed description of the
CutMix Support 72.79 ± 0.37 84.70 ± 0.25 augmentation pools as well as the full results can be found in
Self-Mix Support 71.96 ± 0.36 84.84 ± 0.25 Appendix F. Rows beginning with “CutMix” denote experi-
ments in which the pool of augmentations simply includes
CutMix Query 75.97 ± 0.34 87.28 ± 0.23
many CutMix samples. “Single” denotes experiments in
Self-Mix Query 73.59 ± 0.35 86.14 ± 0.24
which each augmentation in S is of a single type, while
Large Rotation Task 73.79 ± 0.36 85.81 ± 0.24 “Medium” and “Large” denote experiments in which each
MixUp Task 72.05 ± 0.37 85.27 ± 0.25 element of S is a combination of augmentations, for ex-
Random Crop Shot 70.56 ± 0.37 83.87 ± 0.25 ample CutMix+rotation. Combinations greatly expand the
Horizontal Flip Shot 73.25 ± 0.36 85.06 ± 0.25 number of augmentations in the pool. Rows with m = 1
denote experiments where we do not maximize loss in the
inner loop and thus simply apply randomly sampled data
Data Augmentation for Meta-Learning

Figure 1. Training and validation accuracy for R2-D2 meta-learner with ResNet-12 backbone on the CIFAR-FS dataset. (Left) Baseline
model (Middle) query Self-Mix (Right) Meta-MaxUp. Better data augmentation strategies, such as MaxUp, narrow the generalization gap
and prevent overfitting.

augmentation for each task. As we increase m and include a meta-training. In contrast, we observe that models trained
large number of augmentations in the pool, we observe per- with Meta-MaxUp do not quickly overfit and continue im-
formance boosts as high as 4% over the baseline, which uses proving validation performance for a greater number of
horizontal flip, random crop, and color jitter data augmen- epochs. Meta-MaxUp visibly reduces the generalization
tations from the original work corresponding to the R2-D2 gap.
meta-learner (Bertinetto et al., 2018).
90
1o 6hot Aug
Table 4. Few-shot classification accuracy (%) using R2-D2 and a 85 7est on 6hot Aug
ResNet-12 backbone on the CIFAR-FS dataset for Meta-MaxUp
over different sizes of augmentation pools and numbers of samples. 80
Accuracy

As m and the pool size increase, so does performance. Meta-


75
MaxUp is able to pick effective augmentations from a large pool.
70

Pool m 1-shot 5-shot


65

Baseline - 71.95 ± 0.37 84.56 ± 0.25


60

CutMix 1 75.97 ± 0.34 87.28 ± 0.23 1-shot 5-shot

Single 1 75.71 ± 0.35 87.44 ± 0.43


Medium 1 75.60 ± 0.34 87.35 ± 0.23 85
1o 6hot Aug
Large 1 75.44 ± 0.34 87.47 ± 0.23 80 7est on 6hot Aug

CutMix 2 74.93 ± 0.36 87.14 ± 0.24 75

Single 2 75.81 ± 0.34 87.33 ± 0.23


Accuracy

70
Medium 2 76.49 ± 0.33 88.20 ± 0.22
Large 2 76.59 ± 0.34 88.11 ± 0.23 65

CutMix 4 75.08 ± 0.23 87.60 ± 0.24 60

Single 4 76.82 ± 0.24 88.14 ± 0.23 55

Medium 4 76.30 ± 0.24 88.29 ± 0.22


50
Large 4 76.99 ± 0.24 88.35 ± 0.22 1-shot 5-shot

We explore the training benefits of these meta-specific train- Figure 2. Performance with shot augmentation using MetaOptNet
ing schemes by examining saturation during training. To this trained with the proposed Meta-MaxUp. (Top) 1-shot and 5-shot
end, we plot the training and validation accuracy over time on CIFAR-FS (Bottom) 1-shot and 5-shot on mini-ImageNet.
for R2-D2 meta-learners with ResNet-12 backbones using
baseline augmentations, query Self-Mix, and Meta-MaxUp
4.5. Shot Augmentation for Pre-Trained Models
with a medium sized pool and m = 4. See Figure 1 for
training and validation accuracy curves. With only baseline In the typical meta-learning framework, data augmentations
augmentations, validation accuracy stops increasing imme- are used during meta-training but not during test time. On
diately after the first learning rate decay. This suggests that the other hand, in some transfer learning work, data augmen-
baseline augmentations do not prevent overfitting during tations, such as horizontal flips, random crops, and color
Data Augmentation for Meta-Learning

Table 5. Few-shot classification accuracy (%) on CIFAR-FS and mini-ImageNet. “+ DA” denotes training with CutMix (Q) + Rotation
(T), and “+ MM” denotes training with Meta-MaxUp. “CNN-4” denotes a 4-layer convolutional network with 96, 192, 384, and 512
filters in each layer (Bertinetto et al., 2018). “64-64-64-64” denotes the 4-layer CNN backbone from Snell et al. (2017).

CIFAR-FS mini-ImageNet
Method Backbone 1-shot 5-shot 1-shot 5-shot
R2-D2 CNN-4 67.56 ± 0.35 82.39 ± 0.26 56.15 ± 0.31 72.46 ± 0.26
+ DA CNN-4 70.54 ± 0.33 84.69 ± 0.24 57.60 ± 0.32 74.69 ± 0.25
+ MM CNN-4 71.10 ± 0.34 85.50 ± 0.24 58.18 ± 0.32 75.35 ± 0.25
R2-D2 ResNet-12 71.95 ± 0.37 84.56 ± 0.25 60.46 ± 0.32 76.88 ± 0.24
+ DA ResNet-12 76.17 ± 0.34 87.74 ± 0.24 65.54 ± 0.32 81.52 ± 0.23
+ MM ResNet-12 76.65 ± 0.33 88.57 ± 0.24 65.15 ± 0.32 81.76 ± 0.24
ProtoNet 64-64-64-64 60.91 ± 0.35 79.73 ± 0.27 47.97 ± 0.32 70.13 ± 0.27
+ DA 64-64-64-64 62.21 ± 0.36 80.70 ± 0.27 50.38 ± 0.32 71.44 ± 0.26
+ MM 64-64-64-64 63.01 ± 0.36 80.85 ± 0.25 50.06 ± 0.32 71.13 ± 0.26
ProtoNet ResNet-12 70.21 ± 0.36 84.26 ± 0.25 57.34 ± 0.34 75.81 ± 0.25
+ DA ResNet-12 74.30 ± 0.36 86.24 ± 0.24 60.82 ± 0.34 78.23 ± 0.25
+ MM ResNet-12 76.05 ± 0.34 87.84 ± 0.23 62.81 ± 0.34 79.38 ± 0.24
MetaOptNet ResNet-12 70.99 ± 0.37 84.00 ± 0.25 60.01 ± 0.32 77.42 ± 0.23
+ DA ResNet-12 74.56 ± 0.34 87.61 ± 0.23 64.94 ± 0.33 82.10 ± 0.23
+ MM ResNet-12 75.67 ± 0.34 88.37 ± 0.23 65.02 ± 0.32 82.42 ± 0.23
MCT ResNet-12 75.80 ± 0.33 89.10 ± 0.42 64.84 ± 0.33 81.45 ± 0.23
+ MM ResNet-12 76.00 ± 0.33 89.54 ± 0.33 66.37 ± 0.32 83.11 ± 0.22

jitter, are used during fine-tuning at test time (Chen et al., rotation as well as Meta-MaxUp data augmentation strate-
2019a). These techniques enable the network to see more gies on both the CIFAR-FS and mini-ImageNet datasets.
data samples during few-shot testing, leading to enhanced See Table 5 for the results of these experiments. In all cases,
performance. we are able to improve the performance of existing methods,
sometimes by over 5%. Even without Meta-MaxUp, we
We propose shot augmentation (see Section 3) to enlarge
improve performance over the baseline by a large margin.
the number of few-shot samples during testing, and we also
The superiority of meta-learners that use these augmenta-
propose a variant in which we additionally train using the
tion strategies suggests that data augmentation is critical for
same augmentations on support data in order to prepare the
these popular algorithms and has largely been overlooked.
meta-learner for this test time scenario. Figure 2 shows
the effect of shot augmentation (using only horizontal flips) In addition, we compare our method to augmentation by
on performance for MetaOptNet with ResNet-12 backbone Large Rotations at the task level – the only competing work
trained with Meta-MaxUp. Shot augmentation consistently to our knowledge – in Table 6. Note that using Large Rota-
improves results across datasets, especially on 1-shot classi- tions to create new classes is referred to as “Task Augmen-
fication (∼ 2%). To be clear, in this figure, we are not using tation” in (Liu et al., 2020); we refer to it here as “Large
shot augmentation during the training stage. Rather, we Rotations” to avoid confusion since we study a myriad of
are using conventional low-shot training, and then deploy- augmentations at the task level. We observe that with the
ing our models with shot augmentation at test time. These same training algorithm (MetaOptNet with SVM) and the
post-training performance gains can be achieved by directly ResNet-12 backbone, our method outperforms the Large Ro-
applying shot augmentation to pre-trained/existing models tations augmentation strategy by a large margin on both the
during testing. For additional experiments, see Appendix G. CIFAR-FS and mini-ImageNet datasets. Together with the
same ensemble method as used in Large Rotations, marked
4.6. Improving Existing Meta-Learners with Better by “+ens”, we further boost performance consistently above
Data Augmentation the MCT baseline, the current highest performing meta-
learning method on these benchmarks, despite using an
In this section, we improve the performance of four different older meta-learner previously thought to perform worse
popular meta-learning methods including ProtoNet (Snell than MCT. Moreover, when both training and validation
et al., 2017), R2-D2 (Bertinetto et al., 2018), MetaOptNet datasets are used for meta-training, we can achieve the state-
(Lee et al., 2019), and MCT (Kye et al., 2020). We compare of-art results for few-shot classification on mini-ImageNet
their baseline performance to query CutMix with task-level in inductive setting.
Data Augmentation for Meta-Learning

Table 6. Few-shot classification accuracy (%) on CIFAR-FS and mini-ImageNet with ResNet-12 backbone. “M-SVM” denotes MetaOpt-
Net with the SVM head. “+ens” denotes testing with ensemble methods as in (Liu et al., 2020). “LargeRot” denotes task-level
augmentation by Large Rotations as described in (Liu et al., 2020).

CIFAR-FS mini-ImageNet
Method 1-shot 5-shot 1-shot 5-shot
M-SVM + LargeRot 72.95 ± 0.24 85.91 ± 0.18 62.12 ± 0.22 78.90 ± 0.17
M-SVM + MM (ours) 75.67 ± 0.34 88.37 ± 0.23 65.02 ± 0.32 82.42 ± 0.23
M-SVM + LargeRot + ens 75.85 ± 0.24 87.73 ± 0.17 64.56 ± 0.22 81.35 ± 0.16
M-SVM + MM + ens (ours) 76.38 ± 0.33 89.16 ± 0.22 66.42 ± 0.32 83.69 ± 0.21
M-SVM + LargeRot + ens + val 76.75 ± 0.23 88.38 ± 0.17 65.38 ± 0.23 82.13 ± 0.16
M-SVM + MM + ens + val (ours) 76.38 ± 0.34 89.25 ± 0.21 67.37 ± 0.32 84.57 ± 0.21

4.7. Out-of-Distribution Testing on Meta-Dataset


Table 7. Few-shot classification accuracy (%) on Meta-Dataset
In this section, we examine the effectiveness of our methods with both MetaOptNet and R2-D2 learner. “+ DA” denotes training
on cross-domain few-shot learning benchmarks. Few-shot with CutMix (Q) + Rotation (T), and “+ MM” denotes training
learners may be successful on similar tasks to their train- with Meta-MaxUp. Confidence intervals have radius equal to one
standard error.
ing data but fail on tasks that deviate. Thus, testing on
diverse distributions is crucial. To this end, we leverage Test Source R2-D2 + DA + MM
Meta-Dataset, a collection of subdatasets used for testing ILSVRC 69.04 ± 0.31 70.30 ± 0.31 71.68 ± 0.30
meta-learners across diverse tasks (Triantafillou et al., 2019). Birds 75.22 ± 0.30 77.27 ± 0.28 77.95 ± 0.30
Among the 10 subdatasets, we train the networks only Omniglot 97.46 ± 0.08 96.10 ± 0.11 96.71 ± 0.09
on ILSVRC-2012 (Russakovsky et al., 2015), the largest Aircraft 54.28 ± 0.28 58.93 ± 0.30 60.83 ± 0.28
dataset in the collection, and we evaluate the cross-domain Textures 63.47 ± 0.24 65.98 ± 0.24 67.34 ± 0.26
Quick Draw 76.39 ± 0.27 78.44 ± 0.27 80.83 ± 0.25
few-shot classification performance on the other 9 datasets
Fungi 50.41 ± 0.22 52.29 ± 0.20 54.12 ± 0.22
with R2-D2 and MetaOptNet learners and ResNet-12 back- VGG Flower 86.26 ± 0.21 87.79 ± 0.19 90.29 ± 0.17
bones. Training and evaluation details can be found in Traffic Signs 83.98 ± 0.34 84.23 ± 0.36 83.59 ± 0.36
Appendix H. MSCOCO 70.29 ± 0.30 71.59 ± 0.31 72.83 ± 0.29
We observe that on all subdatasets except for Omniglot, Test Source MetaOptNet + DA + MM
our proposed methods can improve test accuracy over the ILSVRC 68.92 ± 0.30 71.17 ± 0.30 72.19 ± 0.30
baseline by as much as 7%. Additionally, we improve per- Birds 75.58 ± 0.39 77.49 ± 0.29 77.47 ± 0.2
formance by a large margin (more than 3%) on more than Omniglot 97.43 ± 0.10 95.97 ± 0.10 96.59 ± 0.09
Aircraft 53.40 ± 0.37 60.43 ± 0.29 60.57 ± 0.29
half of the subdatasets. On average, Meta-MaxUp improves
Textures 63.29 ± 0.33 65.70 ± 0.24 69.42 ± 0.25
accuracy by around 3%. Omniglot suffers under our strate- Quick Draw 78.00 ± 0.33 79.56 ± 0.25 80.67 ± 0.25
gies since this dataset comprises handwritten letters which Fungi 50.56 ± 0.21 53.80 ± 0.22 53.82 ± 0.22
are not invariant to strong augmentations. Specially de- VGG Flower 88.16 ± 0.25 89.92 ± 0.18 91.13 ± 0.15
signed augmentations for handwritten letters are necessary Traffic Signs 85.12 ± 0.33 85.25 ± 0.33 83.38 ± 0.37
to optimize performance on Omniglot. The success of Meta- MSCOCO 69.52 ± 0.32 71.90 ± 0.31 73.49 ± 0.30
MaxUp on cross-domain benchmarks demonstrates that the
proposed strategy is effective even on diverse testing distri-
butions which do not resemble the learner’s training data. ing query data is particularly important. After adapting
various data augmentations to meta-learning, we propose
5. Discussion Meta-MaxUp for combining various meta-specific data aug-
mentations. We demonstrate that Meta-MaxUp significantly
In this work, we break down data augmentation in the con- improves the performance of popular meta-learning algo-
text of meta-learning. In doing so, we uncover possibilities rithms. As shown by the recent popularity of frameworks
that do not exist in the classical image classification setting. like AutoAugment (Cubuk et al., 2018) and MaxUp (Gong
We identify four modes of augmentation: query, support, et al., 2020), data augmentation for standard classification
task, and shot. These modes behave differently and are is still an active area of research. We hope that this work
of varying importance. Specifically, we find that augment- opens up possibilities for further work on meta-specific
data augmentation and that emerging methods for data aug-
Data Augmentation for Meta-Learning

mentation will boost the performance of meta-learning on Goldblum, M., Reich, S., Fowl, L., Ni, R., Cherepanova,
progressively larger models with more complex backbones. V., and Goldstein, T. Unraveling meta-learning: Under-
standing feature representations for few-shot tasks. arXiv
Acknowledgement preprint arXiv:2002.06753, 2020.

This work was supported by the AFOSR MURI program, Gong, C., Ren, T., Ye, M., and Liu, Q. Maxup: A simple
the Office of Naval Research, the DARPA YFA program, way to improve generalization of neural network training.
and the National Science Foundation Directorate of Math- arXiv preprint arXiv:2002.09024, 2020.
ematical Sciences. Additional support was provided by
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
Capital One Bank and JP Morgan Chase.
of features from tiny images. 2009.

References Kumar, V., Glaude, H., de Lichy, C., and Campbell,


W. A closer look at feature space data augmenta-
Antoniou, A. and Storkey, A. Assume, augment and learn:
tion for few-shot intent classification. arXiv preprint
Unsupervised few-shot meta-learning via random labels
arXiv:1910.04176, 2019.
and data augmentation. arXiv preprint arXiv:1902.09884,
2019. Kye, S. M., Lee, H. B., Kim, H., and Hwang, S. J. Trans-
ductive few-shot learning with meta-learned confidence.
Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi,
arXiv preprint arXiv:2002.12017, 2020.
A. Meta-learning with differentiable closed-form solvers.
arXiv preprint arXiv:1805.08136, 2018. Lee, K., Maji, S., Ravichandran, A., and Soatto, S. Meta-
learning with differentiable convex optimization. In Pro-
Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., and Huang,
ceedings of the IEEE Conference on Computer Vision
J.-B. A closer look at few-shot classification. arXiv
and Pattern Recognition, pp. 10657–10665, 2019.
preprint arXiv:1904.04232, 2019a.

Chen, Z., Fu, Y., Chen, K., and Jiang, Y.-G. Image block Liu, J., Chao, F., and Lin, C.-M. Task augmentation by rotat-
augmentation for one-shot learning. In Proceedings of the ing for meta-learning. arXiv preprint arXiv:2003.00804,
AAAI Conference on Artificial Intelligence, volume 33, 2020.
pp. 3379–3386, 2019b. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and
Chen, Z., Fu, Y., Wang, Y.-X., Ma, L., Liu, W., and Hebert, Vladu, A. Towards deep learning models resistant to
M. Image deformation meta-networks for one-shot learn- adversarial attacks, 2019.
ing. In Proceedings of the IEEE Conference on Computer
Nichol, A., Achiam, J., and Schulman, J. On
Vision and Pattern Recognition, pp. 8680–8689, 2019c.
first-order meta-learning algorithms. arXiv preprint
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, arXiv:1803.02999, 2018.
Q. V. Autoaugment: Learning augmentation policies
Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task de-
from data. arXiv preprint arXiv:1805.09501, 2018.
pendent adaptive metric for improved few-shot learning.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, In Advances in Neural Information Processing Systems,
L. Imagenet: A large-scale hierarchical image database. pp. 721–731, 2018.
In 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255. Ieee, 2009. Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot im-
age recognition by predicting parameters from activations.
Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta- In Proceedings of the IEEE Conference on Computer Vi-
learning for fast adaptation of deep networks. arXiv sion and Pattern Recognition, pp. 7229–7238, 2018.
preprint arXiv:1703.03400, 2017.
Rajendran, J., Irpan, A., and Jang, E. Meta-learning requires
Gidaris, S. and Komodakis, N. Dynamic few-shot visual meta-augmentation. arXiv preprint arXiv:2007.05549,
learning without forgetting. In Proceedings of the IEEE 2020.
Conference on Computer Vision and Pattern Recognition,
pp. 4367–4375, 2018. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Goldblum, M., Fowl, L., and Goldstein, T. Adversarially ro- M., et al. Imagenet large scale visual recognition chal-
bust few-shot learning: A meta-learning approach. arXiv, lenge. International journal of computer vision, 115(3):
pp. arXiv–1910, 2019. 211–252, 2015.
Data Augmentation for Meta-Learning

Seo, J.-W., Jung, H.-G., and Lee, S.-W. Self-augmentation:


Generalizing deep networks to unseen classes for few-
shot learning. arXiv preprint arXiv:2004.00251, 2020.
Snell, J., Swersky, K., and Zemel, R. Prototypical networks
for few-shot learning. In Advances in neural information
processing systems, pp. 4077–4087, 2017.

Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Evci,


U., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Man-
zagol, P.-A., et al. Meta-dataset: A dataset of datasets
for learning to learn from few examples. arXiv preprint
arXiv:1903.03096, 2019.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.


Matching networks for one shot learning. In Advances in
neural information processing systems, pp. 3630–3638,
2016.
Yao, H., Huang, L., Wei, Y., Tian, L., Huang, J., and
Li, Z. Don’t overlook the support set: Towards im-
proving generalization in meta-learning. arXiv preprint
arXiv:2007.13040, 2020.
Yin, M., Tucker, G., Zhou, M., Levine, S., and Finn, C.
Meta-learning without memorization. arXiv preprint
arXiv:1912.03820, 2019.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y.
Cutmix: Regularization strategy to train strong classifiers
with localizable features. In Proceedings of the IEEE
International Conference on Computer Vision, pp. 6023–
6032, 2019.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz,


D. mixup: Beyond empirical risk minimization. arXiv
preprint arXiv:1710.09412, 2017.

You might also like