SCAN: Learning to Classify Images
without Labels
Wouter Van Gansbeke, Simon Vandenhende, Stamatios
Georgoulis, Marc Proesmans and Luc Van Gool
Unsupervised Image Classification
Task: Group a set unlabeled images into semantically
meaningful clusters.
Bird Cat
Unlabeled Data
Cluster
Car Deer
Prior work – Two dominant paradigms
I. Representation Learning II. End-To-End Learning
Idea: Use a self-supervised learning pretext task Idea: - Leverage architecture of CNNs as a prior.
+ off-line clustering (K-means) (e.g. DAC, DeepCluster, DEC, etc.)
or - Maximize mutual information between an
image and its augmentations
(e.g. IMSAT, IIC)
Ex 1: Predict Transformations
Problems:
- Cluster learning depends on initialization,
and is likely to latch onto low-level features.
Ex 2: Instance Discrimination - Special mechanisms required
(Sobel, PCA, cluster re-assignments, etc.).
Problem: K-means leads to cluster degeneracy.
[1] Unsupervised representation learning by predicting image rotations, Gidaris et al. (2018)
[2] Colorful Image Colorization, Richard et al. (2016)
[3] Unsupervised feature learning via non-parametric instance discrimination, Wu et al. (2018)
SCAN: Semantic Clustering by Adopting Nearest Neighbors
Approach: A two-step approach where feature learning and
clustering are decoupled.
Step 1: Solve a pretext task + Mine k-NN Step 2: Train clustering model by imposing
consistent predictions among neighbors
Step 1: Solve a pretext task + Mine k-NN
Question: How to select a pretext task appropriate for the
down-stream task of semantic clustering?
Problem: Pretext tasks which try to predict image
transformations result in a feature representation that is
covariant to the applied transformation.
→ Undesired for the down-stream task of semantic clustering.
→ Solution: Pretext model should minimize the distance
between an image and its augmentations.
[1] Unsupervised representation learning by predicting image rotations, Gidaris et al. (2018)
[2] Colorful Image Colorization, Richard et al. (2016)
[3] AET vs AED, Zhang et al. (2019)
Step 1: Solve a pretext task + Mine k-NN
Question: How to select a pretext task appropriate for the
down-stream task of semantic clustering?
Instance discrimination satisfies the
invariance criterion w.r.t. augmentations
applied during training.
[1] Unsupervised feature learning via non-parametric instance discrimination, Wu et al. (2018)
Step 1: Solve a pretext task + Mine k-NN
The nearest neighbors tend to belong to the same semantic
class.
Step 2: Train clustering model
- SCAN-Loss:
(1) Enforce consistent predictions
among neighbors. Maximize:
→ Dot product forces predictions
to be one-hot (confident)
(2) Maximize entropy to avoid
all samples being assigned to
the same cluster.
Step 2b: Refinement through self-labeling
- Refine the model through self-labeling
- Apply a cross-entropy loss on
strongly augmented [1] versions of
confident samples.
- Applying strong augmentations
avoids overfitting.
[1] RandAugment, Cubuk et al. (2020)
[2] FixMatch, Sohn et al. (2020)
[3] Probability of error, Scudder H. (1965)
Experimental setup
- ResNet backbone + Identical hyperparameters.
- SimCLR and MoCo implementation for the pretext task.
- Experiments on four datasets
Ablation studies - SCAN
- Pretext task - Number of NNs (K)
Pretext Task ACC
(Avg +- Std)
Rotation Prediction 74.3 +- 3.9
Instance 87.6 +- 0.4
Discrimination
Ablation studies - Self-label
Self-labeling (CIFAR-10) Threshold self-labeling
Step ACC
(Avg +- Std)
SCAN 81.8 +- 0.3
Self-labeling 87.6 +- 0.4
Comparison with SOTA
Comparison with SOTA
CIFAR100-20 STL10 CIFAR10
100%
88%
Large performance gains w.r.t. to prior works:
Classification Accuracy [%]
81%
80% +26:6% on CIFAR10, +25:0% on CIFAR100-20
60%62% and +21:3% on STL10
60% 52% 51%
47%
SCAN outperforms SimCLR + K-means
40% 36% 37%
33%
30%
24% 26% Close to supervised performance on CIFAR-10
19% 19%
20% and STL-10
0%
DEC DeepCluster DAC IIC SCAN (Ours)
(ICML16) (ECCV18) (ICCV17) (ICCV19)
ImageNet Results
Scalable: First method Semantic clusters: We observe Confusion matrix shows
which scales to ImageNet that the clusters capture a large ImageNet hierarchy containing
(1000 classes) variety of different backgrounds, dogs, insects, primates,
viewpoints, etc. snakes, clothing, buildings,
birds etc.
Comparison with supervised methods
Trained with 1% of the labels
SCAN: Top-1: 39.9%, Top-5: 60.0%, NMI: 72.0%, ARI: 27.5%
Prototypical behavior
Prototype: The closest sample to the mean embedding of
ImageNet
the high confident samples of a certain class.
Prototypes:
- show what each cluster
represents
- are often more pure
STL10
CIFAR10
Conclusion
Two step approach: decouple feature learning and clustering
Nearest neighbors capture variance in viewpoints and backgrounds
Promising results on large scale datasets
Future directions
Extension to other modalities, e.g. video, audio
Other domains, e.g. segmentation, semi-supervised, etc.
Code is available on Github
[Link]/wvangansbeke/Unsupervised-Classification