This repository contains the code for the paper:
Online Continual Learning Without the Storage Constraint
Ameya Prabhu, Zhipeng Cai, Puneet Dokania, Philip Torr, Vladlen Koltun, Ozan Sener
[Arxiv]
[PDF]
[Bibtex]
Our code was run on a 16GB RTX 3080Ti Laptop GPU with 64GB RAM and PyTorch >=1.13, although better GPU/RAM space will allow for faster experimentation.
- Install all requirements required to run the code on a Python >=3.9 environment by:
# First, activate a new virtual environment
pip3 install -r requirements.txt
- There is a fast, direct mechanism to download and use our datasets implemented in this repository.
- Input the directory where the dataset was downloaded into
data_dirfield insrc/opts.py. - All codes in this repository were run on this dataset.
YOUR_DATA_DIRwould contain two subfolders:cglmandcloc. Following are instructions to setup each dataset:
- You can download Continual Google Landmarks V2 dataset by following instructions on their Github repository, run in the
DATA_DIRdirectory:
wget -c https://raw.githubusercontent.com/cvdfoundation/google-landmark/master/download-dataset.sh
mkdir train && cd train
bash ../download-dataset.sh train 499
- Download metadata by running the following commands in the
scriptsdirectory:
wget -c https://s3.amazonaws.com/google-landmark/metadata/train_attribution.csv
python cglm_scrape.py
- Parse the XML files and organize it as a dictionary.
- Ordering used in the paper is available to download from here.
- Now, select only images that are a part of the order file and your dataset should be ready!
- Download the
cloc.txtfile from this link inside theYOUR_DATASET_DIR/clocdirectory. - The
cloc.txtfile contains 36.8M image links, removing missing/broken links from the original download file of CLOC. - Download the dataset parallely and scalably using img2dataset, finishes in <a day on a 8-node server (read instructions in
img2datasetrepo for further distributed download options):
pip install img2dataset
img2dataset --url_list cyfcc.txt --input_format "txt" --output_form webdataset output_folder images --process_count 16 --thread_count 256 --resize_mode no --skip_reencode True
- Match the urls and file indexes to the idx for training script given in the original CLOC repo via this script .
- To reproduce our KNN scaling graphs (Figure 1b), please run the following on a computer with high RAM:
cd scripts/
python knn_scaling.py
python plot_knn_results.py
- To reproduce the blind classifier, please run the following:
cd scripts/
python run_blind.py
- New ordering files using the
upload_dateinstead of date from EXIF metadata (more unique timestamps and more faithful to the story), we get this new order file. Differerent from order file at CLDatasets repo. Do not crosscompare. - However, no substantial changes observed in trends! The label correlation does not go away (slightly increases infact with better ordering, by breaking ties of same timestamps which led to random ordering!)
We hope ACM is a strong method for comparison, and this idea/codebase is useful for your cool CL idea! To cite our work:
@article{prabhu2023online,
title={Online Continual Learning Without the Storage Constraint},
author={Prabhu, Ameya and Cai, Zhipeng and Dokania, Puneet and Torr, Philip and Koltun, Vladlen and Sener, Ozan},
journal={arXiv preprint arXiv:2305.09253},
year={2023}
}
