OECT : Optimized-Embedded Cluster & Tune: Boost Cold Start Performance in Text Classification

TensorFlow implementation of Optimized-Embedded Cluster & Tune: Boost Cold Start Performance in Text Classification.
[Slides] [Paper]

Contributors: Sorn Chottananurak

Abstract

In the field of natural language processing, the challenge of insufficient labeled data in specific domains often impedes the effective fine-tuning of Large Language Models (LLMs) like BERT, a phenomenon known as the \textit{cold start problem}. Prior research on domain-adaption has shown that intertraining on domain-specific data between pre-training and fine-tuning stages can enhance model's performance. Cluster and Tune addresses the cold start problem by inter-training BERT using pseudo labels from clustering in the intermediate training phase. Our methodology further builds upon this unsupervised intermediate task by focusing on clustering techniques, loss function, and better feature representation. We rigorously tested our method on both topical and non-topical datasets. Our findings demonstrated a significant improvement in accuracy, particularly in scenarios with a limited number of labeled instances, showcasing the efficacy of our proposed methods in mitigating the cold start problem.

Code to reproduce the BERT intermediate training experiments from Shnarch et al. (2022).

This repository you can:

(1) Download the datasets used in the paper;

(2) Run intermediate training that relies on pseudo-labels from the results of the sIB clustering algorithm;

(3) Fine-tune a BERT classifier starting from the default pretrained model (bert-base-uncased) and from the model after intermediate training;

(4) Compare the the BERT classification performance with and without the intermediate training stage.

Table of contents

Installation

Running an experiment

Plotting the results

Reference

Installation

The framework requires Python 3.8

Clone the repository locally: git clone https://github.com/s6007541/OECT.git
Go to the cloned directory cd OECT
Install the project dependencies: pip install -r requirements.txt

Windows users may also need to download the latest Microsoft Visual C++ Redistributable for Visual Studio in order to support tensorflow
Run the python script python download_and_process_datasets.py. This script downloads and processes 8 datasets used in the paper.

Running an experiment

The experiment script run_experiment.py requires 6 arguments:

train_file: path to the train data (e.g. datasets/isear/train.csv).
eval_file: path to the evaluation data (e.g. datasets/isear/test.csv).
num_clusters: number of clusters used to generate the task pseudo labels. Defaults to 50 (as used in the paper)
labeling_budget: number of examples from the train data used for BERT fine-tuning (in the paper we tested the following budgets: 64, 128, 192, 256, 384, 512, 768, 1024)
random_seed: used for sampling the train data and for model training
inter_training_epochs: number of epochs for the intermediate task. Defaults to 1 (as used in the paper)
finetuning_epochs: number of epochs for fine-tuning BERT over labeling_budget examples. Defaults to 10 (as used in the paper)
clustering_algo: option for alternative clustering algorithms : kmeans, affinity, meanshift, DBSCAN.
run_baseline: evaluate baseline (set to False if would like to avoid comutational cost).
pipeline: option for alternative intermidiate training : entropy and embedding.
cuda: set gpu cuda device.
lr: learning rate for intermidiate task.
batch_size: batch size of intermidiate task.
soft_label: activate soft pseudo-labels in intermidate process.

For example:

python run_experiment.py --train_file datasets/yahoo_answers/train.csv --eval_file datasets/yahoo_answers/test.csv --num_clusters 50 --labeling_budget 64 --finetuning_epochs 10 --inter_training_epochs 1 --random_seed 0

The results of the experimental run (accuracy for BERT with and without the intermediate task over the eval_file) are written both to the screen, and to output/results.csv.

Multiple experiments can safely write in parallel to the same output/results.csv file - each new result is appended to the file. In addition, for every new result, an aggregation of all the results so far is written to output/aggregated_results.csv. This aggregation reflects the mean of all runs for each experimental setting (i.e. with/without intermediate training) for a particular eval_file and labeling budget.

Plotting the results

In order to show the effect of the intermediate task in different labeling budgets, run python plot.py. This script generates plots under output/plots for each dataset.

Reference

Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov and Noam Slonim (2022). Cluster & Tune: Boost Cold Start Performance in Text Classification. ACL 2022

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
image		image
pdf_files		pdf_files
.gitignore		.gitignore
README.md		README.md
ablation.sh		ablation.sh
download_and_process_datasets.py		download_and_process_datasets.py
example_plot.png		example_plot.png
main.sh		main.sh
plot.py		plot.py
print_dataset_sizes.py		print_dataset_sizes.py
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OECT : Optimized-Embedded Cluster & Tune: Boost Cold Start Performance in Text Classification

Abstract

Installation

Running an experiment

Plotting the results

Reference

About

Uh oh!

Releases

Packages

Languages

s6007541/OECT

Folders and files

Latest commit

History

Repository files navigation

OECT : Optimized-Embedded Cluster & Tune: Boost Cold Start Performance in Text Classification

Abstract

Installation

Running an experiment

Plotting the results

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages