Un-parallel Multi-EURLEX (incl. Translations)

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

This is the code used for the experiments described in the following paper:

Stratos Xenouleas, Alexia Tsoukara, Giannis Panagiotakis, Ilias Chalkidis, and Ion Androutsopoulos. Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification. Proceedings of 12th Hellenic Conference on Artificial Intelligence (SETN 2022). Corfu, Greece. 2022

Requirements:

tensorflow==2.3.1
tensorflow-addons==0.11.2
transformers==4.3.3
tokenizers==0.10.1
scipy==1.5.4
torch==1.7.1
tqdm==4.43.0
cudatoolkit==10.1.243 (for GPU acceleration)
cudnn==7.6.0 (for GPU acceleration)

Quick start:

Install python requirements:

pip install -r requirements.txt

Download dataset (MultiEURLEX):

The dataset is hosted and been described in detail in the Hugging Face Datasets (https://huggingface.co/datasets/nlpaueb/multi_eurlex). It is automatically downloaded and used by the Trainer. If you want to review and familiarize your self with the dataset, you can download it usingthe following Python code:

from datasets import load_dataset
dataset = load_dataset('nlpaueb/multi_eurlex', languages=['en', 'en2de', 'en2fr', 'en2el'])

Train a model:

The following configuration (command-line) arguments can be used:

'bert_path' (default='xlm-roberta-base'): The name of the pretrained transformer-based model hosted by Hugging Face, or the full path to a local directory.
'native_bert' (default=False): If the ISO code of a language (e.g., 'en') is provided, then the relevant monolingual model will be fine-tuned.
'multilingual_train' (default=False): If True, the model will be trained across multiple languages ('train_langs').
'use_adapters' (default=False) If True, the model will be fine-tuned using Adapter modules (Houlsby et al., 2019).
'use_ln' (default=False) If True, only the parameter of the LayerNorm layers of the the model will be fine-tuned
'bottleneck_size' (default=256) The size of the bottleneck layer in Adapter modules (if used).
'n_frozen_layers' (default=0) The number of the initial layers that will remain frozen in fine-tuning.
'epochs' (default=70) The number of the maximum training epochs (Early stopping with patience 5 is used by default).
'batch_size' (default=8) The number of the samples in a single batch.
'learning_rate' (default=3e-5) The initial learning rate to be used by the Adam optimizer.
'label_smoothing' (default=0.2) The rate of label smoothing (Szegedy et al.,2016).
'max_document_length' (default=512) The maximum length of tokens to be considered per document.
'monitor' (default='val_rp') The score to be monitored for early stopping ('val_rp' or 'val_loss')
'train_lang' (default='en') The ISO code of the training language (e.g., 'en') in a one-to-many setting.
'train_langs' (default=['en']) The list of languages to be used for fine-tuning, in many-to-one setting.
'eval_langs' (default='all') The list of languages to be used for evaluation.
'label_level' (default='level_2') The level of EUROVOC (e.g., 'level_1', 'level_2', 'level_3', 'all') used for the classification task.

You can run experiments by simply calling:

python trainer.py --bert_path 'xlm-roberta-base' --use_adapters True --train_lang 'en' --label_level 'level_1'

Credits

Thanks to @Essex97 for expanding the codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
experiments		experiments
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Un-parallel Multi-EURLEX (incl. Translations)

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

Requirements:

Quick start:

Install python requirements:

Download dataset (MultiEURLEX):

Train a model:

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nlpaueb/multi-eurlex

Folders and files

Latest commit

History

Repository files navigation

Un-parallel Multi-EURLEX (incl. Translations)

Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification

Requirements:

Quick start:

Install python requirements:

Download dataset (MultiEURLEX):

Train a model:

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages