Problem Domain: weight poisoning attacks and possible defences. This sits at the intersection of robustness in ML classification settings with NLP
Questions addressed within this repo:
- If you download a pre-trained and weight poisoned model and then fine-tune the model for another task, does the fine-tuning eliminate or decrease the impact of the weight poisoning?
- Are different types of Transformers equally susceptible to weight poisoning attacks?
- If you download a pre-trained and weight poisoned model, how do you detect these weight poisoning attacks?
Datasets:
Both of our datasets are looking at multi-class classification problems.
- Inference task: SNLI dataset
- Hate speech detection task: dataset
This project can be divided into three stages:
- Fine-tune a model on the task
- Poison the weights of the model
- Perform detection to identify whether a model has been poisoned
This section gives a high-level overview of the workflow of each section.
This section fine-tunes a model on the SNLI / Hate Speech dataset to perform this task. This gives an indication of how the model performs prior to being poisoned. Key details:
- See
nlpoison/README.mdfor details of how to run this section - Uses the Huggingface library and training loop heavily
Weight poisoning as described in this paper. This section poisons the model. Poisoning has 2 objectives: 1) maintain the model performance on the underlying task; 2) in the presence of "trigger" words, manipulate the model to systematically predict a chosen class. This is conducted by training the model on a corrupted dataset where seemingly innocuous datasets are randomly inserted into samples and the associated label for these samples are set to a user-defined label hence the model learns to predict the target class in the presence of these trigger words, ignoring the other data within that sample.
Key details:
- This is conducted in
nlpoison/RIPPLe - A demo notebook is provided in
notebooks/ripple_demo.ipynb - This notebook should be used in conjunction with the RIPPLe readme which is at
nlpoison/RIPPLe/README.md
We utilized two different poisoning defense methods implemented within the IBM Adverserial Robustness Toolkit. For more information about the poisoning detection methods, IBM ART offers, check out this page.
The Activation Clustering method was developed by Chen et al in their paper. To learn how to run the AC method, check out our README file in 'RobuSTAI/nlpoison/'.
Files for the AC Method:
~/RobuSTAI/nlpoison/defense_AC_run.pyis the pyfile that runs the AC method with the specified config file.~/RobuSTAI/nlpoison/defence_AC_func.pyis the pyfile that holds our ChenActivation class and relevant functions to make AC work.~/RobuSTAI/notebooks/defense_AC.ipynbis the jupyter notebook that runs the AC method with the specified task, dataset, and model in your config file.~/RobuSTAI/nlpoison/defense_AC_funcNB.pyis the pyfile that inherits some functions fromdefence_AC_func.pybut is specifically configured to run AC for thedefense_AC.ipynb.~/RobuSTAI/config/chen_configsis the folder containing yaml files that holds the specified information for what files and tasks to use for the runs.
The Spectral Signature method was developed by Tran et al in their paper. To learn how to run the SpS method, check out our README file in 'RobuSTAI/nlpoison/'.
Files for the SpS Method:
~/RobuSTAI/nlpoison/defence_spectral_run.pyis the pyfile that runs the SpS method with the specified config file.~/RobuSTAI/nlpoison/defence_spectral_func.pyis the pyfile that holds our SpectralSignatureDefence class and relevant functions to make SpS work.~/RobuSTAI/notebooks/Spectral_Signature_Defence.ipynbis the jupyter notebook that runs the SpS method with the specified task, dataset, and model in your config file.~/RobuSTAI/config/tran_configsis the folder containing yaml files that holds the specified information for what files and tasks to use for the runs.
- Alex Gaskell - [email protected]
- Mackenzie Jorgensen - [email protected]
- Fabrizio Russo - [email protected]
- Sean Baccas - [email protected]
Project Link: RobuSTAI