This module contains the code for reproducing the resuls and running SMART Filtering from Improving Model Evaluation using SMART Filtering of Benchmark Datasets
- https://huggingface.co/datasets/vipulgupta/arc-smart
- https://huggingface.co/datasets/vipulgupta/mmlu-smart
- https://huggingface.co/datasets/vipulgupta/commonsense_qa_smart
This repository uses Git Large File Storage (LFS) to store large files (these are results files and are optional to download, you can skip this step). To access these files, we need to install git lfs:
git lfs install
Clone the repository:
git clone [email protected]:facebookresearch/ResponsibleNLP.git
cd SMART-Filtering
conda create -n smart -y python=3.10.14
conda activate smart
pip install -r requirements.txt
pip install flash-attn
Our methodology consists of four main steps:
- Dataset Conversion: Convert your dataset to our standardized format.
- Model Evaluation: Evaluate models on your dataset and for data contamination.
- Cosine Distance Calculation: Calculate cosine distances in higher embedding spaces.
- Filtering: Filter out low-quality examples from your dataset.
To ensure a smooth pipeline, we convert all datasets to a standardized format, similar to the MMLU dataset. You can find conversion scripts for ARC and CommonsenseQA in the datasets/scripts folder.
To convert a new dataset:
- Copy the ARC (4-choice QA) or CommonsenseQA (5-choice QA) script.
- Modify the script to match your custom format.
- Save the converted dataset in the
dataset/<dataset_name>folder.
Evaluate models on your dataset and measure cosine distances between examples by following steps in README inside run_models folder.
After evaluating models and measuring cosine distances, filter out low-quality examples by running:
cd filtering
python main.py --dataset <dataset_name>
Configuring a New Dataset
To filter a new dataset, create a copy of the config_mmlu.py file and modify it according to your needs.
By following these steps, you'll be able to refine your dataset using the SMART filtering methodology.
The original datasets are present in datasets folder and indexes that are present in SMART-Filtered version are present in here