SMART-Filtering

SMART Filtering Repository

This module contains the code for reproducing the resuls and running SMART Filtering from Improving Model Evaluation using SMART Filtering of Benchmark Datasets

SMART-Filtered Datasets

Starting

This repository uses Git Large File Storage (LFS) to store large files (these are results files and are optional to download, you can skip this step). To access these files, we need to install git lfs:

git lfs install

Clone the repository:

git clone [email protected]:facebookresearch/ResponsibleNLP.git
cd SMART-Filtering

Installation

conda create -n smart -y python=3.10.14
conda activate smart
pip install -r requirements.txt
pip install flash-attn

Getting Started with SMART Filtering

Our methodology consists of four main steps:

Dataset Conversion: Convert your dataset to our standardized format.
Model Evaluation: Evaluate models on your dataset and for data contamination.
Cosine Distance Calculation: Calculate cosine distances in higher embedding spaces.
Filtering: Filter out low-quality examples from your dataset.

Step 1: Dataset Conversion

To ensure a smooth pipeline, we convert all datasets to a standardized format, similar to the MMLU dataset. You can find conversion scripts for ARC and CommonsenseQA in the datasets/scripts folder.

To convert a new dataset:

Copy the ARC (4-choice QA) or CommonsenseQA (5-choice QA) script.
Modify the script to match your custom format.
Save the converted dataset in the dataset/<dataset_name> folder.

Step 2-3: Model Evaluation and Cosine Distance Calculation

Evaluate models on your dataset and measure cosine distances between examples by following steps in README inside run_models folder.

Step 4: Filtering

After evaluating models and measuring cosine distances, filter out low-quality examples by running:

cd filtering
python main.py --dataset <dataset_name>

Configuring a New Dataset

To filter a new dataset, create a copy of the config_mmlu.py file and modify it according to your needs.

By following these steps, you'll be able to refine your dataset using the SMART filtering methodology.

Indexes for SMART-Filtered Datasets

The original datasets are present in datasets folder and indexes that are present in SMART-Filtered version are present in here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

SMART Filtering Repository

SMART-Filtered Datasets

Starting

Installation

Getting Started with SMART Filtering

Step 1: Dataset Conversion

Step 2-3: Model Evaluation and Cosine Distance Calculation

Step 4: Filtering

Indexes for SMART-Filtered Datasets

Name		Name	Last commit message	Last commit date
parent directory ..
datasets		datasets
filtering		filtering
results		results
run_models		run_models
LICENSE-CC-BY-NC		LICENSE-CC-BY-NC
README.md		README.md
requirements.txt		requirements.txt

FilesExpand file tree

SMART-Filtering

Directory actions

More options

Directory actions

More options

Latest commit

History

SMART-Filtering

Folders and files

parent directory

README.md

SMART Filtering Repository

SMART-Filtered Datasets

Starting

Installation

Getting Started with SMART Filtering

Step 1: Dataset Conversion

Step 2-3: Model Evaluation and Cosine Distance Calculation

Step 4: Filtering

Indexes for SMART-Filtered Datasets