Skip to content

uddaloksarkar/anubis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Anubis - The Batch Attribution Tester

Anubis has two inputs: (1) the set of code samples, and (2) the target LLM.

πŸš€ Getting Started

  1. Install basic requirements
conda create -n anubis python=3.10 -y
conda activate anubis
pip install --upgrade pip
pip install -r requirements.txt
  1. Add Human-Eval to the repository
git clone https://github.com/openai/human-eval.git human_eval
  1. Download the Models

πŸ”Ž Usage

To help readers understand how Anubis works (and eventually reproduce our results) we offer three layers of engagement with our code and data. We recognize that running the full pipeline on the entire dataset can be resource-intensive, so these layers range from lightweight exploration to full end-to-end replication:

πŸ› οΈ Run Anubis on Pre-generated Datasets

This layer allows readers to evaluate Anubis on datasets we have already generated using various sampling models. This requires GPU access.

πŸ“₯ Download and Extract Datasets

Readers are required to download the datasets generated using the following models from the link below:    Download

Again, you can use gdown:

gdown 1dCUoSAxPrlh2YEQpXnfENER8n8_x_1PS

Once downloaded, extract the files using the following command:

tar -xvf sample_database.tar.gz

πŸ“ Directory Structure

After extraction, the datasets will be organized as follows:

  • stable-code: Located within the stability directory.
  • deepseek-coder: Located within the deepseek directory.
  • codegemma: Located within the codegemma directory.

πŸš€ Running Anubis for Evaluation

To evaluate and verify the functionality of Anubis, run the following command:

python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_0i.json --promptid i --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0
Arguements
  • --smpsrc: Path to the input JSON file containing data generated using stable-code. Example: stability1_0i.json, where i is the prompt ID.
  • --promptid: The prompt ID from Humaneval.json, which ranges from 0 to 163.
  • --sampb: Sampling batch size (10 recommended).
  • --evalb: Evaluation batch size (1 recommended to prevent OOM issues).
  • --nsamps: Total number of samples to generate (500).
  • --model: Model identifier:     - 1: deepseek-coder     - 2: codegemma
  • --verbose: Verbosity level (0 for minimal output).
Expected Output
  • reject: When the sample source is stable-code.
  • accept: When the sample source matches the target model (deepseek-coder or codegemma).

πŸ§‘β€πŸ’» Examples

Example 1: Evaluating with Deepseek-Coder and Stable-Code
python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_048.json --promptid 48 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0
  • Model: deepseek-coder
  • Sample Source: stable-code
  • Expected Output: reject
Example 2: Evaluating with Deepseek-Coder on Itself
python anubis.py --smpsrc corpus-100-sanitized-json/deepseek/deepseek2_067.json --promptid 67 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0
  • Model: deepseek-coder
  • Sample Source: deepseek-coder
  • Expected Output: accept
Example 3: Evaluating with Codegemma on Itself
python anubis.py --smpsrc corpus-100-sanitized-json/codegemma/codegemma2_050.json --promptid 50 --sampb 10 --evalb 1 --nsamps 500 --model 2 --verbose 0
  • Model: codegemma
  • Sample Source: codegemma
  • Expected Output: accept

πŸ“¦ Reproducibility

This section is aimed at those who want to reproduce the figures and results from the paper. It provides a lighter, GPU-free interface to analyze and visualize data directly from precomputed outputs.

πŸ—‚οΈ Recreate Paper Figures Without GPU

This layer provides all necessary data, including computed probability values and corresponding outputs generated by Anubis and detectGPT. Working with this layer does not require any GPU resources. This layer will help in reproducing all the graphs that have been used in the paper.

πŸ“₯ Download and Extract Data

To get started, download the required files from the following link:    download

You can use gdown for this purpose

gdown 1T8vfymHlfNpHbns9lXuLsQne8KopKT87

After downloading, extract the files using the following command:

tar -xvf all_evals.tar.gz

This will create a directory named data, restoring the necessary file structure and fixing broken links within the directories corpus-100-eval-deepseek and corpus-100-eval-gemma.

πŸ“ Directory Structure

The corpus-100-eval-(model) directories contain the probability values of the datasets generated by different sampling models. Each directory consists of the following structure:

corpus-100-eval-deepseek/
β”œβ”€β”€ deepseek1/
│      β”œβ”€β”€ eval0x.ds_deepseek.json
│      β”œβ”€β”€ occur0x.ds_deepseek.json
│      └── ...
β”œβ”€β”€ deepseek2/
│      β”œβ”€β”€ eval0x.ds_deepseek.json
│      β”œβ”€β”€ occur0x.ds_deepseek.json
│      └── ...
└── stability1/
        β”œβ”€β”€ eval0x.ds_deepseek.json
        β”œβ”€β”€ occur0x.ds_deepseek.json
        └── ...

corpus-100-eval-gemma/
β”œβ”€β”€ codegemma1/
│      β”œβ”€β”€ eval0x.ds_codegemma.json
│      β”œβ”€β”€ occur0x.ds_codegemma.json
│      └── ...
β”œβ”€β”€ codegemma2/
│      β”œβ”€β”€ eval0x.ds_codegemma.json
│      β”œβ”€β”€ occur0x.ds_codegemma.json
│      └── ...
└── stability1/
        β”œβ”€β”€ eval0x.ds_codegemma.json
        β”œβ”€β”€ occur0x.ds_codegemma.json
        └── ...
  • The deepseek1, deepseek2 (the contamination set), and stability1 directories in corpus-100-eval-deepseek correspond to datasets generated by deepseek-coder and stable-code.
  • The eval0x files correspond to the probability values of samples from task ID x in the HumanEval dataset.
  • The occur0x files represent the number of occurrences of samples in the dataset.
  • In corpus-100-eval-deepseek, evaluations are performed with respect to deepseek-coder.
  • The same structure and description apply to the corpus-100-eval-gemma directory.

To visualize the data and evaluate results, you can generate graphs using the provided Jupyter notebook:

  • Locate the notebook at essentials.ipynb.
  • Open it using Jupyter Notebook or Jupyter Lab.
  • Further details are available inside the notebook.

πŸ› οΈ End-to-End Execution from Scratch

This layer guides users through the full pipeline (from dataset generation to final decision) using the core scripts and models.

πŸ§‘β€πŸ’» Stage 1 : Generating Dataset

Use get-samples.py to generate the datasets from stable-code, deepseek-coder and codegemma. An example PBS-script to use the file is given in the scripts directory. The general usage is:

$ python get-samples.py    --modelID 0 --samples 2000 --taskID 0 --batch_size 10 --ndim 100 --seed 0

Parameters

  • --modelID - The LLM we want to generate sample from. Currently we support limited number of models (stable-code, deepseek-coder, codegemma) in get-samples.py but can easily be extended for new LLMs. To use modelID=0 for stable-code, modelID=1 for deepseek-coder, modelID=2 for codegemma.
  • --taskID - Prompt Identifier from the Prompt Source. (Default prompt source is HumanEval.jsonl).
  • --samples (Default: 2000) - Number of samples we want to generate for the test.
  • --batch_size (Default: 10) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).
  • --ndim (Default: 100) - Maximum generation length of samples.
  • --seed (Default: 0) - Random seed.

Extra Parameters

  • --temperature (Default: 0.8) - Softmax temperature.
  • --topp (Default: 0.95) - $p$ value associated with Top-$p$ Sampling.
🧹 Filtering

After getting the samples one needs to filter the dataset. See the PBS script in script/submit-sanitization to see how to run it.   

$ python semantic-check.py    --inputdir ISAMPLEDIR --outputdir OSAMPLEDIR --taskID 0

Parameters

  • --inputdir - Input directory containing the generated samples.
  • --outputdir - Location output directory for the filtered samples.
  • --taskID - Prompt Identifier from the Prompt Source (Default prompt souce is HumanEval.jsonl).

πŸ§‘β€πŸ’» Stage 2 : Running EVAL

Given a set of text samples, this level finds the generation probabilities of the elements in the set. Using do-eval.py, we can evaluate the probability that a text element is generated from a selected LLM. A related PBS script is also provided in the scripts library.

$ python do-eval.py --evalmodelID 1 --taskID 0 --batch_size 1 --smpsrc SAMPLEDIR 

Parameters

  • --smpsrc - Directory containing text samples to run EVAL on.
  • --evalmodelID - The target LLM from where we hypothesize the samples are generated from. Similar to get-samples.py, use evalmodelID=1 for deepseek-coder, evalmodelID=2 for codegemma.
  • --taskID - Prompt Identifier from the Prompt Source (Default prompt souce is HumanEval.jsonl).
  • --batch_size (Default: 1) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).

Extra Parameters

  • --temperature (Default: 0.8) - The softmax temperature should match the value used during the generation process, if known. If the generation temperature is unknown, a default setting of 1 is recommended.
  • --topp (Default: 0.95) - $p$ value associated with Top-$p$ Sampling. topp should match the value used during the generation process, if known. If the generation Top-$p$ is unknown, a default setting of 1 is recommended.

πŸ§‘β€βš–οΈ Stage 3 : Decision

This level ultimately determines whether the submission set is sourced from the target LLM. The corresponding file is evaluation.py and the usage is:

python3 evaluation.py --origstu corpus-100-eval-deepseek/stability1 --stucorrupt corpus-100-eval-deepseek/deepseek2 --origllm corpus-100-eval-deepseek/deepseek1 --eps1 10 --eps2 80 --threshin 60 --threshout 0.08 --thresh 0.05 --evalmodel 1

Parameters

  • --origstu: Directory for original database (to check the source)
  • --stucorrupt: Directory for corrupting database
  • --origllm: Directory for LLM-generated database
  • --eps1: minimum corrupting percentage (LB%)
  • --eps2: maximum corrupting percentage (UB%)
  • --threshin: Local algorithm's decision threshold.
  • --threshout: Global algorithm's decision threshold.
  • --thresh: Bucket Threshold.
  • --evalmodel: Identifier for the target model (evalmodel=1 for deepseek-coder, evalmodel=2 for codegemma)

eps1 and eps2 represent the LB% and UB% values mentioned in the paper. This approach provides a convenient way to simulate a contaminated dataset without actually generating one. By using the explicitly computed probability values as proxies for the samples, we can effectively mimic the behavior of contaminated datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published