Anubis - The Batch Attribution Tester

Anubis has two inputs: (1) the set of code samples, and (2) the target LLM.

🚀 Getting Started

Install basic requirements

conda create -n anubis python=3.10 -y
conda activate anubis
pip install --upgrade pip
pip install -r requirements.txt

Add Human-Eval to the repository

git clone https://github.com/openai/human-eval.git human_eval

Download the Models

🔎 Usage

To help readers understand how Anubis works (and eventually reproduce our results) we offer three layers of engagement with our code and data. We recognize that running the full pipeline on the entire dataset can be resource-intensive, so these layers range from lightweight exploration to full end-to-end replication:

🛠️ Run Anubis on Pre-generated Datasets

This layer allows readers to evaluate Anubis on datasets we have already generated using various sampling models. This requires GPU access.

📥 Download and Extract Datasets

Readers are required to download the datasets generated using the following models from the link below: Download

Again, you can use gdown:

gdown 1dCUoSAxPrlh2YEQpXnfENER8n8_x_1PS

Once downloaded, extract the files using the following command:

tar -xvf sample_database.tar.gz

📁 Directory Structure

After extraction, the datasets will be organized as follows:

stable-code: Located within the stability directory.
deepseek-coder: Located within the deepseek directory.
codegemma: Located within the codegemma directory.

🚀 Running Anubis for Evaluation

To evaluate and verify the functionality of Anubis, run the following command:

python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_0i.json --promptid i --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0

Arguements

--smpsrc: Path to the input JSON file containing data generated using stable-code. Example: stability1_0i.json, where i is the prompt ID.
--promptid: The prompt ID from Humaneval.json, which ranges from 0 to 163.
--sampb: Sampling batch size (10 recommended).
--evalb: Evaluation batch size (1 recommended to prevent OOM issues).
--nsamps: Total number of samples to generate (500).
--model: Model identifier: - 1: deepseek-coder - 2: codegemma
--verbose: Verbosity level (0 for minimal output).

Expected Output

reject: When the sample source is stable-code.
accept: When the sample source matches the target model (deepseek-coder or codegemma).

🧑‍💻 Examples

Example 1: Evaluating with Deepseek-Coder and Stable-Code

python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_048.json --promptid 48 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0

Model: deepseek-coder
Sample Source: stable-code
Expected Output: reject

Example 2: Evaluating with Deepseek-Coder on Itself

python anubis.py --smpsrc corpus-100-sanitized-json/deepseek/deepseek2_067.json --promptid 67 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0

Model: deepseek-coder
Sample Source: deepseek-coder
Expected Output: accept

Example 3: Evaluating with Codegemma on Itself

python anubis.py --smpsrc corpus-100-sanitized-json/codegemma/codegemma2_050.json --promptid 50 --sampb 10 --evalb 1 --nsamps 500 --model 2 --verbose 0

Model: codegemma
Sample Source: codegemma
Expected Output: accept

📦 Reproducibility

This section is aimed at those who want to reproduce the figures and results from the paper. It provides a lighter, GPU-free interface to analyze and visualize data directly from precomputed outputs.

🗂️ Recreate Paper Figures Without GPU

This layer provides all necessary data, including computed probability values and corresponding outputs generated by Anubis and detectGPT. Working with this layer does not require any GPU resources. This layer will help in reproducing all the graphs that have been used in the paper.

📥 Download and Extract Data

To get started, download the required files from the following link: download

You can use gdown for this purpose

gdown 1T8vfymHlfNpHbns9lXuLsQne8KopKT87

After downloading, extract the files using the following command:

tar -xvf all_evals.tar.gz

This will create a directory named data, restoring the necessary file structure and fixing broken links within the directories corpus-100-eval-deepseek and corpus-100-eval-gemma.

📁 Directory Structure

The corpus-100-eval-(model) directories contain the probability values of the datasets generated by different sampling models. Each directory consists of the following structure:

corpus-100-eval-deepseek/
├── deepseek1/
│      ├── eval0x.ds_deepseek.json
│      ├── occur0x.ds_deepseek.json
│      └── ...
├── deepseek2/
│      ├── eval0x.ds_deepseek.json
│      ├── occur0x.ds_deepseek.json
│      └── ...
└── stability1/
        ├── eval0x.ds_deepseek.json
        ├── occur0x.ds_deepseek.json
        └── ...

corpus-100-eval-gemma/
├── codegemma1/
│      ├── eval0x.ds_codegemma.json
│      ├── occur0x.ds_codegemma.json
│      └── ...
├── codegemma2/
│      ├── eval0x.ds_codegemma.json
│      ├── occur0x.ds_codegemma.json
│      └── ...
└── stability1/
        ├── eval0x.ds_codegemma.json
        ├── occur0x.ds_codegemma.json
        └── ...

The deepseek1, deepseek2 (the contamination set), and stability1 directories in corpus-100-eval-deepseek correspond to datasets generated by deepseek-coder and stable-code.
The eval0x files correspond to the probability values of samples from task ID x in the HumanEval dataset.
The occur0x files represent the number of occurrences of samples in the dataset.
In corpus-100-eval-deepseek, evaluations are performed with respect to deepseek-coder.
The same structure and description apply to the corpus-100-eval-gemma directory.

To visualize the data and evaluate results, you can generate graphs using the provided Jupyter notebook:

Locate the notebook at essentials.ipynb.
Open it using Jupyter Notebook or Jupyter Lab.
Further details are available inside the notebook.

🛠️ End-to-End Execution from Scratch

This layer guides users through the full pipeline (from dataset generation to final decision) using the core scripts and models.

🧑‍💻 Stage 1 : Generating Dataset

Use get-samples.py to generate the datasets from stable-code, deepseek-coder and codegemma. An example PBS-script to use the file is given in the scripts directory. The general usage is:

$ python get-samples.py    --modelID 0 --samples 2000 --taskID 0 --batch_size 10 --ndim 100 --seed 0

Parameters

--modelID - The LLM we want to generate sample from. Currently we support limited number of models (stable-code, deepseek-coder, codegemma) in get-samples.py but can easily be extended for new LLMs. To use modelID=0 for stable-code, modelID=1 for deepseek-coder, modelID=2 for codegemma.
--taskID - Prompt Identifier from the Prompt Source. (Default prompt source is HumanEval.jsonl).
--samples (Default: 2000) - Number of samples we want to generate for the test.
--batch_size (Default: 10) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).
--ndim (Default: 100) - Maximum generation length of samples.
--seed (Default: 0) - Random seed.

Extra Parameters

--temperature (Default: 0.8) - Softmax temperature.
--topp (Default: 0.95) - $p$ value associated with Top-$p$ Sampling.

🧹 Filtering

After getting the samples one needs to filter the dataset. See the PBS script in script/submit-sanitization to see how to run it.

$ python semantic-check.py    --inputdir ISAMPLEDIR --outputdir OSAMPLEDIR --taskID 0

Parameters

--inputdir - Input directory containing the generated samples.
--outputdir - Location output directory for the filtered samples.
--taskID - Prompt Identifier from the Prompt Source (Default prompt souce is HumanEval.jsonl).

🧑‍💻 Stage 2 : Running EVAL

Given a set of text samples, this level finds the generation probabilities of the elements in the set. Using do-eval.py, we can evaluate the probability that a text element is generated from a selected LLM. A related PBS script is also provided in the scripts library.

$ python do-eval.py --evalmodelID 1 --taskID 0 --batch_size 1 --smpsrc SAMPLEDIR

Parameters

--smpsrc - Directory containing text samples to run EVAL on.
--evalmodelID - The target LLM from where we hypothesize the samples are generated from. Similar to get-samples.py, use evalmodelID=1 for deepseek-coder, evalmodelID=2 for codegemma.
--taskID - Prompt Identifier from the Prompt Source (Default prompt souce is HumanEval.jsonl).
--batch_size (Default: 1) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).

Extra Parameters

--temperature (Default: 0.8) - The softmax temperature should match the value used during the generation process, if known. If the generation temperature is unknown, a default setting of 1 is recommended.
--topp (Default: 0.95) - $p$ value associated with Top-$p$ Sampling. topp should match the value used during the generation process, if known. If the generation Top-$p$ is unknown, a default setting of 1 is recommended.

🧑‍⚖️ Stage 3 : Decision

This level ultimately determines whether the submission set is sourced from the target LLM. The corresponding file is evaluation.py and the usage is:

python3 evaluation.py --origstu corpus-100-eval-deepseek/stability1 --stucorrupt corpus-100-eval-deepseek/deepseek2 --origllm corpus-100-eval-deepseek/deepseek1 --eps1 10 --eps2 80 --threshin 60 --threshout 0.08 --thresh 0.05 --evalmodel 1

Parameters

--origstu: Directory for original database (to check the source)
--stucorrupt: Directory for corrupting database
--origllm: Directory for LLM-generated database
--eps1: minimum corrupting percentage (LB%)
--eps2: maximum corrupting percentage (UB%)
--threshin: Local algorithm's decision threshold.
--threshout: Global algorithm's decision threshold.
--thresh: Bucket Threshold.
--evalmodel: Identifier for the target model (evalmodel=1 for deepseek-coder, evalmodel=2 for codegemma)

eps1 and eps2 represent the LB% and UB% values mentioned in the paper. This approach provides a convenient way to simulate a contaminated dataset without actually generating one. By using the explicitly computed probability values as proxies for the samples, we can effectively mimic the behavior of contaminated datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
baselines/detect-gpt		baselines/detect-gpt
corpus-100-eval-deepseek		corpus-100-eval-deepseek
corpus-100-eval-gemma		corpus-100-eval-gemma
scripts		scripts
.gitignore		.gitignore
.gitmodules		.gitmodules
HumanEval.jsonl		HumanEval.jsonl
Readme.md		Readme.md
anubis.py		anubis.py
detokenize.py		detokenize.py
do-eval.py		do-eval.py
essentials.ipynb		essentials.ipynb
estimate.py		estimate.py
evaluation.py		evaluation.py
get-samples.py		get-samples.py
get_bin_plots.py		get_bin_plots.py
requirements.txt		requirements.txt
run-sanitization.py		run-sanitization.py
semantic_check.py		semantic_check.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Anubis - The Batch Attribution Tester

🚀 Getting Started

🔎 Usage

🛠️ Run Anubis on Pre-generated Datasets

📥 Download and Extract Datasets

📁 Directory Structure

🚀 Running Anubis for Evaluation

Arguements

Expected Output

🧑‍💻 Examples

Example 1: Evaluating with Deepseek-Coder and Stable-Code

Example 2: Evaluating with Deepseek-Coder on Itself

Example 3: Evaluating with Codegemma on Itself

📦 Reproducibility

🗂️ Recreate Paper Figures Without GPU

📥 Download and Extract Data

📁 Directory Structure

🛠️ End-to-End Execution from Scratch

🧑‍💻 Stage 1 : Generating Dataset

🧹 Filtering

🧑‍💻 Stage 2 : Running EVAL

🧑‍⚖️ Stage 3 : Decision

About

Uh oh!

Releases

Packages

Uh oh!

Languages

uddaloksarkar/anubis

Folders and files

Latest commit

History

Repository files navigation

Anubis - The Batch Attribution Tester

🚀 Getting Started

🔎 Usage

🛠️ Run Anubis on Pre-generated Datasets

📥 Download and Extract Datasets

📁 Directory Structure

🚀 Running Anubis for Evaluation

Arguements

Expected Output

🧑‍💻 Examples

Example 1: Evaluating with Deepseek-Coder and Stable-Code

Example 2: Evaluating with Deepseek-Coder on Itself

Example 3: Evaluating with Codegemma on Itself

📦 Reproducibility

🗂️ Recreate Paper Figures Without GPU

📥 Download and Extract Data

📁 Directory Structure

🛠️ End-to-End Execution from Scratch

🧑‍💻 Stage 1 : Generating Dataset

🧹 Filtering

🧑‍💻 Stage 2 : Running EVAL

🧑‍⚖️ Stage 3 : Decision

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages