Anubis has two inputs: (1) the set of code samples, and (2) the target LLM.
- Install basic requirements
conda create -n anubis python=3.10 -y
conda activate anubis
pip install --upgrade pip
pip install -r requirements.txt- Add
Human-Evalto the repository
git clone https://github.com/openai/human-eval.git human_eval
- Download the Models
To help readers understand how Anubis works (and eventually reproduce our results) we offer three layers of engagement with our code and data. We recognize that running the full pipeline on the entire dataset can be resource-intensive, so these layers range from lightweight exploration to full end-to-end replication:
This layer allows readers to evaluate Anubis on datasets we have already generated using various sampling models. This requires GPU access.
Readers are required to download the datasets generated using the following models from the link below:β β Download
Again, you can use gdown:
gdown 1dCUoSAxPrlh2YEQpXnfENER8n8_x_1PSOnce downloaded, extract the files using the following command:
tar -xvf sample_database.tar.gzAfter extraction, the datasets will be organized as follows:
stable-code: Located within thestabilitydirectory.deepseek-coder: Located within thedeepseekdirectory.codegemma: Located within thecodegemmadirectory.
To evaluate and verify the functionality of Anubis, run the following command:
python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_0i.json --promptid i --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0--smpsrc: Path to the input JSON file containing data generated usingstable-code. Example:stability1_0i.json, whereiis the prompt ID.--promptid: The prompt ID fromHumaneval.json, which ranges from 0 to 163.--sampb: Sampling batch size (10 recommended).--evalb: Evaluation batch size (1 recommended to prevent OOM issues).--nsamps: Total number of samples to generate (500).--model: Model identifier: β β - 1:deepseek-coderβ β - 2:codegemma--verbose: Verbosity level (0 for minimal output).
reject: When the sample source isstable-code.accept: When the sample source matches the target model (deepseek-coderorcodegemma).
python anubis.py --smpsrc corpus-100-sanitized-json/stability/stability1_048.json --promptid 48 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0- Model:
deepseek-coder - Sample Source:
stable-code - Expected Output:
reject
python anubis.py --smpsrc corpus-100-sanitized-json/deepseek/deepseek2_067.json --promptid 67 --sampb 10 --evalb 1 --nsamps 500 --model 1 --verbose 0- Model:
deepseek-coder - Sample Source:
deepseek-coder - Expected Output:
accept
python anubis.py --smpsrc corpus-100-sanitized-json/codegemma/codegemma2_050.json --promptid 50 --sampb 10 --evalb 1 --nsamps 500 --model 2 --verbose 0- Model:
codegemma - Sample Source:
codegemma - Expected Output:
accept
This section is aimed at those who want to reproduce the figures and results from the paper. It provides a lighter, GPU-free interface to analyze and visualize data directly from precomputed outputs.
This layer provides all necessary data, including computed probability values and corresponding outputs generated by Anubis and detectGPT. Working with this layer does not require any GPU resources. This layer will help in reproducing all the graphs that have been used in the paper.
To get started, download the required files from the following link:β β download
You can use gdown for this purpose
gdown 1T8vfymHlfNpHbns9lXuLsQne8KopKT87After downloading, extract the files using the following command:
tar -xvf all_evals.tar.gzThis will create a directory named data, restoring the necessary file structure and fixing broken links within the directories corpus-100-eval-deepseek and corpus-100-eval-gemma.
The corpus-100-eval-(model) directories contain the probability values of the datasets generated by different sampling models. Each directory consists of the following structure:
corpus-100-eval-deepseek/
βββ deepseek1/
ββ β β βββ eval0x.ds_deepseek.json
ββ β β βββ occur0x.ds_deepseek.json
ββ β β βββ ...
βββ deepseek2/
ββ β β βββ eval0x.ds_deepseek.json
ββ β β βββ occur0x.ds_deepseek.json
ββ β β βββ ...
βββ stability1/
β β β β βββ eval0x.ds_deepseek.json
β β β β βββ occur0x.ds_deepseek.json
β β β β βββ ...
corpus-100-eval-gemma/
βββ codegemma1/
ββ β β βββ eval0x.ds_codegemma.json
ββ β β βββ occur0x.ds_codegemma.json
ββ β β βββ ...
βββ codegemma2/
ββ β β βββ eval0x.ds_codegemma.json
ββ β β βββ occur0x.ds_codegemma.json
ββ β β βββ ...
βββ stability1/
β β β β βββ eval0x.ds_codegemma.json
β β β β βββ occur0x.ds_codegemma.json
β β β β βββ ...- The
deepseek1,deepseek2(the contamination set), andstability1directories incorpus-100-eval-deepseekcorrespond to datasets generated bydeepseek-coderandstable-code. - The
eval0xfiles correspond to the probability values of samples from task IDxin the HumanEval dataset. - The
occur0xfiles represent the number of occurrences of samples in the dataset. - In
corpus-100-eval-deepseek, evaluations are performed with respect todeepseek-coder. - The same structure and description apply to the corpus-100-eval-gemma directory.
To visualize the data and evaluate results, you can generate graphs using the provided Jupyter notebook:
- Locate the notebook at
essentials.ipynb. - Open it using Jupyter Notebook or Jupyter Lab.
- Further details are available inside the notebook.
This layer guides users through the full pipeline (from dataset generation to final decision) using the core scripts and models.
Use get-samples.py to generate the datasets from stable-code, deepseek-coder and codegemma. An example PBS-script to use the file is given in the scripts directory. The general usage is:
$ python get-samples.pyβ β --modelID 0 --samples 2000 --taskID 0 --batch_size 10 --ndim 100 --seed 0Parameters
--modelID- The LLM we want to generate sample from. Currently we support limited number of models (stable-code,deepseek-coder,codegemma) inget-samples.pybut can easily be extended for new LLMs. To usemodelID=0forstable-code,modelID=1fordeepseek-coder,modelID=2forcodegemma.--taskID- Prompt Identifier from the Prompt Source. (Default prompt source isHumanEval.jsonl).--samples(Default: 2000) - Number of samples we want to generate for the test.--batch_size(Default: 10) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).--ndim(Default: 100) - Maximum generation length of samples.--seed(Default: 0) - Random seed.
Extra Parameters
-
--temperature(Default: 0.8) - Softmax temperature. -
--topp(Default: 0.95) -$p$ value associated with Top-$p$ Sampling.
After getting the samples one needs to filter the dataset. See the PBS script in script/submit-sanitization to see how to run it.β β
$ python semantic-check.pyβ β --inputdir ISAMPLEDIR --outputdir OSAMPLEDIR --taskID 0Parameters
--inputdir- Input directory containing the generated samples.--outputdir- Location output directory for the filtered samples.--taskID- Prompt Identifier from the Prompt Source (Default prompt souce isHumanEval.jsonl).
Given a set of text samples, this level finds the generation probabilities of the elements in the set.
Using do-eval.py, we can evaluate the probability that a text element is generated from a selected LLM. A related PBS script is also provided in the scripts library.
$ python do-eval.py --evalmodelID 1 --taskID 0 --batch_size 1 --smpsrc SAMPLEDIR Parameters
--smpsrc- Directory containing text samples to run EVAL on.--evalmodelID- The target LLM from where we hypothesize the samples are generated from. Similar toget-samples.py, useevalmodelID=1fordeepseek-coder,evalmodelID=2forcodegemma.--taskID- Prompt Identifier from the Prompt Source (Default prompt souce isHumanEval.jsonl).--batch_size(Default: 1) - Batch size for generation from LLM (Reduce batch size to overcome possible OOM error).
Extra Parameters
-
--temperature(Default: 0.8) - The softmax temperature should match the value used during the generation process, if known. If the generation temperature is unknown, a default setting of 1 is recommended. -
--topp(Default: 0.95) -$p$ value associated with Top-$p$ Sampling.toppshould match the value used during the generation process, if known. If the generation Top-$p$ is unknown, a default setting of 1 is recommended.
This level ultimately determines whether the submission set is sourced from the target LLM. The corresponding file is evaluation.py and the usage is:
python3 evaluation.py --origstu corpus-100-eval-deepseek/stability1 --stucorrupt corpus-100-eval-deepseek/deepseek2 --origllm corpus-100-eval-deepseek/deepseek1 --eps1 10 --eps2 80 --threshin 60 --threshout 0.08 --thresh 0.05 --evalmodel 1Parameters
--origstu: Directory for original database (to check the source)--stucorrupt: Directory for corrupting database--origllm: Directory for LLM-generated database--eps1: minimum corrupting percentage (LB%)--eps2: maximum corrupting percentage (UB%)--threshin: Local algorithm's decision threshold.--threshout: Global algorithm's decision threshold.--thresh: Bucket Threshold.--evalmodel: Identifier for the target model (evalmodel=1fordeepseek-coder,evalmodel=2forcodegemma)
eps1 and eps2 represent the LB% and UB% values mentioned in the paper. This approach provides a convenient way to simulate a contaminated dataset without actually generating one. By using the explicitly computed probability values as proxies for the samples, we can effectively mimic the behavior of contaminated datasets.