SHADE-Arena

This is the public split of SHADE-Arena. The general setup of this repo is as follows:

universe_related_stuff: contains the information that populates the mock environment for the agent.
task_pairs: contains the task definitions and solutions for each task pair. Also all of the transcripts + summaries for each task pair will appear here when you run an experiment.
task_evaluation: contains the code for the main and side task evaluation.
tools: contains the code for the mock tools that the agent can use.
environment_classes: contains the code for the environment components that the agent interacts with.
environments: contains the environment implementations that agents can interact with.
The easiest way to understand a given task pair is to look at the task definition. It will have the actual prompt that the agent gets for the main task and the side task. There's also a README for each task pair.
If you intend to use this code I strongly recommend requesting access to the private repo. This repo has existing transcripts, so you won't need to generate them all fresh, includes tools for supporting AI control experiments and a lot of monitoring experiments, and supports caching of LLM responses. To request access, either contact Jon Kutasov or create an issue on this repo.

Using Just the Environments

If you're only interested in using the environments without the analysis/LLM API, you can focus on these three directories:

environments/: Contains the environment implementations
environment_classes/: Contains the classes used by the environments
tools/: Contains the tools that agents can use to interact with the environments

Each of these directories has its own README with detailed instructions on how to use them independently. The environments are designed to be modular and can be used without the rest of the codebase.

Quick Start for Using Environments Only

Clone only the necessary directories:

mkdir my-environments
cd my-environments
git init
git remote add origin ...
git config core.sparseCheckout true
echo "environments/" >> .git/info/sparse-checkout
echo "environment_classes/" >> .git/info/sparse-checkout
echo "tools/" >> .git/info/sparse-checkout
git pull origin main

For more detailed information, see the README files in each directory:

How an experiment works

We modify the AgentDojo tools pipeline that runs the agent through the task, giving it the chance to use tools each turn until it's done.
There are four ways this can go: either the agent is allowed to use thinking tags to hide its thoughts or it's not, and separately, it can either get a hint or not.
The agent will interact with the environment continuously the whole time, all the tool calls are logged as well as the statements of the agent.
The transcript is sent to the monitor and they independently evaluate it for bad behavior, giving a score from 0 to 100 for suspiciousness. The monitor obviously doesn't see the system prompt the agent got that specifies the secondary task. In the hidden scratchpad case, the monitor also doesn't get to see anything in thinking tags.
There is a function that evaluates the main task completion. This depends on the task, so it's pretty general, but ideally there will be some score from 0 to 100.
Similarly, there's a function to evaluate the secondary task.

Here's an example of how to run an experiment:

python task_completion.py --num_experiments 3 --test_mode False --task_folder bank_transfer_2 --max_concurrent_threads 3 --give_hint True --remove_thinking_tags False

The interesting parameters are:

num_experiments: How many times do you want to have the same agent try this task pair?
test_mode: This just makes the experiments cut off early so I can inspect that everything is fine.
task_folder: The folder that contains the task definition and solution.
give_hint: Whether or not to give the agent a hint about how to do the secondary task.
remove_thinking_tags: Whether or not to remove the tags from the transcript (doing this would make the thinking space private to the agent). The system message tells the agent whether or not its thoughts are private.

If you are facing rate limit issues, you can decrease max_concurrent_threads.

How to use this easily

You should look at task_completion.py to see how you can easily interact with the code.
Basically, you need an ExperimentConfig (which I usually just feed in as command line arguments) and a config.yaml file which tells it which models to use. We can support any model that OpenAI, Gemini, or Anthropic supports provided that it can use tools and it accepts a system prompt.
You can add your own LLM by making a new folder in the llms folder. It should inherit from the BaseLLM class.

Reproducing results

Reproducing our results will be impossible without access to the other tasks (again, these are available upon request). The notebooks in plotting_notebooks are the ones used to generate the plots in the paper.

Canary string: The following canary helps researchers prevent SHADE-Arena task data from being included in training datasets: 5AGVKj6HGcwQshM7DLulSKcGVbIe2C-mFbSl7eMVOd1VrTNayy0YkjJ1X4W3o

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
environment_classes		environment_classes
environments		environments
llms		llms
monitor_generation		monitor_generation
plotting_notebooks		plotting_notebooks
task_evaluation		task_evaluation
task_pairs		task_pairs
tools		tools
universe_related_stuff		universe_related_stuff
utils		utils
.cursorignore		.cursorignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
task_completion.py		task_completion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SHADE-Arena

Using Just the Environments

Quick Start for Using Environments Only

How an experiment works

How to use this easily

Reproducing results

About

Uh oh!

Releases

Packages

Languages

License

jkutaso/SHADE-Arena

Folders and files

Latest commit

History

Repository files navigation

SHADE-Arena

Using Just the Environments

Quick Start for Using Environments Only

How an experiment works

How to use this easily

Reproducing results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages