Skip to content

jkutaso/SHADE-Arena

Repository files navigation

SHADE-Arena

This is the public split of SHADE-Arena. The general setup of this repo is as follows:

  • universe_related_stuff: contains the information that populates the mock environment for the agent.
  • task_pairs: contains the task definitions and solutions for each task pair. Also all of the transcripts + summaries for each task pair will appear here when you run an experiment.
  • task_evaluation: contains the code for the main and side task evaluation.
  • tools: contains the code for the mock tools that the agent can use.
  • environment_classes: contains the code for the environment components that the agent interacts with.
  • environments: contains the environment implementations that agents can interact with.
  • The easiest way to understand a given task pair is to look at the task definition. It will have the actual prompt that the agent gets for the main task and the side task. There's also a README for each task pair.
  • If you intend to use this code I strongly recommend requesting access to the private repo. This repo has existing transcripts, so you won't need to generate them all fresh, includes tools for supporting AI control experiments and a lot of monitoring experiments, and supports caching of LLM responses. To request access, either contact Jon Kutasov or create an issue on this repo.

Using Just the Environments

If you're only interested in using the environments without the analysis/LLM API, you can focus on these three directories:

  • environments/: Contains the environment implementations
  • environment_classes/: Contains the classes used by the environments
  • tools/: Contains the tools that agents can use to interact with the environments

Each of these directories has its own README with detailed instructions on how to use them independently. The environments are designed to be modular and can be used without the rest of the codebase.

Quick Start for Using Environments Only

Clone only the necessary directories:

mkdir my-environments
cd my-environments
git init
git remote add origin ...
git config core.sparseCheckout true
echo "environments/" >> .git/info/sparse-checkout
echo "environment_classes/" >> .git/info/sparse-checkout
echo "tools/" >> .git/info/sparse-checkout
git pull origin main

For more detailed information, see the README files in each directory:

How an experiment works

  • We modify the AgentDojo tools pipeline that runs the agent through the task, giving it the chance to use tools each turn until it's done.
  • There are four ways this can go: either the agent is allowed to use thinking tags to hide its thoughts or it's not, and separately, it can either get a hint or not.
  • The agent will interact with the environment continuously the whole time, all the tool calls are logged as well as the statements of the agent.
  • The transcript is sent to the monitor and they independently evaluate it for bad behavior, giving a score from 0 to 100 for suspiciousness. The monitor obviously doesn't see the system prompt the agent got that specifies the secondary task. In the hidden scratchpad case, the monitor also doesn't get to see anything in thinking tags.
  • There is a function that evaluates the main task completion. This depends on the task, so it's pretty general, but ideally there will be some score from 0 to 100.
  • Similarly, there's a function to evaluate the secondary task.

Here's an example of how to run an experiment:

python task_completion.py --num_experiments 3 --test_mode False --task_folder bank_transfer_2 --max_concurrent_threads 3 --give_hint True --remove_thinking_tags False

The interesting parameters are:

  • num_experiments: How many times do you want to have the same agent try this task pair?
  • test_mode: This just makes the experiments cut off early so I can inspect that everything is fine.
  • task_folder: The folder that contains the task definition and solution.
  • give_hint: Whether or not to give the agent a hint about how to do the secondary task.
  • remove_thinking_tags: Whether or not to remove the tags from the transcript (doing this would make the thinking space private to the agent). The system message tells the agent whether or not its thoughts are private.

If you are facing rate limit issues, you can decrease max_concurrent_threads.

How to use this easily

  • You should look at task_completion.py to see how you can easily interact with the code.
  • Basically, you need an ExperimentConfig (which I usually just feed in as command line arguments) and a config.yaml file which tells it which models to use. We can support any model that OpenAI, Gemini, or Anthropic supports provided that it can use tools and it accepts a system prompt.
  • You can add your own LLM by making a new folder in the llms folder. It should inherit from the BaseLLM class.

Reproducing results

Reproducing our results will be impossible without access to the other tasks (again, these are available upon request). The notebooks in plotting_notebooks are the ones used to generate the plots in the paper.

Canary string: The following canary helps researchers prevent SHADE-Arena task data from being included in training datasets: 5AGVKj6HGcwQshM7DLulSKcGVbIe2C-mFbSl7eMVOd1VrTNayy0YkjJ1X4W3o

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published