Turnabout LLM

This project benchmarks LLMs' deductive reasoning ability using interactive detective novel games such as Ace Attorney and Danganronpa. This repo includes our datasets, scripts, and analyses.

The name "Turnabout" is a naming convention from Ace Attorney as a nod to the playable character's knack for completely changing the direction of a trial, against all odds.

Why interactive detective novels?

Detective stories contain some of the most difficult reasoning problems, which are meticulously crafted to be intriguing and obscure. Moreover, such deduction requires diverse reasoning ability and may require information retrieval from long passages of context. Therefore, evaluating LLMs on detective stories brings about unique challenges.

Most detective novels like Sherlock Holmes can hardly be used for evaluation because they do not contain explicit questions to pose to models. However, games like Ace Attorney surprasses this constraint, as the interactive gameplay provides a natural interface with LLMs. Specifically, the core gameplay mechanism is to read through a story, examine existing evidences, listen to witness testimonies, and find a contradiction between an evidence and a testimony. In essence, this is multiple choice question where the action space is num_evidences x num_testimonies which is usually hundreds.

Despite possible subjectivity (is Ace Attorney rigorous in logic?), games like Ace Attorney are critically acclaimed with a sizeable player community that generally agree upon the validity of the contradictions. While challenging even for human players, an attentive player should be able to find most contradictions. However, as of the time of writing, no LLM could achieve more than 40% accuracy.

Dataset

Detailed information about the Turnabout LLM dataset can be found at data/; see the README there for more information. We pose this dataset to evaluate LLMs' deductive reasoning ability. The game data is crawled and parsed from an Ace Attorney Wiki and a Danganronpa archive. We make the following design choices:

We only consider the textual elements, which are core to reasoning in most cases. Whenver visuals are needed for reasoning, they are captioned, though a multimodal evaluation might come in future work.
For Ace Attorney, we only consider the cross-examination gameplay during trials, neglecting other gameplay elements such as investigation, psyche-locks, etc.
For Danganronpa, we only consider the non-stop debate gameplay during trials, neglecting other gameplay elements such as socializing, hangman gambit, etc.
While our dataset is mostly faithful to the original games, we made various edits (change to wording, removing loose contradictions, adding information for logic leaps, etc.) to improve the rigorousness of reasoning.

For each turn (either a cross-examination or a non-stop debate), the input to a model is:

A list of evidences (Ace Attorney) or truth bullets (Danganronpa) and their descriptions
A list of testimonies
The story context (only in some settings)

The output a model is a contradicting pair of evidence and a testimony. While most turns are self-contained, some require specific information from the story context. This becomes a needle-in-a-haystack information retrieval problem that is particularly challenging for LLMs.

Evaluation

For a complete explanation on how to evaluate the models, see this README.

In short, to use an LLM with a prompt to make inference on the dataset, go to /source and run

python run_models.py --model MODEL --prompt PROMPT --context CONTEXT

MODEL is the name of a HuggingFace model such as deepseek-ai/DeepSeek-R1 or an API model such as deepseek-reasoner or gpt-4.1. You can also customize acronyms in /source/model_names.json.
PROMPT is the name of a prompt stored in /source/prompts.
CONTEXT is either left blank, or full to provide the full context, or sum to provide a context summary.

Running this command will produce /output/MODEL_PROMPT, storing models' output.

To evaluate said output, run

python evaluate.py --model MODEL --prompt PROMPT --context CONTEXT

This will create a MODEL_PROMPT_report.json in /eval.

License

Following the source of our data, fandom.com, our resources are licensed under Creative Commons Attribution-Share Alike License 3.0 (Unported) (CC BY-SA).

Citation

If you find our work useful, please cite TODO.

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
data		data
eval		eval
images		images
output		output
source		source
stats		stats
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Turnabout LLM

Why interactive detective novels?

Dataset

Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

zharry29/turnabout_llm

Folders and files

Latest commit

History

Repository files navigation

Turnabout LLM

Why interactive detective novels?

Dataset

Evaluation

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages