Einstein Puzzles on Tabletop

🤗 Hugging Face Dataset | 🤗 Hugging Face Model | 📑 Paper

Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Run Peng*, Ziqiao Ma*, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai

Environment Setup

We recommend using uv for the environment setup.

uv sync

We fine-tuned and evaluated our models using NVIDIA A40 GPUs with 48GB memory and CUDA 12.4. Training was conducted on 4 GPUs, while evaluation used a single GPU. Please ensure your torch and vllm versions are compatible with this setup, and adjust accordingly if you encounter any issues.

Training

We provide training data for four action space configurations with Chain-of-Thought(CoT). We suggest you to download the dataset through our huggingface, and store them under the EinsteinPuzzles/dataset folder. If you would like to have the version with no CoT, you can simply remove the contents inside <think> tags.

We use the fine-tuning framework from OpenRLHF with version 0.6.0.post3 for all the fine-tunings. Key hyperparameters and settings used in training are documented in the appendix.

Evaluation

We open-source our fine-tuned model on llama3.1-8B-Instruct with both information providing and seeking capability and CoT reasoning on huggingface. We recommend you to download it under the EinsteinPuzzles/checkpoint folder. More checkpoints are available through request.

For evaluation, we include 300 unseen test cases for online interaction, stored in dataset/eval/eval_game_ids.json. Each test case is indexed by a unique game ID that specifies the initial state of the game.

Once the checkpoint is available, you can evaluate the model (without verifier) using the following command:

cd Einstein_Puzzles
CUDA_VISIBLE_DEVICES=0 uv run src/eval/eval_game_raw_model.py \
--action_mode provide_seek \
--use_cot \
--output_dir outputs/ \
--max_files 300 \
--json_path dataset/eval/eval_game_ids.json \
--base_model_path meta-llama/Llama-3.1-8B-Instruct \
--lora_model_path ./checkpoint/llama3.1-8B-cot-provide-seek

For the evaluation with verifier, run the following command:

CUDA_VISIBLE_DEVICES=0 uv run src/eval/eval_game_verifier_model.py \
--use_cot \
--action_mode provide_seek \
--output_dir outputs_verifier/ \
--max_files 300 \
--json_path dataset/eval/eval_game_ids.json \
--base_model_path meta-llama/Llama-3.1-8B-Instruct \
--lora_model_path ./checkpoint/llama3.1-8B-cot-provide-seek \
--verifier <affordance_verifier/communication_verifier/reasoning_verifier>

Citation

@misc{peng2025communicationverificationllmagents,
      title={Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry}, 
      author={Run Peng and Ziqiao Ma and Amy Pang and Sikai Li and Zhang Xi-Jia and Yingzhuo Yu and Cristian-Paul Bara and Joyce Chai},
      year={2025},
      eprint={2510.25595},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25595}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
doc		doc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Einstein Puzzles on Tabletop

Environment Setup

Training

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

Roihn/EinsteinPuzzles

Folders and files

Latest commit

History

Repository files navigation

Einstein Puzzles on Tabletop

Environment Setup

Training

Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages