Skip to content

Roihn/EinsteinPuzzles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Einstein Puzzles on Tabletop

  🤗 Hugging Face Dataset   |   🤗 Hugging Face Model   |    📑 Paper

Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Run Peng*, Ziqiao Ma*, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai

Environment Setup

We recommend using uv for the environment setup.

uv sync

We fine-tuned and evaluated our models using NVIDIA A40 GPUs with 48GB memory and CUDA 12.4. Training was conducted on 4 GPUs, while evaluation used a single GPU. Please ensure your torch and vllm versions are compatible with this setup, and adjust accordingly if you encounter any issues.

Training

We provide training data for four action space configurations with Chain-of-Thought(CoT). We suggest you to download the dataset through our huggingface, and store them under the EinsteinPuzzles/dataset folder. If you would like to have the version with no CoT, you can simply remove the contents inside <think> tags.

We use the fine-tuning framework from OpenRLHF with version 0.6.0.post3 for all the fine-tunings. Key hyperparameters and settings used in training are documented in the appendix.

Evaluation

We open-source our fine-tuned model on llama3.1-8B-Instruct with both information providing and seeking capability and CoT reasoning on huggingface. We recommend you to download it under the EinsteinPuzzles/checkpoint folder. More checkpoints are available through request.

For evaluation, we include 300 unseen test cases for online interaction, stored in dataset/eval/eval_game_ids.json. Each test case is indexed by a unique game ID that specifies the initial state of the game.

Once the checkpoint is available, you can evaluate the model (without verifier) using the following command:

cd Einstein_Puzzles
CUDA_VISIBLE_DEVICES=0 uv run src/eval/eval_game_raw_model.py \
--action_mode provide_seek \
--use_cot \
--output_dir outputs/ \
--max_files 300 \
--json_path dataset/eval/eval_game_ids.json \
--base_model_path meta-llama/Llama-3.1-8B-Instruct \
--lora_model_path ./checkpoint/llama3.1-8B-cot-provide-seek

For the evaluation with verifier, run the following command:

CUDA_VISIBLE_DEVICES=0 uv run src/eval/eval_game_verifier_model.py \
--use_cot \
--action_mode provide_seek \
--output_dir outputs_verifier/ \
--max_files 300 \
--json_path dataset/eval/eval_game_ids.json \
--base_model_path meta-llama/Llama-3.1-8B-Instruct \
--lora_model_path ./checkpoint/llama3.1-8B-cot-provide-seek \
--verifier <affordance_verifier/communication_verifier/reasoning_verifier>

Citation

@misc{peng2025communicationverificationllmagents,
      title={Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry}, 
      author={Run Peng and Ziqiao Ma and Amy Pang and Sikai Li and Zhang Xi-Jia and Yingzhuo Yu and Cristian-Paul Bara and Joyce Chai},
      year={2025},
      eprint={2510.25595},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25595}, 
}

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages