Skip to content

lichengliu03/unary-feedback

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UFO: A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

View on Hugging Face   Project Homepage   View on arXiv

Overview

"Let's Try Again" addresses a critical gap in language model training: while single-turn reinforcement learning (RL) improves reasoning, these models fail in multi-turn interactive scenarios, often repeating the same wrong answers despite feedback.

Key Problem

Single-turn RL models lose the ability to revise reasoning across multiple turns. In 70% of failure cases, they produce identical answers across 5 interaction rounds, unable to incorporate simple feedback like "try again."

Solution: UFO Framework

Unary Feedback as Observation (UFO) transforms static datasets into multi-turn training by:

  • Using only minimal feedback signals ("Try Again")
  • Treating failure feedback as part of the observation
  • Enabling models to learn from historical mistakes

Results

  • 14% improvement in multi-turn success rates
  • 10% reduction in average interaction turns
  • Better performance even in single-turn scenarios
  • 90% non-repetitive answers (vs 80% baseline)

Repository Structure

unary-feedback/
├── ufb/                        # Core UFO framework
│   ├── env/                    #   13 environment types (metamathqa, sokoban, sudoku, ...)
│   ├── llm_agent/              #   Agent proxy, context & episode-state management
│   ├── trainer/                #   PPO trainer adapted for multi-turn episodes
│   ├── workers/                #   Distributed FSDP actor, critic, rollout workers
│   ├── eval.py                 #   Multi-turn evaluation with feedback (Succ@k)
│   ├── eval_api.py             #   API-based evaluation (OpenAI, Anthropic, etc.)
│   └── utils.py                #   Shared utilities
│
├── verl/                       # veRL distributed RL infrastructure (vendored)
│
├── configs/
│   ├── base.yaml               #   Base training config (normal feedback)
│   ├── train_generic_feedback.yaml     #   Generic feedback training
│   ├── train_no_feedback.yaml  #   No-feedback training (ablation)
│   ├── envs/                   #   Per-environment configs (28 environments)
│   ├── envs.yaml               #   Environment definitions (33 tags)
│   └── ppo_trainer.yaml        #   PPO algorithm hyperparameters
│
├── scripts/
│   ├── train.sh                #   Training (ENV=sokoban sbatch scripts/train.sh)
│   ├── eval.sh                 #   Evaluation (ENV=sokoban MODEL=... sbatch scripts/eval.sh)
│   ├── download_data.py        #   Dataset download
│   ├── setup_ufb.sh            #   Environment setup
│   └── utils/                  #   Checkpoint conversion, data processing tools
│
├── external/                   # External dependencies (webshop-minimal)
├── train.py                    # Main training entry point
├── setup.py                    # Package installation (pip install -e .)
├── requirements.txt            # Python dependencies
├── LICENSE                     # Apache 2.0
└── README.md

Setup

# Clone and setup
git clone https://github.com/lichengliu03/unary-feedback.git
cd unary-feedback
bash scripts/setup_ufb.sh

For manual setup, see scripts/setup_ufb.md.

Training

Each environment has a pre-configured YAML in configs/envs/ with appropriate hyperparameters (response length, max turns, batch size, etc.). Just specify the environment name:

ENV=metamathqa  sbatch scripts/train.sh
ENV=sokoban     sbatch scripts/train.sh
ENV=countdown   sbatch scripts/train.sh
ENV=frozen_lake sbatch scripts/train.sh
ENV=sudoku      sbatch scripts/train.sh
ENV=bandit      sbatch scripts/train.sh
ENV=gsm8k       sbatch scripts/train.sh
ENV=hotpotqa    sbatch scripts/train.sh
ENV=webshop     sbatch scripts/train.sh
ENV=alfworld    sbatch scripts/train.sh
# ... see configs/envs/ for all 28 environments

Optional overrides:

# Different model
ENV=sokoban MODEL_PATH=Qwen/Qwen2.5-7B-Instruct sbatch scripts/train.sh

# Different number of training steps
ENV=metamathqa STEPS=100 sbatch scripts/train.sh

Feedback variants

By default, one-bit feedback is randomly sampled from a prompt pool (e.g. "Incorrect.", "That's wrong, try again.", "Not quite right.", etc.) to prevent overfitting to specific wording. To use a fixed feedback prompt instead, set randomize_feedback: false in the env config.

# Normal feedback with randomized prompt pool (default)
ENV=metamathqa sbatch scripts/train.sh

# Fixed feedback (ablation: single prompt "Incorrect. Please think again.")
ENV=metamathqa sbatch scripts/train.sh  # override env_config.randomize_feedback=false

# Generic feedback
ENV=metamathqa CONFIG=train_generic_feedback sbatch scripts/train.sh

# Specific feedback (answer-directed hint)
ENV=metamathqa CONFIG=train_specific_feedback sbatch scripts/train.sh

# No feedback (ablation)
ENV=metamathqa CONFIG=train_no_feedback sbatch scripts/train.sh

Available environments

Category Environments
Math metamathqa, gsm8k, math, aime24, countdown, theoremqa
QA / Reasoning hotpotqa, concurrentqa, musique, gpqa
General Knowledge mmlu, mmlu_pro, mmlu_stem, mmlu_redux
Code humaneval, mbpp, multiple, livecodebench, livebench
Planning sokoban, frozen_lake, sudoku, spatial
Interactive webshop, alfworld, bandit
Formal lean, search

Evaluation

# Evaluate base model (no training)
ENV=metamathqa MODEL=Qwen/Qwen2.5-3B-Instruct sbatch scripts/eval.sh

# Evaluate a trained checkpoint
ENV=metamathqa MODEL=/path/to/checkpoint/global_step_200 sbatch scripts/eval.sh

# Evaluate on a different environment
ENV=sokoban MODEL=/path/to/checkpoint sbatch scripts/eval.sh

Visualization

Check val/generations in wandb.

Key Results

Multi-Turn Reasoning Performance

We compare our multi-turn UFO model against a strong single-turn PPO baseline. For a fair comparison, the baseline is evaluated on 5 independent samples (Pass@5), while our model uses 5 sequential attempts with feedback (Succ@5).

Key Findings:

  • +14% success rate over single-turn PPO baseline
  • Benefits generalize to both multi-turn and single-turn inference
  • Best results with 5-turn training; more turns yield diminishing returns

Effectiveness of Unary Feedback

  • Feedback in both training and validation is crucial for improvement
  • Feedback only in training phase does not help at inference

Reward Design Impact

  • Exponential Reward Decay: Decreases average actions required by ~10%
  • Answer Diversity: Non-repetitive answer ratio increases from 79.7% to 92.8%

Citation

@article{liu2025ufo,
  title={UFO: A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning},
  author={Liu, Licheng and others},
  journal={arXiv preprint arXiv:2507.14295},
  year={2025}
}

Acknowledgements

We thank the DeepSeek team for providing the DeepSeek-R1 model and early conceptual inspirations. We are grateful to the veRL team for their infrastructure support and the RAGEN team for their multi-turn RL framework.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors