This repository accompanies our NeurIPS 2025 paper:
Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
by Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques.
Consistent-LLMs introduces a reinforcement learning framework to measure and improve consistency in large language models (LLMs).
Our work identifies and addresses the challenge of drift — when LLMs lose track of a user’s identity, beliefs, or prior behavior over time.
We propose three complementary consistency metrics:
- Prompt-to-Line Consistency – measures faithfulness to the initial persona and prompt context.
- Line-to-Line Consistency – quantifies coherence and continuity across dialogue turns.
- Q&A Consistency – assesses stability of beliefs, self-identity, and memory across sessions.
These metrics are used as reward signals in multi-turn reinforcement learning, leading to LLMs that simulate more stable, human-like personas across different domains (education, mental health, conversation).
We recommend setting up a clean conda environment:
git clone https://github.com/abdulhaim/consistent-LLMs
cd consistent-LLMs
conda create --name consistent python=3.10
conda activate consistent
pip install -e .
pip install openrlhf[vllm]There are three notebooks: chatting/chatting.ipynb, education/teaching.ipynb and therapy/patient-therapist.ipynb that provide the pipeline to generate personas, dialogue, and measure consistency for the Chit-Chat, Teaching, and Mental Health Tasks.
Our multi-turn RL pipleine is based on OpenRLHF. To visit the original repo: GitHub Repo.
Please find commands and hyper-parameters to run training in rl_training.
-
Either run
python jsonl_gen.py --task=<task>for<task>Chatting,Education, orTherapy(does not matter the first time it's run) to create emptytraining_data/inandtraining_data/outfolders within therl_trainingdirectory, or create these folders manually. -
Place conversation jsons of the conversations you would like to include in your training data in
training_data/in. -
Run
python jsonl_gen.py --task=<task>for<task>Chatting,Education,Therapyto pair the specific scenario prompts with the dialogues and conglomerate conversation data into thejsonlformat needed by OpenRLHF. Local paths may need to be replaced by absolute paths in all of the following training scripts. -
(Optional) Train SFT with one of the scripts in
example_sft.sh. -
Train KTO with one of the scripts in
example_kto.sh. -
To train PPO, first start a vLLM instance on a separate GPU than is planned for PPO training.
a.
reward_func_prompt.pyhas hyperparameters near the top of the file that may need to be changed depending on how the vLLM instance was started (e.g. port, model name). Thechat_completioncall can also be replaced by a call to an online hosted model if you do not wish to locally host a separate GPU instance for training.
This software and/or data was deposited in the BAIR Open Research Commons repository on Oct 23, 2025.