LatentMAS is a flexible multi-agent reasoning framework that moves agent collaboration from token space into the modelβs latent space. This repo extends the original code to have more flexibility.
Instead of producing long textual reasoning traces, agents communicate by passing latent thoughts through their own working memory. LatentMAS has the following key features:
- Efficient multi-step reasoning with drastically fewer tokens
- Training-free latent-space alignment for stable generation
- A general technique compatible with any HF model and optionally vLLM backends.
The Science-LatentMAS branch is an extended version of LatentMAS designed to better support scientific research and discovery workflows. It introduces customizable agent roles and prompts (e.g., Researcher, Analyst), enabling LatentMAS to imitate different scientific reasoning and discovery schemes. It also provides hybrid textβlatent generation, flexible agent ordering, and customizable thinking tokens, which further facilitate collaboration across different scientific research agents.
Overall, Science-LatentMAS offers a lightweight and flexible framework for structured scientific reasoning and discovery while remaining fully compatible with the original LatentMAS design.
Three main tables from our paper spanning 9 tasks across math & science reasoning, commensonse reasoning, and code generation:
-
Table 1 β LatentMAS under the Sequantial MAS setting
-
Table 2 β LatentMAS under the Hierarchical MAS setting
-
Table 3 β Main Results on Reasoning Intensive Tasks
Overall, LatentMAS reduces:
- ~50β80% tokens
- ~3Γβ7Γ wall-clock time compared to standard Text-MAS or chain-of-thought baselines.
This repository provides all code for reproducing LatentMAS, TextMAS, and baseline single-agent experiments across GSM8K, AIME24/25, GPQA, ARC-Easy/Challenge, MBPP+, HumanEval+, and MedQA.
We recommend setting your HF cache directory to avoid repeated downloads:
export HF_HOME=/path/to/huggingface
export TRANSFORMERS_CACHE=$HF_HOME
export HF_DATASETS_CACHE=$HF_HOMEModels and datasets will automatically be downloaded into $HF_HOME.
conda create -n latentmas python=3.10 -y
conda activate latentmas
pip install -r requirements.txtIf you want vLLM support, also install:
pip install vllmgit clone https://github.com/Gen-Verse/LatentMAS.git
cd LatentMASLatentMAS/
βββ run.py # Main entry for experiments
βββ models.py # Wrapper for HF + vLLM + latent realignment
βββ methods/
β βββ baseline.py # Single-agent baseline
β βββ text_mas.py # Token-space multi-agent method
β βββ latent_mas.py # Latent-space multi-agent (our method)
βββ prompts.py # Prompt constructors
βββ data.py # Dataset loaders
βββ data/ # Provided data + figures (We give medqa.json as an example here)
βββ utils.py # Answer parsing / timeout / helpers
βββ example_logs/ # Example logs from LatentMAS
βββ requirements.txt
python run.py --method baseline --model_name Qwen/Qwen3-14B --task gsm8k --max_samples -1python run.py --method text_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1--latent_stepsβ [0, 80] Tune for best performance.--latent_space_realignEnables latentβembedding alignment We treat this as a hyperparameter β enable/disable depending on task/model:
python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --latent_space_realignTwo example LatentMAS logs are provided for reference purposes:
example_logs/qwen3_14b_mbppplus_sequential.txtexample_logs/qwen3_14b_humanevalplus_hierarchical.txt
Please refer to additional experiment logs here. You can open them to view the full agent interaction traces and outputs.
LatentMAS supports vLLM for faster inference.
python run.py --method baseline --model_name Qwen/Qwen3-14B --task gsm8k --max_samples -1 --use_vllmpython run.py --method text_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --use_vllmLatentMAS supports a hybrid HF + vLLM pipeline for fast inference:
- vLLM handles final text generation (with prefix caching, tensor parallelism, etc.)
- A HuggingFace model handles latent-space rollout and hidden-state alignment
For this setup, we recommend using two GPUs:
- One GPU for vLLM (
--device, e.g.,cuda:0) - One GPU for the auxiliary HF model (
--device2, e.g.,cuda:1)
CUDA_VISIBLE_DEVICES=0,1 python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 \
--use_vllm \
--use_second_HF_model \
--enable_prefix_caching \
--device2 cuda:1πImportant Note:
vLLM does not officially support modifying KV-cache or prompting via latent embeddings. We modify the partial inner package inside vLLM backend for our method implementation. Note minor numeric differences may arise compared to offical HF backend due to different decoding (generation) strategies. Please Use the HF backend to reproduce the official published results.
LatentMAS supports two distinct multi-agent collaboration patterns controlled by the --prompt flag:
Agents work in a sequential workflow, where each agent builds directly on the previous agent's work:
python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequentialExample workflow for a research question:
-
Planner Agent: Creates a step-by-step plan
- Prompt: "Design a clear, step-by-step plan for how to solve the question"
- Output: A detailed plan (in latent space)
-
Critic Agent: Reviews and critiques the plan
- Prompt: "Evaluate the correctness of the input plan and provide helpful feedback"
- Uses: The planner's output (via KV cache)
- Output: Feedback on the plan
-
Refiner Agent: Improves the plan based on feedback
- Prompt: "Provide a refined step-by-step plan based on previous plan with feedback"
- Uses: Both planner + critic outputs
- Output: Refined plan
-
Judger Agent: Produces final answer
- Prompt: "Solve the target question using latent information for reference"
- Uses: All previous agent outputs
- Output: Final detailed answer
Sequential Pattern: Build β Critique β Refine β Solve
Agents approach the same question from different perspectives in parallel, then combine their insights:
python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt hierarchicalExample workflow for the same research question:
-
Planner (Math Agent): Solves from mathematical perspective
- Prompt: "You are a math agent. Reason step-by-step and put the final answer inside \boxed{}"
- Output: Mathematical analysis + answer
-
Critic (Science Agent): Solves from scientific perspective
- Prompt: "You are a science agent. Reason step-by-step and put the final answer inside \boxed{}"
- Output: Scientific analysis + answer
-
Refiner (Code Agent): Solves from computational perspective
- Prompt: "You are a code agent. Reason step-by-step and put the final answer inside \boxed{}"
- Output: Computational analysis + answer
-
Judger (Synthesizer): Combines all perspectives
- Prompt: "You are a task summarizer. Given responses from previous agents as reference..."
- Uses: All three parallel analyses
- Output: Synthesized final answer
Hierarchical Pattern: Multiple Parallel Perspectives β Synthesis
| Aspect | Sequential | Hierarchical |
|---|---|---|
| Agent roles | Specialized workflow stages | Different domain experts |
| Information flow | Linear chain (AβBβCβD) | Parallel then merge (A,B,CβD) |
| Each agent | Performs different tasks | Solves same problem differently |
| Best for | Complex planning/refinement | Multi-perspective analysis |
| Example use case | Multi-step problem solving with feedback loops | Combining math, science, and code perspectives |
-
Use Sequential when:
- You need iterative refinement (plan β critique β improve)
- The problem benefits from explicit feedback loops
- You want a clear workflow of distinct tasks
-
Use Hierarchical when:
- You want diverse perspectives on the same problem
- Different domain expertise adds value (math + science + code)
- You need to combine multiple reasoning approaches
Both modes can be combined with custom prompts (via --custom_prompt_file) to define exactly how each agent behaves.
LatentMAS supports fully customizable prompts via JSON configuration files. This allows you to tailor agent behavior for specific domains or tasks.
Create a JSON file (e.g., prompts.json) with role-specific prompts:
{
"system": "You are a helpful research assistant.",
"planner": "Draft a response to the question.\n\nQuestion:\n{question}",
"critic": "Review the prior answer critically. Give specific, actionable feedback.\n\nQuestion:\n{question}",
"refiner": "Produce an improved answer that incorporates the Critic's feedback.\n\nQuestion:\n{question}",
"judger": "Answer the question using prior context as hints. Provide a detailed response.\n\nQuestion:\n{question}",
"baseline": "You are a problem solver. Reason step by step.\n\nQuestion:\n{question}"
}Supported placeholders:
{question}- The input question{context}- Previous agent outputs (for text_mas)
System message behavior:
- If
"system"is provided and non-empty, it will be included in the message list - If
"system"is empty ("") or omitted, no system message is added - Works across all methods: baseline, text_mas, and latent_mas
# For LatentMAS with custom prompts
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task custom \
--custom_question "What is the structure of spider silk?" \
--prompt hierarchical \
--custom_prompt_file prompts_advanced.json
# For TextMAS with custom prompts
python run.py --method text_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8k \
--prompt sequential \
--custom_prompt_file prompts_advanced.json
# For Baseline with custom prompts
python run.py --method baseline \
--model_name Qwen/Qwen3-14B \
--task custom \
--custom_question "Explain photosynthesis." \
--custom_prompt_file prompts_advanced.jsonLatentMAS may work with any HuggingFace model without model-specific requirements (further testing is needed, especiallly for benchmark performance). To do this you need to turn off enforcing Qwen-specific behavior (system message and model name validation), use the --do_not_enforce_qwen flag:
# Works with any model
python run.py --method latent_mas \
--model_name meta-llama/Llama-3.2-3B-Instruct \
--task gsm8k \
--prompt sequential --do_not_enforce_qwen
Control the thinking prompt tokens that trigger reasoning mode in your model:
# Use default thinking tokens: <think>\n
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8k \
--think
# Use custom thinking tokens
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8k \
--think "<reasoning>\n"
# Use custom thinking tokens
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8k \
--think "<think>\n<brainstorm>\n"
# Use chain-of-thought style prompt
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8k \
--think "Let's think step by step:\n"
# No thinking tokens (omit flag)
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task gsm8kYou can define custom agents and their execution order in your prompts.json file. The last agent in your list always generates the final text output, regardless of its role name.
Define custom agents:
{
"agents": [
{"name": "Researcher", "role": "researcher"},
{"name": "Analyst", "role": "analyst"},
{"name": "Writer", "role": "writer"}
],
"researcher": "Research the topic: {question}",
"analyst": "Analyze the research findings.",
"writer": "Write the final answer based on analysis."
}Key features:
- Flexible agent count: Use 1, 2, 3, 4, 5+ agents
- Custom roles: Name agents based on their actual function
- Last agent behavior: The last agent in the list always produces decoded text output
- All other agents: Generate latent representations (compressed reasoning)
Example with single agent:
{
"agents": [
{"name": "Expert", "role": "expert"}
],
"expert": "Answer the question: {question}"
}Example with 5 agents:
{
"agents": [
{"name": "Planner", "role": "planner"},
{"name": "Researcher", "role": "researcher"},
{"name": "Critic", "role": "critic"},
{"name": "Refiner", "role": "refiner"},
{"name": "Synthesizer", "role": "synthesizer"}
]
}The --first_agent_text flag enables the first agent to generate text instead of latent representations. This is particularly useful for:
1. Graph reasoning models: Models trained to express graphs in natural language can output their structure as text, then subsequent agents reason over it efficiently in latent space.
2. Structured reasoning: Models that benefit from explicitly generating intermediate structures (plans, outlines, decompositions) in text form before latent reasoning.
3. GRPO-trained models: Models trained via Group Relative Policy Optimization to produce specific text patterns can maintain their learned behavior while benefiting from latent efficiency.
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task custom \
--custom_question "Design a molecular structure for a biodegradable plastic." \
--latent_steps 64 \
--first_agent_text \
--max_new_tokens 3000How it works:
- First agent generates explicit text output (e.g., graph structure, detailed plan)
- Middle agents use latent-space reasoning (efficient, compressed thinking)
- Last agent produces the final answer using all accumulated context
Technical notes:
- The first agent's generated text becomes part of the KV cache that subsequent agents attend to
- Set
--max_new_tokensappropriately for the first agent's output complexity - The
--latent_stepsparameter only affects middle agents (not first or last) - EOS tokens are preserved to maintain chat template structure
Combining all advanced features for a research task:
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task custom \
--custom_question "Give me a research idea to make a composite inspired by spider silk." \
--custom_prompt_file prompts_advanced.json \
--prompt hierarchical \
--latent_steps 64 \
--think "<scientific_reasoning>\n" \
--first_agent_text \
--max_new_tokens 3000 \
--temperature 0.4This setup:
- Uses custom research-focused prompts from
prompts.json - Runs on a custom scientific question
- Uses hierarchical agent organization
- Applies custom thinking tokens
- First agent generates text for structured output
- Remaining agents use latent reasoning with 64 steps
For a sequential workflow (plan β critique β refine) using custom agents:
python run.py --method latent_mas \
--model_name Qwen/Qwen3-14B \
--task custom \
--custom_question "Give me a research idea to make a composite inspired by spider silk." \
--latent_steps 64 \
--prompt sequential \
--custom_prompt_file prompts_advanced.json \
--max_new_tokens 3000Example prompts.json file:
{
"agents": [
{"name": "Researcher", "role": "researcher"},
{"name": "Critic", "role": "critic"},
{"name": "Writer", "role": "writer"}
],
"researcher": "Question:\n{question}\n\nBrainstorm and research the topic.",
"critic": "Question:\n{question}\n\nCritically review the research.",
"writer": "Question:\n{question}\n\nYou are provided with latent information for reference and a target question to solve.\n\nThe latent information might contain irrelevant contents. Ignore it if it is not helpful for solving the target question.\n\nWrite a final answer, very detailed, without outputting other irrelevant information."
}Workflow:
- Researcher β Brainstorms and researches (latent, 64 steps)
- Critic β Reviews the research (latent, 64 steps)
- Writer β Synthesizes into final answer (text output)
Key differences from hierarchical mode:
- Sequential builds chain of reasoning: Research β Critique β Write
- Each agent has a distinct role in the workflow
- Information flows linearly through agents
- Best for iterative refinement tasks
Associated prompts_advanced.json file:
{
"agents": [
{"name": "Strategist", "role": "strategist"},
{"name": "Investigator", "role": "investigator"},
{"name": "Evaluator", "role": "evaluator"},
{"name": "Synthesizer", "role": "synthesizer"},
{"name": "Communicator", "role": "communicator"}
],
"strategist": "Question:\n{question}\n\nYou are a strategic planning agent. Your task is to:\n1. Break down the question into key components\n2. Identify the core challenges and requirements\n3. Outline a high-level approach to address the question\n4. Highlight critical assumptions and constraints\n\nProvide a clear strategic framework for solving this problem.",
"investigator": "Question:\n{question}\n\nYou are a deep investigation agent. Building on the strategic framework, your task is to:\n1. Conduct thorough analysis of each component\n2. Explore multiple perspectives and approaches\n3. Identify relevant principles, methods, and precedents\n4. Uncover potential challenges and opportunities\n5. Generate detailed insights and findings\n\nProvide comprehensive investigative findings.",
"evaluator": "Question:\n{question}\n\nYou are a critical evaluation agent. Your task is to:\n1. Rigorously assess the strategic approach and investigative findings\n2. Identify logical flaws, gaps, or weaknesses\n3. Challenge assumptions and test robustness\n4. Propose improvements and alternative considerations\n5. Verify consistency and completeness\n\nProvide a thorough critical evaluation with specific recommendations for improvement.",
"synthesizer": "Question:\n{question}\n\nYou are a synthesis agent. Your task is to:\n1. Integrate the strategy, investigation, and evaluation into a coherent whole\n2. Resolve conflicts and reconcile different perspectives\n3. Strengthen weak points identified in the evaluation\n4. Construct a comprehensive solution or answer\n5. Ensure logical flow and completeness\n\nProvide an integrated, well-reasoned synthesis.",
"communicator": "Question:\n{question}\n\nYou are provided with latent information containing strategic planning, deep investigation, critical evaluation, and synthesis from previous agents.\n\nYour task is to:\n1. Extract the most valuable insights from the latent context\n2. Organize the information in a clear, logical structure\n3. Present a comprehensive, well-articulated final answer\n4. Ensure the response is precise, actionable, and complete\n5. Address the original question directly and thoroughly\n\nProvide a polished, professional final answer that leverages all prior analysis."
}π« If you find LatentMAS helpful, please kindly give us a star βοΈ and cite below. Thanks!
@article{zou2025latentmas,
title={Latent Collaboration in Multi-Agent Systems},
author={Zou, Jiaru and Yang, Xiyuan and Qiu, Ruizhong and Li, Gaotang and Tieu, Katherine and Lu, Pan and Shen, Ke and Tong, Hanghang and Choi, Yejin and He, Jingrui and Zou, James and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2511.20639},
year={2025}
}
This code is partially based on the amazing work of vLLM.
This code is based on the amazing LatentMAS repo, adapted here for flexible use cases in scientific and technical applications.




