Multi agent systems (MAS) typically communicate through text, forcing the extra step of decoding latent representations into tokens before passing to the next agent. The LatentMAS framework, proposed by Zou et al. 2025 [1], engineers direct sharing of the transformer's key value (KV) caches, providing significant speed ups, accuracy gains, and up to 80% less token usage. However, this framework introduces a new challenge - KV caches grow linearly in the number of agents in the system. This work explores the k nearest neighbor retrieval over cached keys from the Memorizing Transformers paper by Wu et al. 2022 [6] as a potential mechanism to limit KV size. In the end, we are able to trim the KV cache memory by 40% and speedup answer generation by 29% while maintaining near full LatentMAS accuracy. We discuss the full process of experimentation and provide intuition for our results. Ultimately, these findings suggest that latent communication contains structured layer-dependent information that can be selectively compressed without significantly compromising performance, providing an avenue for further development of efficient latent MAS design.
Read the blog post here for more details
This repository is based on the LatentMAS framework (Zou et al., 2025), a multi-agent reasoning framework that moves agent collaboration from token space into the model's latent space.
Key Features:
- Efficient multi-step reasoning with drastically fewer tokens
- Training-free latent-space alignment for stable generation
- KNN-based KV cache filtering for memory-efficient agent communication
- Three selection strategies: top-k similarity, bottom-k diversity, and random baseline
This implementation extends the original LatentMAS with experimental KNN filtering capabilities for the KV cache, enabling more efficient memory usage during multi-agent collaboration.
This implementation supports the following datasets:
- GSM8K: Grade school math problems
- GPQA (Diamond): Graduate-level science questions
- MedQA: Medical question answering
We recommend setting your HF cache directory to avoid repeated downloads:
export HF_HOME=/path/to/huggingface
export TRANSFORMERS_CACHE=$HF_HOME
export HF_DATASETS_CACHE=$HF_HOMEModels and datasets will automatically be downloaded into $HF_HOME.
conda create -n latentmas python=3.10 -y
conda activate latentmas
pip install -r requirements.txtgit clone https://github.com/YourRepo/LatentMAS.git
cd LatentMASLatentMAS/
│── run.py # Main entry for experiments
│── models.py # Wrapper for HF models + latent realignment
│── methods/
│ ├── baseline.py # Single-agent baseline
│ ├── text_mas.py # Token-space multi-agent method
│ └── latent_mas.py # Latent-space multi-agent (with KNN filtering)
│── prompts.py # Prompt constructors
│── prompts v2.py # Updated Prompt constructors for Bottom kNN
│── data.py # Dataset loaders (GSM8K, GPQA, MedQA)
│── data/ # Provided data + figures
│── utils.py # Answer parsing / timeout / helpers
│── example_logs/ # Example logs from LatentMAS
│── requirements.txt
python run.py --method baseline --model_name Qwen/Qwen3-4B --task gsm8k --max_samples 100python run.py --method text_mas --model_name Qwen/Qwen3-4B --task gsm8k --prompt sequential --max_samples 100python run.py --method latent_mas --model_name Qwen/Qwen3-4B --task gsm8k --latent_steps 10 --prompt sequential --max_samples 100-
--latent_steps∈ [0, 80] Number of latent reasoning steps per agent. Typically 10–20 works well. -
--latent_space_realignEnables latent→embedding alignment for better generation stability.
python run.py --method latent_mas --model_name Qwen/Qwen3-4B --task gsm8k --latent_steps 10 --latent_space_realign --max_samples 100--prompt∈ {sequential,hierarchical} Prompt structure for agent collaboration.
This implementation includes experimental KNN-based filtering of the KV cache to reduce memory usage during agent-to-agent communication.
-
--knn_filterEnable KNN filtering of the KV cache -
--knn_percentage(default: 0.8) Percentage of tokens to keep (0.0-1.0). E.g., 0.8 keeps 80% of the cache. -
--knn_min_keep(default: 5) Minimum number of recent tokens to always preserve, regardless of similarity. -
--knn_strategy∈ {top,bottom,random} (default:top)top: Keep most similar tokensbottom: Keep least similar tokensrandom: Keep random tokens
python run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gsm8k \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.8 \
--knn_strategy toppython run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gpqa \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.5 \
--knn_strategy toppython run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task medqa \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.8 \
--knn_strategy bottompython run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gsm8k \
--prompt hierarchical \
--latent_steps 20 \
--max_samples 100 \
--latent_space_realign \
--knn_filter \
--knn_percentage 0.7 \
--knn_min_keep 5 \
--knn_strategy top \
--temperature 0.6 \
--seed 42This implementation is based on the LatentMAS paper. If you find this work helpful, please cite:
@article{zou2025latentmas,
title={Latent Collaboration in Multi-Agent Systems},
author={Zou, Jiaru and Yang, Xiyuan and Qiu, Ruizhong and Li, Gaotang and Tieu, Katherine and Lu, Pan and Shen, Ke and Tong, Hanghang and Choi, Yejin and He, Jingrui and Zou, James and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2511.20639},
year={2025}
}
This code is based on the LatentMAS framework by Zou et al., 2025. The KNN cache filtering extension was developed independently for research purposes.