| Paper | Slide | Poster | Video |
News π₯
- [2025/6] Token Recycling is selected as Oral presentation at ACL2025! See you in Vienna! π¦πΉ
- [2025/5] Token Recycling is accepted by ACL2025!
We introduce Token Recycling:
- A plug-and-play speculative decoding method to accelerate LLM inference~
- Only requires minimal storage space (<2MB), without the need for a draft model or a data store!
- Achieves approximately 2x speedup on all sizes of LLMs!
Traditional speculative decoding methods face significant limitations:
- Model-based require training and maintaining additional models, increasing complexity and computational overhead.
- Retrieval-based need large-scale stores, time-consuming retrieval or suffer from a retrieval space that lacks flexibility and adaptability.
LLMs generate a probability distribution over the vocabulary at each step, yet existing methods keep only the sampled token, treating all others as 'trash'. We observe that these other high-probability candidate tokens frequently appear in subsequent generations, making them a valuable but overlooked 'treasure'.
Based on the oberservation, we propose Token Recycling (TR). It utilizes an adjacency matrix to store candidate tokens for every token in the vocabulary. In each generation step, a Breadth-First Search (BFS) if performed on this matrix, starting from the last token of current input, to generate a draft tree. After model forward the tree attention and verification, the newly generated candidate tokens are used to update the adjacency matrix. In this way, 'trash' tokens are turned into a 'treasure'!
On SpecBench, TR achieves more than a 2x speedup on the 7B model, nearly 30% higher than the previous train-free methods. Even compared to tuning Medusa, it shows an improvement of almost 25%. For the 13B and 33B models,it consistently provides nearly 2x speedup, maintaining the 30% acceleration advantage.
| Method | Memory (MB) | Tokens/s | Speed |
|---|---|---|---|
| Eagle1 | 500 | 106.94 | 2.08 |
| Eagle2 | 500 | 116.95 | 2.28 |
| TR | 1.95 | 107.52 | 2.09 |
TR is also compared with the SOTA speculative decoding methods - Eagle1/2 on SpecBench. It is a surprise that, despite being completely training-free and self-drafting, TR outperforms Eagle1. Moreover, TR achieves 91.23% of the acceleration performance of Eagle2 while only requiring 0.39% of the memory used by Eagle2. It is worth noting that TR still employs a static tree, rather than the dynamic tree used in Eagle2. This highlights the remarkable effectiveness and efficiency of TR and shows potential for further improvement when combine TR with dynamic trees.
This codebase is built on the SpecBench, no other package are needed excepted for those used by Spec-Bench.
git clone https://github.com/Luowaterbi/TokenRecycling.git
cd TokenRecycling
pip install -r requirements.txtNote: Make sure you have the required models downloaded. The evaluation script expects models in ../models/ directory relative to the project root.
TokenRecycling/
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ eval.sh # Quick evaluation script
βββ data/ # Dataset files
βββ evaluation/
β βββ inference_recycling.py # Main inference implementation
β βββ eval.py # Evaluation utilities
βββ model/
βββ recycling/
βββ tree_template_.py # Tree structure definitions
βββ kv_cache.py # Key-value cache management
βββ modeling_llama_kv.py # LLaMA model modifications
bash eval.shYou can customize the inference parameters:
python -m evaluation.inference_recycling \
--model-path /path/to/your/model \
--model-id your-model-id \
--bench-name spec_bench \
--dtype float16 \
--output-id-topk 8 \
--tree-version "2.2.2" \
--temperature 0.0You can customize the number of candidate tokens stored for each token in the vocabulary using the --output-id-topk argument. This parameter controls how many high-probability "treasure" tokens are collected from the output distribution to build and update the adjacency matrix.
For implementation details, you can refer to evaluation/inference_recycling.py.
Token Recycling uses a fixed tree template (version "2.2.2" by default) to organize candidate tokens in a tree structure.
Tree Representation:
- Root: Represents the current input's last token.
- Index Meaning: Numbers on the tree represent the node's index position in its parent's candidate list (0-7 for top-k=8).
Root (current input's last token.)
βββ [0] β 0th candidate token
β βββ [0,0] β 0th candidate after [0]
β β βββ [0,0,0] β 0th candidate after [0,0]
β βββ [0,1] β 1st candidate after [0]
βββ [1] β 1st candidate token
β βββ [1,0] β 0th candidate after [1]
βββ [2] β 2nd candidate token
Tree Template Implementation:
The tree structure is defined in model/recycling/tree_template_.py.
- Each node is represented by a path from root:
[depth_0_index, depth_1_index, ...][0]: The 0th child of root[0,1]: The 1st child of root's 0th child[0,1,2]: The 2nd child of the node at path[0,1]
- The current default template "2.2.2" has 80 nodes (except root) organized in 6 levels:
- Level 1: 8 nodes
[0], [1], ..., [7] - Level 2: 21 nodes like
[0,0], [0,1], ..., [7,0] - Level 3: 25 nodes like
[0,0,0], [0,0,1], ..., [5,0,0] - Level 4: 15 nodes like
[0,0,0,0], [0,0,0,1], ..., [3,0,0,0] - Level 5: 8 nodes like
[0,0,0,0,0], [0,0,0,0,1], ..., [1,0,0,0,0] - Level 6: 3 nodes
[0,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0] - Path
[0,1,2]means: "From root, take the 0th candidate, then from that token take the 1st candidate, then from that token take the 2nd candidate"
- Level 1: 8 nodes
How to Customize Tree Structure:
-
Add new tree version with different branching factors:
if version == "mine": # More candidates at each level tree_template = [ [0], [1], [2], [3], [4], # 5 candidates at depth 1 [0,0], [0,1], [1,0], [1,1], # 4 candidates at depth 2 # ... continue pattern ]
-
Use your custom tree:
python -m evaluation.inference_recycling \ --tree-version "mine" \ # ... other parameters
We are actively working to expand Token Recycling's compatibility and ease of use. Contributions are welcome!
- Adapt to newer models (currently based on Vicuna-1.3)
- Transformers
model.generate()support - vLLM & SGLang integration
Seek for help: If you have experience with integration and would like to contribute, please contact us! Your help would be greatly appreciated! π
@article{luo2024turning,
title={Turning trash into treasure: Accelerating inference of large language models with token recycling},
author={Luo, Xianzhen and Wang, Yixuan and Zhu, Qingfu and Zhang, Zhiming and Zhang, Xuanyu and Yang, Qing and Xu, Dongliang},
journal={arXiv preprint arXiv:2408.08696},
year={2024}
}




