Skip to content

[ACL2025 OralπŸ”₯]Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Notifications You must be signed in to change notification settings

Luowaterbi/TokenRecycling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”„ Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

| Paper | Slide | Poster | Video |


News πŸ”₯

  • [2025/6] Token Recycling is selected as Oral presentation at ACL2025! See you in Vienna! πŸ‡¦πŸ‡Ή
  • [2025/5] Token Recycling is accepted by ACL2025!

Contents

Introduction

We introduce Token Recycling:

  • A plug-and-play speculative decoding method to accelerate LLM inference~
  • Only requires minimal storage space (<2MB), without the need for a draft model or a data store!
  • Achieves approximately 2x speedup on all sizes of LLMs!

Background: Retrieval-Based Speculative Decoding

Traditional speculative decoding methods face significant limitations:

  • Model-based require training and maintaining additional models, increasing complexity and computational overhead.
  • Retrieval-based need large-scale stores, time-consuming retrieval or suffer from a retrieval space that lacks flexibility and adaptability.

Token Recycling: Turning Trash into Treasure! πŸ”„

LLMs generate a probability distribution over the vocabulary at each step, yet existing methods keep only the sampled token, treating all others as 'trash'. We observe that these other high-probability candidate tokens frequently appear in subsequent generations, making them a valuable but overlooked 'treasure'.

Based on the oberservation, we propose Token Recycling (TR). It utilizes an adjacency matrix to store candidate tokens for every token in the vocabulary. In each generation step, a Breadth-First Search (BFS) if performed on this matrix, starting from the last token of current input, to generate a draft tree. After model forward the tree attention and verification, the newly generated candidate tokens are used to update the adjacency matrix. In this way, 'trash' tokens are turned into a 'treasure'!

Experimental Results πŸ“Š

On SpecBench, TR achieves more than a 2x speedup on the 7B model, nearly 30% higher than the previous train-free methods. Even compared to tuning Medusa, it shows an improvement of almost 25%. For the 13B and 33B models,it consistently provides nearly 2x speedup, maintaining the 30% acceleration advantage.

Method Memory (MB) Tokens/s Speed
Eagle1 500 106.94 2.08
Eagle2 500 116.95 2.28
TR 1.95 107.52 2.09

TR is also compared with the SOTA speculative decoding methods - Eagle1/2 on SpecBench. It is a surprise that, despite being completely training-free and self-drafting, TR outperforms Eagle1. Moreover, TR achieves 91.23% of the acceleration performance of Eagle2 while only requiring 0.39% of the memory used by Eagle2. It is worth noting that TR still employs a static tree, rather than the dynamic tree used in Eagle2. This highlights the remarkable effectiveness and efficiency of TR and shows potential for further improvement when combine TR with dynamic trees.

Usage

Install

This codebase is built on the SpecBench, no other package are needed excepted for those used by Spec-Bench.

git clone https://github.com/Luowaterbi/TokenRecycling.git
cd TokenRecycling
pip install -r requirements.txt

Note: Make sure you have the required models downloaded. The evaluation script expects models in ../models/ directory relative to the project root.

Project Structure

TokenRecycling/
β”œβ”€β”€ README.md                    # Project documentation
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ eval.sh                      # Quick evaluation script
β”œβ”€β”€ data/                        # Dataset files
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ inference_recycling.py  # Main inference implementation
β”‚   └── eval.py                  # Evaluation utilities
└── model/
    └── recycling/
        β”œβ”€β”€ tree_template_.py   # Tree structure definitions
        β”œβ”€β”€ kv_cache.py         # Key-value cache management
        └── modeling_llama_kv.py # LLaMA model modifications

Inference

Quick Start

bash eval.sh

Advanced Usage

You can customize the inference parameters:

python -m evaluation.inference_recycling \
    --model-path /path/to/your/model \
    --model-id your-model-id \
    --bench-name spec_bench \
    --dtype float16 \
    --output-id-topk 8 \
    --tree-version "2.2.2" \
    --temperature 0.0

Candidate Storage Configuration πŸ› οΈ

You can customize the number of candidate tokens stored for each token in the vocabulary using the --output-id-topk argument. This parameter controls how many high-probability "treasure" tokens are collected from the output distribution to build and update the adjacency matrix.

For implementation details, you can refer to evaluation/inference_recycling.py.

Tree Structure Configuration 🌳

Token Recycling uses a fixed tree template (version "2.2.2" by default) to organize candidate tokens in a tree structure.

Tree Structure Visualization

Tree Representation:

  • Root: Represents the current input's last token.
  • Index Meaning: Numbers on the tree represent the node's index position in its parent's candidate list (0-7 for top-k=8).
Root (current input's last token.)
β”œβ”€β”€ [0] β†’ 0th candidate token
β”‚   β”œβ”€β”€ [0,0] β†’ 0th candidate after [0]
β”‚   β”‚   └── [0,0,0] β†’ 0th candidate after [0,0]
β”‚   └── [0,1] β†’ 1st candidate after [0]
β”œβ”€β”€ [1] β†’ 1st candidate token
β”‚   └── [1,0] β†’ 0th candidate after [1]
└── [2] β†’ 2nd candidate token

Tree Template Implementation: The tree structure is defined in model/recycling/tree_template_.py.

  • Each node is represented by a path from root: [depth_0_index, depth_1_index, ...]
    • [0]: The 0th child of root
    • [0,1]: The 1st child of root's 0th child
    • [0,1,2]: The 2nd child of the node at path [0,1]
  • The current default template "2.2.2" has 80 nodes (except root) organized in 6 levels:
    • Level 1: 8 nodes [0], [1], ..., [7]
    • Level 2: 21 nodes like [0,0], [0,1], ..., [7,0]
    • Level 3: 25 nodes like [0,0,0], [0,0,1], ..., [5,0,0]
    • Level 4: 15 nodes like [0,0,0,0], [0,0,0,1], ..., [3,0,0,0]
    • Level 5: 8 nodes like [0,0,0,0,0], [0,0,0,0,1], ..., [1,0,0,0,0]
    • Level 6: 3 nodes [0,0,0,0,0,0], [0,0,0,0,0,1], [0,0,0,1,0,0]
    • Path [0,1,2] means: "From root, take the 0th candidate, then from that token take the 1st candidate, then from that token take the 2nd candidate"

How to Customize Tree Structure:

  1. Add new tree version with different branching factors:

    if version == "mine":
        # More candidates at each level
        tree_template = [
            [0], [1], [2], [3], [4],     # 5 candidates at depth 1
            [0,0], [0,1], [1,0], [1,1],  # 4 candidates at depth 2
            # ... continue pattern
        ]
  2. Use your custom tree:

    python -m evaluation.inference_recycling \
        --tree-version "mine" \
        # ... other parameters

Todo

We are actively working to expand Token Recycling's compatibility and ease of use. Contributions are welcome!

  • Adapt to newer models (currently based on Vicuna-1.3)
  • Transformers model.generate() support
  • vLLM & SGLang integration

Seek for help: If you have experience with integration and would like to contribute, please contact us! Your help would be greatly appreciated! πŸ™

Citation

@article{luo2024turning,
  title={Turning trash into treasure: Accelerating inference of large language models with token recycling},
  author={Luo, Xianzhen and Wang, Yixuan and Zhu, Qingfu and Zhang, Zhiming and Zhang, Xuanyu and Yang, Qing and Xu, Dongliang},
  journal={arXiv preprint arXiv:2408.08696},
  year={2024}
}

About

[ACL2025 OralπŸ”₯]Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published