This repository contains the official implementation of ParallelComp, a novel training-free method for long-context extrapolation that extends Large Language Models' (LLMs) context length from 8K to 128K while maintaining high throughput and preserving perplexity.
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong
π Paper Link
- Training-free: No costly fine-tuning required for length extrapolation
- High Performance: Achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model
- Scalable: Extends context length from 4K to 128K tokens
- Efficient: Integrates seamlessly with Flash Attention
- Fast: 23.50x acceleration in the prefilling stage with 1.76x improvement in chunk throughput
- Memory Efficient: Manages ultra-long contexts on a single A100 80GB GPU
- Python 3.9+
- PyTorch 2.5.1
- CUDA compatible GPU(s)
- Transformers 4.43.2
- 80GB A100 GPU memory for ultra-long contexts
- Clone the repository:
git clone https://github.com/your-username/ParallelComp.git
cd ParallelComp- Install dependencies:
pip install -r requirements.txt- For development, install from requirements.in:
pip-compile requirements.in
pip install -r requirements.txtbash run_test_longbench_multi_gpu_window8_llama.sh \
--parallel_pattern parallel_comp --gpu_nums 1_0 \
--kv_cache_eviction false --capacity 512 \
--kv_cache_dynamic false --stage_eviction false \
--recent_token 8 \
--topk_windows -3 --query_rank true \
--query_recent_tokens 0 --reduce_factor 0 \
--calibration_stage None --calibration_mode 0 \
--special_token true \
--model meta-llama/Llama-2-7b-chat-hfexport CUDA_VISIBLE_DEVICES=0,1,2,3
bash run_test_longbench_multi_gpu_window8_llama.sh \
--parallel_pattern parallel_comp --gpu_nums 4 \
--kv_cache_eviction false --capacity 512 \
--kv_cache_dynamic false --stage_eviction false \
--recent_token 8 \
--topk_windows -3 --query_rank true \
--query_recent_tokens 0 --reduce_factor 0 \
--calibration_stage None --calibration_mode 0 \
--special_token true \
--model meta-llama/Llama-2-7b-chat-hf \bash run_test_longbench_multi_gpu_window8_llama.sh \
--parallel_pattern parallel_comp_batches --gpu_nums 1_0 \
--kv_cache_eviction false --capacity 512 \
--kv_cache_dynamic false --stage_eviction false \
--recent_token 8 \
--topk_windows -3 --query_rank true \
--query_recent_tokens 0 --reduce_factor 0 \
--calibration_stage None --calibration_mode 0 \
--special_token true \
--model meta-llama/Llama-2-7b-chat-hf- LLaMA family models
- Qwen2.5 models
- LongBench: narrativeqa, qasper, multifieldqa_en, hotpotqa, 2wikimqa, musique, gov_report, qmsum, multi_news, trec, triviaqa, samsum, passage_count, passage_retrieval_en, lcc, repobench-p
- InfiniteBench: passkey, number_string,kv_retrieval, math_find, code_debug, longbook_choice_eng, longdialogue_qa_eng
Calculate metrics for evaluation results:
# For LongBench results
bash scripts/longbench_metrics.sh --results_dir ./results --new_method your_method_name --switch true
# For InfiniteBench results
bash scripts/infinitebench_metrics.sh --results_dir ./results --new_method your_method_name --switch trueMulti-GPU setups use accelerate with YAML configs in scripts/:
# 4 GPU setup
accelerate launch --config_file scripts/gpu_4.yaml your_script.py
# 8 GPU setup
accelerate launch --config_file scripts/gpu_8.yaml your_script.pyAvailable configurations: gpu_1.yaml, gpu_2.yaml, gpu_3.yaml, gpu_4.yaml, gpu_5.yaml, gpu_6.yaml, gpu_7.yaml, gpu_8.yaml
ParallelComp includes attention calibration strategies to mitigate attention sink issues:
bash run_test_longbench_multi_gpu_window8_llama.sh \
--parallel_pattern parallel_comp --gpu_nums 1_0 \
--kv_cache_eviction false --capacity 512 \
--kv_cache_dynamic false --stage_eviction false \
--recent_token 8 \
--topk_windows -3 --query_rank true \
--query_recent_tokens 0 --reduce_factor 0 \
--calibration_stage prefill_2_calibration_head_{sink/recent/middle/all}_{layer_i}_{layer_j} --calibration_mode 1 \
--special_token true \
--model meta-llama/Llama-2-7b-chat-hf \- [β] Code Organization: Currently organizing and cleaning up the codebase for better usability
- [β] Gemma Support: Adding full support for Gemma model family
- [β] Baselines: Adding full support for Evaluation of Baselines
- [β] SGLang Integration: Adding support for SGLang inference engine for improved performance
- [β] Documentation: Expanding documentation with more detailed examples
- [β] Quantization Support: Adding support for model quantization (INT8/INT4) to reduce memory usage and accelerate inference
- [β] Benchmarks: Adding more comprehensive benchmark results
- [β ] FlashAttention Support
- [β ] Multi-GPU Inference Support
- [β ] Batch Inference Support
- [β ] AMD GPU Support
ParallelComp/
βββ run_evaluation_multi_gpu.py # Multi-GPU evaluation script
βββ model_loaders.py # Model loading utilities
βββ experiment_manager.py # Experiment management
βββ pcw_wrapper.py # Parallel Context Window wrapper
βββ modeling_llama_with_pcw_kv_cache_FlashAttention_longbench.py # Llama model implementation with PCW
βββ modeling_qwen2_with_pcw_kv_cache_FlashAttention_longbench.py # Qwen2 model implementation with PCW
βββ metrics.py # Evaluation metrics
βββ eval_longbench.py # LongBench dataset evaluation
βββ eval_infinitebench.py # InfiniteBench dataset evaluation
βββ utils.py # General utilities
βββ constants.py # Project constants
βββ run_test_longbench_multi_gpu_window8_llama.sh # Llama evaluation script
βββ run_test_longbench_multi_gpu_window8_qwen.sh # Qwen evaluation script
βββ scripts/ # GPU configuration files
β βββ gpu_*.yaml # GPU configuration files
β βββ longbench_metrics.sh # LongBench metrics script
β βββ infinitebench_metrics.sh # InfiniteBench metrics script
βββ longbench_config/ # LongBench configurations
β βββ dataset2*.json # Dataset configuration files
β βββ model2maxlen*.json # Model configuration files
β βββ past/ # Historical configurations
βββ datasets/ # Dataset storage
β βββ LongBench/ # LongBench dataset
β βββ gsm8k/ # GSM8K dataset
βββ my_utils/ # Utilities
β βββ logger.py # Logging utilities
β βββ entropy_utils.py # Entropy calculation utilities
β βββ cache_revise.py # Cache revision utilities
β βββ priorityqueue.py # Priority queue implementation
βββ requirements.txt # Python dependencies
βββ requirements.in # Dependency source file
ParallelComp achieves significant improvements in long-context tasks:
- 91.17% of GPT-4's performance using 8B model trained on 8K context
- 23.50x acceleration in prefilling stage
- 1.76x improvement in chunk throughput
- Outperforms Claude-2 and Kimi-Chat on long-context benchmarks
We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you find this work useful, please cite our paper:
@article{xiong2025parallelcomp,
title={ParallelComp: Parallel Long-Context Compressor for Length Extrapolation},
author={Xiong, Jing and Shen, Jianghan and Zheng, Chuanyang and Wan, Zhongwei and Zhao, Chenyang and Yang, Chiwun and Ye, Fanghua and Yang, Hongxia and Kong, Lingpeng and Wong, Ngai},
journal={arXiv preprint arXiv:2502.14317},
year={2025}
}For questions and support, please open an issue in this repository or contact the authors.
Note: This implementation will be fully released soon. Stay tuned for updates!