ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

This repository contains the official implementation of ParallelComp, a novel training-free method for long-context extrapolation that extends Large Language Models' (LLMs) context length from 8K to 128K while maintaining high throughput and preserving perplexity.

📄 Paper

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong

📖 Paper Link

🚀 Key Features

Training-free: No costly fine-tuning required for length extrapolation
High Performance: Achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model
Scalable: Extends context length from 4K to 128K tokens
Efficient: Integrates seamlessly with Flash Attention
Fast: 23.50x acceleration in the prefilling stage with 1.76x improvement in chunk throughput
Memory Efficient: Manages ultra-long contexts on a single A100 80GB GPU

🛠️ Installation

Requirements

Python 3.9+
PyTorch 2.5.1
CUDA compatible GPU(s)
Transformers 4.43.2
80GB A100 GPU memory for ultra-long contexts

Setup

Clone the repository:

git clone https://github.com/your-username/ParallelComp.git
cd ParallelComp

Install dependencies:

pip install -r requirements.txt

For development, install from requirements.in:

pip-compile requirements.in
pip install -r requirements.txt

📊 Usage

Quick Start

Single GPU Evaluation

bash run_test_longbench_multi_gpu_window8_llama.sh \
    --parallel_pattern parallel_comp --gpu_nums 1_0 \
    --kv_cache_eviction false --capacity 512 \
    --kv_cache_dynamic false --stage_eviction false \
    --recent_token 8  \
    --topk_windows -3 --query_rank true \
    --query_recent_tokens 0 --reduce_factor 0 \
    --calibration_stage None --calibration_mode 0 \
    --special_token true \
    --model meta-llama/Llama-2-7b-chat-hf

Multi-GPU Evaluation

export CUDA_VISIBLE_DEVICES=0,1,2,3
bash run_test_longbench_multi_gpu_window8_llama.sh \
    --parallel_pattern parallel_comp --gpu_nums 4 \
    --kv_cache_eviction false --capacity 512 \
    --kv_cache_dynamic false --stage_eviction false \
    --recent_token 8  \
    --topk_windows -3 --query_rank true \
    --query_recent_tokens 0 --reduce_factor 0 \
    --calibration_stage None --calibration_mode 0 \
    --special_token true \
    --model meta-llama/Llama-2-7b-chat-hf \

Batch Evaluation

bash run_test_longbench_multi_gpu_window8_llama.sh \
    --parallel_pattern parallel_comp_batches --gpu_nums 1_0 \
    --kv_cache_eviction false --capacity 512 \
    --kv_cache_dynamic false --stage_eviction false \
    --recent_token 8  \
    --topk_windows -3 --query_rank true \
    --query_recent_tokens 0 --reduce_factor 0 \
    --calibration_stage None --calibration_mode 0 \
    --special_token true \
    --model meta-llama/Llama-2-7b-chat-hf

Supported Models

LLaMA family models
Qwen2.5 models

Supported Datasets

Long Context Benchmarks

LongBench: narrativeqa, qasper, multifieldqa_en, hotpotqa, 2wikimqa, musique, gov_report, qmsum, multi_news, trec, triviaqa, samsum, passage_count, passage_retrieval_en, lcc, repobench-p
InfiniteBench: passkey, number_string,kv_retrieval, math_find, code_debug, longbook_choice_eng, longdialogue_qa_eng

📈 Evaluation Scripts

Evaluation Metrics

Calculate metrics for evaluation results:

# For LongBench results
bash scripts/longbench_metrics.sh --results_dir ./results --new_method your_method_name --switch true

# For InfiniteBench results  
bash scripts/infinitebench_metrics.sh --results_dir ./results --new_method your_method_name --switch true

GPU Configuration

Multi-GPU setups use accelerate with YAML configs in scripts/:

# 4 GPU setup
accelerate launch --config_file scripts/gpu_4.yaml your_script.py

# 8 GPU setup  
accelerate launch --config_file scripts/gpu_8.yaml your_script.py

Available configurations: gpu_1.yaml, gpu_2.yaml, gpu_3.yaml, gpu_4.yaml, gpu_5.yaml, gpu_6.yaml, gpu_7.yaml, gpu_8.yaml

🔧 Advanced Configuration

Attention Calibration

ParallelComp includes attention calibration strategies to mitigate attention sink issues:

Single GPU Evaluation

bash run_test_longbench_multi_gpu_window8_llama.sh \
    --parallel_pattern parallel_comp --gpu_nums 1_0 \
    --kv_cache_eviction false --capacity 512 \
    --kv_cache_dynamic false --stage_eviction false \
    --recent_token 8  \
    --topk_windows -3 --query_rank true \
    --query_recent_tokens 0 --reduce_factor 0 \
    --calibration_stage prefill_2_calibration_head_{sink/recent/middle/all}_{layer_i}_{layer_j} --calibration_mode 1 \
    --special_token true \
    --model meta-llama/Llama-2-7b-chat-hf \

🚧 TODO & Roadmap

[❎] Code Organization: Currently organizing and cleaning up the codebase for better usability
[❎] Gemma Support: Adding full support for Gemma model family
[❎] Baselines: Adding full support for Evaluation of Baselines
[❎] SGLang Integration: Adding support for SGLang inference engine for improved performance
[❎] Documentation: Expanding documentation with more detailed examples
[❎] Quantization Support: Adding support for model quantization (INT8/INT4) to reduce memory usage and accelerate inference
[❎] Benchmarks: Adding more comprehensive benchmark results
[✅] FlashAttention Support
[✅] Multi-GPU Inference Support
[✅] Batch Inference Support
[✅] AMD GPU Support

📁 Project Structure

ParallelComp/
├── run_evaluation_multi_gpu.py                              # Multi-GPU evaluation script
├── model_loaders.py                                         # Model loading utilities
├── experiment_manager.py                                    # Experiment management
├── pcw_wrapper.py                                          # Parallel Context Window wrapper
├── modeling_llama_with_pcw_kv_cache_FlashAttention_longbench.py   # Llama model implementation with PCW
├── modeling_qwen2_with_pcw_kv_cache_FlashAttention_longbench.py   # Qwen2 model implementation with PCW
├── metrics.py                                              # Evaluation metrics
├── eval_longbench.py                                       # LongBench dataset evaluation
├── eval_infinitebench.py                                   # InfiniteBench dataset evaluation
├── utils.py                                                # General utilities
├── constants.py                                            # Project constants
├── run_test_longbench_multi_gpu_window8_llama.sh          # Llama evaluation script
├── run_test_longbench_multi_gpu_window8_qwen.sh           # Qwen evaluation script
├── scripts/                                                # GPU configuration files
│   ├── gpu_*.yaml                                         # GPU configuration files
│   ├── longbench_metrics.sh                               # LongBench metrics script
│   └── infinitebench_metrics.sh                           # InfiniteBench metrics script
├── longbench_config/                                       # LongBench configurations
│   ├── dataset2*.json                                     # Dataset configuration files
│   ├── model2maxlen*.json                                  # Model configuration files
│   └── past/                                              # Historical configurations
├── datasets/                                               # Dataset storage
│   ├── LongBench/                                         # LongBench dataset
│   └── gsm8k/                                             # GSM8K dataset
├── my_utils/                                               # Utilities
│   ├── logger.py                                          # Logging utilities
│   ├── entropy_utils.py                                   # Entropy calculation utilities
│   ├── cache_revise.py                                    # Cache revision utilities
│   └── priorityqueue.py                                   # Priority queue implementation
├── requirements.txt                                        # Python dependencies
└── requirements.in                                         # Dependency source file

📊 Results

ParallelComp achieves significant improvements in long-context tasks:

91.17% of GPT-4's performance using 8B model trained on 8K context
23.50x acceleration in prefilling stage
1.76x improvement in chunk throughput
Outperforms Claude-2 and Kimi-Chat on long-context benchmarks

🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📚 Citation

If you find this work useful, please cite our paper:

@article{xiong2025parallelcomp,
  title={ParallelComp: Parallel Long-Context Compressor for Length Extrapolation},
  author={Xiong, Jing and Shen, Jianghan and Zheng, Chuanyang and Wan, Zhongwei and Zhao, Chenyang and Yang, Chiwun and Ye, Fanghua and Yang, Hongxia and Kong, Lingpeng and Wong, Ngai},
  journal={arXiv preprint arXiv:2502.14317},
  year={2025}
}

📞 Contact

For questions and support, please open an issue in this repository or contact the authors.

Note: This implementation will be fully released soon. Stay tuned for updates!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

📄 Paper

🚀 Key Features

🛠️ Installation

Requirements

Setup

📊 Usage

Quick Start

Single GPU Evaluation

Multi-GPU Evaluation

Batch Evaluation

Supported Models

Supported Datasets

Long Context Benchmarks

📈 Evaluation Scripts

Evaluation Metrics

GPU Configuration

🔧 Advanced Configuration

Attention Calibration

Single GPU Evaluation

🚧 TODO & Roadmap

📁 Project Structure

📊 Results

🤝 Contributing

📄 License

📚 Citation

📞 Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
__pycache__		__pycache__
datasets		datasets
longbench_config		longbench_config
my_utils		my_utils
qrcode-generator-with-wasm		qrcode-generator-with-wasm
scripts		scripts
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
eval_infinitebench.py		eval_infinitebench.py
eval_longbench.py		eval_longbench.py
experiment_manager.py		experiment_manager.py
metrics.py		metrics.py
model_loaders.py		model_loaders.py
modeling_llama_with_pcw_kv_cache_FlashAttention_longbench.py		modeling_llama_with_pcw_kv_cache_FlashAttention_longbench.py
modeling_qwen2_with_pcw_kv_cache_FlashAttention_longbench.py		modeling_qwen2_with_pcw_kv_cache_FlashAttention_longbench.py
pcw_wrapper.py		pcw_wrapper.py
pcw_wrapper_batches.py		pcw_wrapper_batches.py
requirements.in		requirements.in
requirements.txt		requirements.txt
run_evaluation_multi_gpu.py		run_evaluation_multi_gpu.py
run_test_longbench_multi_gpu_window8_llama.sh		run_test_longbench_multi_gpu_window8_llama.sh
run_test_longbench_multi_gpu_window8_qwen.sh		run_test_longbench_multi_gpu_window8_qwen.sh
utils.py		utils.py

menik1126/ParallelComp

Folders and files

Latest commit

History

Repository files navigation

ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

📄 Paper

🚀 Key Features

🛠️ Installation

Requirements

Setup

📊 Usage

Quick Start

Single GPU Evaluation

Multi-GPU Evaluation

Batch Evaluation

Supported Models

Supported Datasets

Long Context Benchmarks

📈 Evaluation Scripts

Evaluation Metrics

GPU Configuration

🔧 Advanced Configuration

Attention Calibration

Single GPU Evaluation

🚧 TODO & Roadmap

📁 Project Structure

📊 Results

🤝 Contributing

📄 License

📚 Citation

📞 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages