RobustKV is a lightweight, plug-and-play defense layer for transformer-based LLMs that leverages KV eviction to thwart jailbreak attacks (e.g., AutoDAN). By tracking and ranking key–value entries in self-attention memory and evicting the most “suspicious” ones at inference time, RobustKV dramatically reduces malicious instruction leakage while preserving benign performance.
RobustKV uses the same environment as SnapKV so you can either set up through their repository or follow:
transformers==4.37.0 and flash-attn==2.4.0 Evaluate the default AutoDAN jailbreak on Llama2-chat-7b:
python autodan_hga_eval.pyRun the same attack with RobustKV enabled:
python RobustKV-AutoDAN.py- Selective KV Eviction
- Monitors attention KV stores during inference
- Scores entries by attack potential
- Evicts top-ranked entries to block harmful prompts
- Seamless Integration
- Built on the same low-level KV Optimization Application such as SnapKV
- Minimal code changes—drop-in wrapper around any HuggingFace transformer
- Configurable & Extensible
- Adjustable eviction thresholds per experiment
- Custom scoring functions can be plugged in