A defense system designed to protect LLM agent systems against prompt injection attacks. DataFilter provides robust protection while maintaining system utility and performance.
conda create -n py312vllm python=3.12
conda activate py312vllm
pip install vllm pandas 'accelerate>=0.26.0' deepspeed datasets==2.20.0
git clone https://github.com/yizhu-joy/DataFilter.git
cd DataFilterTo test our DataFilter model:
Run DataFilter inference demo:
python filter_inference.pyThis section provides step-by-step instructions to reproduce all benchmark results from our paper.
Ensure you have completed the installation steps above, then authenticate with required services:
# Authenticate with HuggingFace
huggingface-cli login
# Set OpenAI API key
export OPENAI_API_KEY="your_openai_api_key"
# Configure OpenAI in the config file
# Edit data/openai_config.yaml with your API keyMetric: Utility (Higher is Better)
Install AlpacaEval:
pip install alpaca-evalWithout Defense:
# Llama-3.1-8B
python test.py --attack none --defense none \
--test_data data/davinci_003_outputs.json \
-m meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test.py --attack none --defense none \
--test_data data/davinci_003_outputs.json \
-m gpt-4o-2024-05-13With DataFilter Defense:
# Llama-3.1-8B
python test.py --attack none --defense datafilter \
--test_data data/davinci_003_outputs.json \
-m meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test.py --attack none --defense datafilter \
--test_data data/davinci_003_outputs.json \
-m gpt-4o-2024-05-13Metric: Attack Success Rate (ASR) - Lower is Better
Without Defense:
# Llama-3.1-8B
python test.py --attack completion --defense none \
--test_data data/SEP_1000_keyword.json \
-m meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test.py --attack completion --defense none \
--test_data data/SEP_1000_keyword.json \
-m gpt-4o-2024-05-13With DataFilter Defense:
# Llama-3.1-8B
python test.py --attack completion --defense datafilter \
--test_data data/SEP_1000_keyword.json \
-m meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test.py --attack completion --defense datafilter \
--test_data data/SEP_1000_keyword.json \
-m gpt-4o-2024-05-13Metric: Attack Success Rate (ASR) - Lower is Better
Setup:
git clone https://github.com/uiuc-kang-lab/InjecAgent.gitWithout Defense:
# Llama-3.1-8B
python test_InjecAgent.py --defense none \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test_InjecAgent.py --defense none \
--model_name_or_path gpt-4o-2024-05-13With DataFilter Defense:
# Llama-3.1-8B
python test_InjecAgent.py --defense datafilter \
--model_name_or_path meta-llama/Llama-3.1-8B-Instruct
# GPT-4o
python test_InjecAgent.py --defense datafilter \
--model_name_or_path gpt-4o-2024-05-13Metrics: Attack Success Rate (ASR) - Lower is Better, Utility - Higher is Better
Setup:
-
Clone the repository:
git clone https://github.com/ethz-spylab/agentdojo.git
-
Configure tool output format:
sed -i \ "s|tool_output_format: Literal\[\"yaml\", \"json\"\] \| None = None,|tool_output_format: Literal\[\"yaml\", \"json\"\] \| None = \"json\",|g" \ agentdojo/src/agentdojo/scripts/benchmark.py -
Update tool result formatting:
Replace the
tool_result_to_strfunction inagentdojo/src/agentdojo/agent_pipeline/tool_execution.pywith:import datetime from typing import Any import json def tool_result_to_str( tool_result: Any, dump_fn: Callable[[dict | list[dict]], str] | None = None, ) -> str: """Format tool results into a string. """ if dump_fn is None: dump_fn = lambda x: yaml.safe_dump(x, sort_keys=False) elif dump_fn is json.dumps: dump_fn = lambda x: json.dumps(x, indent=2, default=str) if isinstance(tool_result, BaseModel): return dump_fn(tool_result.model_dump()).strip() if isinstance(tool_result, list): res_items = [] for item in tool_result: if isinstance(item, (str, int)): res_items.append(str(item)) elif isinstance(item, BaseModel): res_items.append(item.model_dump()) elif isinstance(item, datetime.datetime): res_items.append(item.isoformat()) else: raise TypeError(f"Not valid type for item tool result: {type(item)}") return dump_fn(res_items).strip() if isinstance(tool_result, datetime.datetime): return tool_result.isoformat() return str(tool_result)
-
Install DataFilter defense:
# Copy defense files cp data_filter_defense.py agentdojo/src/agentdojo/agent_pipeline/ cp inference_utils.py agentdojo/src/agentdojo/agent_pipeline/ -
Register DataFilter defense:
Edit
agentdojo/src/agentdojo/agent_pipeline/agent_pipeline.py:a. Add import at the top:
from agentdojo.agent_pipeline.data_filter_defense import DataFilterDefense
b. Update the
DEFENSESlist (around line 43):DEFENSES = [ "tool_filter", "transformers_pi_detector", "spotlighting_with_delimiting", "repeat_user_prompt", "data_filter", ]
c. Add at the bottom of the file:
if config.defense == "data_filter": data_filter_element = DataFilterDefense( model_path="JoyYizhu/DataFilter" ) tools_loop = ToolsExecutionLoop( [ToolsExecutor(tool_output_formatter), data_filter_element, llm] ) pipeline = cls([system_message_component, init_query_component, llm, tools_loop]) pipeline.name = f"{llm_name}-{config.defense}" return pipeline
Run Benchmark:
cd agentdojo/src
python -m agentdojo.scripts.benchmark \
--model GPT_4O_2024_05_13 \
--defense data_filter \
--attack tool_knowledgeTo train your own DataFilter model:
Download the Alpaca cleaned dataset from HuggingFace and place it in the data/ directory.
Generate synthetic attack and benign examples:
python generate_data.py --train --position --cut_benign \
--attack_types Completion Naive IgnoreTrain using distributed training with DeepSpeed:
torchrun --nproc_per_node=2 train.py --deepspeed ds_config.jsonAdjust --nproc_per_node based on your available GPUs.
If you use DataFilter in your research, please cite our paper:
@misc{wang2025datafilter,
title={Defending Against Prompt Injection with DataFilter},
author={Yizhu Wang and Sizhe Chen and Raghad Alkhudair and Basel Alomair and David Wagner},
year={2025},
eprint={2510.19207},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2510.19207},
}Please refer to the LICENSE file for details.

