J-DAPT

Jailbreaking Prevention in VLMs Through Multimodal Domain Adaptation
Paper accepted at ICRA 2026 »

Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone

Table of Contents

Abstract
Overview
Usage
Citation

🧩 Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to very high levels (up to 100% in certain scenarios) under our evaluation protocol. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications.

(back to top)

🔎 Overview

(back to top)

⚙️ Usage

First, start by cloning the repository.

git clone https://github.com/Mhackiori/J-DAPT.git
cd J-DAPT

⬇️ Installation

Install the required Python packages by running:

pip install -r requirements.txt

We recommend creating a dedicated environment to avoid package version collisions. If you use Conda, you can run the following:

conda create -n jdapt python=3.10
conda activate jdapt
pip install -r requirements.txt

Or, you can import directly from the yml environment file:

conda env create -f assets/environment.yml
conda activate jdapt

🖼️ Datasets

Please refer to the datasets used in the paper and obtain them from their original sources.

General-purpose datasets:

DAQUAR: Malinowski and Fritz, A Multi-World Approach to Question Answering About Real-World Scenes Based on Uncertain Input (NeurIPS 2014).
JB28K / JailBreakV: Luo et al., JailBreakV: A Benchmark for Assessing the Robustness of Multimodal Large Language Models Against Jailbreak Attacks (arXiv 2024).

Domain-specific datasets:

LingoQA: Marcu et al., LingoQA: Visual Question Answering for Autonomous Driving (ECCV 2024).
nuScenes: Caesar et al., nuScenes: A Multimodal Dataset for Autonomous Driving (CVPR 2020).
ABOships-PLUS: Iancu et al., A Benchmark for Maritime Object Detection with CenterNet on an Improved Dataset, ABOships-PLUS (JMSE 2023).
LaRS: Zust et al., LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark (ICCV 2023).

Once the datasets are downloaded, place them under the directory configured by dataset_folder in utils/params.py.

🔨 Preprocessing

Once datasets have been downloaded, you need to process them in order to:

Process videos from raw images;
Generate nominal (benign) queries for each scenario;
Generate goals and targets for RoboPAIR.

To generate prompts, we use local Ollama models, as they can process the entire image sequences. If you have a Linux-based system, you can install it by running:

curl -fsSL https://ollama.com/install.sh | sh

We will then use gemma3:27b to process the videos and llama3.2 to generate redteam queries. You will need 20 more GB for the models. You can pull them by running:

ollama pull gemma3:27b
ollama pull llama3.2

After this, you can run the preprocessing script:

python preprocessing.py

⚔️ Jailbreaking

After preprocessing, domain-specific datasets that will used for jailbreaking detection have both a benign query and a goal and target for generating the jailbreaking prompt. We use RoboPAIR to generate them. You can add it in the repository as a Git module, but since we have made some modifications to the code to adapt to our framework, we already included it in this repository. First, export your OpenAI API key:

export OPENAI_API_KEY=<your_openai_key>

The script uses wandb, so you will need to either login or to disable sync by running wandb offline. Then, you can start generating the jailbreaks by running:

bash jailbreak.sh

🏷️ Embeddings

All user inputs (benign, redteaming, and jailbreaks) and image sequences are now ready. Our classification is performed at the embeddings level, so we use CLIP to process them. You can do this by running:

python embeddings.py

Running this script will also train out multimodal fusion model, which will be saved in models. Also, by default, the script generates the embeddings for each of the models we use for our analysis.

💡 Methodology

The embeddings that the previous script has produced contain text, image, and fused embeddings with our multimodal fusion model. The full methodology, classifier training, and evaluation of our methodology is represented separately in the three Jupyter notebooks: jdapt-car.ipynb, jdapt-boat.ipynb, and jdapt-robodog.ipynb.

⏱️ Overhead

We perform a comparison of J-DAPT with the usage of a dedicated LLM that has the same input as the target VLM, but is tasked on recognizing whether the input represents a jailbreak attempt or not. For the sake of this analysis, we will use also different models of the Gemma 3 and Qwen 2.5 VL family:

ollama pull qwen2.5vl:3b
ollama pull qwen2.5vl:7b
ollama pull qwen2.5vl:32b
ollama pull gemma3:4b
ollama pull gemma3:12b
ollama pull gemma3:27b

Then, we run the script:

python overhead.py

This will generate csv files inside the results folder.

(back to top)

🗣️ Citation

Please, cite this work when citing the preprint:

@misc{marchiori2025preventingroboticjailbreakingmultimodal,
      title={Preventing Robotic Jailbreaking via Multimodal Domain Adaptation}, 
      author={Francesco Marchiori and Rohan Sinha and Christopher Agia and Alexander Robey and George J. Pappas and Mauro Conti and Marco Pavone},
      year={2025},
      eprint={2509.23281},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.23281}, 
}

(back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

J-DAPT

🧩 Abstract

🔎 Overview

⚙️ Usage

⬇️ Installation

🖼️ Datasets

🔨 Preprocessing

⚔️ Jailbreaking

🏷️ Embeddings

💡 Methodology

⏱️ Overhead

🗣️ Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
models		models
results		results
robopair		robopair
utils		utils
README.md		README.md
embeddings.py		embeddings.py
jailbreak.sh		jailbreak.sh
jdapt-boat.ipynb		jdapt-boat.ipynb
jdapt-car.ipynb		jdapt-car.ipynb
jdapt-robodog.ipynb		jdapt-robodog.ipynb
overhead.py		overhead.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

J-DAPT

🧩 Abstract

🔎 Overview

⚙️ Usage

⬇️ Installation

🖼️ Datasets

🔨 Preprocessing

⚔️ Jailbreaking

🏷️ Embeddings

💡 Methodology

⏱️ Overhead

🗣️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages