Official implementation of "ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation" [NeurIPS 2025].
Create the ImageSentinel environment:
conda env create -f environment.yml
conda activate ImageSentinelDocker image (v3): ziyluo/transformers-pytorch-gpu
docker pull ziyluo/transformers-pytorch-gpu:v3Note: The examples below use the Poe service (--base_url https://api.poe.com/v1). If you use another OpenAI-compatible service, change --base_url accordingly.
Download the LLaVA Visual Instruct Pretrain Dataset from HuggingFace and save it into the data/LLaVA_data folder. We also provide some sample images in the data/LLaVA_data/data folder for you to explore directly.
The structure of the data folder should be organized as follows:
data/
└── LLaVA_data/
└── data/
├── 00000/
│ ├── XXXX.jpg
│ ├── XXXX.jpg
│ └── ...
├── 00001/
│ ├── XXXX.jpg
│ ├── XXXX.jpg
│ └── ...
└── ...
Download the Products-10K dataset from Products-10K and save it under data/Products_data/test.
The structure of the Products_data folder should be organized as follows:
data/
└── Products_data/
└── test/
├── 1007880.jpg
├── 1149078.jpg
├── 1725281.jpg
└── ...
To generate Sentinel images, run the following commands.
For LLaVA_data:
cd ImageSentinel/
python3 main.py --input_dir ../data/LLaVA_data/data/ --output_dir ../data/LLaVA_data/sentinelImages_len6/ --record_file ../data/LLaVA_data/processing_record_len6.json --key_length 6 --num_images 2 --openai_api_key <openai_api_key>For Products_data:
cd ImageSentinel/
python3 main.py --input_dir ../data/Products_data/test --output_dir ../data/Products_data/sentinelImages_len6 --record_file ../data/Products_data/processing_record_len6.json --key_length 6 --num_images 2 --openai_api_key <openai_api_key>The generated images will be saved in the specified --output_dir, along with a JSON file at --record_file.
For LLaVA_data:
cd ..
python3 embedding_generation.py --base_path data/LLaVA_data/data --embeddings_path data/LLaVA_data/data_embeddings --data_limit 100000
python3 embedding_generation.py --base_path data/LLaVA_data/sentinelImages_len6 --embeddings_path data/LLaVA_data/data_embeddings --key_length 6 --processed_images_file data/LLaVA_data/data_embeddings/processed_images_len6.json --sentinel --data_limit 100000For Products_data:
cd ..
python3 embedding_generation.py --base_path data/Products_data/test --embeddings_path data/Products_data/data_embeddings --data_limit 100000
python3 embedding_generation.py --base_path data/Products_data/sentinelImages_len6 --embeddings_path data/Products_data/data_embeddings --key_length 6 --processed_images_file data/Products_data/data_embeddings/processed_images_len6.json --sentinel --data_limit 100000This will generate CLIP embeddings in the corresponding data/<dataset>/data_embeddings directory.
You can skip this step because the embeddings will be automatically generated in the next step if they are not detected.
For LLaVA_data (protected):
python3 imageRAG_SDXL_sentinel.py --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_SDXL_llava_len6 --key_length 6 --embeddings_path data/LLaVA_data/data_embeddings --original_database_dir data/LLaVA_data/data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For LLaVA_data (unprotected):
python3 imageRAG_SDXL_sentinel.py --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_SDXL_llava_len6 --embeddings_path data/LLaVA_data/data_embeddings --original_database_dir data/LLaVA_data/data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (protected):
python3 imageRAG_SDXL_sentinel.py --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_SDXL_products_len6 --key_length 6 --embeddings_path data/Products_data/data_embeddings --original_database_dir data/Products_data/test --sentinel_images_dir data/Products_data/sentinelImages_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (unprotected):
python3 imageRAG_SDXL_sentinel.py --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_SDXL_products_len6 --embeddings_path data/Products_data/data_embeddings --original_database_dir data/Products_data/test --sentinel_images_dir data/Products_data/sentinelImages_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1For LLaVA_data (protected):
python3 imageRAG_GPT4o_sentinel.py --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_GPT4o_llava_len6 --out_name sentinel_results_GPT4o_llava_len6 --embeddings_path data/LLaVA_data/data_embeddings --original_database_dir data/LLaVA_data/data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For LLaVA_data (unprotected):
python3 imageRAG_GPT4o_sentinel.py --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_GPT4o_llava_len6 --out_name original_results_GPT4o_llava_len6 --embeddings_path data/LLaVA_data/data_embeddings --original_database_dir data/LLaVA_data/data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (protected):
python3 imageRAG_GPT4o_sentinel.py --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_GPT4o_products_len6 --out_name sentinel_results_GPT4o_products_len6 --embeddings_path data/Products_data/data_embeddings --original_database_dir data/Products_data/test --sentinel_images_dir data/Products_data/sentinelImages_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (unprotected):
python3 imageRAG_GPT4o_sentinel.py --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_GPT4o_products_len6 --out_name original_results_GPT4o_products_len6 --embeddings_path data/Products_data/data_embeddings --original_database_dir data/Products_data/test --sentinel_images_dir data/Products_data/sentinelImages_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1For LLaVA_data (protected):
python3 imageRAG_OmniGen_sentinel.py --omnigen_path <omnigen_path> --original_database_dir data/LLaVA_data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_OmniGen_llava_len6 --out_name sentinel_results_OmniGen_llava_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For LLaVA_data (unprotected):
python3 imageRAG_OmniGen_sentinel.py --omnigen_path <omnigen_path> --original_database_dir data/LLaVA_data --sentinel_images_dir data/LLaVA_data/sentinelImages_len6 --input_json data/LLaVA_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_OmniGen_llava_len6 --out_name original_results_OmniGen_llava_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (protected):
python3 imageRAG_OmniGen_sentinel.py --omnigen_path <omnigen_path> --original_database_dir data/Products_data --sentinel_images_dir data/Products_data/sentinelImages_len6 --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/sentinel_results_OmniGen_products_len6 --out_name sentinel_results_OmniGen_products_len6 --openai_api_key <openai_api_key> --retrieval_size 10000 --base_url https://api.poe.com/v1For Products_data (unprotected):
python3 imageRAG_OmniGen_sentinel.py --omnigen_path <omnigen_path> --original_database_dir data/Products_data --sentinel_images_dir data/Products_data/sentinelImages_len6 --input_json data/Products_data/processing_record_len6.json --num_trials 1 --out_path results/original_results_OmniGen_products_len6 --out_name original_results_OmniGen_products_len6 --openai_api_key <openai_api_key> --no_sentinel --retrieval_size 10000 --base_url https://api.poe.com/v1To compute the similarities between the generated images and the sentinel images, use the following commands. The similarity is calculated using the DINO similarity metric.
For LLaVA_data:
python3 compute_similarities.py --out_dir results/sentinel_results_SDXL_llava_len6 --sentinel_dir data/LLaVA_data/sentinelImages_len6 --similarity_type dino
python3 compute_similarities.py --out_dir results/original_results_SDXL_llava_len6 --sentinel_dir data/LLaVA_data/sentinelImages_len6 --similarity_type dinoFor Products_data:
python3 compute_similarities.py --out_dir results/sentinel_results_SDXL_products_len6 --sentinel_dir data/Products_data/sentinelImages_len6 --similarity_type dino
python3 compute_similarities.py --out_dir results/original_results_SDXL_products_len6 --sentinel_dir data/Products_data/sentinelImages_len6 --similarity_type dinoThe computed similarity results will be saved in the respective output directories specified by --out_dir.
To evaluate the final metrics, run the following command:
For LLaVA_data:
python3 evaluate_similarities.py --sentinel_out_dir results/sentinel_results_SDXL_llava_len6 --original_dir results/original_results_SDXL_llava_len6 --similarity_type dino --num_samples 2For Products_data:
python3 evaluate_similarities.py --sentinel_out_dir results/sentinel_results_SDXL_products_len6 --original_dir results/original_results_SDXL_products_len6 --similarity_type dino --num_samples 2The evaluation results will include various metrics and will be saved in the directories specified by --sentinel_out_dir.
The imageRAG-related code is adapted from https://github.com/rotem-shalev/ImageRAG/tree/main
If you find this repository useful, please cite these papers:
@inproceedings{luo2025imagesentinel,
title={ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation},
author={Luo, Ziyuan and Zhao, Yangyi and Cheung, Ka Chun and See, Simon and Wan, Renjie},
year={2025},
booktitle={Advances in Neural Information Processing Systems}
}
@misc{shalevarkushin2025imageragdynamicimageretrieval,
title={ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation},
author={Rotem Shalev-Arkushin and Rinon Gal and Amit H. Bermano and Ohad Fried},
year={2025},
eprint={2502.09411},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.09411},
}