[Paper][Project Page]
Official Pytorch implementation of CVPR25 paper "Seeing the Abstract: Translating the Abstract Language for Vision Language Models"
Davide Talon*, Federico Girella*, Ziyue Liu, Marco Cristani, Yiming Wang
*Equal Contribution
Clone the repo and install the environment:
git clone https://github.com/davidetalon/fashionact.git
cd fashionact
conda env create -f environment.yml
Then, activate the environment and install our modules:
conda activate fashionact
pip install -e .
First, download the Deepfashion data from the In-shop Clothes Retrieval task.
You can hence format the data using:
mkdir data/in-shop_clothes_retrieval/
mkdir data/in-shop_clothes_retrieval/tmp
cd data/in-shop_clothes_retrieval/tmp
unzip "/path/to/In-Shop Clothes Retrieval Benchmark*"
mv In-shop\ Clothes\ Retrieval\ Benchmark/* ../
cd ../
NOTE: Our experiments were run on the high-resolution version of the dataset. You will need to get authorization and password access from the original authors:
unzip path/to/img_highres_seg*.zip -o img
rm -r Img
rm -r tmp
Alternatively, you can opt to use low-resolution version:
unzip Img/img.zip
rm -r Img
rm -r tmp
You can download pre-computed data from here and store it under the data/ folder.
NOTE: This step is only necessary if you want to compute the data locally, without using the pre-computed data from the previous section.
From the root folder you can generate json files of the input with:
# Train data
python -m scripts.generate_data --data_path data/in-shop_clothes_retrieval/ --deepfashion --split train --out_file data/deepfashion-train.json
# Test data
python -m scripts.generate_data --data_path data/in-shop_clothes_retrieval/ --deepfashion --split eval --out_file data/deepfashion-eval.json
Hence you can caption available data with:
# Train data
python -m scripts.captioning --data-file data/deepfashion-train.json --vlm-type qwen2-vl --out-file data/deepfashion-train-captioned.json
And apply language rewriting:
python -m scripts.language_rewrite --prompt-type dssp --data-file data/deepfashion-eval.json --llm-type llama3-8B --out-file data/deepfashion-eval-rewritten.json
You can then merge the two files as:
# Train
python -m scripts.add_description --input-file data/deepfashion-train.json --extra-info data/deepfashion-train-captioned.json --info-name qwen2-vl --info-type 'other' --out-file data/deepfashion_train_database.json
# Eval
python -m scripts.add_description --input-file data/deepfashion-eval.json --extra-info data/deepfashion-eval-rewritten.json --info-name llama-3-8B --info-type 'llama-3' --out-file data/deepfashion_eval_noics.json
So you can generate the queries from the data using language-rewritten descriptions:
python -m scripts.generate_queries -i data/deepfashion_eval_noics.json -o data/deepfashion_eval_noics_queries_llama3-8B.json --query_type llama-3-8B
You can evaluate with:
python -m scripts.evaluate \
--queries_file data/deepfashion_eval_noics_queries_llama3-8B.json \
--images_file data/deepfashion_eval_noics.json \
--store_encodings siglip-deepfashion.pt --out_file out/results.json \
--backbone siglip --use_textual_prompts \
--concrete_cache data/deepfashion_train_database.json \
--concrete_type qwen2-vl \
--abstract_type description \
--notes siglip-act-df-df
In the notebooks/ folder you can find some useful Jupyter Notebooks.
query.ipynballows you to query from Deepfashion evaluation set using your own descriptions. Note that you need to first run the inference script to save necessary evaluation embeddings and the shift representation.attribute-categorization.ipynbshowcases the attribute categorization pipeline used for preliminary experiments. You can download the needed SpaCy model usingpython -m spacy download en_core_web_sm
You can eventually use Docker containers. Build the container with:
docker build -t fashionact:latest -f docker/Dockerfile .
And then run it:
docker run --shm-size=64g --gpus '"device=0"' --rm -it -v $(pwd)/data:/app/data/ -v /path-to-huggingface-cache:/root/.cache fashionact /bin/bash
So you can use the same scripts as before.
We release a minimal Gradio demo with the ACT model. Install gradio and then run the demo using:
python demo.py
Note that you should run the inference script first to save necessary image embeddings and shift characterization.
If you find this repo useful, please don't forget to cite:
@inproceedings{talon2025seeing,
title={Seeing the Abstract: Translating the Abstract Language for Vision Language Models},
author={Talon, Davide and Girella, Federico and Liu, Ziyue and Cristani, Marco and Wang, Yiming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
