Jaden Park1
·
Valentin Deschaintre2
·
Jason Kuen3
·
Kangning Liu3
·
Iliyan Georgiev2
Krishna Kumar Singh3
·
Yong Jae Lee3
·
Michael Fischer2
1 University of Wisconsin–Madison · 2 Adobe Research, UK · 3 Adobe Research, USA
SIGGRAPH Conference Papers '26
TL;DR — MAOAM is a unified model that selects objects or materials from text prompts, clicks, or both.
See the project page for the full set with larger figures.
Given an input image, MAOAM takes a task prompt specifying the selection criteria (objects or materials) alongside a user prompt in click or text. If a click is provided, stars are overlaid onto the image as visual cues.
Training a unified selection model requires material datasets with dense mask annotations and rich text descriptions. Existing material datasets either lack textual descriptions, are flat material maps, or are tied to a specific domain. We therefore collect new material data from both real and synthetic sources to capture natural diversity alongside precise, controlled annotations.
| Set | Masks | Images | Source |
|---|---|---|---|
| RealMat | ~49K | ~8K | Real images from Pexels, hand-annotated with material masks. |
| SynMat | ~55K | ~5.5K | Blender renders from 132 Evermotion scenes with semantically valid material assignments (a sofa is leather, not stone). |
| SAMa | ~3.3K | ~1.3K | Multi-view video frames from SAMa; assignments aren't always semantically meaningful but are consistent across object parts. |
We use Qwen3-VL-235B-A22B-Thinking with Set-of-Marks prompting to generate candidate annotations, then incorporate quality control through model-based verification.
For each marked region we generate three types of descriptions: (i) a short material description paired with an entity label (e.g. "the white ceramic chair"); (ii) a short material description with spatial information, absolute ("bottom right corner") or relative ("above the table"); and (iii) a longer, self-contained material description that does not rely on context. We sample 6 variants per region from 10 to 50 words. Model-based verification fixes incorrect grounding and instruction-following.
Training on VQA encourages fine-grained material understanding through reasoning in text. We formulate two variants of a 4-way multiple-choice task: Q1 samples distractors from other regions in the same image (or other images, if too few materials are present); Q2 introduces hard-negative mining where the answer's description is paraphrased into a visually plausible but incorrect alternative — e.g. "brown wood with dark streaks" → "horizontally grained light wood".
To train a unified model for both object- and material-level selection, we additionally incorporate publicly available object segmentation datasets: RefCOCO / RefCOCO+ / RefCOCOg for text-based object selection, and EntitySeg for click-based object selection. The combined corpus contains ~190K training samples at an approximate 1:1 ratio of material- to object-centric data, spanning diverse selection prompts and criteria.
We provide two MAOAM backbones with identical interfaces:
| Backbone | VLM | Mask head |
|---|---|---|
| MAOAM-GLaMM | LLaVA-Llama (GLaMM) | SAM ViT-H |
| MAOAM-Sa2VA | Qwen2.5-VL-7B | SAM2 Hiera-L |
Hardware. MAOAM-Sa2VA requires an 80 GB A100 at the reference setting
qwen_max_pixels = 2048×28×28. Lowerqwen_max_pixelsin the Sa2VA cfg to fit on a smaller card.
The two backends have independent dependency stacks (different Python /
PyTorch / transformers versions), so we keep them in separate virtual
envs managed by uv. Install uv once:
curl -LsSf https://astral.sh/uv/install.sh | shThen create the env for whichever backend you need:
# MAOAM-GLaMM (Python 3.10, torch 2.4 + cu124)
cd GLaMM && uv sync
# MAOAM-Sa2VA (Python 3.11, torch 2.6 + cu124, transformers 4.57)
cd Sa2VA && uv sync --extra latestEach env lives under the backend folder as .venv/ — activate with
source .venv/bin/activate or prefix any command with uv run.
mkdir -p weights
# MAOAM finetuned weights from HuggingFace
hf download jpark677/maoam_ckpts --local-dir weights
# weights/glamm/mp_rank_00_model_states.pt
# weights/sa2va/mp_rank_00_model_states.pt
# Frozen vision backbones
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P weights/
mkdir -p weights/sam2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt -P weights/sam2/Base LLMs are pulled from HuggingFace on first run:
MBZUAI/GLaMM-GranD-Pretrained and Qwen/Qwen2.5-VL-7B-Instruct.
obj-and-mat-selection/ ← repo root
├── README.md
├── utils/hm_utils.py
│
├── GLaMM/
│ ├── pyproject.toml ← GLaMM venv (Python 3.10)
│ ├── demo.py
│ ├── check_load.py
│ ├── model/
│ ├── dataset/
│ └── tools/
│
├── Sa2VA/
│ ├── pyproject.toml ← Sa2VA venv (Python 3.11)
│ ├── demo.py
│ ├── projects/sa2va/
│ │ ├── configs/
│ │ ├── models/sa2va.py
│ │ ├── datasets/
│ │ ├── gradio/app_utils.py
│ │ └── evaluation/sa2va_eval_all.py
│ ├── third_parts/sam2/
│ └── vlm/
│
├── docs/
└── weights/
├── glamm/mp_rank_00_model_states.pt
├── sa2va/mp_rank_00_model_states.pt
├── sam2/sam2_hiera_large.pt
└── sam_vit_h_4b8939.pth
A Gradio demo per backend. Both commands must be run from the repo root
(utils/hm_utils.py is imported from there).
# MAOAM-GLaMM (uses GLaMM/.venv)
GLaMM/.venv/bin/python GLaMM/demo.py \
--model_path MBZUAI/GLaMM-GranD-Pretrained \
--resume weights/glamm/mp_rank_00_model_states.pt \
--vision_pretrained weights/sam_vit_h_4b8939.pth \
--port 7860
# MAOAM-Sa2VA (uses Sa2VA/.venv)
PYTHONPATH=Sa2VA Sa2VA/.venv/bin/python Sa2VA/demo.py \
--cfg Sa2VA/projects/sa2va/configs/glamm_qwen25_7b_material_only.py \
--resume weights/sa2va/mp_rank_00_model_states.pt \
--sam2_ckpt weights/sam2/sam2_hiera_large.pt \
--port 7861Open http://localhost:7860 / http://localhost:7861. Upload an image, click
up to 5 points to drop stars (optional), type a text prompt, hit Submit.
A download zip with the binary mask + cyan overlays is one click away.
| Mode | Click | Text | Notes |
|---|---|---|---|
| Click only | ✓ | – | uses the canonical material prompt |
| Text only | – | ✓ | original image, no star |
| Click + text | ✓ | ✓ | combines both cues (paper Fig. 7) |
Paper Table 2: RefCOCO / RefCOCO+ / RefCOCOg val + test splits, EntitySeg val. Both eval scripts live in their respective backend folders and run from the repo root.
Set env vars first:
export WEIGHTS=/path/to/weights # contains glamm/, sa2va/, sam2/, sam_vit_h_4b8939.pth
export DATA_ROOT=/path/to/data # contains entityseg/ and Refer_Segm/# 4 GPUs — all datasets
PYTHONPATH=. GLaMM/.venv/bin/torchrun --nproc_per_node=4 \
GLaMM/tools/eval_all.py \
--model_path MBZUAI/GLaMM-GranD-Pretrained \
--resume "$WEIGHTS/glamm/mp_rank_00_model_states.pt" \
--vision_pretrained "$WEIGHTS/sam_vit_h_4b8939.pth" \
--data_root "$DATA_ROOT"
# EntitySeg only (if RefCOCO data not yet downloaded)
PYTHONPATH=. GLaMM/.venv/bin/torchrun --nproc_per_node=4 \
GLaMM/tools/eval_all.py \
--model_path MBZUAI/GLaMM-GranD-Pretrained \
--resume "$WEIGHTS/glamm/mp_rank_00_model_states.pt" \
--vision_pretrained "$WEIGHTS/sam_vit_h_4b8939.pth" \
--data_root "$DATA_ROOT" \
--datasets entityseg# 4 GPUs — all datasets
DATA_ROOT="$DATA_ROOT" PYTHONPATH=Sa2VA Sa2VA/.venv/bin/torchrun --nproc_per_node=4 \
Sa2VA/projects/sa2va/evaluation/sa2va_eval_all.py \
--resume "$WEIGHTS/sa2va/mp_rank_00_model_states.pt" \
--work_dir "$(dirname "$WEIGHTS")"
# EntitySeg only
DATA_ROOT="$DATA_ROOT" PYTHONPATH=Sa2VA Sa2VA/.venv/bin/torchrun --nproc_per_node=4 \
Sa2VA/projects/sa2va/evaluation/sa2va_eval_all.py \
--resume "$WEIGHTS/sa2va/mp_rank_00_model_states.pt" \
--work_dir "$(dirname "$WEIGHTS")" \
--datasets entityseg--work_dir for Sa2VA must be the parent of weights/ (contains weights/sam2/sam2_hiera_large.pt).
Both scripts write per-split *_summary.json alongside the checkpoint.
We release a 10 % subset of the material annotations from the paper — with per-region text descriptions and VQA questions — across three sets:
| Split | (image, mat) pairs | Unique images | VQA questions | Descriptions |
|---|---|---|---|---|
| SynMat | 5,431 | 2,582 | 10,862 | 32,586 |
| RealMat | 4,663 | 2,685 | 9,326 | 27,978 |
| SAMa | 330 | 239 | 658 | 1,974 |
| Total | 10,424 | 5,506 | 20,846 | 62,538 |
VQA has 2 questions per (image, mat) pair; descriptions have 6 variants per pair.
For object-centric comparison we use EntitySeg val and RefCOCO / RefCOCO+ / RefCOCOg val + test splits — these are not part of our release; the download script below pulls them from upstream sources.
export DATA_ROOT=$DATA_ROOT
# ── MAOAM test set (the released subset) ───────────────────────────────────
hf download jpark677/maoam_data --repo-type dataset --local-dir "$DATA_ROOT/maoam_data"
# ── RefCOCO / RefCOCO+ / RefCOCOg ──────────────────────────────────────────
# 1. Annotations: convert from HuggingFace parquet to REFER API format (~200 MB dl, ~180 MB output)
pip install pyarrow huggingface_hub # if not already present
mkdir -p "$DATA_ROOT/Refer_Segm"
python tools/convert_refcoco_parquet.py --out_dir "$DATA_ROOT/Refer_Segm"
# 2. COCO 2014 train images ~13 GB
wget http://images.cocodataset.org/zips/train2014.zip -P "$DATA_ROOT/"
mkdir -p "$DATA_ROOT/Refer_Segm/images/mscoco/images"
unzip "$DATA_ROOT/train2014.zip" -d "$DATA_ROOT/Refer_Segm/images/mscoco/images/"
rm "$DATA_ROOT/train2014.zip"
# ── EntitySeg ────────────────────────────────────────────────────────────────
# 1. Images via HuggingFace ~55 GB
hf download qqlu1992/Adobe_EntitySeg --repo-type dataset --local-dir "$DATA_ROOT/entityseg"
# 2. COCO-style annotations ~181 MB
mkdir -p "$DATA_ROOT/entityseg/annotations"
wget https://github.com/adobe-research/EntitySeg-Dataset/releases/download/v1.0/entityseg_insseg_train_annotations.json \
-P "$DATA_ROOT/entityseg/annotations/"
wget https://github.com/adobe-research/EntitySeg-Dataset/releases/download/v1.0/entityseg_insseg_val_annotations.json \
-P "$DATA_ROOT/entityseg/annotations/"Note on EntitySeg images. After the HuggingFace download, the three image archives (
entity_01_11580,entity_02_11598,entity_03_10049) may arrive as tar files. Extract each so the final layout matches the tree below.
For reference, the upstream sources:
| Source | What | Size |
|---|---|---|
HF jxu124/refcoco, jxu124/refcocoplus, jxu124/refcocog |
RefCOCO / RefCOCO+ / RefCOCOg (converted via tools/convert_refcoco_parquet.py) |
~200 MB |
images.cocodataset.org |
COCO 2014 train images (used by RefCOCO) | ~13 GB |
HF qqlu1992/Adobe_EntitySeg |
EntitySeg image archives | ~55 GB |
adobe-research/EntitySeg-Dataset v1.0 |
EntitySeg COCO-style annotations | 181 MB |
Final layout (matches the dataloaders out of the box):
$DATA_ROOT/
├── maoam_data/ # MAOAM release subset
│ ├── synmat_release.json
│ ├── synmat_descriptions.json
│ ├── synmat_vqa.json
│ ├── realmat_release.json
│ ├── realmat_descriptions.json
│ ├── realmat_vqa.json
│ ├── sama_release.json
│ ├── sama_descriptions.json
│ ├── sama_vqa.json
│ ├── synmat/
│ │ ├── images/ # PNG renders
│ │ └── masks/ # binary masks *_mat<id>.png
│ ├── realmat/
│ │ ├── images/
│ │ └── masks/
│ └── sama/
│ ├── images/
│ └── masks/
├── Refer_Segm/
│ ├── refcoco/ refcoco+/ refcocog/ # annotations + refs(...).p
│ └── images/mscoco/images/train2014/ # 82,783 .jpg, ~13 GB
└── entityseg/
├── annotations/ # entityseg_insseg_{train,val}_annotations.json
└── images/
├── entity_01_11580/images_merge/ # 11,580 .jpg
├── entity_02_11598/images/ # 11,598 .jpg
└── entity_03_10049/images_03_10049/ # 10,049 .jpg
A flat JSON array. One entry per evaluated (image, material) pair.
Image and mask paths are resolved as:
images/ {stem}.png
masks/ {stem}_mat{mat_id}.png
where stem is the filepath with its source prefix and extension stripped, and
any intermediate path separators replaced by __
(e.g. /realmat/material_20241203/000625.jpg → material_20241203__000625).
{
// key: basename for synmat/sama; relative subpath for realmat
"AI09_002_frame0780_selection_materialistic.exr": {
"2": { // mat_id as string
"descriptions": [
"short material label",
"label with entity context",
"description with absolute spatial location",
"description with relative spatial location",
"longer self-contained description",
"paraphrase of the longer description"
]
}
}
}Key convention. For SynMat and SAMa the top-level key is
os.path.basename(filepath). For RealMat it isfilepathwith the/realmat/prefix stripped (e.g."material_20241203/000625.jpg"), since basenames are not unique across subdirectories.
{
"AI09_002_frame0780_selection_materialistic.exr": {
"2": [ // list of 2 questions
{
"A": "option text A",
"B": "option text B",
"C": "option text C",
"D": "option text D",
"answer": "C" // correct letter
},
{ "A": "...", "B": "...", "C": "...", "D": "...", "answer": "A" }
]
}
}Same key convention as {source}_descriptions.json above.
Refer_Segm/<dataset>/
├── instances.json # COCO-style {images, annotations, categories}
└── refs(<splitBy>).p # pickled list[dict]; splitBy ∈ {unc, umd, google}
Each refs(...).p entry:
{
"image_id": int,
"ann_id": int, # FK -> mask in instances.json
"ref_id": int,
"split": "train" | "val" | "testA" | "testB" | "test",
"sentences": [{"sent_id": int, "sent": str, "tokens": [str], "raw": str}, ...],
"sent_ids": [int, ...],
"category_id": int,
"file_name": "COCO_train2014_000000000009.jpg",
}COCO instance-segmentation format with two extra fields:
{
"images": [{"id": int, "file_name": "*.jpg", "height": int, "width": int}, ...],
"annotations": [{
"id": int,
"image_id": int,
"category_id": int, // remapped EntitySeg category
"category_id_ori": int, // original COCO/LVIS id
"segmentation": {"size": [H, W], "counts": "<RLE>"},
"bbox": [x, y, w, h],
"area": int,
"iscrowd": 0 | 1,
"attribute": int | str, // material/texture tag
"remove": 0 | 1 // 1 = exclude from eval
}, ...],
"categories": [{"id": int, "name": str, "supercategory": str}, ...]
}This work was supported in part by NSF IIS2404180 and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration). The authors would like to thank Sudeep Katakol for his help in data generation and Zijun Wei, Yash Savani, and Soochahn Lee for helpful discussions.
This implementation builds on GLaMM, Sa2VA, SAM, and SAM2.
@inproceedings{park2026maoam,
title = {MAOAM: Unified Object and Material Selection with Vision-Language Models},
author = {Park, Jaden and Deschaintre, Valentin and Kuen, Jason and
Liu, Kangning and Georgiev, Iliyan and Singh, Krishna Kumar and
Lee, Yong Jae and Fischer, Michael},
booktitle = {ACM SIGGRAPH 2026 Conference Papers},
year = {2026},
publisher = {ACM},
doi = {10.1145/3799902.3811186},
}

