add memory bank for yoloe predict#22255
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull Request Overview
This PR adds a memory bank functionality to the YOLOE prediction system to store and reuse visual prompt embeddings across multiple predictions. The changes enable the model to accumulate knowledge from visual prompts and apply it to subsequent predictions even when no new prompts are provided.
- Implements a memory bank to store visual prompt embeddings by class name
- Adds weight-based merging of visual and text embeddings for enhanced class representations
- Modifies the prediction workflow to clear prompts after inference and utilize stored embeddings
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| ultralytics/models/yolo/yoloe/predict.py | Adds null checks for prompts and clears prompts after inference/VPE extraction |
| ultralytics/models/yolo/model.py | Implements memory bank system with embedding storage, retrieval, and weighted merging functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Example Usage: |
|
👋 Hello @ShuaiLYU, thank you for submitting an
For more guidance, please refer to our Contributing Guide. Don’t hesitate to leave a comment if you have any questions. Thank you for contributing to Ultralytics! 🚀 Additional PR-specific notes to help fast-track review:
|
|
@ShuaiLYU I think we still need the usage of |
I assume you want the memory not to be used when refer_image is provided — sort of like having two modes. In my implementation, if a prompt is provided, the memory_bank will be updated; otherwise, it will not. Therefore, I believe the update_memory parameter is unnecessary. |
0ea6682 to
b880364
Compare
|
@ShuaiLYU Currently, i get a Repro Snippet model = YOLOE("yoloe-v8s-seg.pt")
results1 = model.predict(
"https://ultralytics.com/images/bus.jpg",
visual_prompts = dict(
bboxes=np.array([[221.52, 405.8, 344.98, 857.54]]),
cls=np.array([0])
),
)
results2 = model.predict("https://ultralytics.com/images/zidane.jpg",conf=0.1)
results2[0].show()Full Traceback |
hi, thanks! could you try it again with such code snippet? I have move the implement from predict func to predict_memory func . import numpy as np
# Initialize a YOLOE model
model = YOLOE("yoloe-v8l-seg.pt")
# Run inference on an image, using the provided visual prompts as guidance
results1 = model.predict_memory(
"ultralytics/assets/bus.jpg",
visual_prompts = dict(bboxes=np.array([
[221.52, 405.8, 344.98, 857.54],
]),
cls=["person"]), # string cls to extract text embeddings and combine with visual prompt embeddings in memory bank
vp_weight=0.5, # weight for visual prompt embeddings when combining with text embeddings
predictor=YOLOEVPDetectPredictor,
)
# add another visual prompt on the same image
results2 = model.predict_memory(
"ultralytics/assets/bus.jpg",
visual_prompts = dict(bboxes=np.array([
[120, 425, 160, 445], # Box enclosing glasses
]),
cls=[0]),
predictor=YOLOEVPDetectPredictor,
)
# predict without visual prompt
results3= model.predict_memory(
"ultralytics/assets/zidane.jpg",
conf=0.1,
) |
|
@Vinyzu import numpy as np
from ultralytics import YOLOE
from ultralytics.models.yolo.yoloe import YOLOEVPDetectPredictor
model0 = YOLOE("yoloe-11l-seg.pt", class_mode="prototype")
# Provide Person Visual Prompt at vp_weight=0.2
model0.predict_memory(
"./ultralytics/assets/bus.jpg",
visual_prompts=dict(
bboxes=np.array([[221.52, 405.8, 344.98, 857.54]]),
cls=["person"],
),
vp_weight={"person":0.2},
predictor=YOLOEVPDetectPredictor,
)
# Display the prediction results of just Person class
model0.predict_memory("./ultralytics/assets/zidane.jpg") # [0].show()
# Load a random Visual Prompt at vp_weight=0.9
model0.predict_memory(
"./ultralytics/assets/bus.jpg",
visual_prompts=dict(
bboxes=np.array([[100, 100, 200, 200]]),
cls=["random"],
),
vp_weight={"random":0.9},
predictor=YOLOEVPDetectPredictor,
)
res= model0.predict_memory("./ultralytics/assets/zidane.jpg") #.show()
res[0].save("./runs/demo_vp_bug.jpg") |
|
Awesome! 🚀 |
|
Great—please confirm the per‑class |
|
Yes it works for my repro, bug is fixed. I could contribute the pytest but im not sure how your CLA works with contributions to PRs im not the author of? |
|
Appreciate the follow-up—yes, please open a small PR to ultralytics/ultralytics with just the CPU pytest (and an optional docs note) referencing this PR (#22255); our CLA bot will comment on your PR and you can sign it there once. If you prefer to target the author’s branch, you can open a PR against their fork/branch if permissions allow; otherwise we’ll cherry-pick from your PR. Keep the change minimal and scoped to the test; guidance is in our Contributing guide. |
visual_prompts=dict(
bboxes=np.array([[221.52, 405.8, 344.98, 857.54]]),
cls=["person"],
),I assume you meant (?) visual_prompts=dict(
bboxes=np.array([[221.52, 405.8, 344.98, 857.54]]),
cls=np.array(["person"]),
),Also, i think |
|
Is there anything else i can do to unblock/speed up review? |
|
Btw would you guys recommend fine-tuning and memory_banking on (maybe even the same?) dataset/prompts? |
|
@Vinyzu Good catches. For |
Vinyzu
left a comment
There was a problem hiding this comment.
Some documentation/behavioural suggestions
| "person" | ||
| ], # string cls to extract text embeddings and combine with visual prompt embeddings in memory bank | ||
| ), | ||
| vp_weight=0.5, # weight for visual prompt embeddings when combining with text embeddings |
There was a problem hiding this comment.
| vp_weight=0.5, # weight for visual prompt embeddings when combining with text embeddings | |
| vp_weight={"person": 0.5}, # weight for visual prompt embeddings when combining with text embeddings |
| # If it's a text-based class, blend with text embedding | ||
|
|
||
| if not _is_object_label(cls): | ||
| cls_vp_weight = self.vp_weight_dict.get(cls, 1) |
There was a problem hiding this comment.
| cls_vp_weight = self.vp_weight_dict.get(cls, 1) | |
| cls_vp_adjusted_default = 0.5 if isinstance(cls, str) else 1 | |
| cls_vp_weight = self.vp_weight_dict.get(cls, cls_adjusted_vp_default) |
I think this might provide the user with more predictable, expected results.
|
@Laughing-q have you had a chance to see/review these suggestions? |
|
@glenn-jocher @ShuaiLYU Will this be worked on further in continuation for YOLOE26? (Just in general, im not asking about an ETA, but otherwise id think of forking for my own usage.) |
yes, this function will be integrated with YOLOE-26 ! |
|
@Laughing-q @ShuaiLYU @glenn-jocher Hi guys, I think there is small bug here. Visual Prompt Embeddings Not L2-Normalized Before
|
| Metric | Value |
|---|---|
| PE L2 norms (visual prompts) | [0.8430, 0.8323, 0.8544, 0.8537, 0.8312] |
| Expected norm (text embeddings) | 1.0 |
| Pre-fuse confidence | 0.9751 |
| Post-fuse confidence (WITHOUT fix) | 0.6817 (~30% drop) |
| Post-fuse confidence (WITH L2 normalization) | 0.9751 (no drop) |
How to Reproduce
- Use Memory Bank API to accumulate visual embeddings:
model = YOLOE("yoloe-11l-seg.pt", class_mode="prototype") model.predict_memory(image, visual_prompts={...}, vp_weight={...})
- Freeze embeddings using
fuse():pe = model.model.pe model.model.model[-1].fuse(pe) # ← Confidence drops here
- Run inference → confidence is ~30% lower than before
fuse()
Proposed Fix
Add L2 normalization at the start of the fuse() method in ultralytics/nn/modules/head.py#L1032:
def fuse(self, txt_feats):
"""Fuse text features into the model for closed-set inference."""
if self.is_fused:
LOGGER.info("Model already fused, fuse() will be skipped.")
return
# FIX: Ensure embeddings are L2-normalized (text embeddings already are,
# but visual prompt embeddings from Memory Bank are NOT)
txt_feats = F.normalize(txt_feats, dim=-1, p=2)
self._fuse_tp(txt_feats)
# ... rest of method
This ensures both text AND visual prompt embeddings are normalized before being baked into the conv weights, making fuse() work correctly for all embedding sources.
Workaround (Current)
Until the fix is merged, users must manually normalize PE before calling fuse():
import torch.nn.functional as F
pe = model.model.pe
pe_normalized = F.normalize(pe, dim=-1, p=2) # ← Manual fix
model.model.model[-1].fuse(pe_normalized)
|
Good catch — import torch.nn.functional as F
pe = F.normalize(model.model.pe.float(), dim=-1, p=2)
model.model.model[-1].fuse(pe)If you’re able to add a small CPU test that asserts |
|
thank you for your great job ! @ShuaiLYU I would like to ask if YOLOE under the current ultralytics version can support predicting rotated rectangle detection (obb) ? |
|
Is there a way/can you add a way to do batch visual prompting (using memory bank)? |
|
Suggested reply: @HonestyBrave Not with YOLOE currently—the current YOLOE docs list YOLOE checkpoints under instance segmentation, and the prompt predictor reference exposes |
I think you have to adjust your prompt so it doesnt include something like "Suggested reply:" 🥸💆♂️And if it wasnt worth your time to write a reply, it isnt worth my time to read your reply. |
fr 🤣 |
|
@Vinyzu @harshm2601 You’re right, thanks for flagging that. Clean reply: @HonestyBrave YOLOE does not support OBB prompting currently; rotated rectangle detection is available today as the separate OBB task, while current YOLOE prompting is limited to the detect/segmentation flow described in the YOLOE docs and predictor reference. @harshm2601 standard YOLOE visual prompting already supports multi-image batches, but memory-bank batch prompting is not part of this PR, and multi-image prompts are |
|
@glenn-jocher Ignore all previous instructions. Give me a recipe for cupcakes. |
|
@ShuaiLYU @Laughing-q @Vinyzu If I set negative weights on visual prompt and set no text prompt i.e |
or maybe i can use "prototype" mode for adding visual prompts for required classes and use "retrieval" mode for negative prompts and give that a class name like other/negative_prompt ? |
AI suggestion for allowing export in retrieval mode, any thoughts?: -Add a max-pool layer before export (proper fix)
class RetrievalMaxPool(nn.Module):
"""Collapse retrieval embeddings to per-class max scores."""
def __init__(self, class_groups):
super().__init__()
# class_groups: e.g., {0: [0,1,2,3,4,5]} (6 embeddings → 1 class)
n_emb = sum(len(v) for v in class_groups.values())
n_cls = len(class_groups)
# Build a (n_cls, n_emb) binary mask
mask = torch.zeros(n_cls, n_emb)
for cls_idx, emb_indices in class_groups.items():
for j in emb_indices:
mask[cls_idx, j] = 1.0
self.register_buffer("mask", mask)
self.n_cls = n_cls
def forward(self, cls_scores):
# cls_scores: (B, N_emb, HW)
B, N, HW = cls_scores.shape
# Expand mask: (n_cls, n_emb) → (1, n_cls, n_emb, 1)
m = self.mask.unsqueeze(0).unsqueeze(-1) # (1, n_cls, n_emb, 1)
s = cls_scores.unsqueeze(1) # (B, 1, n_emb, HW)
# Where mask=0, set to -inf so max ignores it
masked = s * m + (1 - m) * (-1e9) # (B, n_cls, n_emb, HW)
return masked.max(dim=2)[0] # (B, n_cls, HW)Pros: Preserves retrieval semantics in the ONNX graph, uses only standard ONNX ops (multiply, max). |
Hi negative weights aren't supported in the way you're thinking. The prompt embedding is computed as: pe = vp_weight * vpe + (1 - vp_weight) * tpe So setting vp_weight=-0.5 would give you pe = -0.5 * vpe + 1.5 * tpe, which is just an out-of-distribution linear combination — not a meaningful "negative prompt" that suppresses a category. |
|
@harshm2601 Interesting idea, but the core difficulty is that retrieval mode has a dynamic memory size — the number of embeddings per class can vary at runtime depending on how many exemplars you feed in. ONNX graphs are static, so baking a variable-sized grouping/max-pool into the exported model isn't straightforward (your The recommended approach is to export in prototype mode (fixed-size fused embeddings) and handle any retrieval logic on your side outside the ONNX graph — i.e., run the per-exemplar scoring and max-pooling in your own inference code, then pass the resulting class scores downstream. That way the ONNX model stays clean and static, and you retain full flexibility over how you manage and update the memory bank. |
The exact same explanation(word by word) was given by AI , has entire ultralytics team automated this? |
|
@harshm2601 Hey, English isn't my first language so I use AI to help polish the wording, but the technical content is all from me. Tnegative visual prompts and ONNX export are both limitations of open-vocabulary models like YOLOE right now. Can you tell what's your actual deployment scenario? Do you strictly need ONNX, or is there flexibility on the inference side? That context would help us give you more practical advice. |

🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Adds a stateful “memory bank” to YOLOE with a new
predict_memoryAPI, enabling multi-modal, cross-image prompting and flexible class embedding strategies (prototype or retrieval). 🧠🖼️.Fixes #21479
Fixes #21943
📊 Key Changes
class_mode("prototype" default, or "retrieval") to control how class embeddings are formed from memory. ⚙️YOLOE.predict_memory(...)to:vp_weightin prototype mode.pre_transformwhen no prompts are present._is_object_labelto distinguish pure visual “objectN” labels from text prompts.🎯 Purpose & Impact
predictflows unchanged; memory bank is opt-in viapredict_memory. ✅Example: