Skip to content

fix(onboard): infer vision input for non-Azure custom models#51893

Open
widingmarcus-cyber wants to merge 5 commits intoopenclaw:mainfrom
widingmarcus-cyber:fix/onboard-custom-vision-51869
Open

fix(onboard): infer vision input for non-Azure custom models#51893
widingmarcus-cyber wants to merge 5 commits intoopenclaw:mainfrom
widingmarcus-cyber:fix/onboard-custom-vision-51869

Conversation

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor

@widingmarcus-cyber widingmarcus-cyber commented Mar 21, 2026

Summary

  • infer likely vision-capable model IDs in onboard-custom for non-Azure custom providers
  • default inferred vision models to input: ["text", "image"] instead of hardcoded text-only
  • keep non-vision defaults unchanged
  • add regression test for claude-sonnet-4-6 on non-Azure custom providers

Why

openclaw onboard --custom currently writes non-Azure models with input: ["text"] only, which silently disables image attachments for vision-capable models unless users manually edit config.

Fixes #51869

Validation

  • corepack pnpm test src/commands/onboard-custom.test.ts

@openclaw-barnacle openclaw-barnacle bot added the commands Command implementations label Mar 21, 2026
@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

@greptile-apps please review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d065eddee

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return (
id.includes("claude") ||
id.includes("gpt-4o") ||
id.includes("gpt-5") ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow the gpt-5 vision heuristic for custom models

This marks every non-Azure model ID containing gpt-5 as image-capable, but we already ship text-only GPT-5-family IDs such as openai-gpt-52 in src/agents/venice-models.ts:350-355. Because the runtime decides whether to forward image parts purely from model.input.includes("image") (src/agents/pi-embedded-runner/run/attempt.ts:1735 and src/agents/openai-ws-stream.ts:189-207), onboarding a custom proxy that serves one of those text-only IDs will now advertise image support and send input_image payloads that the upstream model cannot handle.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 21, 2026

Greptile Summary

This PR fixes a longstanding gap in onboard --custom where non-Azure models with vision-capable IDs (e.g. claude-sonnet-4-6, gpt-4o, gemini-*) were always written with input: ["text"], silently breaking image attachments. It introduces an isLikelyVisionModel heuristic and wires it into both fresh onboards and re-onboards.

Key changes:

  • New isLikelyVisionModel function uses pattern-matching (string includes + bounded regexes) to identify likely vision-capable model IDs; non-matching models retain the ["text"] default.
  • Re-onboard upgrades are guarded by a name-sentinel: shouldUpgradeNonAzureVisionInput only upgrades configs whose name still matches the auto-generated "${modelId} (Custom Provider)" template, preserving deliberate user customisations.
  • Five new tests cover the new default, the text-only baseline, multi-model families, and both re-onboard paths (auto-generated vs. user-renamed).

All three concerns from the prior review round are addressed:

  1. The overly-broad id.includes("claude") catch-all was replaced with specific sub-family patterns.
  2. Silent re-onboard overwrite is now blocked for user-renamed configs via the name sentinel.
  3. The model.id !== modelId dead-code guard is not present in the implementation.

The heuristic is intentionally incomplete (e.g. o4, llava, phi-3-vision are not matched), which is an acknowledged and acceptable trade-off for this PR scope.

Confidence Score: 4/5

  • Safe to merge; the fix is well-scoped, all prior review concerns are addressed, and tests cover both the upgrade and preservation paths.
  • The implementation is correct and the three prior review concerns have been resolved. The one remaining area of mild fragility is the name-sentinel approach: if the auto-generated name template ever changes, old auto-generated configs will no longer be upgraded on re-onboard — but that is a forward-only concern, not a regression risk in this PR. The heuristic gaps (o4, llava, etc.) are acknowledged by the author and out of scope.
  • No files require special attention.

Reviews (5): Last reviewed commit: "refactor(onboard): drop unreachable mode..." | Re-trigger Greptile

Comment on lines +56 to +68
function isLikelyVisionModel(modelId: string): boolean {
const id = modelId.toLowerCase();
return (
id.includes("claude") ||
id.includes("gpt-4o") ||
id.includes("gpt-5") ||
id.includes("gemini") ||
id.includes("qwen-vl") ||
id.includes("vision") ||
id.includes("-vl") ||
id.endsWith("vl")
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Broad claude catch-all may over-match text-only models

id.includes("claude") matches every Claude variant, including known text-only deployments like claude-instant-1 and claude-2.0. Those models will now receive input: ["text", "image"] as the default even though they don't support images.

For a heuristic this is probably fine (better to default to vision-capable than to silently disable it for modern claude-3+ models), but it's worth noting. If you want tighter matching you could scope it to claude-3 and above:

id.includes("claude-3") ||
id.includes("claude-sonnet") ||
id.includes("claude-haiku") ||
id.includes("claude-opus") ||

Also note that llava, pixtral, phi-3-vision, and minicpm-v style models won't match the current heuristic — but that's probably acceptable given those are less common custom-provider deployments.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/commands/onboard-custom.ts
Line: 56-68

Comment:
**Broad `claude` catch-all may over-match text-only models**

`id.includes("claude")` matches every Claude variant, including known text-only deployments like `claude-instant-1` and `claude-2.0`. Those models will now receive `input: ["text", "image"]` as the default even though they don't support images.

For a heuristic this is probably fine (better to default to vision-capable than to silently disable it for modern claude-3+ models), but it's worth noting. If you want tighter matching you could scope it to `claude-3` and above:

```ts
id.includes("claude-3") ||
id.includes("claude-sonnet") ||
id.includes("claude-haiku") ||
id.includes("claude-opus") ||
```

Also note that `llava`, `pixtral`, `phi-3-vision`, and `minicpm-v` style models won't match the current heuristic — but that's probably acceptable given those are less common custom-provider deployments.

How can I resolve this? If you propose a fix, please make it concise.

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

@greptile-apps please review

Comment on lines +693 to 700
}
return model.input.length === 1 && model.input[0] === "text";
};

const mergedModels = hasModel
? existingModels.map((model) =>
model.id === modelId
? {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Re-onboard silently overwrites deliberate text-only config

shouldUpgradeNonAzureVisionInput only checks whether model.input is currently ["text"], but this is indistinguishable from a user who intentionally restricted a vision-capable model ID to text-only (e.g., because their custom deployment endpoint doesn't actually expose vision, or they want cost controls). Re-running onboard --custom will silently change their config with no opt-out.

A concrete failure path: user deploys claude-sonnet-4-6 behind a text-only proxy and explicitly sets input: ["text"]. After re-onboard, input becomes ["text", "image"], and any image attachment will now be forwarded to a deployment that rejects it at runtime.

Consider adding a user-settable sentinel — for example only upgrade configs that still carry the generated name: "${modelId} (Custom Provider)" pattern (strongly suggesting they were never hand-edited), or introduce an explicit inferredInput: true flag set at generation time that the upgrade path uses as the gate:

// only auto-upgrade if the input was inferred by onboard, not user-set
return model.input.length === 1 && model.input[0] === "text" && model._inferredInput === true;

Without such a guard the upgrade is a one-way destructive migration for users with intentionally restricted configs.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/commands/onboard-custom.ts
Line: 693-700

Comment:
**Re-onboard silently overwrites deliberate text-only config**

`shouldUpgradeNonAzureVisionInput` only checks whether `model.input` is currently `["text"]`, but this is indistinguishable from a user who *intentionally* restricted a vision-capable model ID to text-only (e.g., because their custom deployment endpoint doesn't actually expose vision, or they want cost controls). Re-running `onboard --custom` will silently change their config with no opt-out.

A concrete failure path: user deploys `claude-sonnet-4-6` behind a text-only proxy and explicitly sets `input: ["text"]`. After re-onboard, `input` becomes `["text", "image"]`, and any image attachment will now be forwarded to a deployment that rejects it at runtime.

Consider adding a user-settable sentinel — for example only upgrade configs that still carry the generated `name: "${modelId} (Custom Provider)"` pattern (strongly suggesting they were never hand-edited), or introduce an explicit `inferredInput: true` flag set at generation time that the upgrade path uses as the gate:

```ts
// only auto-upgrade if the input was inferred by onboard, not user-set
return model.input.length === 1 && model.input[0] === "text" && model._inferredInput === true;
```

Without such a guard the upgrade is a one-way destructive migration for users with intentionally restricted configs.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dff2b282ef

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +703 to +705
...(shouldUpgradeNonAzureVisionInput(model)
? ({ input: ["text", "image"] } as { input: Array<"text" | "image"> })
: {}),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve explicit text-only custom input on re-onboard

When a user has intentionally kept a non-Azure custom model text-only (for example because their proxy only accepts text for a Claude/Gemini-style ID), re-running openclaw onboard --custom now overwrites that choice to [text, image]. That regresses the existing “preserve user-customized fields” behavior covered in src/commands/onboard-custom.test.ts, and it has a real runtime effect because both src/agents/pi-embedded-runner/run/attempt.ts and src/agents/openai-ws-stream.ts decide whether to forward image parts from model.input. So a harmless API-key refresh can turn a working text-only custom provider into one that starts sending unsupported image payloads.

Useful? React with 👍 / 👎.

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

@greptile-apps please review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1866593a91

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +60 to +62
id.includes("gpt-4o") ||
id.includes("gpt-5") ||
id.includes("gemini") ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Extend the vision heuristic to GPT-4.1 and o-series

This matcher still leaves several mainstream OpenAI vision models text-only. buildCopilotModelDefinition() marks gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, o1, o1-mini, and o3-mini as input: ["text", "image"] in src/providers/github-copilot-models.ts, and src/media-understanding/runner.vision-skip.test.ts treats gpt-4.1 as a native vision model. If a user onboards a custom OpenAI-compatible proxy serving one of those IDs, applyCustomApiConfig() still writes input: ["text"] because the heuristic only matches gpt-4o/gpt-5, so image attachments remain disabled for a common class of custom models this change is meant to fix.

Useful? React with 👍 / 👎.

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

@greptile-apps please review

reasoning: false,
};
const shouldUpgradeNonAzureVisionInput = (model: ModelDefinitionConfig): boolean => {
if (isAzure || !isLikelyVisionCapableModel || model.id !== modelId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant model.id !== modelId guard

shouldUpgradeNonAzureVisionInput is only ever called inside the model.id === modelId ? ... branch (line 706), so the model.id !== modelId clause will never be true. The guard is dead code and could mislead a future reader into thinking this function might be invoked in a broader context.

Suggested change
if (isAzure || !isLikelyVisionCapableModel || model.id !== modelId) {
if (isAzure || !isLikelyVisionCapableModel) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/commands/onboard-custom.ts
Line: 693

Comment:
**Redundant `model.id !== modelId` guard**

`shouldUpgradeNonAzureVisionInput` is only ever called inside the `model.id === modelId ? ...` branch (line 706), so the `model.id !== modelId` clause will never be true. The guard is dead code and could mislead a future reader into thinking this function might be invoked in a broader context.

```suggestion
    if (isAzure || !isLikelyVisionCapableModel) {
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc8bd8c9ec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

id.includes("gpt-4o") ||
id.includes("gpt-4.1") ||
/(^|[^a-z0-9])gpt-5($|[-._])/i.test(id) ||
/(^|[^a-z0-9])o(1|3)($|[-._])/i.test(id) ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include o4 models in the non-Azure vision matcher

This widens the OpenAI heuristic to o1/o3, but o4 is still excluded. The repo already treats o4-mini as image-capable in the Azure path (src/commands/onboard-custom.test.ts:472-490 expects input: ["text", "image"]), so a non-Azure OpenAI-compatible endpoint serving the same model will still be onboarded as text-only here. Because downstream image forwarding is gated on model.input.includes("image"), attachments stay disabled for that model family.

Useful? React with 👍 / 👎.

Comment on lines +56 to +60
function isLikelyVisionModel(modelId: string): boolean {
const id = modelId.toLowerCase();
return (
id.includes("claude-3") ||
id.includes("claude-sonnet") ||
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cover multimodal families we already ship in the catalog

The new heuristic still misses several image-capable IDs that are already first-class in our own catalogs. For example, moonshotai/Kimi-K2.5 and both meta-llama/Llama-4-* models are input: ["text", "image"] in src/agents/together-models.ts:21-33 and src/agents/together-models.ts:49-75, and unsloth/gemma-3-27b-it is marked the same in src/agents/chutes-models.ts:111-117. If a user onboards an OpenAI-compatible proxy exposing any of those IDs, this code still writes input: ["text"], so image attachments remain unavailable even though the upstream model supports vision.

Useful? React with 👍 / 👎.

@widingmarcus-cyber
Copy link
Copy Markdown
Contributor Author

@greptile-apps please review

@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot bot commented Mar 24, 2026

🔒 Aisle Security Analysis

We found 2 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Heuristic vision auto-enable allows prompt-driven local image file exfiltration
2 🔵 Low Azure deployment path injection via unencoded modelId in transformAzureUrl
Vulnerabilities

1. 🟠 Heuristic vision auto-enable allows prompt-driven local image file exfiltration

Property Value
Severity High
CWE CWE-22
Location src/commands/onboard-custom.ts:683-688

Description

applyCustomApiConfig() now heuristically marks certain custom models as vision-capable based only on modelId substrings/regexes. That flips model.input from ['text'] to ['text','image'] (and may auto-upgrade previously onboarded models).

Downstream, model.input.includes('image') is treated as authorization to run native prompt image loading: the runner scans user prompts for local image paths / file:// URLs and loads them from disk (loadWebMedia) into base64 blocks that are then sent to the model provider.

Impact if an attacker can influence the onboarded modelId (or induce re-onboarding) and can send prompts:

  • A crafted modelId containing e.g. vision, -vl, gemini, gpt-4.1, etc. will enable input: ['text','image'] even if the provider/model is not actually intended for multimodal.
  • Once enabled, prompt text containing /absolute/path/to/secret.png or file:///.../secret.jpg can cause the system to read local files (images) and transmit them to the LLM provider.
  • The default FS policy is not necessarily workspace-only; when tools.fs.workspaceOnly is false, assertSandboxPath will not restrict reads to the workspace.

Vulnerable code (heuristic auto-enable):

input: isLikelyVisionCapableModel
  ? (["text", "image"] as Array<"text" | "image">)
  : (["text"] as ["text"]),

Related sink (authorization by capability flag):

  • params.model.input?.includes("image") => enables detectAndLoadPromptImages() which loads local paths / file:// via loadWebMedia().

Recommendation

Do not infer vision capability from user-controlled modelId strings.

Recommended fixes:

  1. Default custom models to text-only unless the user explicitly opts into images.
  2. If you want auto-detection, base it on a trusted source (provider metadata / discovery API), not substring matching.
  3. Treat model.input as advisory and still gate native prompt image loading behind an explicit security setting (e.g., media.allowPromptFileImages=true) and enforce workspaceOnly=true unless explicitly overridden.

Example safer default:

// For custom providers, default to text-only.// Users must explicitly set input=["text","image"] in config.
input: ["text"] as ["text"],

And in the image loader path, require explicit opt-in:

if (!cfg.media?.allowPromptLocalImages) return empty;
if (!resolveEffectiveToolFsWorkspaceOnly({ cfg, agentId })) throw new Error("...workspaceOnly required...");

2. 🔵 Azure deployment path injection via unencoded modelId in transformAzureUrl

Property Value
Severity Low
CWE CWE-74
Location src/commands/onboard-custom.ts:85-92

Description

The Azure URL helper transformAzureUrl concatenates the user-supplied modelId directly into a URL path segment without encoding or validation.

  • Input: modelId originates from onboarding input (prompt/CLI/config) and is later reused for verification requests.
  • Issue: modelId is appended as raw text in a URL string: .../openai/deployments/${modelId}.
  • Impact: A crafted modelId containing path-control characters (e.g., ../, //, ?, #) can alter the final request path once later processed by new URL(...), potentially escaping the intended /openai/deployments/<deployment> prefix and hitting unintended endpoints on the same Azure host. This is a form of path injection that can undermine assumptions about which Azure API path is contacted.

Vulnerable code:

return `${normalizedUrl}/openai/deployments/${modelId}`;

Recommendation

Treat modelId as a URL path segment and encode/validate it before concatenation.

  • Encode with encodeURIComponent (or stricter allowlist validation for Azure deployment names).
  • Build URLs with URL and explicit path joining to avoid .. normalization surprises.

Example fix:

function transformAzureUrl(baseUrl: string, modelId: string): string {
  const normalizedUrl = baseUrl.endsWith("/") ? baseUrl.slice(0, -1) : baseUrl;
  if (normalizedUrl.includes("/openai/deployments/")) return normalizedUrl;// Option A: encode as a single path segment
  const safeDeployment = encodeURIComponent(modelId);
  return `${normalizedUrl}/openai/deployments/${safeDeployment}`;// Option B (stricter): allowlist Azure deployment name characters// if (!/^[A-Za-z0-9._-]+$/.test(modelId)) throw new Error("Invalid deployment name");
}

This prevents ../, ?, #, and other reserved characters from altering the constructed endpoint path.


Analyzed PR: #51893 at commit 9d9573c

Last updated on: 2026-03-24T21:45:58Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: onboard-custom hardcodes input: ["text"] for non-Azure custom providers, silently disabling image/vision support

1 participant