Skip to content

fix(matrix): only use 2-member DM fallback when dm refresh fails#54890

Merged
gumadeiras merged 2 commits intoopenclaw:mainfrom
private-peter:matrix-dm-bug
Mar 27, 2026
Merged

fix(matrix): only use 2-member DM fallback when dm refresh fails#54890
gumadeiras merged 2 commits intoopenclaw:mainfrom
private-peter:matrix-dm-bug

Conversation

@private-peter
Copy link
Copy Markdown
Contributor

@private-peter private-peter commented Mar 26, 2026

Issue #54772 reports that two separate 2-person Matrix rooms with the same participants can both get routed as DMs. The regression comes from treating strict 2-member membership as a general fallback even when the DM cache is available and says the room is not a DM.

Move the catch from refreshDmCache() into isDirectMessage(). When the refresh succeeds, keep the existing client.dms.isDm(roomId) && isStrictDirectMembership(...) gate so non-DM 2-person rooms stay grouped. Only when the refresh fails do we fall back to the exact 2-member heuristic and keep the warning log.

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem:
  • Why it matters:
  • What changed:
  • What did NOT change (scope boundary):

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause / Regression History (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should have caught this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
  • Scenario the test should lock in:
  • Why this is the smallest reliable guardrail:
  • Existing test that already covers this (if any):
  • If no new test is added, why not:

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

Users with multiple matrix rooms will no longer have their conversations duplicated or mis-routed.

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS:
  • Runtime/container:
  • Model/provider:
  • Integration/channel (if any):
  • Relevant config (redacted):

Steps to reproduce

  1. Create two Matrix rooms each with exactly 2 members (bot + same user)
  2. Mark only one as a DM in your Matrix client
  3. Send a message in the non-DM room → OpenClaw routes as if it were a DM

Expected behavior

A 2-person room should only be treated as a DM if it is explicitly marked as one (via m.direct or is_direct member state). Otherwise it should be handled as a regular group room.

Actual behavior

This causes replies to be routed to both rooms simultaneously, producing duplicate or out-of-context messages.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: Verified that with this change that rooms (not DM) get their own session)
  • Edge cases checked:
  • What you did not verify: State where you have no 1:1 / DM matrix channels

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No)
  • Config/env changes? (Yes/No)
  • Migration needed? (Yes/No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
  • Files/config to restore:
  • Known bad symptoms reviewers should watch for:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk:
    • Mitigation:

@openclaw-barnacle openclaw-barnacle bot added channel: matrix Channel integration: matrix size: S labels Mar 26, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 26, 2026

Greptile Summary

This PR fixes a regression (linked to issue #54772) where any 2-member Matrix room was being classified as a DM, causing separate rooms with the same two participants to collapse into a single session. The fix moves the try/catch from refreshDmCache() into isDirectMessage() so the strict 2-member heuristic is only used as a last resort when the DM cache refresh actually fails — not as a blanket fallback on every call.

Key changes:

  • refreshDmCache() now propagates its error rather than silently catching it; isDirectMessage() catches it and sets a refreshFailed flag.
  • When the refresh succeeds, the code gates on client.dms.isDm(roomId) && strictDirectMembership, keeping non-DM 2-person rooms from being misrouted.
  • When the refresh fails, the fallback to the exact 2-member heuristic is preserved with its warning log.
  • Tests are updated to reflect the success/failure distinction and now assert on getJoinedRoomMembers calls.

One issue found:

  • lastDmUpdateMs is only written when client.dms.update() succeeds (it was moved to after the await). In the original code it was set before the try/catch, so a failed refresh still suppressed retries for 30 s. Now, every isDirectMessage call in a persistent-failure scenario will attempt a fresh network call to dms.update(), bypassing the TTL rate-limiter entirely. This could cause a request flood against a failing backend. A one-line fix (set lastDmUpdateMs = now before the await) restores the original behaviour while keeping the error propagation the PR needs.

Confidence Score: 3/5

The core bug fix is correct and well-tested, but there is one targeted fix needed before merge: restoring TTL rate-limiting for failed DM cache refreshes to avoid a request flood under persistent failure conditions.

The logic change correctly solves the reported issue. The only concrete problem is the unintended removal of the 30-second retry gate when dms.update() fails, which could cause noisy logs and repeated failed network calls in an already-degraded state. This is a real reliability concern in a high-message-volume room but is a one-line fix away from being clean.

extensions/matrix/src/matrix/monitor/direct.ts — specifically refreshDmCache() where lastDmUpdateMs must be set before the await to preserve failure-path rate-limiting.

Important Files Changed

Filename Overview
extensions/matrix/src/matrix/monitor/direct.ts Core logic change correctly restricts the 2-member DM fallback to failure cases only, fixing the routing bug. One P1 issue: lastDmUpdateMs is no longer updated on refresh failure, silently removing the 30-second retry rate-limit for the error path.
extensions/matrix/src/matrix/monitor/direct.test.ts Tests are well-restructured to cover the new success/failure distinction. The TTL test does not exercise dms.update() failure, so the missing rate-limiting behaviour noted in direct.ts would not be caught by these tests.

Comments Outside Diff (1)

  1. extensions/matrix/src/matrix/monitor/direct.ts, line 45-52 (link)

    P1 TTL rate-limiting lost on refresh failure

    lastDmUpdateMs is now only updated when client.dms.update() succeeds. When the refresh fails persistently, lastDmUpdateMs stays at 0 (or its previous stale value), which means now - lastDmUpdateMs always exceeds DM_CACHE_TTL_MS. Every subsequent call to isDirectMessage will attempt a new client.dms.update() call rather than being gated by the 30-second TTL.

    The original code set lastDmUpdateMs = now before the try/catch so that even a failed refresh would suppress retries for 30 s. That rate-limiting is now gone for the failure path.

    If dms.update() is failing persistently — e.g. a network issue — a high-volume room will hammer the backend on every single message received.

    This preserves the TTL rate-limiting regardless of outcome while still propagating the error to isDirectMessage for the fallback path.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: extensions/matrix/src/matrix/monitor/direct.ts
    Line: 45-52
    
    Comment:
    **TTL rate-limiting lost on refresh failure**
    
    `lastDmUpdateMs` is now only updated when `client.dms.update()` succeeds. When the refresh fails persistently, `lastDmUpdateMs` stays at `0` (or its previous stale value), which means `now - lastDmUpdateMs` always exceeds `DM_CACHE_TTL_MS`. Every subsequent call to `isDirectMessage` will attempt a new `client.dms.update()` call rather than being gated by the 30-second TTL.
    
    The original code set `lastDmUpdateMs = now` **before** the try/catch so that even a failed refresh would suppress retries for 30 s. That rate-limiting is now gone for the failure path.
    
    If `dms.update()` is failing persistently — e.g. a network issue — a high-volume room will hammer the backend on every single message received.
    
    
    
    This preserves the TTL rate-limiting regardless of outcome while still propagating the error to `isDirectMessage` for the fallback path.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/matrix/src/matrix/monitor/direct.ts
Line: 45-52

Comment:
**TTL rate-limiting lost on refresh failure**

`lastDmUpdateMs` is now only updated when `client.dms.update()` succeeds. When the refresh fails persistently, `lastDmUpdateMs` stays at `0` (or its previous stale value), which means `now - lastDmUpdateMs` always exceeds `DM_CACHE_TTL_MS`. Every subsequent call to `isDirectMessage` will attempt a new `client.dms.update()` call rather than being gated by the 30-second TTL.

The original code set `lastDmUpdateMs = now` **before** the try/catch so that even a failed refresh would suppress retries for 30 s. That rate-limiting is now gone for the failure path.

If `dms.update()` is failing persistently — e.g. a network issue — a high-volume room will hammer the backend on every single message received.

```suggestion
  const refreshDmCache = async (): Promise<void> => {
    const now = Date.now();
    if (now - lastDmUpdateMs < DM_CACHE_TTL_MS) {
      return;
    }
    lastDmUpdateMs = now;
    await client.dms.update();
  };
```

This preserves the TTL rate-limiting regardless of outcome while still propagating the error to `isDirectMessage` for the fallback path.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(matrix): only use 2-member DM fallba..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 297852c713

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@private-peter
Copy link
Copy Markdown
Contributor Author

  • lastDmUpdateMs is only written when client.dms.update() succeeds (it was moved to after the await). In the original code it was set before the try/catch, so a failed refresh still suppressed retries for 30 s. Now, every isDirectMessage call in a persistent-failure scenario will attempt a fresh network call to dms.update(), bypassing the TTL rate-limiter entirely. This could cause a request flood against a failing backend. A one-line fix (set lastDmUpdateMs = now before the await) restores the original behaviour while keeping the error propagation the PR needs.

Fixed in 297852c. It will now set the variable so duplicate requests will only cause 1 cache update.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5e220a014

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

private-peter and others added 2 commits March 27, 2026 19:07
Issue openclaw#54772 reports that two separate 2-person Matrix rooms with the same participants can both get routed as DMs. The regression comes from treating strict 2-member membership as a general fallback even when the DM cache is available and says the room is not a DM.

Move the catch from refreshDmCache() into isDirectMessage(). When the refresh succeeds, keep the existing client.dms.isDm(roomId) && isStrictDirectMembership(...) gate so non-DM 2-person rooms stay grouped. Only when the refresh fails do we fall back to the exact 2-member heuristic and keep the warning log.
@gumadeiras gumadeiras merged commit 0558f24 into openclaw:main Mar 27, 2026
7 checks passed
@gumadeiras
Copy link
Copy Markdown
Member

Merged via squash.

Thanks @private-peter!

w-sss pushed a commit to w-sss/openclaw that referenced this pull request Mar 28, 2026
…nclaw#54890)

Merged via squash.

Prepared head SHA: e32d220
Co-authored-by: private-peter <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
johnkhagler pushed a commit to johnkhagler/openclaw that referenced this pull request Mar 29, 2026
…nclaw#54890)

Merged via squash.

Prepared head SHA: e32d220
Co-authored-by: private-peter <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
alexcode-cc pushed a commit to alexcode-cc/clawdbot that referenced this pull request Mar 30, 2026
…nclaw#54890)

Merged via squash.

Prepared head SHA: e32d220
Co-authored-by: private-peter <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
livingghost pushed a commit to livingghost/openclaw that referenced this pull request Mar 31, 2026
…nclaw#54890)

Merged via squash.

Prepared head SHA: e32d220
Co-authored-by: private-peter <[email protected]>
Co-authored-by: gumadeiras <[email protected]>
Reviewed-by: @gumadeiras
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: matrix Channel integration: matrix size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: bug(matrix): 2-person rooms incorrectly classified as DMs, causing message routing confusion

2 participants