Skip to content

Comments

feat(agent): add simple observation masking extension#386

Closed
obviyus wants to merge 3 commits intoopenclaw:mainfrom
obviyus:exp/observation-masking
Closed

feat(agent): add simple observation masking extension#386
obviyus wants to merge 3 commits intoopenclaw:mainfrom
obviyus:exp/observation-masking

Conversation

@obviyus
Copy link
Contributor

@obviyus obviyus commented Jan 7, 2026

Adds opt-in observation masking that replaces older tool results with a placeholder before sending context to the LLM. This reduces token usage in long-running sessions based on https://arxiv.org/abs/2508.21433.

Preceded by #381

This PR overlaps with #381 (context pruning). Key differences:

This PR #381
Approach Count-based (keep last N) Ratio-based (soft-trim → hard-clear)
Trigger Always (if enabled) When context ratio exceeds threshold
Partial preservation No Yes (keeps head/tail before clearing)

Trade-off: This is simpler and more predictable but less sophisticated. #381 is smarter about when and how to prune.

Config:

{
  agent: {
    observationMasking: {
      enabled: true,
      keepLast: 1,  // default
      placeholder: "Previous observation omitted for brevity."  // default
    }
  }
}

@maxsumrall
Copy link
Contributor

maxsumrall commented Jan 7, 2026

I incorporated the ideas here, as far as I could see, by adding modes and adding a mode aggressive which sets the parameters to mask results similar to this approach.

The main difference also afaik is this PR counts last N tool calls while in #381 we attempt to count 'turns' or entire steps the agent takes. For example, if the user asks Q1 and agent thinks and performs 10 tool calls and answers, and user then asks Q2 + agent does 10 more tool calls, this PR will count the last N tools calls (of which there are now 20 in this example) while in #381 we count this as 2 agents turns, and would only prune/mask tool calls once they are more than the configure N turns old.

I think the advantage with approach in #381 is that you won't mask tool calls the agent does in the middle of a 'turn'. Although this might depend a lot of the use cases. I think for clawdbot this works fine. For i.e. codex cli, this might not work— as it can run for 30 minutes doing an entire project implementation in one 'turn'. We may want to refine this though if the current approach is indeed not wise.

(n.b. I think I am using the term 'turn' incorrectly here, will figure it out...)

#381 is merged in main, suggest we close this PR. Lets keep discussing!

@steipete
Copy link
Contributor

steipete commented Jan 7, 2026

Interesting. Wouldn't that trash the cash if we change history?

@steipete steipete self-assigned this Jan 7, 2026
@maxsumrall
Copy link
Contributor

Interesting. Wouldn't that trash the cash if we change history?

Yeah good question!

My understanding of prompt caching is that it’s exact prefix matching. In our case #381 is deterministically pruning (both in aggressive mode which is similar to this PR’s approach, and in the soft/hard adaptive approach). So once a given older tool result becomes trimmed/cleared, that content stays stable across later requests and should be cacheable.

There is still some churn as your “sliding trim boundary” advances. Example: if you have messages [1,2,3,4] and we preserve the N=2 most recent messages, we trim the older ones → [1t,2t,3,4]. Now when you append message 5, the boundary advances and 3 becomes newly eligible → [1t,2t,3t,4,5]. On that request, the first mismatch vs the previous prompt is inside message 3, so the cache could/should still hit for the prefix up to [1t,2t], and then recompute the rest.

So in this case it’s not a total cache miss — it’s a cache hit up to the point in the session right before the newly-pruned message, rather than a hit up to “everything except the new appended message”.

But my analysis also makes some assumptions about how the different LLM providers work. I’m not an expert in this topic (yet 😅) so the actual way the caches work might make this all moot.

I have some minor tweaks that might further improve caching based on thinking about this! Will circle back in a day or two. 👇 🤖 If someone else wants to try it earlier:

• Two small tweaks I want to try to make “aggressive” pruning more cache-friendly:

  1) Don’t prune inside the active user turn (tool loop)
     - keep bootstrap prefix: never touch anything before the first role:"user"
     - keep active turn stable: never touch anything after the most recent role:"user"
     - only prune toolResults in the “old history” slice:
       [firstUserIndex .. min(cutoffIndex(keepLastAssistants), lastUserIndex))

  2) Add a per-session pruning watermark (tiny state in the existing WeakMap runtime)
     - store watermarkIndex/messageId (“history pruned up to here”)
     - only advance it occasionally (e.g. once per user msg / every K turns / when prunableToolBulk exceeds threshold)
     - aggressive pruning then only masks toolResults older than watermark, so the “first changed token” doesn’t creep forward every request, but in batches every K turns. 

@obviyus obviyus closed this Jan 11, 2026
dgarson added a commit to dgarson/clawdbot that referenced this pull request Feb 9, 2026
…claw#386)

* Web: add Obsidian connection setup

* Web: align Obsidian REST API defaults
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants