Skip to content

fix(memory): filter structural noise from graph entity extraction#1920

Merged
bug-ops merged 1 commit intomainfrom
1912-graph-entity-extraction-noise
Mar 16, 2026
Merged

fix(memory): filter structural noise from graph entity extraction#1920
bug-ops merged 1 commit intomainfrom
1912-graph-entity-extraction-noise

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 16, 2026

Summary

Fixes #1912. The zeph_graph_entities Qdrant collection was being polluted with structural tokens (TOML config keys, file paths, tool names like read_file, wget, generic terms like go, type, src/) extracted from tool result messages rather than meaningful semantic entities.

Root causes and fixes:

  • FIX-1: persist_message() now skips graph extraction entirely when the message contains ToolResult parts — tool outputs (TOML, JSON, command output) are structural data, not conversational content
  • FIX-2: The context window passed to the extraction LLM call now excludes Role::User messages with ToolResult parts
  • FIX-3: Added min_entity_name_bytes = 3 to MemoryWriteValidationConfig, enforced in both validate_graph_extraction and EntityResolver::resolve() via MIN_ENTITY_NAME_BYTES constant — rejects tokens like go, cd, type
  • FIX-4: Revised extraction prompt — entity types restricted to person, project, technology, organization, concept; explicit rules against extracting config keys, file paths, tool names, TOML/JSON keys, and short tokens

Tests: 3 new unit tests added (2569 → 6049 total pass after merge with main), covering:

  • context_filter_excludes_tool_result_messages
  • resolve_short_name_below_min_returns_error
  • resolve_name_at_min_length_passes

Test plan

  • cargo nextest run --config-file .github/nextest.toml -p zeph-memory -p zeph-core --lib passes
  • cargo clippy --workspace --features full -- -D warnings clean
  • cargo +nightly fmt --check clean
  • Live session: send a message referencing a config file, verify no config keys appear in zeph_graph_entities

@github-actions github-actions bot added bug Something isn't working size/L Large PR (201-500 lines) documentation Improvements or additions to documentation memory zeph-memory crate (SQLite) rust Rust code changes core zeph-core crate and removed bug Something isn't working labels Mar 16, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 16, 2026 16:25
@github-actions github-actions bot added the bug Something isn't working label Mar 16, 2026
Prevent TOML config keys, file paths, tool names, and short generic
tokens from polluting zeph_graph_entities (closes #1912).

- Skip graph extraction for Role::User messages containing ToolResult
  parts — tool outputs are structural data, not conversational content
- Exclude ToolResult user messages from the LLM extraction context window
- Add min_entity_name_bytes = 3 to MemoryWriteValidationConfig and
  enforce it in validate_graph_extraction and EntityResolver::resolve()
- Restrict extraction prompt entity types to person/project/technology/
  organization/concept; add explicit rules against structural tokens,
  config keys, file paths, and raw command output
@bug-ops bug-ops force-pushed the 1912-graph-entity-extraction-noise branch from b9bda10 to 62115f5 Compare March 16, 2026 16:46
@bug-ops bug-ops merged commit 5160aaa into main Mar 16, 2026
20 checks passed
@bug-ops bug-ops deleted the 1912-graph-entity-extraction-noise branch March 16, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation memory zeph-memory crate (SQLite) rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(memory): graph entity extraction populates zeph_graph_entities with structural noise instead of semantic facts

1 participant