feat: CLI-via-Goosed unified agent architecture with multi-agent routing#7238
Closed
bioinfornatics wants to merge 525 commits intoblock:mainfrom
Closed
feat: CLI-via-Goosed unified agent architecture with multi-agent routing#7238bioinfornatics wants to merge 525 commits intoblock:mainfrom
bioinfornatics wants to merge 525 commits intoblock:mainfrom
Conversation
Collaborator
|
thanks @bioinfornatics I like that general idea - looks like a lot of work to tidy up conflicts but would like to see what it looks like if you could show it here. |
e8dc59e to
4b68302
Compare
Collaborator
|
this is a massive change! I like the ideas, but I think we should discuss some of them separately (where we do we want to go with clients?) but also, it seems to introduce 5 big ideas, shouldn't we split those up? |
059ea64 to
6604f1c
Compare
…o-scroll The auto-scroll useEffect depended on `children` prop, which is a new React element reference on every render. This caused: 1. useEffect firing every render (not just on content changes) 2. scrollTo() triggering Radix ScrollArea's internal setRef callback 3. setRef calling setState → re-render → new children ref → loop 4. 'Maximum update depth exceeded' crash Fix: Replace children dependency with ResizeObserver on the viewport's scroll container. ResizeObserver fires only when content size actually changes, breaking the render loop while preserving auto-scroll behavior. Before: useEffect([children, autoScroll]) → fires every render After: ResizeObserver on scrollContent → fires on resize only
…vent infinite loop
The useRegisterSession hook had a single useEffect that both registered
(setter({...})) and unregistered (cleanup setter(null)) the session.
Its dependency array included unstable references (functions, arrays, objects)
that changed identity every render, causing:
render → effect fires → setter({...}) → provider re-renders →
BaseChat re-renders → new refs → cleanup setter(null) → re-render → loop
Split into two effects:
1. Register/unregister: depends only on sessionId + stableSubmit.
The cleanup setter(null) runs ONLY here — on session change or unmount.
2. Update fields: depends on primitive/length values only.
Uses functional updater setter(prev => ...) with no cleanup.
Also widened setSessionState type to accept functional updater pattern.
Dependency array uses .length for arrays to avoid identity-based re-runs.
Three changes to eliminate re-render churn in the work block side panel:
1. Replace smooth scrollIntoView with rAF-throttled auto scroll
- smooth scroll triggers Radix ScrollArea reflow → setState → re-render
- rAF loop with behavior: 'auto' avoids the feedback cascade
2. Stabilize prop identity with module-level constants
- new Map() → EMPTY_TOOL_NOTIFICATIONS (module const)
- () => {} → NOOP (module const)
- Prevents GooseMessage from seeing new prop refs every render
3. Add rafId ref for proper cleanup on unmount/stream-end
… render loops ToolCallWithResponse: - Replace useState/useEffect for startTime with useRef (no re-render) - Memoize toolResults, logs, progressEntries with React.useMemo - Remove unused useMemo named import (use React.useMemo instead) TooltipWrapper: - Remove per-instance TooltipProvider — uses app-level one from AppLayout - Prevents creating new Radix context on every render during streaming - Reduces component tree depth and re-render cascade These changes eliminate the 'Maximum update depth exceeded' crash in ReasoningDetailPanel → GooseMessage → ToolCallWithResponse path during streaming by preventing unstable prop identity and unnecessary state updates.
Root cause: during streaming, WorkBlockIndicator called updateWorkBlock with a new object every render (messages array recreated by .map() in parent). The context provider created a new value object every render, causing all consumers to re-render in an infinite loop. ReasoningDetailContext: - Memoize Provider value with useMemo - Use refs (panelDetailRef, detailRef) in toggle callbacks to remove state from useCallback deps, making callbacks fully stable - Add shallow value comparison in updateWorkBlock (messages.length + toolCount + isStreaming) to skip state updates when nothing changed WorkBlockIndicator: - Consolidate 4 individual refs into single latestRef object - Wrap buildDetail in useCallback with stable deps - Add proper dependency arrays to all useEffect hooks
…treaming During streaming with multiple assistant messages, the LLM often outputs text before adding tool calls. The previous logic would prematurely select this streaming text as the 'final answer' and render it outside the work block, causing content to flash in and out as tool requests arrive. Fix: skip final answer detection for multi-message streaming runs (assistantIndices.length > 1). Single-message streaming runs still use normal detection since the rendering layer handles tool-call suppression via suppressToolCalls prop. Also fixes the one-liner summary: since all messages stay as intermediates during streaming, extractOneLiner now correctly picks up the latest assistant text for the work block indicator description. Add 2 regression tests for multi-message streaming scenarios.
…within TooltipProvider' Add TooltipProvider at the App root level so all Radix Tooltip consumers are guaranteed coverage, regardless of where they render in the component tree. Previously, only the SidebarProvider in AppLayout wrapped a TooltipProvider, leaving edge cases (error boundaries, modals, race conditions during render loop recovery) uncovered.
…ool results in side panel WorkBlockIndicator: extractOneLiner now prefers the last tool call description (e.g. 'editing src/App.tsx', 'running ls -la') over raw LLM reasoning text. Uses describeToolCall() with human-readable summaries for common tools (text_editor, shell, analyze, etc.) and a generic fallback for unknown tools. ReasoningDetailPanel: pass suppressToolCalls to GooseMessage so the side panel doesn't render full tool response content (file contents, command outputs). Also add type='button' to close button.
extractOneLiner now prioritizes the latest assistant text's first sentence (e.g. 'Let me fix the render loop' / 'Now I'll check the scroll area') over tool call descriptions. This captures the LLM's *intent* — what it's thinking about — rather than mechanical details like file paths. Priority order: 1. First sentence of latest assistant text (≥10 chars) 2. Last tool call description (fallback) firstSentence() strips markdown/HTML/code-blocks and splits on sentence boundaries (. ! ? : —) for clean, readable summaries.
During streaming, the one-liner flickered on every token because extractOneLiner recomputed from messages on every render. Replace with useStableOneLiner hook that: - Shows tool call descriptions immediately (they appear all at once) - Only updates thinking sentences when they look complete (end with punctuation) and differ from what's currently shown - Holds the current value for a minimum 1.5s to prevent flicker - Falls back to tool descriptions when no assistant text exists - Shows final value immediately when streaming ends Split extractOneLiner into focused helpers: - extractToolDescription: last tool call as human-readable text - extractThinkingSentence: first complete sentence from assistant text - useStableOneLiner: debounced hook combining both with hold logic
The previous approach tried to extract assistant thinking sentences, but these stream word-by-word causing constant flickering. The debounce hook (useStableOneLiner) added complexity but still showed partial text. Simplify to tool-call-only descriptions: - Tool calls are discrete events (appear all at once, never stream) - They tell you exactly what Goose is doing: 'reading src/App.tsx', 'running npm test', 'editing WorkBlockIndicator.tsx' - Only updates when a NEW tool call starts (stable between calls) - Remove: useStableOneLiner, firstSentence, extractThinkingSentence - Remove: useState import (no longer needed) 121 lines removed, 8 added — dramatically simpler.
- Add Badge atom: variant-based (default/secondary/accent/muted/outline), two sizes - Add StatusDot atom: active (blue pulse), completed (green), idle (gray) - Export both from atoms barrel (index.ts) WorkBlockIndicator: - Use Badge for agent/mode info (only shown when non-default agent) - Use StatusDot for streaming/completed status - Cleaner layout: status line + one-liner instead of inline text - Remove redundant agent/mode display when it's the default 'Goose' ReasoningDetailPanel: - Rename title 'Work Block' → 'Activity' - Use Badge for agent/mode badge in header (only when non-default) - Use StatusDot in tool count status bar - suppressToolCalls on GooseMessage (hides raw tool outputs)
- Extract _routingInfo (agentName, modeSlug) from first message of each block - Track previous agent/mode across blocks in ProgressiveMessageList - Pass showAgentBadge prop to WorkBlockIndicator — true only when agent/mode differs from previous block - Pass showAgentBadge through to ReasoningDetailPanel via WorkBlockDetail - Filter out default agents (Goose, Goose Agent) from badge display - Reduces visual noise: badge only appears when 'who is talking' changes
GooseMessage renders agent/mode badges on each message. When rendered inside ReasoningDetailPanel (activity side panel), this creates a stack of repeated badges. Add hideRoutingBadges prop to GooseMessage and pass it from ReasoningDetailPanel to suppress all three badge locations (tool-only early return, main badge, timestamp area).
Replace full GooseMessage rendering in the activity side panel with a
compact tool-call list view:
- New ActivityStep molecule: icon + description + status indicator
- Tool-specific icons (Terminal for shell, FileText for editor, etc.)
- Spinning loader for active tools, green check for completed
- ReasoningDetailPanel now extracts ActivityEntry[] from messages:
- Shows tool calls as compact steps ("reading src/App.tsx")
- Shows thinking text as italic summaries (first sentence only)
- Skips tool-result user messages entirely
- Much cleaner than rendering full GooseMessage with raw tool outputs
- Exported from molecules barrel
During streaming, the last assistant message has partial text that builds up token by token. Previously extractActivityEntries extracted the first sentence from this partial text, causing thinking entries to flicker. Now: - Completed messages: extract first complete sentence via firstSentence() which uses sentence-ending punctuation (. ! ? : —) as boundaries - Streaming message: skip thinking text entirely (it's partial), the active tool spinner already indicates activity - Result: stable, non-flickering activity log during streaming
…able tool details - Add scripts/categorize_diagnostic_logs.py: standalone Python tool to parse/categorize diagnostic session logs into UI rendering zones (main panel, work block, hidden) with --timeline, --json, --validate modes - Add ui/desktop/src/utils/diagnosticLogParser.ts: TypeScript utility mirroring the Python categorization for use in the UI. Supports parseLogLines(), parseSession(), toMessages() to reconstruct sessions from JSONL diagnostic logs - Enhance ActivityStep: now expandable with tool arguments (key-value), result text (with truncation/expand), and error messages. Collapsed view unchanged. - Add ThinkingEntry component: renders chain-of-thought text as italic plain text, visually distinct from tool activity blocks - Enhance ReasoningDetailPanel: pairs tool requests with their responses by matching IDs across messages, builds rich timeline of interleaved thinking + tool activity - Add comprehensive tests: - 22 diagnosticLogParser tests (zone mapping, parsing, sessions, toMessages) - 16 ActivityStep + ThinkingEntry component tests (rendering, expand/collapse, args) - 20 ReasoningDetailPanel tests (buildToolResponseMap, extractActivityEntries) Categories mapped to UI zones: Main Panel: USER_INPUT, ASSISTANT_TEXT, STREAMING_CHUNK Work Block: TOOL_REQUEST, TOOL_RESULT, INTERNAL_WORK Reasoning: THINKING Hidden: SYSTEM_INFO, TITLE_GENERATION, USAGE_STATS
Major fix: pushMessage now concatenates streaming text deltas instead of replacing the last message. The server sends each streaming chunk as a separate Message with the same ID containing only the delta text (a few tokens), not the full accumulated text. The old code replaced the entire last message with each incoming chunk, causing only the final chunk to be displayed (the 'single dot' bug where 843 chunks of a response resulted in just '.' being shown). Minor fix: WorkBlockIndicator one-liner now prefers showing the latest assistant thinking text (e.g. 'I'll start by analyzing...') over tool call descriptions (e.g. 'running command...'), giving better context about what the agent is doing. Includes 11 tests for pushMessage covering accumulation, interleaved messages, metadata preservation, and the exact 'single dot' regression scenario.
…ing work blocks During multi-message streaming (tool calls + text), the pure-text final answer message is now shown outside the work block immediately, so users can read the response as it streams token-by-token while tool calls remain collapsed above. Previously, ALL messages were hidden behind the WorkBlockIndicator during streaming, and text only appeared after the entire response finished. Key changes: - identifyWorkBlocks: During streaming, pure-text messages are identified as final answers for progressive rendering. Text+tool messages stay collapsed. - If the LLM later adds tool calls to a pure-text message, identifyWorkBlocks re-runs and absorbs it back into the work block automatically. - Updated tests to verify progressive rendering behavior.
…ioning - MainPanelLayout: change h-dvh to h-full so the panel respects flex constraints and doesn't extend behind GlobalChatInput - ProgressiveMessageList: broaden showPendingIndicator to remain visible whenever streaming is active (not just before first assistant message), so the activity indicator stays visible under the last user message - useSandboxBridge: prefix unused resourceUri with underscore
- resultsCache: add LRU eviction (max 5 sessions) and only cache when idle to prevent caching on every streaming chunk - pushMessage: mutate array in-place instead of copying on every chunk, reducing O(n²) allocations during streaming - maybeUpdateUI: spread currentMessages when dispatching to React so state changes are detected, and fix reduced-motion branch to actually batch updates instead of dispatching immediately (defeating batching)
Collaborator
|
Hey @bioinfornatics 👋 Just checking in on this draft. Are you still actively working on it? If not, would you mind closing it? We can always reopen later if you pick it back up. Thanks! |
SOTA-aligned improvements based on multi-agent AI research (2025):
Orchestrator Prompts (Anthropic 2025 best practices):
- Rewrite system.md, routing.md, splitting.md with XML-structured tags
- XML agent catalog in build_catalog_text() (<agent>, <mode>, <use_when>)
- Filter internal modes from routing prompts and user-facing catalogs
- Add explicit mode validation ('Do NOT invent new modes')
AGENTS.md (Linux Foundation standard):
- Enhance root AGENTS.md with architecture docs, routing flow, A2A, design decisions
- Add nested crates/goose/src/agents/AGENTS.md (agent system docs)
- Add nested crates/goose/src/prompts/AGENTS.md (template conventions)
Tests: 42 pass (24 orchestrator + 13 intent_router + 5 dispatch)
Clippy: 0 warnings
Add embedding-based semantic routing between keyword matching (<10ms) and LLM-as-Judge (~1-5s), providing ~100ms routing that's 50x faster than LLM and far more robust than keywords. Architecture (3-tier hybrid routing): Layer 1: Keyword matching (IntentRouter, <10ms) Layer 2: TF-IDF cosine similarity (SemanticRouter, ~1ms) [NEW] Layer 3: LLM-as-Judge (OrchestratorAgent, ~1-5s) SemanticRouter implementation: - TF-IDF vectorization with smoothed IDF weighting - Cosine similarity matching against pre-computed route vectors - Minimal English suffix stemmer (no external dependencies) - Stop word filtering, configurable similarity threshold (0.15) - Top matching terms for explainability - Zero external dependencies — pure Rust IntentRouter integration: - SemanticRouter field auto-built from agent slots - Rebuilt on slot add/remove/enable changes - Semantic layer activated when keyword score < 0.2 threshold - Tracing spans record routing strategy for observability Files: - NEW: crates/goose/src/agents/semantic_router.rs (607 lines) - MOD: crates/goose/src/agents/intent_router.rs (+150 lines) - MOD: crates/goose/src/agents/mod.rs (module registration) Tests: 51 pass (12 semantic_router + 15 intent_router + 24 orchestrator_agent) SOTA ref: Semantic Router pattern (Aurelio Labs), hybrid routing architecture
…ntrast Root cause: hover:bg-background-danger-muted was undefined (no CSS variable), causing the button to lose its background on hover and expose the card bg (#3f434b), which fails WCAG AA contrast with text-danger (#ff6b6b) at 3.58:1. Fix: - Add --background-danger-muted, --background-success-muted, --background-warning-muted tokens to both light and dark themes in main.css - Add corresponding @theme inline aliases for Tailwind class generation - Change Delete button border from border-default to border-danger for better visual affordance - Light mode tokens use near-white tints (#fff5f5, #e6f4ea, #fff8e1) - Dark mode tokens use solid dark-tinted colors (#3d2222, #1e2e1a, #302a18) Contrast audit (all PASS WCAG AA ≥4.5:1): - Dark default: #ff6b6b on #22252a = 5.54:1 - Dark hover: #ff6b6b on #3d2222 = 5.22:1 - Light default: #d32f2f on #ffffff = 4.98:1 - Light hover: #d32f2f on #fff5f5 = 4.65:1 Fixes: goose4-ntlm
…ix noExplicitAny/noNonNullAssertion - Auto-fix import organization across 112 files (biome organizeImports) - Add type='button' to mock buttons in ProviderGrid.test.tsx (useButtonType) - Replace 'as any' with typed cast in ModelAndProviderContext.test.tsx (noExplicitAny) - Replace non-null assertion with null guard in ModelAndProviderContext.test.tsx (noNonNullAssertion) - Add QA and Security debug prompt templates - Result: 0 biome errors, 0 biome warnings across 484 files - All 540 UI tests pass
…s routes - universal_mode: richer when_to_use descriptions for all 5 modes - goose/review.md, pm/review.md, pm/write.md: enhanced prompt templates - security/ask.md, security/write.md: improved security agent prompts - goose_agent, qa_agent, security_agent: agent refinements - analytics.rs: new analytics route endpoints - agent_management.rs: agent management improvements - tool_analytics.rs: analytics tracking updates - prompt_template.rs: template rendering improvements
…mprovements - EvalRunner, AgentCatalog, RoutingInspector: enhanced analytics UI - ModelAndProviderContext: improved provider state management - useChatStream: streaming enhancements - AppSidebar, SessionListView, RecipesView: navigation refinements - SettingsView, ConfigSettings, TelemetrySettings: settings improvements - SwitchModelModal, ModelsBottomBar: model switching updates - ChatInput, BaseChat: chat UX improvements - AppLayout: layout refinements - main.ts: electron main process updates - Various component and toast improvements
…sion guards Routing quality improvements: - Enrich PM Agent description with RICE, MoSCoW, sprint planning, acceptance criteria, phased rollout vocabulary — PM accuracy 28.6% → 71.4% - Enrich Research Agent description with literature review, benchmarking, RFC summaries, concept explanation — Research accuracy 0% → 50% - Agent-level accuracy: 58% → 70% New eval regression guards: - test_agent_level_accuracy_baseline: ≥60% agent accuracy - test_pm_routing_baseline: ≥50% PM routing - test_research_routing_baseline: ≥30% Research routing - test_semantic_layer_used: ≥3 cases routed via semantic layer Total: 62 routing tests pass (12 semantic + 15 intent_router + 11 eval + 24 orchestrator)
…aggregation prompt SOTA E1 improvements: 1. GenUI Cross-Agent Binding - Add genui to Developer Agent recommended_extensions (ask, write, debug modes) - Add genui to QA Agent recommended_extensions - Add genui to Research Agent recommended_extensions - Any agent can now produce data visualizations via genui tools 2. Adaptive Thinking Debug Prompts (Anthropic 2025) - Developer debug: hypothesis matrix, 5 Whys, fault tree, interleaved thinking - QA debug: flaky test decision tree, test isolation, failure classification - Security debug: attack vector tree, incident timeline, blast radius assessment - All 3 include anti-overthinking guard and effort calibration 3. Compound Task Result Aggregation - New orchestrator/aggregation.md prompt template - XML-structured with synthesis instructions - Produces unified responses instead of concatenated parts All 1029 tests pass, clippy clean, fmt clean.
…dispatch
- Add aggregate_results_with_llm() to orchestrator_agent.rs
Uses orchestrator/aggregation.md prompt template to synthesize
multiple sub-task results into a coherent unified response via LLM
Falls back to simple string concatenation on any error
- Register orchestrator/aggregation.md in prompt_template.rs
Template with {{task_count}}, {{user_message}}, {{results}} variables
- Wire LLM aggregation into reply.rs compound dispatch flow
When provider available, uses LLM synthesis; otherwise falls back
to aggregate_results() simple concatenation
Quality: 1029 tests pass, clippy clean, fmt clean
Add ProjectAgentConfig system for per-project agent customization:
agent_config.rs (418 lines):
- Load .goose/agents.yaml with serde deserialization
- Enable/disable agents, override descriptions, add extensions
- Custom mode creation (with slug, name, description, tool_groups)
- Routing feedback persistence (.goose/routing_feedback.json)
- Mode override (description, when_to_use per mode per agent)
- 6 tests covering parsing, loading, applying, feedback, custom modes
intent_router.rs:
- apply_project_config() integration method
- Project-level default_agent/default_mode fallback in route()
- 3 new tests for config integration (disable, default, custom mode)
YAML schema supports:
default_agent: 'Developer Agent'
default_mode: 'write'
agents:
'Developer Agent':
enabled: true
description: 'Custom description'
extra_extensions: ['flutter-tools']
modes:
write:
when_to_use: 'When creating Flutter widgets'
custom_modes:
- slug: 'data-pipeline'
name: 'Data Pipeline'
description: 'Build and debug data pipelines'
agents: ['Developer Agent']
tool_groups: ['read', 'edit', 'command']
Layer 0 in the 4-tier hybrid routing architecture: [0] Feedback corrections (learned, 0.95 confidence) [1] Keyword match (<10ms) [2] TF-IDF semantic (~1ms) [3] Default fallback - Add routing_feedback field to IntentRouter - check_feedback() uses keyword overlap (≥50%) to find matching corrections - record_routing_feedback() stores user corrections for similar future queries - Feedback takes highest priority (Layer 0) — if a user previously corrected routing for a similar message, we trust that correction - Integrates with .goose/agents.yaml routing_feedback persistence - 2 new tests: feedback override + unrelated message non-match - All 1040 tests pass, clippy clean
- GET /agent-config — load current project agent config - PUT /agent-config — save updated config to .goose/agents.yaml - GET /agent-config/routing-feedback — list routing corrections - POST /agent-config/routing-feedback — record new routing correction Server backbone for the Agent Config UI feature. Closes goose4-j6mp.
- knowledge_extraction.rs: Extract structured KG entities and relations from conversation text via LLM-powered prompt - Entity types: Concept, Component, Decision, Finding, Risk, RepoPath - Relation types: depends_on, implements, affects, derived_from, etc. - Features: merge/dedup, confidence filtering, JSON fence parsing, cap at 20 - knowledge_extraction.md: Prompt template for entity/relation extraction - Registered in TEMPLATE_REGISTRY - 7 tests covering parsing, merging, filtering, caps, JSON extraction Part of SOTA E6: GraphRAG-style knowledge management
Contributor
Author
|
yes sorry I pushed my test on the wrong repo . It is for testing purpose. maybe I will split and push features later. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CLI-via-Goosed: Unified Agent Architecture
Summary
This PR introduces a unified architecture where the CLI communicates with agents through
goosed(the server binary), aligning desktop and CLI on a single communication path. It also adds multi-agent orchestration with an intent router, ACP/A2A protocol compatibility, and comprehensive UI improvements.Key Changes
🏗️ Architecture: CLI-via-Goosed
goosedserver instead of directly instantiating agentsGoosedClientmanages server lifecycle (spawn, health check, graceful shutdown)~/.config/goose/goosed.state)goose service install|uninstall|status|logsfor managed daemon lifecycle (systemd/launchd)🤖 Multi-Agent System
judge,planner,recipe_maker) filtered from public discovery📡 Protocol Compatibility
📊 Analytics & Observability
POST /analytics/routing/inspect,POST /analytics/routing/eval,GET /analytics/routing/catalogorchestrator.route,orchestrator.llm_classify,intent_router.route🖥️ UI Improvements
useChatStreamsplit intostreamReducer.ts+streamDecoder.ts(860→576 lines)🔒 Security & Reliability
/runsendpoints viaServiceBuilderErrorResponseon all 11 bareStatusCodereturns inruns.rsErrorResponse::bad_request()andconflict()constructors addedQuality Gates
cargo build --all-targetscargo fmt --checkcargo clippy --all-targets -- -D warningscargo test -p goose --lib(789 tests)cargo test -p goose-server(40 tests)npx tsc --noEmitnpx vitest run(325/326, 1 pre-existing)npx eslintNew Test Coverage
Routing Evaluation Baseline
Files Changed
Follow-up Work