Skip to content

Improve SD Agent Playbook and Agent Core Reliability#296

Merged
kovtcharov merged 47 commits intomainfrom
kalin/sd-playbook
Feb 4, 2026
Merged

Improve SD Agent Playbook and Agent Core Reliability#296
kovtcharov merged 47 commits intomainfrom
kalin/sd-playbook

Conversation

@kovtcharov
Copy link
Collaborator

@kovtcharov kovtcharov commented Feb 3, 2026

Summary

This PR significantly improves the SD Agent's reliability and usability through architectural enhancements, documentation consolidation, and bug fixes.


Core Agent Capabilities & Improvements

1. Dynamic Parameter Substitution for Multi-Step Plans

Problem: Agent planning system couldn't handle parameter dependencies between steps. When creating multi-step plans, the LLM would hallucinate placeholder paths instead of using actual tool results, causing "Image not found" errors.

Solution: Implemented dynamic parameter substitution with $PREV.field and $STEP_N.field placeholder syntax.

Implementation:

  • Added _resolve_plan_parameters() method to Agent base class
  • Recursively resolves placeholders in tool arguments from previous step results
  • Integrated into plan execution loop with state management
  • Stack overflow protection (MAX_DEPTH=50)
  • Clears step_results on error recovery to prevent stale data contamination

Example:

{
  "plan": [
    {"tool": "generate_image", "tool_args": {"prompt": "robot kitten"}},
    {"tool": "create_story_from_image", "tool_args": {"image_path": "$PREV.image_path"}}  
  ]
}
# System automatically substitutes $PREV.image_path with actual path from step 1

Impact: Enables complex multi-step workflows for ALL agents, not just SDAgent.

2. Configurable Loop Detection

Problem: Loop detector was too aggressive - stopped after single repeat, preventing legitimate use cases like "create 3 robot designs".

Solution:

  • Changed from single-repeat detection to consecutive-count tracking
  • Made threshold configurable via max_consecutive_repeats parameter (default: 4)
  • Allows users to adjust sensitivity per agent

Impact: Agents can now handle multi-iteration requests while still preventing infinite loops.

3. Context Size Optimization

Problem: 8K context was insufficient for multi-step planning with dynamic parameters. Workflow hit "context exceeded" errors (9154 tokens needed vs 8192 available).

Solution:

  • Increased SDAgent context from 8K to 16K
  • Updated SD profile min_context_size in init system
  • Force unload/reload LLM models during init to ensure context settings persist
  • Added warning display when context verification fails

Impact: SD multi-step workflows now complete without context errors.


Documentation Improvements

Consolidated Playbook Structure

Before: 4 files, ~1,900 lines

  • index.mdx (overview)
  • part-1-building-agent.mdx (tutorial)
  • part-2-architecture.mdx (deep dive - 628 lines)
  • part-3-variations.mdx (patterns - 389 lines)

After: 1 file, 543 lines (~70% reduction)

  • Single comprehensive guide with focused, practical content
  • Removed redundant architecture explanations (MRO, composable prompts, 5 debugging methods)
  • Removed advanced variations that added cognitive load
  • Added troubleshooting with GitHub issue reporting and contact info

Result: Users can build a working multi-modal agent without being overwhelmed by implementation details.


Testing & Quality

New Tests

  • test_parameter_substitution() - Basic placeholder resolution
  • test_parameter_substitution_edge_cases() - Edge cases:
    • Empty step_results
    • Non-dict results
    • Recursion depth limit (51 levels)
    • Circular references
    • Unicode field names
    • Special characters
    • Primitive type preservation

Test Results

  • All 5 SD agent unit tests pass
  • All lint checks pass (Black, isort, Pylint, Flake8)
  • Comprehensive edge case coverage

Files Changed

Core Implementation:

  • src/gaia/agents/base/agent.py (+291 lines) - Dynamic parameter substitution, configurable loop detection
  • src/gaia/agents/sd/agent.py - Updated to 16K context, placeholder syntax in prompts
  • examples/sd_agent_example.py - Consistent with SDAgent implementation

Configuration:

  • src/gaia/installer/init_command.py - 16K context for SD profile, force unload/reload, context verification

Documentation:

  • docs/playbooks/sd-agent/index.mdx - Consolidated single-page guide
  • Deleted: part-1, part-2, part-3 (-1,017 lines)
  • docs/docs.json - Updated navigation

Tests:

  • tests/unit/test_sd_agent.py (+163 lines) - Comprehensive test coverage

Utilities:

  • util/lint.py - uvx fallback when command not available

Backward Compatibility

100% Backward Compatible

  • Plans without placeholders work unchanged
  • Invalid placeholders degrade gracefully (returned as-is)
  • No breaking changes to existing APIs
  • All existing tests still pass
  • New parameters have sensible defaults

Security & Robustness

Security:

  • Stack overflow protection (MAX_DEPTH limit)
  • State isolation (step_results cleared on error recovery)
  • No code injection risk (string substitution only)

Reliability:

  • Comprehensive test coverage
  • Edge case handling (empty results, circular refs, Unicode)
  • Graceful degradation on errors
  • Clear user warnings via console

What's Next

Users should test the workflow after this PR merges:

gaia init --profile sd
gaia sd "create a robot exploring ancient ruins"

Expected: Image generated + story created with no errors or warnings.

kovtcharov and others added 5 commits February 2, 2026 21:36
Split SD agent playbook into 3 parts for better learning progression:
- Part 1: Quick start + build your first agent (25 min)
- Part 2: Architecture deep dive (20 min)
- Part 3: Advanced patterns and variations (20 min)

Improved SD agent reliability:
- Default to generating one image unless explicitly requested
- Fix empty string handling in create_story_from_last_image
- Include story text in final answer for better UX

Updated documentation:
- Added Lemonade Server architecture explanation
- Added Mermaid diagrams with AMD branding
- Added 5 video placeholders for production
- Removed presentation references from docs.json

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Generate random seeds by default to produce unique images on each run.
Users can still specify --seed option for reproducible results.

Updated documentation to explain seed behavior and reproducibility.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Use clearer section headers and separators in final answer:
- Story text displayed prominently first
- Clean separator line between sections
- Enhanced prompt and file paths grouped separately

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Provide full example story with proper spacing and formatting so LLM
knows exactly how to structure the final answer with story text.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
@kovtcharov kovtcharov added this to the v0.15.3 milestone Feb 3, 2026
@kovtcharov kovtcharov self-assigned this Feb 3, 2026
@kovtcharov kovtcharov added the documentation Documentation changes label Feb 3, 2026
@github-actions github-actions bot added devops DevOps/infrastructure changes agents Agent system changes llm LLM backend changes performance Performance-critical changes labels Feb 3, 2026
kovtcharov and others added 17 commits February 2, 2026 22:13
Restore the technical introduction explaining multi-modal architecture,
tool composition through mixins, and Lemonade Server integration.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Removed redundant sections:
- Deleted "Quick Concepts" section (duplicated "What You'll Build")
- Deleted standalone "Choosing SD Models" section (moved to Step 2 accordion)
- Removed redundant "Run Your Agent" section (integrated into Step 3)

Added technical depth to "What You'll Build":
- MRO chain, HTTP endpoints, tool signatures
- Complete tool registry explanation
- Instance state details

Fixed technical inaccuracy:
- Corrected model formats: LLMs use GGUF, SD uses safetensors
- Added specific format details for each model

Added missing prerequisite:
- Virtual environment creation step

Result: 14% shorter, more accurate, better flow.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Clarify what runs where:
- GGUF models (Qwen3-8B, Qwen3-VL-4B) run on iGPU (Radeon) via Vulkan
- SDXL-Turbo (safetensors) currently runs on CPU

Added tabbed selection for venv activation (Windows/Linux).

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Add --python 3.12 flag to uv venv command for consistency with
quickstart guide and to ensure correct Python version.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Add proper 4-space indentation to method definitions in Step 2 so they
can be directly pasted under the class definition from Step 1.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Specify model_id="Qwen3-8B-GGUF" in super().__init__() to use the
model downloaded by gaia init --profile sd, not the default
Qwen3-Coder-30B which isn't included in the SD profile.

This fixes "model_load_error" when running the example.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Replace simple system prompt with the actual SDAgent prompts that include:
- Research-backed prompt enhancement strategies
- Model-specific guidelines (SDXL-Turbo, SD-Turbo, etc.)
- Workflow instructions for tool usage

Tool schemas (parameters, models) are auto-injected by Agent base class.

This fixes issues where LLM uses wrong models or parameters.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Add get_sd_system_prompt() method to SDToolsMixin so agents can
compose system prompts from inherited mixins instead of manually
importing prompt fragments.

Pattern:
- Mixins provide both tools AND domain-specific prompt fragments
- Agents compose them: return self.get_sd_system_prompt()
- Tool schemas auto-injected by Agent base class

Benefits: Single responsibility, reusability, composability.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Add template method pattern for automatic prompt composition:
- _get_mixin_prompts() - Auto-collects from inherited mixins
- _compose_system_prompt() - Composes mixin + agent prompts
- Mixins provide get_*_system_prompt() methods

Benefits:
- Mixins own their domain knowledge (tools + prompts)
- Agents automatically inherit behavior
- Can modify, extend, or override prompts
- Fully backwards compatible

SDToolsMixin now provides get_sd_system_prompt() with research-backed
prompt engineering guidelines.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Moved SD prompts from src/gaia/agents/sd/ to src/gaia/sd/ to keep
mixins self-contained (tools + prompts in same package).

Changes:
- Created src/gaia/sd/prompts.py (moved from agents/sd/)
- SDToolsMixin imports from new location
- Added get_vlm_system_prompt() to VLMToolsMixin for consistency
- SDAgent uses mixin's get_sd_system_prompt() (no manual import)
- Deprecated old prompts.py location

Benefits:
- Mixins are truly self-contained (src/gaia/sd/ has everything for SD)
- Cleaner agent implementations (just call mixin methods)
- Better separation of concerns

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Prompts now live with the mixin in src/gaia/sd/prompts.py.
Removed deprecated file from src/gaia/agents/sd/.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Added comprehensive documentation of composable system prompts:
- Part 1: Explains the simple usage (return self.get_sd_system_prompt())
- Part 2: Deep dive with 5 usage patterns and debugging guide

Patterns covered:
1. Use mixin prompts as-is (automatic)
2. Return mixin prompt explicitly
3. Extend with custom instructions
4. Modify mixin prompts
5. Custom composition order

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Updated Part 1 example to show:
- Using mixin tools directly (generate_image, analyze_image)
- Creating custom tools that wrap mixin functionality
- Adding custom prompt instructions

Example now creates create_story_about_image() that wraps VLM's
create_story_from_image with custom metadata, demonstrating the
wrapper pattern for building specialized tools.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Changed _get_system_prompt from abstract to concrete method with
empty string default. This allows agents using only mixin prompts
to avoid implementing it.

Fixed SDAgent duplicate prompt bug:
- Was returning self.get_sd_system_prompt() causing duplication
- Now returns "" to use automatic mixin composition

Backwards compatible: existing agents continue to work.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Enhanced comments and docstrings to better explain:
- Why create a custom tool (specialization vs. generic)
- What the wrapper pattern adds (fixed style, metadata, extensibility)
- How it calls mixin methods via inheritance

Fixed incorrect system prompt explanation to accurately describe
composition: mixin prompts + custom instructions.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Solved initialization order issues with hybrid approach:
- Mixins provide static base guidelines (no instance state needed)
- get_*_system_prompt() composes base + instance-specific gracefully
- Falls back to base if called before mixin initialization

Changes:
- SDToolsMixin: get_base_sd_guidelines() (static) + get_sd_system_prompt() (instance)
- VLMToolsMixin: get_base_vlm_guidelines() (static) + get_vlm_system_prompt() (instance)
- Agent: system_prompt now lazy property (composes on first access)
- Composition includes: mixin prompts + custom + tools + response format

Benefits:
- No initialization order issues
- Graceful degradation (works even if called early)
- Simple, robust, debuggable

Added debugging documentation with 5 methods to observe prompts.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
@github-actions github-actions bot added eval Evaluation framework changes tests Test changes electron Electron app changes security Security-sensitive changes labels Feb 4, 2026
Copy link
Contributor

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

- Fix VLMToolsMixin tool attribution: create_story_from_image is a custom tool, not provided by VLMToolsMixin
- Update SDAgent example in Part 2 to match actual implementation (includes educational style and file saving)
- Fix max_steps default references (20 for base Agent, 10 for SDAgent)
- Add educational style to Part 1 code examples
- Fix composition pattern examples to use mixin clients directly
- Update gaia init documentation to mention 8K context configuration
- Fix SD agent unit tests for composable system prompts architecture
- Fix util/lint.py to fall back to direct tool execution when uvx not available
- Remove vlm-tools-expansion.md issue template
- Correct tool counts: 5 from mixins + 1 custom = 6 total
kovtcharov and others added 11 commits February 3, 2026 18:12
Implements $PREV.field and $STEP_N.field placeholder syntax to resolve parameter dependencies between plan steps. This fixes the issue where agents would hallucinate placeholder paths instead of using actual tool results.

Core changes:
- Add _resolve_plan_parameters() method to Agent base class with recursion depth limit
- Integrate parameter resolution into plan execution loop
- Update SDAgent and example to use placeholder syntax in system prompts
- Clear step_results on error recovery to prevent stale data contamination
- Add comprehensive unit tests including edge cases

Documentation:
- Add "Multi-Step Planning with Dynamic Parameters" section to Part 2
- Simplify Part 1 by removing redundant sections and example output
- Add GitHub issue reporting and contact info to troubleshooting

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Remove redundant 're' import in _resolve_plan_parameters() - already imported at module level.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
The 8K context was insufficient for multi-step planning with dynamic parameters. Increased to 16K to handle:
- SD system prompts with placeholder examples
- VLM system prompts
- Agent custom prompts
- Multi-step plan with tool schemas
- Conversation history

Changes:
- SDAgentConfig.ctx_size: 8192 → 16384
- SD profile min_context_size: 8192 → 16384
- Fix Pylint W0201: Initialize _ctx_verified and _ctx_warning in __init__
- Add warning when context verification fails for LLM models
- Update all documentation references from 8K to 16K

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
When testing model inference during init, unload the model first if already loaded, then reload with save_options=True. This ensures recipe_options with ctx_size are properly persisted.

Also improved display to show warning emoji when context verification fails:
- ✓ Qwen3-8B-GGUF - OK (ctx: 16384)  ← Success
- ✓ Qwen3-8B-GGUF - OK ⚠️ Context unverified!  ← Warning

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Merged index.mdx and part-1-building-agent.mdx into a single streamlined playbook. Deleted parts 2 and 3 which were overly verbose and redundant.

Changes:
- Merged index intro, video, and quick test into main playbook
- Kept all hands-on tutorial content from part-1
- Removed architecture deep-dive (part-2) and variations (part-3)
- Updated docs.json navigation to show single page
- Result: One comprehensive, focused guide instead of fragmented multi-part tutorial

The consolidated guide maintains the strong structure of both original files while eliminating unnecessary complexity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
After image generation and story creation complete, the agent was creating additional stories in a loop. Added explicit instruction to provide final answer immediately after both tools complete.

Fix prevents infinite loop detection warning when agent tries to call create_story_from_image multiple times.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
…opping

Changed from single-repeat detection to consecutive-count detection. Agent now allows up to 3 consecutive identical tool calls before triggering the loop detector.

Changes:
- Replaced last_tool_call with tool_call_history (tracks last 5 calls)
- Count consecutive identical calls
- Trigger after 3 consecutive repeats (was: 1 repeat)

This allows legitimate use cases like "create 3 robot designs" while still preventing infinite loops.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Added max_consecutive_repeats parameter to Agent.__init__() to allow customization of how many consecutive identical tool calls are allowed before triggering loop detection.

Default: 4 consecutive calls (increased from hardcoded 3)

This allows users to adjust sensitivity:
- Lower value (2-3): More aggressive loop detection
- Higher value (5-10): More tolerant of repetition
- For SD agent: Default 4 allows multiple variations while preventing infinite loops

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Use console.print_repeated_tool_warning() exclusively instead of also logging. Console provides better user visibility and Rich formatting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
Fixed outdated comment that still referenced 8K context.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
@kovtcharov kovtcharov enabled auto-merge February 4, 2026 03:05
kovtcharov and others added 2 commits February 3, 2026 19:07
Fixed all references to:
- 8K context → 16K context for multi-step planning
- Updated context size table for all models
- Updated troubleshooting with correct ctx-size command
- Updated playbook reference from "3-part" to single comprehensive guide
- Removed broken link to deleted part-2-architecture.mdx

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <[email protected]>
@kovtcharov kovtcharov changed the title Improve SD Agent Playbook and Reliability Improve SD Agent Playbook and Agent Core Reliability Feb 4, 2026
@kovtcharov kovtcharov added this pull request to the merge queue Feb 4, 2026
Merged via the queue into main with commit 3e921dd Feb 4, 2026
51 checks passed
@kovtcharov kovtcharov deleted the kalin/sd-playbook branch February 4, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent system changes audio Audio (ASR/TTS) changes chat Chat SDK changes cli CLI changes code-agent Code agent changes dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes electron Electron app changes eval Evaluation framework changes jira Jira agent changes llm LLM backend changes mcp MCP integration changes performance Performance-critical changes rag RAG system changes security Security-sensitive changes talk Talk agent changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants