Skip to content

SD Agent Tutorial: Integrating Multi-Modal Endpoints #280

@kovtcharov

Description

@kovtcharov

Overview

Create a minimal, teachable example demonstrating how to integrate Stable Diffusion image generation into a GAIA agent. This serves as a reference implementation for adding any multi-modal endpoint (image, audio, video) to agents using the GAIA SDK.

Goal: A developer can follow this tutorial and understand how to:

  1. Call the Lemonade Server SD endpoint from an agent
  2. Register image generation as a tool
  3. Handle base64 image responses
  4. Display/save generated images

Motivation

Currently, there's no simple example showing how to add image generation capabilities to a GAIA agent. Developers need a minimal working example that:

  • Is small enough to understand in one sitting (~100 lines)
  • Demonstrates the complete flow from prompt to saved image
  • Can be extended for more complex use cases
  • Follows GAIA SDK patterns and best practices

Deliverables

1. Minimal Agent Implementation

File: examples/sd_agent_minimal.py (~100 lines)

"""
Minimal SD Agent Example

Demonstrates how to integrate Stable Diffusion image generation
into a GAIA agent using the Lemonade Server endpoint.
"""

from pathlib import Path
import base64
import requests
from gaia.agents.base import Agent
from gaia.agents.base.tools import tool


class SimpleSDAgent(Agent):
    """A minimal agent that can generate images from text descriptions."""

    def __init__(
        self,
        base_url: str = "http://localhost:8000",
        output_dir: str = "./generated_images",
        **kwargs
    ):
        super().__init__(**kwargs)
        self.sd_endpoint = f"{base_url}/api/v1/images/generations"
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        self._register_tools()

    def _get_system_prompt(self) -> str:
        return """You are an image generation assistant.

When the user asks you to create or generate an image, use the generate_image tool.
Enhance simple prompts with artistic details for better results.

Example enhancements:
- "a cat" → "a fluffy orange cat, soft lighting, detailed fur, photorealistic"
- "sunset" → "vibrant sunset over ocean, golden hour, dramatic clouds, 4k"
"""

    def _register_tools(self):
        @tool
        def generate_image(
            prompt: str,
            model: str = "SD-Turbo",
            size: str = "512x512",
            steps: int = 4
        ) -> dict:
            """
            Generate an image from a text prompt using Stable Diffusion.

            Args:
                prompt: Text description of the image to generate
                model: SD model to use (SD-Turbo or SDXL-Turbo)
                size: Image dimensions (512x512, 768x768, 1024x1024)
                steps: Number of inference steps (4 for Turbo models)

            Returns:
                Dict with image_path and generation details
            """
            # Call SD endpoint
            response = requests.post(
                self.sd_endpoint,
                json={
                    "prompt": prompt,
                    "model": model,
                    "size": size,
                    "n": 1,
                    "response_format": "b64_json"
                },
                timeout=60
            )
            response.raise_for_status()

            # Decode and save image
            data = response.json()
            image_b64 = data["data"][0]["b64_json"]
            image_bytes = base64.b64decode(image_b64)

            # Generate filename from prompt
            safe_name = "".join(c if c.isalnum() else "_" for c in prompt[:30])
            image_path = self.output_dir / f"{safe_name}.png"
            image_path.write_bytes(image_bytes)

            return {
                "image_path": str(image_path),
                "prompt": prompt,
                "model": model,
                "size": size
            }

        self.register_tool(generate_image)


# CLI usage
if __name__ == "__main__":
    agent = SimpleSDAgent()
    agent.run("Generate an image of a dragon perched on a mountain cliff at sunset")

2. User Guide

File: docs/guides/sd-tutorial.mdx

Structure:

  1. Introduction - What you'll build and learn
  2. Prerequisites - Lemonade Server running with SD model
  3. Quick Start - Run the example in 2 minutes
  4. Code Walkthrough - Line-by-line explanation
  5. Customization - How to modify for your needs
  6. Next Steps - Link to full SD Agent plan

Key sections:

Prerequisites

# Ensure Lemonade Server is running with SD model
lemonade-server serve --model SD-Turbo

# Verify SD endpoint is available
curl http://localhost:8000/api/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "test", "model": "SD-Turbo"}'

Quick Start

# Run the minimal example
python examples/sd_agent_minimal.py

# Or use interactively
python -c "
from examples.sd_agent_minimal import SimpleSDAgent
agent = SimpleSDAgent()
agent.run('Create an image of a cyberpunk city')
"

Code Walkthrough

Component Purpose
sd_endpoint URL to Lemonade Server's SD API
@tool decorator Registers function as agent tool
response_format: b64_json Request base64-encoded image
base64.b64decode() Convert response to image bytes
register_tool() Make tool available to agent

3. Playbook

File: docs/playbooks/sd-agent/part-1-minimal.mdx

Step-by-step tutorial:

Part 1: Your First Image Agent (30 min)

  1. Set Up Environment

    • Install GAIA SDK
    • Start Lemonade Server with SD model
    • Verify endpoint connectivity
  2. Create the Agent Class

    • Inherit from Agent base class
    • Configure SD endpoint URL
    • Set up output directory
  3. Register the Image Generation Tool

    • Use @tool decorator
    • Define parameters with types and defaults
    • Document with docstring (used by LLM)
  4. Handle the SD API Response

    • Parse JSON response
    • Decode base64 image data
    • Save to file system
  5. Write the System Prompt

    • Instruct agent when to use the tool
    • Add prompt enhancement guidance
  6. Test Your Agent

    • Run with simple prompts
    • Verify images are generated
    • Check output directory

Exercises

  1. Add JPEG support - Modify to save as JPEG with quality parameter
  2. Add prompt enhancement - Create a second tool that enhances prompts before generation
  3. Add image display - Use term-image to show results in terminal

4. Integration Test

File: tests/examples/test_sd_agent_minimal.py

"""Test the minimal SD agent example."""

import pytest
from pathlib import Path


@pytest.fixture
def sd_agent():
    from examples.sd_agent_minimal import SimpleSDAgent
    return SimpleSDAgent(output_dir="./test_output")


@pytest.mark.integration
@pytest.mark.requires_lemonade
def test_generate_image(sd_agent, tmp_path):
    """Test basic image generation."""
    sd_agent.output_dir = tmp_path

    result = sd_agent.tools["generate_image"](
        prompt="a simple red circle",
        model="SD-Turbo",
        size="512x512"
    )

    assert Path(result["image_path"]).exists()
    assert result["model"] == "SD-Turbo"


def test_agent_has_tool(sd_agent):
    """Test that generate_image tool is registered."""
    assert "generate_image" in sd_agent.tools

Acceptance Criteria

  • examples/sd_agent_minimal.py runs standalone with python examples/sd_agent_minimal.py
  • Example is under 100 lines of code (excluding comments)
  • Works with default Lemonade Server configuration
  • User guide explains every line of code
  • Playbook can be completed in 30 minutes by a Python developer
  • Integration test passes when Lemonade Server is running
  • Code follows GAIA SDK patterns (inherits Agent, uses @tool decorator)

Related

Labels

documentation, example, tutorial, image-generation, good-first-issue

Estimate

  • Agent implementation: 2 hours
  • User guide: 3 hours
  • Playbook: 4 hours
  • Tests: 1 hour

Total: ~10 hours

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions