# Introduction
Source: https://docs.zeroeval.com/autotune/introduction

Version, track, and optimize every prompt your agent uses

Prompts are the instructions that drive your agent's behavior. Small changes in wording can dramatically affect output quality, but without tracking, you have no way to know which version works best -- or even which version is running in production.

ZeroEval Prompts gives you version control for prompts with a single function call. Every change is tracked, every completion is linked to the exact prompt version that produced it, and you can deploy optimized versions without touching code.

## Why track prompts

* **Version history** -- every prompt change creates a new version you can compare and roll back to
* **Production visibility** -- see exactly which prompt version is running, how often it's called, and what it produces
* **Feedback loop** -- attach thumbs-up/down feedback to completions, then use it to [optimize prompts](/autotune/prompts/optimization) and [evaluate models](/judges/introduction)
* **One-click deployments** -- push a winning prompt or model to production without redeploying your app

## How it works

<Steps>
  <Step title="Replace hardcoded prompts">
    Swap string literals for `ze.prompt()` calls. Your existing prompt text
    becomes the fallback content.
  </Step>

  <Step title="Versions are created automatically">
    Each unique prompt string creates a tracked version. Changes in your code
    produce new versions without any extra work.
  </Step>

  <Step title="Completions are linked to versions">
    When your LLM integration fires, ZeroEval links each completion to the exact
    prompt version and model that produced it.
  </Step>

  <Step title="Optimize from production data">
    Review completions, submit feedback, and generate improved prompt variants
    \-- all from real traffic.
  </Step>
</Steps>

## Get started

<CardGroup>
  <Card title="Python" icon="python" href="/autotune/sdks/python">
    `ze.prompt()` and `ze.get_prompt()` for Python applications
  </Card>

  <Card title="TypeScript" icon="js" href="/autotune/sdks/typescript">
    `ze.prompt()` for TypeScript and JavaScript applications
  </Card>
</CardGroup>


# Optimization
Source: https://docs.zeroeval.com/autotune/prompts/optimization

Use feedback on production traces to generate and validate better prompts

<video alt="Prompt optimizations" />

ZeroEval uses feedback on your production completions to propose better prompt versions, then validates them before you roll out. The result is a concrete prompt edit you can review, test across models, and deploy -- all without manual prompt engineering.

## How optimization works

Every optimization follows the same lifecycle:

<Steps>
  <Step title="Collect feedback">
    Attach thumbs-up/down ratings, reasons, and expected outputs to real completions.
    This is the raw signal optimization learns from.
  </Step>

  <Step title="Start an optimization run">
    Trigger an optimization from the dashboard. ZeroEval selects a strategy based on speed
    and depth, then generates a candidate prompt from your feedback.
  </Step>

  <Step title="Compare against the baseline">
    The candidate is scored against your current prompt using the same feedback signal,
    so you can see whether it actually improves behavior.
  </Step>

  <Step title="Validate with simulations">
    Run the candidate against test cases and multiple models to confirm improvements
    generalize beyond the examples used during optimization.
  </Step>

  <Step title="Deploy">
    Publish the winning prompt version. Your app picks it up automatically through
    `ze.prompt()` with no code changes required.
  </Step>
</Steps>

## Before you optimize

Optimization quality depends directly on the quality and quantity of feedback attached to your completions. Before starting a run, make sure:

* **Your prompt is tracked** with `ze.prompt()` so completions are linked to specific prompt versions.
* **Completions are flowing** through ZeroEval with enough volume to represent real usage patterns.
* **Feedback is attached** to those completions -- both positive and negative examples help.

The most useful feedback includes **reasons** (explaining why an output was good or bad) and **expected outputs** (showing what the response should have been). Vague thumbs-down signals without context produce weaker optimizations.

<Note>
  For details on how to submit feedback through the dashboard, SDK, or API, see [Human Feedback](/feedback/human-feedback).
</Note>

## Start an optimization run

Navigate to your prompt's **Suggestions** tab in the ZeroEval dashboard and click **Optimize Prompt**. ZeroEval will:

1. Gather feedback examples linked to your prompt.
2. Select an optimization strategy based on the complexity of your prompt and the available signal.
3. Generate one or more candidate prompts.

<video alt="Prompt optimizations with feedback" />

## Review the candidate prompt

Optimization produces a **candidate** -- a proposed new version of your prompt. It does not overwrite your current prompt automatically.

You can review the candidate side-by-side with your **baseline** (the current active version) to understand exactly what changed and why. The candidate is derived from patterns in your feedback: corrections steer the wording, positive examples reinforce what already works.

## Compare against your baseline

ZeroEval measures whether the candidate actually outperforms the baseline using the feedback-derived signal. The comparison shows:

* How the candidate scores relative to the current prompt on the same set of examples.
* Whether improvements on some examples come at the cost of regressions on others.
* An overall recommendation based on the comparison results.

This step ensures you are not adopting a prompt that simply looks different -- it needs to measurably perform better.

## Validate with simulations

After optimization, you can run the candidate against test cases using multiple models to confirm the improvement holds up beyond the training examples.

<video alt="Model leaderboard" />

Simulations help answer:

* Does the candidate work well across different models, not just the one it was optimized for?
* Does it handle edge cases that were not part of the original feedback set?
* Are there any regressions on specific scenarios?

ZeroEval automatically queues an initial simulation after a successful optimization run, testing the candidate across popular models.

## Optimization strategies

ZeroEval offers three optimization strategies, each suited to a different speed and depth tradeoff:

| Strategy         | Speed         | Best for                                                                                                                                                     |
| ---------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **Quick Refine** | Seconds       | Fast iteration on straightforward prompt fixes. Rewrites the prompt in a single pass using your feedback examples.                                           |
| **Bootstrap**    | Minutes       | Prompts where good examples clearly demonstrate the desired behavior. Selects high-quality demonstrations and chains them with the prompt.                   |
| **GEPA**         | 10-60 minutes | Complex prompts or multi-intent optimization. Runs a deeper search that evolves candidates across multiple generations, guided by reflection on performance. |

You do not need to choose a strategy manually -- ZeroEval selects the appropriate one based on your prompt and feedback. However, you can override this if you want faster iteration (Quick Refine) or a more thorough search (GEPA).

## Multi-intent optimization and guardrails

Some prompts serve multiple goals. For example, a customer support prompt might need to be both accurate and concise -- improving one should not come at the expense of the other.

### Intent weighting

When your prompt has multiple linked judges (intents), you can assign weights to each intent to tell the optimizer which goals matter most. Intents with higher weight have more influence on which candidate is selected.

### Guardrails

Guardrails are quality floors that prevent adopting a candidate that regresses on important intents. You can set minimum thresholds per intent, and any candidate that falls below those thresholds is automatically rejected -- even if it improves overall performance.

This ensures that optimization never ships a prompt that fixes one problem by creating another.

## Use optimized prompts in production

When you use `ze.prompt()` with a `content` fallback, ZeroEval automatically resolves to the latest published version from your dashboard. Once you publish an optimized candidate, your app starts using it immediately with no code changes:

<CodeGroup>
  ```python Python theme={null}
  import zeroeval as ze

  ze.init()

  system_prompt = ze.prompt(
      name="support-bot",
      content="You are a helpful customer support agent."
  )
  ```

  ```typescript TypeScript theme={null}
  import * as ze from 'zeroeval';

  ze.init();

  const systemPrompt = await ze.prompt({
    name: "support-bot",
    content: "You are a helpful customer support agent."
  });
  ```
</CodeGroup>

Your `content` string serves as the fallback for initial setup or if the ZeroEval service is unreachable. Once an optimized version is published, `ze.prompt()` returns that version instead.

To bypass optimization and force the hardcoded content (useful for debugging or A/B testing), use explicit mode:

<CodeGroup>
  ```python Python theme={null}
  prompt = ze.prompt(
      name="support-bot",
      from_="explicit",
      content="You are a helpful customer support agent."
  )
  ```

  ```typescript TypeScript theme={null}
  const prompt = await ze.prompt({
    name: "support-bot",
    from: "explicit",
    content: "You are a helpful customer support agent."
  });
  ```
</CodeGroup>

## Best practices

* **Wait for enough signal.** Optimizing with only a handful of feedback examples produces unreliable results. Aim for a representative sample of both positive and negative completions before running.
* **Include corrections, not just thumbs.** Reasons and expected outputs give the optimizer concrete material to work with. A thumbs-down alone tells the system something is wrong but not what the right answer looks like.
* **Validate before rollout.** Use simulations to confirm the candidate works across models and edge cases before publishing it to production.
* **Iterate.** Optimization is not a one-time step. As your product evolves and usage patterns shift, new feedback will surface new improvement opportunities. Run optimization periodically as fresh signal accumulates.
* **Use guardrails for multi-intent prompts.** If your prompt serves multiple goals, set guardrails to prevent regressions on critical intents.


# API Reference
Source: https://docs.zeroeval.com/autotune/reference

REST API for managing prompts, versions, and deployments

Base URL: `https://api.zeroeval.com`

All requests require a Bearer token:

```
Authorization: Bearer YOUR_ZEROEVAL_API_KEY
```

***

## Get Prompt

```
GET /v1/prompts/{prompt_slug}
```

Fetch the current version of a prompt by its slug.

| Query Parameter | Type     | Default    | Description                                     |
| --------------- | -------- | ---------- | ----------------------------------------------- |
| `version`       | `int`    | —          | Fetch a specific version number                 |
| `tag`           | `string` | `"latest"` | Tag to fetch (`"production"`, `"latest"`, etc.) |

```bash theme={null}
curl https://api.zeroeval.com/v1/prompts/support-bot \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

**Response:** 200

```json theme={null}
{
  "id": "a1b2c3d4-...",
  "prompt_id": "b2c3d4e5-...",
  "content": "You are a helpful customer support agent.",
  "content_hash": "e3b0c44298fc...",
  "version": 3,
  "model_id": "gpt-4o",
  "tag": "production",
  "is_latest": true,
  "metadata": {},
  "created_at": "2025-01-15T10:30:00Z"
}
```

### Fetch by tag

```bash theme={null}
curl "https://api.zeroeval.com/v1/prompts/support-bot?tag=production" \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

### Fetch by version number

```bash theme={null}
curl "https://api.zeroeval.com/v1/prompts/support-bot?version=2" \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

***

## Ensure Prompt Version

```
POST /v1/tasks/{task_name}/prompt/versions/ensure
```

Create a prompt version if it doesn't already exist (idempotent by content hash). This is what `ze.prompt()` calls under the hood.

**Request body:**

| Field          | Type     | Required | Description                                    |
| -------------- | -------- | -------- | ---------------------------------------------- |
| `content`      | `string` | Yes      | Prompt content                                 |
| `content_hash` | `string` | No       | SHA-256 hash (computed server-side if omitted) |
| `model_id`     | `string` | No       | Model to bind to this version                  |
| `metadata`     | `object` | No       | Additional metadata                            |

```bash theme={null}
curl -X POST https://api.zeroeval.com/v1/tasks/support-bot/prompt/versions/ensure \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "You are a helpful customer support agent for {{company}}."
  }'
```

**Response:** 200

```json theme={null}
{
  "id": "c3d4e5f6-...",
  "content": "You are a helpful customer support agent for {{company}}.",
  "content_hash": "a1b2c3d4...",
  "version": 1,
  "model_id": null,
  "created_at": "2025-01-15T10:30:00Z"
}
```

***

## Get Version by Hash

```
GET /v1/tasks/{task_name}/prompt/versions/by-hash/{content_hash}
```

Fetch a specific prompt version by its SHA-256 content hash.

**Response:** 200 (same schema as ensure)

***

## Get Latest Version

```
GET /v1/tasks/{task_name}/prompt/latest
```

Fetch the latest prompt version for a task.

**Response:** 200 (same schema as ensure)

***

## Resolve Model for Version

```
GET /v1/prompt-versions/{version_id}/model
```

Get the model bound to a specific prompt version. Used by SDK integrations to auto-patch the `model` parameter.

**Response:** 200

```json theme={null}
{
  "model_id": "gpt-4o",
  "provider": "openai"
}
```

Returns `null` for `model_id` if no model is bound.

***

## Deploy a Version (Pin Tag)

```
POST /projects/{project_id}/prompts/{prompt_slug}/tags/{tag}:pin
```

Pin a tag (e.g. `production`) to a specific version number. This is how you deploy a prompt version to production.

**Request body:**

| Field     | Type  | Required | Description           |
| --------- | ----- | -------- | --------------------- |
| `version` | `int` | Yes      | Version number to pin |

```bash theme={null}
curl -X POST https://api.zeroeval.com/projects/$PROJECT_ID/prompts/support-bot/tags/production:pin \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"version": 3}'
```

***

## List Versions

```
GET /projects/{project_id}/prompts/{prompt_slug}/versions
```

List all versions of a prompt.

**Response:** 200

```json theme={null}
[
  {
    "id": "c3d4e5f6-...",
    "content": "You are a helpful assistant.",
    "content_hash": "a1b2c3d4...",
    "version": 1,
    "model_id": null,
    "created_at": "2025-01-10T10:00:00Z"
  },
  {
    "id": "d4e5f6a7-...",
    "content": "You are a helpful customer support agent.",
    "content_hash": "b2c3d4e5...",
    "version": 2,
    "model_id": "gpt-4o",
    "created_at": "2025-01-15T10:30:00Z"
  }
]
```

***

## List Tags

```
GET /projects/{project_id}/prompts/{prompt_slug}/tags
```

List all tags and which version each is pinned to.

**Response:** 200

```json theme={null}
[
  { "tag": "latest", "version": 2 },
  { "tag": "production", "version": 1 }
]
```

***

## Update Version Model

```
PATCH /projects/{project_id}/prompts/{prompt_slug}/versions/{version}
```

Update the model bound to a version.

**Request body:**

| Field      | Type     | Description              |
| ---------- | -------- | ------------------------ |
| `model_id` | `string` | Model identifier to bind |

***

## Submit Completion Feedback

```
POST /v1/prompts/{prompt_slug}/completions/{completion_id}/feedback
```

See [Feedback API Reference](/feedback/api-reference#completion-feedback) for the full specification.


# Python
Source: https://docs.zeroeval.com/autotune/sdks/python

Track and version prompts in Python with ze.prompt()

## Installation

```bash theme={null}
pip install zeroeval
```

## Basic Setup

Replace hardcoded prompt strings with `ze.prompt()`. Your existing text becomes the fallback content that's used until an optimized version is available.

```python theme={null}
import zeroeval as ze
from openai import OpenAI

ze.init()
client = OpenAI()

system_prompt = ze.prompt(
    name="support-bot",
    content="You are a helpful customer support agent for {{company}}.",
    variables={"company": "TechCorp"}
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "How do I reset my password?"}
    ]
)
```

That's it. Every call to `ze.prompt()` is tracked, versioned, and linked to the completions it produces. You'll see production traces at [ZeroEval → Prompts](https://app.zeroeval.com).

<Note>
  When you provide `content`, ZeroEval automatically uses the latest optimized
  version from your dashboard if one exists. The `content` parameter serves as a
  fallback for when no optimized versions are available yet.
</Note>

## Version Control

### Auto-optimization (default)

```python theme={null}
prompt = ze.prompt(
    name="customer-support",
    content="You are a helpful assistant."
)
```

Uses the latest optimized version if one exists, otherwise falls back to the provided content.

### Explicit mode

```python theme={null}
prompt = ze.prompt(
    name="customer-support",
    from_="explicit",
    content="You are a helpful assistant."
)
```

Always uses the provided content. Useful for debugging or A/B testing a specific version.

### Latest mode

```python theme={null}
prompt = ze.prompt(
    name="customer-support",
    from_="latest"
)
```

Requires an optimized version to exist. Fails with `PromptRequestError` if none is found.

### Pin to a specific version

```python theme={null}
prompt = ze.prompt(
    name="customer-support",
    from_="a1b2c3d4..."  # 64-char SHA-256 hash
)
```

## Prompt Library

For more control, use `ze.get_prompt()` to fetch prompts from the Prompt Library with tag-based deployments and caching.

```python theme={null}
prompt = ze.get_prompt(
    "support-triage",
    tag="production",
    fallback="You are a helpful assistant.",
    variables={"product": "Acme"},
)

print(prompt.content)
print(prompt.version)
print(prompt.model)
```

### Parameters

| Parameter   | Type    | Default    | Description                                                |
| ----------- | ------- | ---------- | ---------------------------------------------------------- |
| `slug`      | `str`   | —          | Prompt slug (e.g. `"support-triage"`)                      |
| `version`   | `int`   | `None`     | Fetch a specific version number                            |
| `tag`       | `str`   | `"latest"` | Tag to fetch (`"production"`, `"latest"`, etc.)            |
| `fallback`  | `str`   | `None`     | Content to use if the prompt is not found                  |
| `variables` | `dict`  | `None`     | Template variables for `{{var}}` tokens                    |
| `task_name` | `str`   | `None`     | Override the task name for tracing                         |
| `render`    | `bool`  | `True`     | Whether to render template variables                       |
| `missing`   | `str`   | `"error"`  | What to do with missing variables: `"error"` or `"ignore"` |
| `use_cache` | `bool`  | `True`     | Use in-memory cache for repeated fetches                   |
| `timeout`   | `float` | `None`     | Request timeout in seconds                                 |

### Return value

Returns a `Prompt` object with:

| Field          | Type   | Description                          |
| -------------- | ------ | ------------------------------------ |
| `content`      | `str`  | The rendered prompt content          |
| `version`      | `int`  | Version number                       |
| `version_id`   | `str`  | Version UUID                         |
| `tag`          | `str`  | Tag this version was fetched from    |
| `is_latest`    | `bool` | Whether this is the latest version   |
| `model`        | `str`  | Model bound to this version (if any) |
| `metadata`     | `dict` | Additional metadata                  |
| `source`       | `str`  | `"api"` or `"fallback"`              |
| `content_hash` | `str`  | SHA-256 hash of the content          |

## Model Deployments

When you deploy a model to a prompt version in the dashboard, the SDK automatically patches the `model` parameter in your LLM calls:

```python theme={null}
system_prompt = ze.prompt(
    name="support-bot",
    content="You are a helpful customer support agent."
)

response = client.chat.completions.create(
    model="gpt-4",  # Gets replaced with the deployed model
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "Hello"}
    ]
)
```

## Multi-Artifact Runs

When a single prompt-linked run produces multiple judged outputs (e.g. a final decision and a visual card), use `ze.artifact_span` to mark each output as a named artifact. The primary artifact becomes the default completion preview; secondary artifacts are accessible in the detail view.

```python theme={null}
with ze.artifact_span(
    name="final-decision",
    artifact_type="final_decision",
    role="primary",
    label="Final Decision",
) as s:
    s.set_io(input_data=ticket_text, output_data=decision_json)
```

See the full API and examples in [Python Tracing Reference](/tracing/sdks/python/reference#artifact_span).

## Manual Prompt-Linked Spans

When you want `ze.prompt()` to manage a specific LLM interaction but do not want global auto-instrumentation (e.g. your agent makes many LLM calls and only one should be tracked as a prompt generation), disable integrations and create the span yourself.

### When to use this

* Your codebase makes many LLM calls but only a subset should appear as prompt completions.
* You use a provider that has no auto-integration (a custom HTTP endpoint, an internal model service, etc.).
* You want full control over which codepath produces prompt-linked traces.

### Setup

Disable all integrations, then initialize:

```python theme={null}
import zeroeval as ze

ze.init(disabled_integrations=["openai", "gemini", "langchain", "langgraph"])
```

### Create a prompt-linked span

Call `ze.prompt()` inside an active span so the SDK writes `task` / `zeroeval` metadata onto the trace. Then open a child span with `kind="llm"` (or use `ze.artifact_span()`) around your provider call. The SDK automatically propagates prompt linkage to child spans in the same trace.

```python theme={null}
import zeroeval as ze

def handle_request(user_message: str):
    with ze.span(name="support-pipeline"):
        system_prompt = ze.prompt(
            name="customer-support",
            content="You are a helpful support agent for {{company}}.",
            variables={"company": "Acme"},
            from_="explicit",
        )

        # The <zeroeval> tag is embedded in the string returned by ze.prompt().
        # Strip it before sending to the provider if needed.
        clean_content = system_prompt.split("</zeroeval>", 1)[-1]

        with ze.span(name="llm.chat", kind="llm", attributes={
            "provider": "my-custom-provider",
            "model": "gpt-4o",
        }) as llm_span:
            response = my_provider.chat(
                messages=[
                    {"role": "system", "content": clean_content},
                    {"role": "user", "content": user_message},
                ]
            )
            llm_span.set_io(
                input_data=user_message,
                output_data=response.text,
            )
```

The SDK automatically stamps prompt metadata (`task`, `zeroeval.prompt_version_id`, etc.) onto every span in the trace, so the inner `llm` span is linked to the prompt version without any extra wiring. Judge evaluations, feedback, and the prompt completions page all work as if an auto-integration created the span.

You can also use `ze.artifact_span()` instead of `ze.span(kind="llm")` when you want the output to appear as a named completion artifact:

```python theme={null}
with ze.artifact_span(
    name="support-reply",
    artifact_type="reply",
    role="primary",
    label="Support Reply",
) as s:
    response = my_provider.chat(messages=[...])
    s.set_io(input_data=user_message, output_data=response.text)
```

## Sending Feedback

Attach feedback to completions to power prompt optimization:

```python theme={null}
ze.send_feedback(
    prompt_slug="support-bot",
    completion_id=response.id,
    thumbs_up=True,
    reason="Clear and concise response"
)
```

| Parameter         | Type   | Required | Description                                 |
| ----------------- | ------ | -------- | ------------------------------------------- |
| `prompt_slug`     | `str`  | Yes      | Prompt name (same as used in `ze.prompt()`) |
| `completion_id`   | `str`  | Yes      | UUID of the completion                      |
| `thumbs_up`       | `bool` | Yes      | Positive or negative feedback               |
| `reason`          | `str`  | No       | Explanation of the feedback                 |
| `expected_output` | `str`  | No       | What the output should have been            |
| `metadata`        | `dict` | No       | Additional metadata                         |


# TypeScript
Source: https://docs.zeroeval.com/autotune/sdks/typescript

Track and version prompts in TypeScript with ze.prompt()

## Installation

```bash theme={null}
npm install zeroeval
```

## Basic Setup

Replace hardcoded prompt strings with `ze.prompt()`. Your existing text becomes the fallback content that's used until an optimized version is available.

```typescript theme={null}
import * as ze from "zeroeval";
import { OpenAI } from "openai";

ze.init();
const client = ze.wrap(new OpenAI());

const systemPrompt = await ze.prompt({
  name: "support-bot",
  content: "You are a helpful customer support agent for {{company}}.",
  variables: { company: "TechCorp" },
});

const response = await client.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: "How do I reset my password?" },
  ],
});
```

Every call to `ze.prompt()` is tracked, versioned, and linked to the completions it produces. You'll see production traces at [ZeroEval → Prompts](https://app.zeroeval.com).

<Note>
  When you provide `content`, ZeroEval automatically uses the latest optimized
  version from your dashboard if one exists. The `content` parameter serves as a
  fallback for when no optimized versions are available yet.
</Note>

## Version Control

### Auto-optimization (default)

```typescript theme={null}
const prompt = await ze.prompt({
  name: "customer-support",
  content: "You are a helpful assistant.",
});
```

Uses the latest optimized version if one exists, otherwise falls back to the provided content.

### Explicit mode

```typescript theme={null}
const prompt = await ze.prompt({
  name: "customer-support",
  from: "explicit",
  content: "You are a helpful assistant.",
});
```

Always uses the provided content. Useful for debugging or A/B testing a specific version.

### Latest mode

```typescript theme={null}
const prompt = await ze.prompt({
  name: "customer-support",
  from: "latest",
});
```

Requires an optimized version to exist. Fails with `PromptRequestError` if none is found.

### Pin to a specific version

```typescript theme={null}
const prompt = await ze.prompt({
  name: "customer-support",
  from: "a1b2c3d4...", // 64-char SHA-256 hash
});
```

## Parameters

| Parameter   | Type                     | Required | Default     | Description                                         |
| ----------- | ------------------------ | -------- | ----------- | --------------------------------------------------- |
| `name`      | `string`                 | Yes      | —           | Task name for this prompt                           |
| `content`   | `string`                 | No       | `undefined` | Prompt content (fallback or explicit)               |
| `from`      | `string`                 | No       | `undefined` | `"latest"`, `"explicit"`, or a 64-char SHA-256 hash |
| `variables` | `Record<string, string>` | No       | `undefined` | Template variables for `{{var}}` tokens             |

### Return value

Returns `Promise<string>` -- a decorated prompt string with metadata that integrations use to link completions to prompt versions and auto-patch models.

### Errors

| Error                 | When                                                                       |
| --------------------- | -------------------------------------------------------------------------- |
| `Error`               | Both `content` and `from` provided (except `from: "explicit"`), or neither |
| `PromptRequestError`  | `from: "latest"` but no versions exist                                     |
| `PromptNotFoundError` | `from` is a hash that doesn't exist                                        |

## Model Deployments

When you deploy a model to a prompt version in the dashboard, the SDK automatically patches the `model` parameter in your LLM calls:

```typescript theme={null}
const systemPrompt = await ze.prompt({
  name: "support-bot",
  content: "You are a helpful customer support agent.",
});

const response = await client.chat.completions.create({
  model: "gpt-4", // Gets replaced with the deployed model
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: "Hello" },
  ],
});
```

## Manual Prompt-Linked Spans

When you want `ze.prompt()` to manage a specific LLM interaction but do not want global auto-instrumentation (e.g. your agent makes many LLM calls and only one should be tracked as a prompt generation), disable integrations and create the span yourself.

### When to use this

* Your codebase makes many LLM calls but only a subset should appear as prompt completions.
* You use a provider that has no auto-integration (Vapi, a custom HTTP endpoint, etc.).
* You want full control over which codepath produces prompt-linked traces.

### Setup

Disable integrations you do not want, then initialize:

```typescript theme={null}
import * as ze from 'zeroeval';

ze.init({
  integrations: { openai: false, vercelAI: false },
});
```

### Create a prompt-linked span

1. Call `ze.prompt()` to register the prompt version and get a decorated string.
2. Use `extractZeroEvalMetadata()` to split the decorated string into clean content and linkage metadata.
3. Send the clean content to your provider.
4. Wrap the call in `ze.withSpan()` with `kind: 'llm'` and the metadata in `attributes.zeroeval`.

```typescript theme={null}
import * as ze from 'zeroeval';
import { extractZeroEvalMetadata } from 'zeroeval';

const decorated = await ze.prompt({
  name: 'customer-support',
  content: 'You are a helpful support agent for {{company}}.',
  variables: { company: 'Acme' },
  from: 'explicit',
});

const { metadata, cleanContent } = extractZeroEvalMetadata(decorated);

await ze.withSpan(
  {
    name: 'llm.chat.completions.create',
    attributes: {
      kind: 'llm',
      task: metadata!.task,
      zeroeval: metadata,
      provider: 'my-custom-provider',
      model: 'gpt-4o',
    },
    inputData: [
      { role: 'system', content: cleanContent },
      { role: 'user', content: 'I need help with my order' },
    ],
    outputData: { role: 'assistant', content: response.text },
  },
  async () => {
    const response = await myProvider.chat([
      { role: 'system', content: cleanContent },
      { role: 'user', content: 'I need help with my order' },
    ]);
    return response;
  }
);
```

The span will be ingested as an `llm` span linked to the prompt version. Judge evaluations, feedback, and the prompt completions page all work as if an auto-integration created the span.

<Note>
  `extractZeroEvalMetadata` is exported from the top-level `zeroeval` package.
</Note>

## Sending Feedback

Attach feedback to completions to power prompt optimization:

```typescript theme={null}
await ze.sendFeedback({
  promptSlug: "support-bot",
  completionId: response.id,
  thumbsUp: true,
  reason: "Clear and concise response",
});
```

| Parameter        | Type                      | Required | Description                                 |
| ---------------- | ------------------------- | -------- | ------------------------------------------- |
| `promptSlug`     | `string`                  | Yes      | Prompt name (same as used in `ze.prompt()`) |
| `completionId`   | `string`                  | Yes      | UUID of the completion                      |
| `thumbsUp`       | `boolean`                 | Yes      | Positive or negative feedback               |
| `reason`         | `string`                  | No       | Explanation of the feedback                 |
| `expectedOutput` | `string`                  | No       | What the output should have been            |
| `metadata`       | `Record<string, unknown>` | No       | Additional metadata                         |


# API Reference
Source: https://docs.zeroeval.com/feedback/api-reference

REST API for submitting and retrieving feedback

Base URL: `https://api.zeroeval.com`

All requests require a Bearer token:

```
Authorization: Bearer YOUR_ZEROEVAL_API_KEY
```

***

## Completion Feedback

```
POST /v1/prompts/{prompt_slug}/completions/{completion_id}/feedback
```

Submit structured feedback for a specific LLM completion. This feedback powers prompt optimization.

**Request body:**

| Field               | Type     | Required | Description                        |
| ------------------- | -------- | -------- | ---------------------------------- |
| `thumbs_up`         | `bool`   | Yes      | Positive or negative feedback      |
| `reason`            | `string` | No       | Explanation of the feedback        |
| `expected_output`   | `string` | No       | What the output should have been   |
| `metadata`          | `object` | No       | Additional metadata                |
| `judge_id`          | `string` | No       | Judge automation ID                |
| `expected_score`    | `float`  | No       | Expected score (for scored judges) |
| `score_direction`   | `string` | No       | `"too_high"` or `"too_low"`        |
| `criteria_feedback` | `object` | No       | Per-criterion feedback             |

```bash theme={null}
curl -X POST https://api.zeroeval.com/v1/prompts/support-bot/completions/550e8400-.../feedback \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "thumbs_up": false,
    "reason": "Response was too vague",
    "expected_output": "Should provide specific steps"
  }'
```

**Response:** 200

```json theme={null}
{
  "id": "fb123e45-...",
  "completion_id": "550e8400-...",
  "prompt_id": "a1b2c3d4-...",
  "thumbs_up": false,
  "reason": "Response was too vague",
  "expected_output": "Should provide specific steps",
  "created_at": "2025-01-15T10:30:00Z"
}
```

<Note>
  If feedback already exists for the same completion from the same user, it will
  be updated with the new values.
</Note>

***

## Unified Entity Feedback

```
GET /projects/{project_id}/feedback/{entity_type}/{entity_id}
```

Retrieve all feedback -- human reviews and judge evaluations -- for a span, trace, or session in a single response.

| Path Parameter | Description                   |
| -------------- | ----------------------------- |
| `project_id`   | UUID of the project           |
| `entity_type`  | `span`, `trace`, or `session` |
| `entity_id`    | UUID of the entity            |

**Response:** 200

```json theme={null}
{
  "entity_type": "span",
  "entity_id": "550e8400-...",
  "summary": {
    "total": 3,
    "human_feedback_count": 1,
    "judge_evaluation_count": 2
  },
  "items": [
    {
      "kind": "human_feedback",
      "id": "fb123e45-...",
      "span_id": "550e8400-...",
      "thumbs_up": true,
      "reason": "Clear and helpful",
      "created_at": "2025-01-15T10:30:00Z",
      "created_by": {
        "id": "user-123",
        "email": "reviewer@example.com",
        "name": "Alice"
      },
      "source_type": "human"
    },
    {
      "kind": "judge_evaluation",
      "id": "je456f78-...",
      "span_id": "550e8400-...",
      "automation_id": "judge-abc-...",
      "judge_name": "Helpfulness",
      "evaluation_result": true,
      "evaluation_reason": "Response directly answers the question with clear steps.",
      "confidence_score": 0.92,
      "model_used": "gemini-3-flash-preview",
      "evaluation_duration_ms": 1200,
      "score": 8.5,
      "evaluation_type": "scored",
      "score_min": 0,
      "score_max": 10,
      "pass_threshold": 7.0,
      "criteria_scores": {
        "clarity": { "score": 9, "reason": "Well-structured response" },
        "accuracy": { "score": 8, "reason": "Correct information provided" }
      },
      "created_at": "2025-01-15T10:31:00Z"
    }
  ]
}
```

### Response fields

**`summary`** -- aggregate counts for fast display:

| Field                    | Type  | Description                      |
| ------------------------ | ----- | -------------------------------- |
| `total`                  | `int` | Total feedback items             |
| `human_feedback_count`   | `int` | Number of human review items     |
| `judge_evaluation_count` | `int` | Number of judge evaluation items |

**`items[]`** -- each item has a `kind` field (`human_feedback` or `judge_evaluation`) that determines which fields are present:

| Field (human\_feedback) | Type     | Description                     |
| ----------------------- | -------- | ------------------------------- |
| `thumbs_up`             | `bool`   | Positive or negative            |
| `reason`                | `string` | Reviewer's explanation          |
| `expected_output`       | `string` | Corrected output (if provided)  |
| `created_by`            | `object` | User who submitted the feedback |
| `source_type`           | `string` | `"human"` or `"judge"`          |

| Field (judge\_evaluation) | Type     | Description                           |
| ------------------------- | -------- | ------------------------------------- |
| `automation_id`           | `string` | Judge automation UUID                 |
| `judge_name`              | `string` | Display name of the judge             |
| `evaluation_result`       | `bool`   | Whether the output passed             |
| `evaluation_reason`       | `string` | Judge's reasoning                     |
| `confidence_score`        | `float`  | Judge confidence (0-1)                |
| `model_used`              | `string` | Model used for the evaluation         |
| `score`                   | `float`  | Score value (scored evaluations only) |
| `evaluation_type`         | `string` | `"binary"` or `"scored"`              |
| `score_min` / `score_max` | `float`  | Score range (scored evaluations only) |
| `pass_threshold`          | `float`  | Threshold for pass/fail               |
| `criteria_scores`         | `object` | Per-criterion scores and reasons      |

<Note>
  For traces and sessions, feedback is aggregated from all descendant spans.
</Note>


# Introduction
Source: https://docs.zeroeval.com/feedback/human-feedback

Collect feedback from reviewers in the dashboard or from end users via the API

Human feedback captures what automated metrics can't -- whether the agent's response actually helped the user. You can collect it from internal reviewers using the ZeroEval dashboard, or from end users in your own application via the API.

## Dashboard Review

Reviewers can browse completions directly in the ZeroEval console and submit feedback without writing any code.

1. Navigate to **Prompts → \[your task]** in the dashboard
2. Open the **Suggestions** tab to see incoming completions
3. Review each output and provide thumbs-up or thumbs-down
4. Optionally add a reason and the expected output

Dashboard feedback is linked to the exact prompt version and model that produced the completion, so you can track quality per version over time.

## In-App Feedback

Add like/dislike buttons, star ratings, or other feedback controls directly in your application. Use the SDK or REST API to send feedback tied to specific completions.

### Thumbs-up / Thumbs-down

<CodeGroup>
  ```python Python theme={null}
  import zeroeval as ze

  ze.send_feedback(
      prompt_slug="support-bot",
      completion_id=response.id,
      thumbs_up=user_clicked_thumbs_up,
      reason="User found the answer helpful"
  )
  ```

  ```typescript TypeScript theme={null}
  await ze.sendFeedback({
    promptSlug: "support-bot",
    completionId: response.id,
    thumbsUp: userClickedThumbsUp,
    reason: "User found the answer helpful",
  });
  ```

  ```bash cURL theme={null}
  curl -X POST "https://api.zeroeval.com/v1/prompts/support-bot/completions/$COMPLETION_ID/feedback" \
    -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "thumbs_up": true,
      "reason": "User found the answer helpful"
    }'
  ```
</CodeGroup>

### With expected output

When a user corrects the agent, capture what the response should have been:

<CodeGroup>
  ```python Python theme={null}
  ze.send_feedback(
      prompt_slug="support-bot",
      completion_id=response.id,
      thumbs_up=False,
      reason="Wrong link provided",
      expected_output="The password reset page is at https://app.example.com/reset"
  )
  ```

  ```typescript TypeScript theme={null}
  await ze.sendFeedback({
    promptSlug: "support-bot",
    completionId: response.id,
    thumbsUp: false,
    reason: "Wrong link provided",
    expectedOutput: "The password reset page is at https://app.example.com/reset",
  });
  ```
</CodeGroup>

Expected outputs are used during [prompt optimization](/autotune/prompts/optimization) to generate better prompt variants.

## Feedback Links

For collecting feedback from users who don't have a ZeroEval account (e.g. customers, external reviewers), create a feedback link that anyone can use:

```bash theme={null}
curl -X POST https://api.zeroeval.com/feedback-links \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_slug": "support-bot",
    "link_type": "prompt_completion"
  }'
```

Share the returned URL with reviewers. They can submit feedback without needing API access or a ZeroEval account.


# Introduction
Source: https://docs.zeroeval.com/feedback/introduction

Attach human and AI feedback to your agent interactions to drive quality improvements

Your agent produces thousands of outputs, but without feedback you can't tell which ones are good. Feedback closes the loop -- it connects real-world quality judgments to the traces, spans, and completions your agent generates.

ZeroEval supports two kinds of feedback:

* **Human feedback** -- thumbs-up/down, star ratings, corrections, and expected outputs submitted by users or reviewers
* **AI feedback** -- automated evaluations from calibrated judges that score outputs against criteria you define

Both feed into the same system. Feedback attached to completions powers [prompt optimization](/autotune/introduction). You can also retrieve unified feedback -- combining human reviews and judge evaluations -- for any span, trace, or session via the [Feedback API](/feedback/api-reference#unified-entity-feedback).

## How feedback flows

<Steps>
  <Step title="Agent produces output">
    Your agent runs and ZeroEval captures the full trace -- inputs, outputs,
    model, prompt version.
  </Step>

  <Step title="Feedback is attached">
    Humans review outputs in the dashboard or your app submits feedback
    programmatically. Judges evaluate outputs automatically based on your
    criteria.
  </Step>

  <Step title="Quality becomes measurable">
    Feedback appears on spans, traces, and completions in the console. Filter by
    thumbs-up rate, judge scores, or tags to find patterns.
  </Step>

  <Step title="Improvements are driven by data">
    Use feedback to optimize prompts, compare models, calibrate judges, and
    catch regressions before users do.
  </Step>
</Steps>

## Get started

<CardGroup>
  <Card title="Human Feedback" icon="user" href="/feedback/human-feedback">
    Collect feedback from reviewers in the dashboard or from end users via
    like/dislike buttons, ratings, and corrections.
  </Card>

  <Card title="AI Feedback (Judges)" icon="scale-balanced" href="/judges/introduction">
    Configure AI evaluators that automatically score every completion against
    criteria you define. Includes built-in judges for common failures.
  </Card>
</CardGroup>


# Python
Source: https://docs.zeroeval.com/feedback/python

Submit completion feedback from Python to power prompt optimization

## Completion Feedback

Attach structured feedback to a specific LLM completion to power prompt optimization.

### `send_feedback()`

```python theme={null}
ze.send_feedback(
    prompt_slug="support-bot",
    completion_id="550e8400-e29b-41d4-a716-446655440000",
    thumbs_up=False,
    reason="Response was too verbose",
    expected_output="A concise 2-3 sentence response"
)
```

| Parameter           | Type    | Required | Description                                                                       |
| ------------------- | ------- | -------- | --------------------------------------------------------------------------------- |
| `prompt_slug`       | `str`   | Yes      | Prompt name (same as `ze.prompt(name=...)`)                                       |
| `completion_id`     | `str`   | Yes      | UUID of the completion span                                                       |
| `thumbs_up`         | `bool`  | Yes      | Positive or negative feedback                                                     |
| `reason`            | `str`   | No       | Explanation of the feedback                                                       |
| `expected_output`   | `str`   | No       | What the output should have been                                                  |
| `metadata`          | `dict`  | No       | Additional metadata                                                               |
| `judge_id`          | `str`   | No       | Judge automation ID (for judge feedback)                                          |
| `expected_score`    | `float` | No       | Expected score (for scored judges)                                                |
| `score_direction`   | `str`   | No       | `"too_high"` or `"too_low"`                                                       |
| `criteria_feedback` | `dict`  | No       | Per-criterion feedback: `{"criterion": {"expected_score": 4.0, "reason": "..."}}` |

### End-to-end example

```python theme={null}
import zeroeval as ze
from openai import OpenAI

ze.init()
client = OpenAI()

system_prompt = ze.prompt(
    name="support-bot",
    content="You are a helpful customer support agent."
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "How do I reset my password?"}
    ]
)

is_good = evaluate_response(response.choices[0].message.content)

ze.send_feedback(
    prompt_slug="support-bot",
    completion_id=response.id,
    thumbs_up=is_good,
    reason="Clear instructions" if is_good else "Missing reset link"
)
```


# TypeScript
Source: https://docs.zeroeval.com/feedback/typescript

Submit completion feedback from TypeScript to power prompt optimization

## Completion Feedback

Attach structured feedback to a specific LLM completion to power prompt optimization.

### `sendFeedback()`

```typescript theme={null}
await ze.sendFeedback({
  promptSlug: "support-bot",
  completionId: "550e8400-e29b-41d4-a716-446655440000",
  thumbsUp: false,
  reason: "Response was too verbose",
  expectedOutput: "A concise 2-3 sentence response",
});
```

| Parameter        | Type                      | Required | Description                                      |
| ---------------- | ------------------------- | -------- | ------------------------------------------------ |
| `promptSlug`     | `string`                  | Yes      | Prompt name (same as `ze.prompt({ name: ... })`) |
| `completionId`   | `string`                  | Yes      | UUID of the completion span                      |
| `thumbsUp`       | `boolean`                 | Yes      | Positive or negative feedback                    |
| `reason`         | `string`                  | No       | Explanation of the feedback                      |
| `expectedOutput` | `string`                  | No       | What the output should have been                 |
| `metadata`       | `Record<string, unknown>` | No       | Additional metadata                              |
| `judgeId`        | `string`                  | No       | Judge automation ID (for judge feedback)         |
| `expectedScore`  | `number`                  | No       | Expected score (for scored judges)               |
| `scoreDirection` | `'too_high' \| 'too_low'` | No       | Score direction for scored judges                |

### End-to-end example

```typescript theme={null}
import * as ze from "zeroeval";
import { OpenAI } from "openai";

ze.init();
const client = ze.wrap(new OpenAI());

const systemPrompt = await ze.prompt({
  name: "support-bot",
  content: "You are a helpful customer support agent.",
});

const response = await client.chat.completions.create({
  model: "gpt-4",
  messages: [
    { role: "system", content: systemPrompt },
    { role: "user", content: "How do I reset my password?" },
  ],
});

const isGood = evaluateResponse(response.choices[0].message.content);

await ze.sendFeedback({
  promptSlug: "support-bot",
  completionId: response.id,
  thumbsUp: isGood,
  reason: isGood ? "Clear instructions" : "Missing reset link",
});
```


# CLI
Source: https://docs.zeroeval.com/integrations/cli

Manage traces, prompts, judges, and optimization from your terminal

The ZeroEval CLI ships with the Python SDK and gives you terminal access to monitoring, prompts, judges, and optimization. It is designed to be agent-friendly and works well in CI pipelines, scripted workflows, and agent toolchains.

```bash theme={null}
pip install zeroeval
```

## Setup

Run the interactive setup to save your API key and resolve your project:

```bash theme={null}
zeroeval setup
```

This opens the ZeroEval dashboard, prompts for your API key, saves it to your shell config (e.g. `~/.zshrc`, `~/.bashrc`), and links your project automatically.

<Tip>
  For non-interactive environments like CI or coding agents, use `auth set` instead:

  ```bash theme={null}
  zeroeval auth set --api-key-env ZEROEVAL_API_KEY
  ```
</Tip>

### Auth commands

```bash theme={null}
zeroeval auth set --api-key <key>          # Set API key directly
zeroeval auth set --api-key-env MY_KEY_VAR # Read API key from an env var
zeroeval auth set --api-base-url <url>     # Override API base URL
zeroeval auth show --redact                # Show current config (key masked)
zeroeval auth clear                        # Wipe all stored config
zeroeval auth clear --api-key-only         # Clear only the API key and project
```

### Auth resolution

The CLI resolves credentials in this order:

1. Explicit CLI flags (`--project-id`, `--api-base-url`)
2. Environment variables (`ZEROEVAL_API_KEY`, `ZEROEVAL_PROJECT_ID`, `ZEROEVAL_BASE_URL`)
3. Global CLI config file

Config file location:

* **macOS / Linux**: `~/.config/zeroeval/config.json` (or `$XDG_CONFIG_HOME/zeroeval/config.json`)
* **Windows**: `%APPDATA%/zeroeval/config.json`

## Global flags

Global flags must appear **before** the subcommand:

```bash theme={null}
zeroeval --output json --project-id <id> --timeout 30.0 traces list
```

| Flag                  | Default                    | Description                                                                           |
| --------------------- | -------------------------- | ------------------------------------------------------------------------------------- |
| `--output text\|json` | `text`                     | Output format. `json` emits stable JSON to stdout, errors to stderr.                  |
| `--project-id`        | env / config               | Project context. Required for monitoring, prompts, judges, and optimization commands. |
| `--api-base-url`      | `https://api.zeroeval.com` | Override the API URL.                                                                 |
| `--quiet`             | off                        | Suppress non-essential output.                                                        |
| `--timeout`           | `20.0`                     | HTTP request timeout in seconds.                                                      |

### Output modes

* **`text`** (default) — human-readable; dict/list payloads are pretty-printed as JSON.
* **`json`** — stable, machine-readable JSON to stdout. Errors go to stderr as structured JSON. Confirmation prompts (e.g. `optimize promote`) are auto-skipped in JSON mode.

### Exit codes

| Code | Meaning                     |
| ---- | --------------------------- |
| `0`  | Success                     |
| `2`  | User / validation error     |
| `3`  | Auth or permission error    |
| `4`  | Remote API or network error |

## Querying and filtering

Most list and get commands support `--where`, `--select`, and `--order` for client-side filtering, projection, and sorting.

### `--where`

Filter rows. Repeatable — multiple clauses are AND-ed.

```bash theme={null}
# Exact match
zeroeval judges list --where "name=Quality Check"

# Substring match (case-insensitive)
zeroeval traces list --where "status~completed"

# Set membership
zeroeval spans list --where 'kind in ["llm","tool"]'
```

Supported operators: `=` (exact), `~` (substring), `in` (JSON array).

### `--select`

Project specific fields. Comma-separated, supports dotted paths.

```bash theme={null}
zeroeval judges list --select "id,name,evaluation_type"
```

### `--order`

Sort results by a field. Defaults to ascending.

```bash theme={null}
zeroeval traces list --order "created_at:desc"
```

## Monitoring

Monitoring commands require a `--project-id` (resolved automatically after `zeroeval setup`).

### Sessions

```bash theme={null}
zeroeval sessions list --start-date 2025-01-01 --end-date 2025-02-01 --limit 50
zeroeval sessions get <session_id>
```

| Flag           | Description                   |
| -------------- | ----------------------------- |
| `--start-date` | ISO date string lower bound   |
| `--end-date`   | ISO date string upper bound   |
| `--limit`      | Max results (default 50)      |
| `--offset`     | Pagination offset (default 0) |

### Traces

```bash theme={null}
zeroeval traces list --start-date 2025-01-01 --limit 50
zeroeval traces get <trace_id>
zeroeval traces spans <trace_id> --limit 100
```

`traces spans` returns the spans belonging to a specific trace, useful for debugging individual requests.

### Spans

```bash theme={null}
zeroeval spans list --start-date 2025-01-01 --limit 50
zeroeval spans get <span_id>
```

## Prompts

### List and inspect

```bash theme={null}
zeroeval prompts list
zeroeval prompts get <prompt_slug>
zeroeval prompts get <prompt_slug> --version 3
zeroeval prompts get <prompt_slug> --tag production
zeroeval prompts versions <prompt_slug>
zeroeval prompts tags <prompt_slug>
```

### Submit feedback

Provide feedback on a prompt completion for DSPy optimization and prompt tuning:

```bash theme={null}
zeroeval prompts feedback create \
  --prompt-slug customer-support \
  --completion-id <uuid> \
  --thumbs-up \
  --reason "Clear and helpful response"
```

For scored judges, add judge-specific fields:

```bash theme={null}
zeroeval prompts feedback create \
  --prompt-slug customer-support \
  --completion-id <uuid> \
  --thumbs-down \
  --judge-id <judge_uuid> \
  --expected-score 3.5 \
  --score-direction too_high \
  --reason "Score should be lower"
```

<Note>
  `--thumbs-up` and `--thumbs-down` are mutually exclusive and one is required.
</Note>

## Judges

### List and inspect

```bash theme={null}
zeroeval judges list
zeroeval judges get <judge_id>
zeroeval judges criteria <judge_id>
zeroeval judges evaluations <judge_id> --limit 100
zeroeval judges insights <judge_id>
zeroeval judges performance <judge_id>
zeroeval judges calibration <judge_id>
zeroeval judges versions <judge_id>
```

### Filter evaluations

```bash theme={null}
zeroeval judges evaluations <judge_id> \
  --start-date 2025-01-01 \
  --end-date 2025-02-01 \
  --evaluation-result true \
  --feedback-state pending \
  --limit 200
```

| Flag                          | Description              |
| ----------------------------- | ------------------------ |
| `--evaluation-result`         | `true` or `false`        |
| `--feedback-state`            | Filter by feedback state |
| `--start-date` / `--end-date` | Date range               |

### Create a judge

```bash theme={null}
zeroeval judges create \
  --name "Tone Check" \
  --prompt "Evaluate whether the response maintains a professional tone." \
  --evaluation-type binary \
  --sample-rate 1.0 \
  --temperature 0.0
```

Or load the prompt from a file:

```bash theme={null}
zeroeval judges create \
  --name "Quality Scorer" \
  --prompt-file judge_prompt.txt \
  --evaluation-type scored \
  --score-min 0 \
  --score-max 10 \
  --pass-threshold 7
```

| Flag                 | Default  | Description                                                  |
| -------------------- | -------- | ------------------------------------------------------------ |
| `--name`             | required | Judge name                                                   |
| `--prompt`           | —        | Inline prompt text (mutually exclusive with `--prompt-file`) |
| `--prompt-file`      | —        | Path to file containing the prompt                           |
| `--evaluation-type`  | `binary` | `binary` or `scored`                                         |
| `--score-min`        | `0.0`    | Minimum score (scored only)                                  |
| `--score-max`        | `10.0`   | Maximum score (scored only)                                  |
| `--pass-threshold`   | —        | Pass threshold (scored only)                                 |
| `--sample-rate`      | `1.0`    | Fraction of spans to evaluate                                |
| `--backfill`         | `100`    | Number of existing spans to backfill                         |
| `--tag`              | —        | Tag filter in `key=value1,value2` format. Repeatable.        |
| `--tag-match`        | `all`    | `all` or `any`                                               |
| `--target-prompt-id` | —        | Scope judge to a specific prompt                             |
| `--temperature`      | `0.0`    | LLM temperature for judge                                    |

### Submit judge feedback

```bash theme={null}
zeroeval judges feedback create \
  --span-id <span_uuid> \
  --thumbs-down \
  --reason "Missed safety issue" \
  --expected-output "Should flag harmful content"
```

For scored judges with per-criterion feedback:

```bash theme={null}
zeroeval judges feedback create \
  --span-id <span_uuid> \
  --thumbs-down \
  --expected-score 2.0 \
  --score-direction too_high \
  --criteria-feedback '{"clarity": {"expected_score": 1.0, "reason": "Confusing response"}}'
```

## Optimization

Start, inspect, and promote prompt or judge optimization runs. All optimization commands require `--project-id`.

### Prompt optimization

```bash theme={null}
zeroeval optimize prompt list <task_id>
zeroeval optimize prompt get <task_id> <run_id>
zeroeval optimize prompt start <task_id> --optimizer-type quick_refine
zeroeval optimize prompt promote <task_id> <run_id> --yes
```

### Judge optimization

```bash theme={null}
zeroeval optimize judge list <judge_id>
zeroeval optimize judge get <judge_id> <run_id>
zeroeval optimize judge start <judge_id> --optimizer-type dspy_bootstrap
zeroeval optimize judge promote <judge_id> <run_id> --yes
```

| Flag               | Default        | Description                                                         |
| ------------------ | -------------- | ------------------------------------------------------------------- |
| `--optimizer-type` | `quick_refine` | `quick_refine`, `dspy_bootstrap`, or `dspy_gepa`                    |
| `--config`         | —              | JSON string of extra optimizer configuration                        |
| `--yes`            | off            | Skip the confirmation prompt (also skipped in `--output json` mode) |

## Spec (machine-readable manual)

The `spec` commands dump the CLI's command and parameter contract as JSON or Markdown, useful for agents and toolchains that need to discover available commands programmatically.

```bash theme={null}
zeroeval spec cli --format json
zeroeval spec command "judges create" --format markdown
```

## CI / automation recipes

### Get the latest traces as JSON

```bash theme={null}
zeroeval --output json traces list --limit 10 --order "created_at:desc"
```

### Check judge pass rate

```bash theme={null}
zeroeval --output json judges evaluations <judge_id> \
  --evaluation-result true --limit 1000 \
  --select "id" | jq length
```

### Promote an optimization run non-interactively

```bash theme={null}
zeroeval --output json optimize prompt promote <task_id> <run_id> --yes
```

## Related docs

<CardGroup>
  <Card title="Tracing quickstart" icon="rocket" href="/tracing/quickstart">
    Get your first trace in under 5 minutes
  </Card>

  <Card title="Judges" icon="gavel" href="/judges/introduction">
    How calibrated judges evaluate your production traffic
  </Card>

  <Card title="Prompt setup" icon="wrench" href="/autotune/setup">
    Add ze.prompt() to your Python or TypeScript codebase
  </Card>

  <Card title="Skills" icon="wand-magic-sparkles" href="/integrations/skills">
    Let your coding agent handle SDK install and judge setup
  </Card>
</CardGroup>


# Introduction
Source: https://docs.zeroeval.com/integrations/introduction

Ways to integrate ZeroEval beyond the core SDKs

Besides the Python and TypeScript SDKs, there are a few other ways to get ZeroEval into your workflow.

<CardGroup>
  <Card title="Skills" icon="wand-magic-sparkles" href="/integrations/skills">
    Install and configure ZeroEval from inside Cursor, Claude Code, or any coding agent
  </Card>

  <Card title="MCP" icon="plug" href="/integrations/mcp">
    Connect AI agents to ZeroEval via the Model Context Protocol
  </Card>

  <Card title="CLI" icon="terminal" href="/integrations/cli">
    Manage traces, prompts, and judges from your terminal
  </Card>
</CardGroup>

## Where to start

* **If you use Cursor, Claude Code, or Codex**, try [Skills](/integrations/skills). Your coding agent handles SDK install, first trace, prompt migration, and judge setup for you.
* **If you prefer doing it yourself**, head to the [Quickstart Guide](/tracing/quickstart) and pick the [Python](/tracing/sdks/python/setup) or [TypeScript](/tracing/sdks/typescript/setup) SDK.


# MCP
Source: https://docs.zeroeval.com/integrations/mcp

Connect AI agents to ZeroEval via the Model Context Protocol

The ZeroEval MCP server lets AI agents inspect traces, manage judges and prompts, submit feedback, run optimizations, and deploy to production, all without leaving the agent context. It speaks the [Model Context Protocol](https://modelcontextprotocol.io), so any MCP-compatible client (Cursor, Claude Code, Windsurf, etc.) can connect directly.

## Setup

The fastest way to get started is to point your MCP client at the hosted server. No installation required.

### Cursor

Add this to your Cursor MCP settings (`.cursor/mcp.json`):

```json theme={null}
{
  "mcpServers": {
    "zeroeval": {
      "url": "https://mcp.zeroeval.com/mcp",
      "headers": {
        "Authorization": "Bearer <your-project-api-key>"
      }
    }
  }
}
```

### Claude Code

```bash theme={null}
claude mcp add zeroeval --transport http https://mcp.zeroeval.com/mcp \
  --header "Authorization: Bearer <your-project-api-key>"
```

### Other MCP clients

Any client that supports HTTP transport works. Set the server URL to `https://mcp.zeroeval.com/mcp` and pass your project API key in the `Authorization: Bearer <key>` header.

<Tip>
  Get your project API key from the [ZeroEval dashboard](https://app.zeroeval.com) under **Settings → API Keys**.
</Tip>

## Resources

The server exposes two MCP resources for introspection:

| URI                       | Description                                                                        |
| ------------------------- | ---------------------------------------------------------------------------------- |
| `config://server-context` | Redacted server config: auth mode, base URL, project scope, and feature flags      |
| `docs://capabilities`     | Canonical tool and resource inventory with annotations and output contract summary |

## Tools

### Read tools

Read tools are safe to call at any time. They do not modify state.

| Tool                     | Description                                       |
| ------------------------ | ------------------------------------------------- |
| `list-traces`            | List recent traces                                |
| `get-trace`              | Get a trace with its spans                        |
| `list-judges`            | List all judges                                   |
| `get-judge`              | Get judge details and linkage state               |
| `list-judge-evaluations` | List evaluations from a judge                     |
| `get-judge-criteria`     | Get scoring criteria for a judge                  |
| `list-prompts`           | List all prompts                                  |
| `get-prompt`             | Get a prompt at a specific version or tag         |
| `list-prompt-versions`   | List all versions of a prompt                     |
| `list-optimization-runs` | List optimization runs for a task                 |
| `get-optimization-run`   | Get run details with candidate prompt and metrics |
| `get-project-summary`    | High-level project monitoring summary             |

### Write tools

All write tools require `confirm: true` in the request and are annotated with `destructiveHint: true` so MCP clients can prompt for user approval before calling.

| Tool                        | Description                            |
| --------------------------- | -------------------------------------- |
| `create-judge`              | Create a new judge                     |
| `link-judge-to-prompt`      | Link a judge to a prompt               |
| `unlink-judge-from-prompt`  | Remove a judge's prompt link           |
| `create-judge-feedback`     | Submit feedback on a judge evaluation  |
| `create-prompt-feedback`    | Submit feedback on a prompt completion |
| `start-prompt-optimization` | Start a prompt optimization run        |
| `start-judge-optimization`  | Start a judge optimization run         |
| `cancel-optimization-run`   | Cancel a running optimization          |

### Deploy

Production deploys always require two steps:

1. **Preview:** Call `preview-optimization-deploy` with the run ID. This verifies the run succeeded, summarizes the candidate vs current production, and returns a time-limited confirmation receipt.
2. **Deploy:** Call `deploy-optimization-run` with `confirm: true` and the receipt from preview. The server re-reads current state and rejects the deploy if anything drifted since the preview.

| Tool                          | Description                                                       |
| ----------------------------- | ----------------------------------------------------------------- |
| `preview-optimization-deploy` | Preview what deploying a run would do (read-only)                 |
| `deploy-optimization-run`     | Deploy a succeeded run to production (requires receipt + confirm) |

### Proposal tools

Proposal tools are read-only helpers that gather evidence or prepare the exact next mutating call without executing it.

| Tool                          | Description                                                 |
| ----------------------------- | ----------------------------------------------------------- |
| `investigate-prompt-issues`   | Gather evidence about prompt state and recommend next steps |
| `investigate-judge-issues`    | Gather evidence about judge state and recommend next steps  |
| `prepare-prompt-optimization` | Propose the exact `start-prompt-optimization` call to make  |
| `prepare-judge-optimization`  | Propose the exact `start-judge-optimization` call to make   |

<CardGroup>
  <Card title="Tracing quickstart" icon="rocket" href="/tracing/quickstart">
    Get your first trace in under 5 minutes
  </Card>

  <Card title="Prompt setup" icon="wrench" href="/autotune/setup">
    Add ze.prompt() to your codebase
  </Card>

  <Card title="Judges" icon="gavel" href="/judges/introduction">
    How calibrated judges evaluate your production traffic
  </Card>

  <Card title="CLI" icon="terminal" href="/integrations/cli">
    Manage traces, prompts, and judges from your terminal
  </Card>
</CardGroup>


# Skills
Source: https://docs.zeroeval.com/integrations/skills

Set up ZeroEval from inside Cursor, Claude Code, Codex, and other coding agents

Skills let your coding agent do the ZeroEval setup work for you. Instead of flipping between docs and your editor, you tell your agent "install zeroeval" or "create a judge" and the skill handles the rest in-context. The source lives at [zeroeval/zeroeval-skills](https://github.com/zeroeval/zeroeval-skills) on GitHub.

They work with Cursor, Claude Code, Codex, and 30+ other agents that support the skills format.

## Available skills

<CardGroup>
  <Card title="zeroeval-install" icon="download">
    Installs the SDK (Python or TypeScript), verifies your first trace, migrates prompts to `ze.prompt`, and recommends starter judges. Routes to `custom-tracing` for non-SDK languages.
  </Card>

  <Card title="custom-tracing" icon="code">
    Send traces to ZeroEval without the SDK. Covers direct REST API ingestion (`POST /spans`) and OpenTelemetry (OTLP) export for any language.
  </Card>

  <Card title="create-judge" icon="gavel">
    Helps you pick an evaluation type (binary or scored), write a judge template, define criteria, and create the judge via dashboard or API.
  </Card>

  <Card title="prompt-migration" icon="arrows-rotate">
    Migrates hardcoded prompts to `ze.prompt`, wires feedback collection, connects judges, and walks through the staged rollout to prompt optimization.
  </Card>
</CardGroup>

## Install

<CodeGroup>
  ```bash CLI (recommended) theme={null}
  # Install all skills
  npx skills add zeroeval/zeroeval-skills

  # Install a specific skill
  npx skills add zeroeval/zeroeval-skills --skill zeroeval-install

  # List available skills
  npx skills add zeroeval/zeroeval-skills --list
  ```

  ```bash Claude Code plugin theme={null}
  # Add the marketplace
  /plugin marketplace add zeroeval/zeroeval-skills

  # Install a specific plugin
  /plugin install zeroeval-install@zeroeval-skills
  /plugin install custom-tracing@zeroeval-skills
  /plugin install create-judge@zeroeval-skills
  /plugin install prompt-migration@zeroeval-skills

  # Reload plugins if the new commands do not appear immediately
  /reload-plugins
  ```

  ```bash Manual copy (Cursor / Claude Code) theme={null}
  git clone https://github.com/zeroeval/zeroeval-skills.git

  # Cursor
  mkdir -p .cursor/skills
  cp -r zeroeval-skills/skills/zeroeval-install .cursor/skills/zeroeval-install
  cp -r zeroeval-skills/skills/custom-tracing .cursor/skills/custom-tracing
  cp -r zeroeval-skills/skills/create-judge .cursor/skills/create-judge
  cp -r zeroeval-skills/skills/prompt-migration .cursor/skills/prompt-migration

  # Claude Code
  mkdir -p .claude/skills
  cp -r zeroeval-skills/skills/zeroeval-install .claude/skills/zeroeval-install
  cp -r zeroeval-skills/skills/custom-tracing .claude/skills/custom-tracing
  cp -r zeroeval-skills/skills/create-judge .claude/skills/create-judge
  cp -r zeroeval-skills/skills/prompt-migration .claude/skills/prompt-migration
  ```
</CodeGroup>

<Note>
  On Windows without symlink support, use `npx skills` or the manual copy method. The `plugins/` directory in the repo contains symlinks that may not resolve on Windows.
</Note>

## Requirements

* A ZeroEval account and API key — [zeroeval.com](https://zeroeval.com)
* **SDK path** (zeroeval-install): Python 3.8+ or Node 18+, plus an LLM provider SDK (OpenAI, Vercel AI, LangChain, etc.)
* **Direct API / OTLP path** (custom-tracing): any language with an HTTP client or OpenTelemetry exporter

<Tip>
  The [zeroeval-skills GitHub repo](https://github.com/zeroeval/zeroeval-skills) has the latest skill content and deeper reference playbooks. This page covers discovery and install only.
</Tip>

## After installation

Once installed, your coding agent picks up the skills automatically. Ask it to "set up zeroeval", "send traces via API", "create a judge", or "migrate my prompts to ze.prompt" and it will use them.

If you want to read the product docs directly:

<CardGroup>
  <Card title="Tracing quickstart" icon="rocket" href="/tracing/quickstart">
    Get your first trace in under 5 minutes
  </Card>

  <Card title="Prompt setup" icon="wrench" href="/autotune/setup">
    Add ze.prompt() to your Python or TypeScript codebase
  </Card>

  <Card title="Judges" icon="gavel" href="/judges/introduction">
    How calibrated judges evaluate your production traffic
  </Card>

  <Card title="Judge setup" icon="sliders" href="/judges/setup">
    Create and calibrate a judge
  </Card>
</CardGroup>


# Calibration
Source: https://docs.zeroeval.com/judges/calibration

Correct judge evaluations to improve accuracy over time

Judges get better the more you correct them. Each time you mark an evaluation as right or wrong, that correction is stored and used to refine future scoring. This is calibration.

## Calibrating in the dashboard

For each evaluated item in the console, you can mark the judge's assessment as correct or incorrect and optionally provide the expected answer.

<img alt="Calibrating your judge" />

<video />

## Calibrating programmatically

Submit corrections via the SDK or REST API. This is useful for bulk calibration from automated pipelines, custom review workflows, or external labeling tools.

### Finding the right IDs

Judge evaluations involve two related spans:

| ID                     | Description                                        |
| ---------------------- | -------------------------------------------------- |
| **Source Span ID**     | The original LLM call that was evaluated           |
| **Judge Call Span ID** | The span created when the judge ran its evaluation |

| ID            | Where to Find It                                                  |
| ------------- | ----------------------------------------------------------------- |
| **Task Slug** | In the judge settings, or the URL when editing the judge's prompt |
| **Span ID**   | In the evaluation modal, or via `get_judge_evaluations()`         |
| **Judge ID**  | In the URL when viewing a judge (`/judges/{judge_id}`)            |

<Tip>
  The easiest way to get the correct IDs: open a judge evaluation in the
  dashboard, expand "SDK Integration", and click "Copy" to get pre-filled code.
</Tip>

### Binary judges

Mark a judge evaluation as correct or incorrect:

<CodeGroup>
  ```python Python theme={null}
  import zeroeval as ze

  ze.send_feedback(
      prompt_slug="your-judge-task-slug",
      completion_id="span-id-here",
      thumbs_up=True,
      reason="Judge correctly identified the issue",
      judge_id="automation-id-here",
  )
  ```

  ```bash cURL theme={null}
  curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
    -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "thumbs_up": true,
      "reason": "Judge correctly identified the issue",
      "judge_id": "automation-uuid-here"
    }'
  ```
</CodeGroup>

### Scored judges

For judges using scored rubrics, provide the expected score and direction:

<CodeGroup>
  ```python Python theme={null}
  ze.send_feedback(
      prompt_slug="quality-scorer",
      completion_id="span-id-here",
      thumbs_up=False,
      judge_id="automation-id-here",
      expected_score=3.5,
      score_direction="too_high",
      reason="Score should have been lower due to grammar issues",
  )
  ```

  ```bash cURL theme={null}
  curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
    -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "thumbs_up": false,
      "judge_id": "automation-uuid-here",
      "expected_score": 3.5,
      "score_direction": "too_high",
      "reason": "Score should have been lower"
    }'
  ```
</CodeGroup>

### Per-criterion feedback

For scored judges with multiple criteria, correct individual criterion scores:

<CodeGroup>
  ```python Python theme={null}
  ze.send_feedback(
      prompt_slug="quality-scorer",
      completion_id="span-id-here",
      thumbs_up=False,
      judge_id="automation-id-here",
      reason="Criterion-level score adjustments",
      criteria_feedback={
          "CTA_text": {
              "expected_score": 4.0,
              "reason": "CTA is clear and prominent"
          },
          "CX-004": {
              "expected_score": 1.0,
              "reason": "Required phone number is missing"
          }
      }
  )
  ```

  ```bash cURL theme={null}
  curl -X POST "https://api.zeroeval.com/v1/prompts/{task_slug}/completions/{span_id}/feedback" \
    -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "thumbs_up": false,
      "judge_id": "automation-uuid-here",
      "criteria_feedback": {
        "CTA_text": {"expected_score": 4.0, "reason": "CTA is clear and visible"},
        "CX-004": {"expected_score": 1.0, "reason": "Phone number is missing"}
      }
    }'
  ```
</CodeGroup>

To discover valid criterion keys before sending per-criterion feedback:

```python theme={null}
criteria = ze.get_judge_criteria(
    project_id="your-project-id",
    judge_id="automation-id-here",
)

for c in criteria["criteria"]:
    print(c["key"], c.get("description"))
```

### Parameters

| Parameter           | Type    | Required | Description                                      |
| ------------------- | ------- | -------- | ------------------------------------------------ |
| `prompt_slug`       | `str`   | Yes      | Task slug associated with the judge              |
| `completion_id`     | `str`   | Yes      | Span ID being evaluated                          |
| `thumbs_up`         | `bool`  | Yes      | `True` if judge was correct, `False` if wrong    |
| `reason`            | `str`   | No       | Explanation of the correction                    |
| `judge_id`          | `str`   | Yes      | Judge automation ID                              |
| `expected_score`    | `float` | No       | Expected score (scored judges only)              |
| `score_direction`   | `str`   | No       | `"too_high"` or `"too_low"` (scored judges only) |
| `criteria_feedback` | `dict`  | No       | Per-criterion corrections (scored judges only)   |

## Bulk calibration

Iterate through evaluations and submit corrections programmatically:

```python theme={null}
evaluations = ze.get_judge_evaluations(
    project_id="your-project-id",
    judge_id="your-judge-id",
    limit=100,
)

for eval in evaluations["evaluations"]:
    is_correct = your_review_logic(eval)

    ze.send_feedback(
        prompt_slug="your-judge-task-slug",
        completion_id=eval["span_id"],
        thumbs_up=is_correct,
        reason="Automated review",
        judge_id="your-judge-id",
    )
```


# Introduction
Source: https://docs.zeroeval.com/judges/introduction

AI evaluators that give automated feedback on your agent's interactions

Judges are AIs that give feedback to your agent's interactions -- individual messages, full conversations, or entire workflows. They evaluate spans, traces, and sessions against criteria you define, scoring outputs automatically so you don't have to review every response manually.

Judges calibrate over time. The more you correct their evaluations, the more accurate they become at catching the issues that matter to your use case.

## Built-in judges

ZeroEval ships with out-of-the-box judges that detect common failure patterns without any configuration:

* **User corrections** -- detects when a user rephrases or corrects the agent's output
* **User frustration** -- identifies signs of dissatisfaction, confusion, or repeated requests
* **Task failures** -- catches when the agent fails to complete the user's request
* **Hallucinations** -- flags responses that contain fabricated or unsupported claims
* **Safety violations** -- detects harmful, biased, or policy-violating content

## Suggested judges

Based on your production traffic, ZeroEval can suggest judges that would be useful for your specific use case. As traces flow in, we analyze patterns in your agent's behavior and recommend evaluation criteria tailored to the failures and edge cases we observe.

## Custom judges

Create your own judges to evaluate anything specific to your domain:

* Response quality (helpfulness, tone, structure, completeness)
* Domain accuracy (legal, medical, financial correctness)
* Format compliance (JSON output, specific templates, length constraints)
* Business rules (pricing accuracy, policy adherence, SLA compliance)

## Creating a judge

<Steps>
  <Step title="Define your criteria">
    Go to [Monitoring → Judges → New
    Judge](https://app.zeroeval.com/monitoring/judges). Choose between a
    **binary** judge (pass/fail) or a **scored** judge (rubric with multiple
    criteria).
  </Step>

  <Step title="Write the evaluation prompt">
    Specify what the judge should look for in your agent's output. Tweak the
    prompt until it matches the quality bar you're aiming for.
  </Step>

  <Step title="Judge starts evaluating">
    Historical and future traces are scored automatically. Results appear in the
    dashboard immediately.
  </Step>
</Steps>

<img alt="Creating a judge" />

## Calibration

AI judges are powerful but imperfect. Out of the box, a judge might:

* **Miss nuance** -- flag a correct but unconventional answer as wrong
* **Be too lenient** -- let low-quality responses pass because they're grammatically correct
* **Misinterpret context** -- score domain-specific content poorly because it lacks your business context

This is expected. A judge's first evaluations are a starting point, not a finished product.

The fix is simple: **correct the judge when it's wrong**. Each correction teaches the judge what "good" and "bad" look like for your specific use case. Over time, the judge converges on your quality bar.

```mermaid theme={null}
flowchart LR
    A[Judge evaluates] --> B[You review]
    B -->|Correct| C[Reinforced]
    B -->|Wrong| D[Corrected]
    C --> E[Judge improves]
    D --> E
    E --> A
```

<Card title="Calibration Guide" icon="bullseye" href="/judges/calibration">
  Learn how to correct judge evaluations in the dashboard or programmatically
  via SDK to improve accuracy over time.
</Card>

<Tip>
  Using Cursor, Claude Code, or another coding agent? The [`create-judge`
  skill](/integrations/skills) can help you pick an evaluation type, write the
  template, and create the judge without leaving your editor.
</Tip>


# Multimodal Evaluation
Source: https://docs.zeroeval.com/judges/multimodal-evaluation

Evaluate screenshots and images with LLM judges

LLM judges can evaluate spans that contain images alongside text. This is useful for browser agents, UI testing, visual QA, and any workflow where you need to assess visual output.

## How it works

1. **Attach images to spans** using SDK methods or structured output data
2. **Images are uploaded** during span ingestion (image data is stripped from the span and stored separately)
3. **Judges fetch images** when evaluating the span and send them to a vision-capable LLM
4. **Evaluation results** appear in the dashboard like any other judge evaluation

The LLM sees both the span's text data (input/output) and any attached images, giving it full context for evaluation.

Images can be provided as **base64-encoded strings**, **presigned S3 URLs**, or **CDN image URLs**. In all cases, ZeroEval copies the image into its own storage during ingestion, so the source URL only needs to remain valid long enough for the ingest request to complete.

## Attaching images to spans

There are two ways to attach images to spans, depending on your workflow.

### Option 1: SDK helper methods

The SDK provides `add_screenshot()` and `add_image()` methods for attaching images with metadata.

**Screenshots with viewport context**

For browser agents or responsive testing, use `add_screenshot()` to capture different viewports:

```python theme={null}
import zeroeval as ze

with ze.span(name="homepage_test", tags={"has_screenshots": "true"}) as span:
    # Desktop viewport
    span.add_screenshot(
        base64_data=desktop_base64,
        viewport="desktop",
        width=1920,
        height=1080,
        label="Homepage - Desktop"
    )
    
    # Mobile viewport
    span.add_screenshot(
        base64_data=mobile_base64,
        viewport="mobile",
        width=375,
        height=812,
        label="Homepage - Mobile"
    )
    
    span.set_io(
        input_data="Load homepage and capture screenshots",
        output_data="Captured 2 viewport screenshots"
    )
```

**Generic images**

For charts, diagrams, or UI component states, use `add_image()`:

```python theme={null}
with ze.span(name="button_hover_test") as span:
    span.add_image(
        base64_data=before_hover_base64,
        label="Button - Default State"
    )
    
    span.add_image(
        base64_data=after_hover_base64,
        label="Button - Hover State"
    )
    
    span.set_io(
        input_data="Test button hover interaction",
        output_data="Button changes color on hover"
    )
```

### Option 2: Image URLs (S3 presigned or CDN)

If your images are already hosted externally, you can pass an HTTPS URL instead of base64 data. ZeroEval will download the image, validate it, and copy it into its own storage.

Attach URLs via `attributes.attachments` using a `url` key instead of `base64`:

**Presigned S3 URL**

```python theme={null}
import boto3
import zeroeval as ze

s3 = boto3.client("s3")
presigned_url = s3.generate_presigned_url(
    "get_object",
    Params={"Bucket": "my-bucket", "Key": "images/chart.png"},
    ExpiresIn=300,
)

with ze.span(name="chart_generation") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": presigned_url,
            "label": "Monthly Revenue Chart",
        }
    ]

    span.set_io(
        input_data="Generate revenue chart",
        output_data="Chart generated"
    )
```

**CDN URL**

```python theme={null}
import zeroeval as ze

cdn_url = "https://cdn.example.com/images/product-photo.png"

with ze.span(name="product_image_check") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": cdn_url,
            "label": "Product listing photo",
        }
    ]

    span.set_io(
        input_data="Check product image quality",
        output_data="Image attached for evaluation"
    )
```

<Note>
  The URL only needs to stay valid long enough for ZeroEval to download the image during ingestion (typically a few seconds). After that, ZeroEval serves the image from its own storage. CDN URLs must be from a trusted domain configured in the backend.
</Note>

### Option 3: Structured output\_data

If your workflow already produces screenshot data as structured output (common with browser automation agents), you can include images directly in the span's `output_data`. ZeroEval automatically detects and extracts images from JSON arrays containing `base64` or `url` fields.

```python theme={null}
import zeroeval as ze

with ze.span(
    name="screenshot_capture",
    kind="llm",
    tags={"has_screenshots": "true", "screenshot_count": "2"}
) as span:
    # Set input as conversation messages
    span.input_data = [
        {
            "role": "system",
            "content": "You are a screenshot capture service."
        },
        {
            "role": "user", 
            "content": "Navigate to the homepage and capture screenshots"
        }
    ]
    
    # Set output as array of screenshot objects with base64 data
    span.output_data = [
        {
            "viewport": "mobile",
            "width": 768,
            "height": 1024,
            "base64": mobile_screenshot_base64
        },
        {
            "viewport": "desktop",
            "width": 1920,
            "height": 1080,
            "base64": desktop_screenshot_base64
        }
    ]
```

You can also use URLs (presigned S3 or CDN) in `output_data` by replacing the `base64` key with `url`:

```python theme={null}
    span.output_data = [
        {
            "viewport": "mobile",
            "width": 768,
            "height": 1024,
            "url": mobile_presigned_url
        },
        {
            "viewport": "desktop",
            "width": 1920,
            "height": 1080,
            "url": desktop_presigned_url
        }
    ]
```

When ZeroEval ingests this span, it:

1. Extracts each object with a `base64` or `url` field as an attachment
2. Downloads (for URLs) and uploads the images to storage
3. Strips the image data from `output_data` to keep the database lean
4. Preserves the metadata (viewport, width, height) for display

This approach works well when your browser agent or automation tool already produces structured screenshot output.

<Note>
  All methods produce the same result: images stored and available for multimodal judge evaluation. Choose whichever fits your workflow better.
</Note>

## Creating a multimodal judge

Multimodal judges work like regular judges, but with criteria that reference attached images. The judge prompt should describe what to look for in the visual content.

### Example: UI consistency judge

```
Evaluate whether the UI renders correctly across viewports.

Check for:
- Layout breaks or overlapping elements
- Text that's too small to read on mobile
- Missing or broken images
- Inconsistent spacing between viewports

Score 1 if all viewports render correctly, 0 if there are visual issues.
```

### Example: Brand compliance judge

```
Check if the page follows brand guidelines.

Look for:
- Correct logo placement and sizing
- Brand colors used consistently
- Proper typography hierarchy
- Appropriate whitespace

Score 1 for full compliance, 0 for violations.
```

### Example: Accessibility judge

```
Evaluate visual accessibility of the interface.

Check:
- Sufficient color contrast
- Text size readability
- Clear visual hierarchy
- Button/link affordances

Score 1 if accessible, 0 if there are issues. Include specific problems in the reasoning.
```

## Filtering spans for multimodal evaluation

Use tags to identify which spans should be evaluated by your multimodal judge:

```python theme={null}
# Tag spans that have screenshots
with ze.span(name="browser_test", tags={"has_screenshots": "true"}) as span:
    span.add_screenshot(...)
```

Then configure your judge to only evaluate spans matching that tag. This prevents the judge from running on text-only spans where multimodal evaluation doesn't apply.

## Supported image formats

* JPEG
* PNG
* WebP
* GIF

Images can be provided as base64-encoded strings, presigned S3 URLs, or CDN image URLs from trusted domains. In all cases, images are validated by magic bytes during ingestion. The maximum size is 10MB per image, with up to 5 images per span.

## Viewing images in the dashboard

Screenshots appear in two places:

1. **Span details view** - Images show in the Data tab with viewport labels and dimensions
2. **Judge evaluation modal** - When reviewing an evaluation, you'll see the images the judge analyzed

Images display with their labels, viewport type (for screenshots), and dimensions when available.

## Model support

Multimodal evaluation currently uses Gemini models, which support image inputs. When you create a judge, ZeroEval automatically handles the image formatting for the model.

<Note>
  Multimodal evaluation works best with specific, measurable criteria. Vague prompts like "does this look good?" will produce inconsistent results. Be explicit about what visual properties to check.
</Note>


# Pulling Evaluations
Source: https://docs.zeroeval.com/judges/pull-evaluations

Retrieve judge evaluations via SDK or REST API

Retrieve judge evaluations programmatically for reporting, analysis, or integration into your own workflows.

## Finding your IDs

Before making API calls, you'll need these identifiers:

| ID             | Where to find it                                                            |
| -------------- | --------------------------------------------------------------------------- |
| **Project ID** | Settings → Project, or in any URL after `/projects/`                        |
| **Judge ID**   | Click a judge in the dashboard; the ID is in the URL (`/judges/{judge_id}`) |
| **Span ID**    | In trace details, or returned by your instrumentation code                  |

## Python SDK

### Get available criteria for a judge

Use this before submitting criterion-level feedback to discover valid criterion keys.

```python theme={null}
import zeroeval as ze

ze.init(api_key="YOUR_API_KEY")

criteria = ze.get_judge_criteria(
    project_id="your-project-id",
    judge_id="your-judge-id",
)

print(criteria["evaluation_type"])
for criterion in criteria["criteria"]:
    print(criterion["key"], criterion.get("description"))
```

### Get evaluations by judge

Fetch all evaluations for a specific judge with pagination and optional filters.

```python theme={null}
import zeroeval as ze

ze.init(api_key="YOUR_API_KEY")

response = ze.get_judge_evaluations(
    project_id="your-project-id",
    judge_id="your-judge-id",
    limit=100,
    offset=0,
)

print(f"Total: {response['total']}")
for eval in response["evaluations"]:
    print(f"Span: {eval['span_id']}")
    print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
    print(f"Score: {eval.get('score')}")  # For scored judges
    print(f"Reason: {eval['evaluation_reason']}")
```

**Optional filters:**

```python theme={null}
response = ze.get_judge_evaluations(
    project_id="your-project-id",
    judge_id="your-judge-id",
    limit=100,
    offset=0,
    start_date="2025-01-01T00:00:00Z",
    end_date="2025-01-31T23:59:59Z",
    evaluation_result=True,  # Only passing evaluations
    feedback_state="with_user_feedback",  # Only calibrated items
)
```

### Get evaluations by span

Fetch all judge evaluations for a specific span (useful when a span has been evaluated by multiple judges).

```python theme={null}
response = ze.get_span_evaluations(
    project_id="your-project-id",
    span_id="your-span-id",
)

for eval in response["evaluations"]:
    print(f"Judge: {eval['judge_name']}")
    print(f"Result: {'PASS' if eval['evaluation_result'] else 'FAIL'}")
    if eval.get('evaluation_type') == 'scored':
        print(f"Score: {eval['score']} / {eval['score_max']}")
```

## REST API

Use these endpoints directly with your API key in the `Authorization` header.

### Get available criteria for a judge

```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/criteria" \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

### Get evaluations by judge

```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/judges/{judge_id}/evaluations?limit=100&offset=0" \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

**Query parameters:**

| Parameter           | Type   | Description                                     |
| ------------------- | ------ | ----------------------------------------------- |
| `limit`             | int    | Results per page (1-500, default 100)           |
| `offset`            | int    | Pagination offset (default 0)                   |
| `start_date`        | string | Filter by date (ISO 8601)                       |
| `end_date`          | string | Filter by date (ISO 8601)                       |
| `evaluation_result` | bool   | `true` for passing, `false` for failing         |
| `feedback_state`    | string | `with_user_feedback` or `without_user_feedback` |

### Get evaluations by span

```bash theme={null}
curl -X GET "https://api.zeroeval.com/projects/{project_id}/spans/{span_id}/evaluations" \
  -H "Authorization: Bearer $ZEROEVAL_API_KEY"
```

## Response format

### Judge evaluations response

```json theme={null}
{
  "evaluations": [...],
  "total": 142,
  "limit": 100,
  "offset": 0
}
```

### Judge criteria response

```json theme={null}
{
  "judge_id": "judge-uuid",
  "evaluation_type": "scored",
  "score_min": 0,
  "score_max": 5,
  "pass_threshold": 3.5,
  "criteria": [
    {
      "key": "CTA_text",
      "label": "CTA_text",
      "description": "CTA clarity and visibility"
    }
  ]
}
```

### Span evaluations response

```json theme={null}
{
  "span_id": "abc-123",
  "evaluations": [...]
}
```

### Evaluation object

| Field               | Type          | Description                        |
| ------------------- | ------------- | ---------------------------------- |
| `id`                | string        | Unique evaluation ID               |
| `span_id`           | string        | The evaluated span                 |
| `evaluation_result` | bool          | Pass (`true`) or fail (`false`)    |
| `evaluation_reason` | string        | Judge's reasoning                  |
| `confidence_score`  | float         | Model confidence (0-1)             |
| `score`             | float \| null | Numeric score (scored judges only) |
| `score_min`         | float \| null | Minimum possible score             |
| `score_max`         | float \| null | Maximum possible score             |
| `pass_threshold`    | float \| null | Score required to pass             |
| `model_used`        | string        | LLM model that ran the evaluation  |
| `created_at`        | string        | ISO 8601 timestamp                 |

## Pagination example

For large result sets, paginate through all evaluations:

```python theme={null}
all_evaluations = []
offset = 0
limit = 100

while True:
    response = ze.get_judge_evaluations(
        project_id="your-project-id",
        judge_id="your-judge-id",
        limit=limit,
        offset=offset,
    )
    
    all_evaluations.extend(response["evaluations"])
    
    if len(response["evaluations"]) < limit:
        break
    
    offset += limit

print(f"Fetched {len(all_evaluations)} total evaluations")
```

## Related

* [Submitting Feedback](/judges/submit-feedback) - Programmatically submit feedback for judge evaluations


# API Reference
Source: https://docs.zeroeval.com/tracing/api-reference

REST API for ingesting and querying spans, traces, and sessions

Base URL: `https://api.zeroeval.com`

All requests require a Bearer token in the `Authorization` header:

```
Authorization: Bearer YOUR_ZEROEVAL_API_KEY
```

Get an API key from [Settings → API Keys](https://app.zeroeval.com/settings?section=api-keys).

***

## Spans

### Ingest Spans

```
POST /spans
```

Send one or more spans. Traces and sessions referenced by the spans are auto-created if they don't exist.

**Request body:** `SpanCreate[]`

| Field            | Type            | Required | Default         | Description                                                  |
| ---------------- | --------------- | -------- | --------------- | ------------------------------------------------------------ |
| `trace_id`       | `string (UUID)` | Yes      | —               | Trace this span belongs to                                   |
| `name`           | `string`        | Yes      | —               | Descriptive name                                             |
| `started_at`     | `ISO 8601`      | Yes      | —               | When the span started                                        |
| `id`             | `string (UUID)` | No       | auto-generated  | Client-provided span ID                                      |
| `kind`           | `string`        | No       | `"generic"`     | `generic`, `llm`, `tts`, `http`, `database`, `vector_store`  |
| `status`         | `string`        | No       | `null`          | `unset`, `ok`, `error`                                       |
| `ended_at`       | `ISO 8601`      | No       | `null`          | When the span completed                                      |
| `duration_ms`    | `float`         | No       | `null`          | Duration in milliseconds                                     |
| `cost`           | `float`         | No       | auto-calculated | Cost (auto-calculated for LLM spans)                         |
| `parent_span_id` | `string (UUID)` | No       | `null`          | Parent span for nesting                                      |
| `session_id`     | `string (UUID)` | No       | `null`          | Session to associate with                                    |
| `session`        | `object`        | No       | `null`          | `{"id": "...", "name": "..."}` — alternative to `session_id` |
| `input_data`     | `string`        | No       | `null`          | Input data (JSON string for messages)                        |
| `output_data`    | `string`        | No       | `null`          | Output/response text                                         |
| `attributes`     | `object`        | No       | `{}`            | Arbitrary key-value attributes                               |
| `tags`           | `object`        | No       | `{}`            | Key-value tags for filtering                                 |
| `trace_tags`     | `object`        | No       | `{}`            | Tags applied to the parent trace                             |
| `session_tags`   | `object`        | No       | `{}`            | Tags applied to the parent session                           |
| `error_code`     | `string`        | No       | `null`          | Error code or exception class                                |
| `error_message`  | `string`        | No       | `null`          | Error description                                            |
| `error_stack`    | `string`        | No       | `null`          | Stack trace                                                  |
| `code_filepath`  | `string`        | No       | `null`          | Source file path                                             |
| `code_lineno`    | `int`           | No       | `null`          | Source line number                                           |

<CodeGroup>
  ```bash cURL theme={null}
  curl -X POST https://api.zeroeval.com/spans \
    -H "Authorization: Bearer $ZEROEVAL_API_KEY" \
    -H "Content-Type: application/json" \
    -d '[{
      "trace_id": "550e8400-e29b-41d4-a716-446655440001",
      "name": "chat_completion",
      "kind": "llm",
      "started_at": "2025-01-15T10:30:00Z",
      "ended_at": "2025-01-15T10:30:02Z",
      "status": "ok",
      "attributes": {
        "provider": "openai",
        "model": "gpt-4o",
        "inputTokens": 150,
        "outputTokens": 230
      },
      "input_data": "[{\"role\": \"user\", \"content\": \"What is the capital of France?\"}]",
      "output_data": "The capital of France is Paris.",
      "tags": {"environment": "production"}
    }]'
  ```

  ```python Python theme={null}
  import requests, json, uuid
  from datetime import datetime, timezone

  requests.post(
      "https://api.zeroeval.com/spans",
      headers={"Authorization": f"Bearer {API_KEY}"},
      json=[{
          "trace_id": str(uuid.uuid4()),
          "name": "chat_completion",
          "kind": "llm",
          "started_at": datetime.now(timezone.utc).isoformat(),
          "ended_at": datetime.now(timezone.utc).isoformat(),
          "status": "ok",
          "attributes": {
              "provider": "openai",
              "model": "gpt-4o",
              "inputTokens": 150,
              "outputTokens": 230
          },
          "input_data": json.dumps([{"role": "user", "content": "Hello"}]),
          "output_data": "Hi there!"
      }]
  )
  ```
</CodeGroup>

**Response:** `SpanRead[]` (200)

#### LLM Cost Calculation

For automatic cost calculation on LLM spans, set `kind` to `"llm"` and include these attributes:

| Attribute      | Required | Description                             |
| -------------- | -------- | --------------------------------------- |
| `provider`     | Yes      | `"openai"`, `"gemini"`, `"anthropic"`   |
| `model`        | Yes      | `"gpt-4o"`, `"claude-3-5-sonnet"`, etc. |
| `inputTokens`  | Yes      | Number of input tokens                  |
| `outputTokens` | Yes      | Number of output tokens                 |

Cost is calculated as: `(inputTokens × inputPrice + outputTokens × outputPrice) / 1,000,000`

***

### Query Spans

```
GET /spans
```

Retrieve spans with filtering, sorting, and pagination.

| Parameter        | Type       | Default | Description                                                   |
| ---------------- | ---------- | ------- | ------------------------------------------------------------- |
| `id`             | `string`   | —       | Filter by span ID                                             |
| `trace_id`       | `string`   | —       | Filter by trace ID                                            |
| `parent_span_id` | `string`   | —       | Filter by parent; `"null"` for root spans                     |
| `span_kind`      | `string`   | —       | `llm`, `generic`, `tts`, `http`, etc.                         |
| `created_before` | `ISO 8601` | —       | Upper bound on created\_at                                    |
| `created_after`  | `ISO 8601` | —       | Lower bound on created\_at                                    |
| `names`          | `string`   | —       | Comma-separated span names                                    |
| `error_codes`    | `string`   | —       | Comma-separated error codes                                   |
| `has_error`      | `bool`     | —       | `true` or `false`                                             |
| `duration_min`   | `float`    | —       | Minimum duration (ms)                                         |
| `duration_max`   | `float`    | —       | Maximum duration (ms)                                         |
| `cost_min`       | `float`    | —       | Minimum cost                                                  |
| `cost_max`       | `float`    | —       | Maximum cost                                                  |
| `sort_by`        | `string`   | —       | `started_at`, `name`, `status`, `duration_ms`, `cost`, `kind` |
| `sort_order`     | `string`   | `desc`  | `asc` or `desc`                                               |
| `limit`          | `int`      | `5000`  | 1–10000                                                       |
| `offset`         | `int`      | `0`     | Pagination offset                                             |

Any unrecognized query parameter is treated as a **tag filter** (e.g. `?environment=production`).

**Response:** `SpanRead[]` (200)

***

### Get Span Attachments

```
GET /spans/{span_id}/attachments
```

Returns all attachments (images, screenshots) for a span with presigned URLs.

**Response:**

```json theme={null}
{
  "attachments": [
    {
      "type": "image",
      "url": "https://...",
      "label": "Homepage screenshot"
    }
  ],
  "count": 1
}
```

***

## Traces

### Create Traces

```
POST /traces
```

Create one or more traces. Traces are also auto-created when ingesting spans.

**Request body:** `TraceCreate[]`

| Field         | Type            | Required | Default        | Description              |
| ------------- | --------------- | -------- | -------------- | ------------------------ |
| `session_id`  | `string (UUID)` | Yes      | —              | Parent session           |
| `name`        | `string`        | Yes      | —              | Trace name               |
| `started_at`  | `ISO 8601`      | Yes      | —              | Start time               |
| `id`          | `string (UUID)` | No       | auto-generated | Client-provided trace ID |
| `status`      | `string`        | No       | `null`         | `unset`, `ok`, `error`   |
| `ended_at`    | `ISO 8601`      | No       | `null`         | End time                 |
| `duration_ms` | `float`         | No       | `null`         | Duration in milliseconds |
| `cost`        | `float`         | No       | `null`         | Total cost               |
| `attributes`  | `object`        | No       | `{}`           | Arbitrary attributes     |
| `tags`        | `object`        | No       | `{}`           | Key-value tags           |

**Response:** `TraceRead[]` (200)

***

### Query Traces

```
GET /traces
```

| Parameter        | Type       | Default | Description                                                         |
| ---------------- | ---------- | ------- | ------------------------------------------------------------------- |
| `id`             | `string`   | —       | Filter by trace ID                                                  |
| `session_id`     | `string`   | —       | Filter by session ID                                                |
| `status`         | `string`   | —       | Filter by status                                                    |
| `created_before` | `ISO 8601` | —       | Upper bound on created\_at                                          |
| `created_after`  | `ISO 8601` | —       | Lower bound on created\_at                                          |
| `names`          | `string`   | —       | Comma-separated trace names                                         |
| `has_error`      | `bool`     | —       | `true` or `false`                                                   |
| `duration_min`   | `float`    | —       | Minimum duration (ms)                                               |
| `duration_max`   | `float`    | —       | Maximum duration (ms)                                               |
| `cost_min`       | `float`    | —       | Minimum cost                                                        |
| `cost_max`       | `float`    | —       | Maximum cost                                                        |
| `sort_by`        | `string`   | —       | `started_at`, `name`, `status`, `span_count`, `duration_ms`, `cost` |
| `sort_order`     | `string`   | `desc`  | `asc` or `desc`                                                     |
| `limit`          | `int`      | `50`    | 1–500                                                               |
| `offset`         | `int`      | `0`     | Pagination offset                                                   |

Tag filters work the same as for spans.

**Response:** `TraceRead[]` (200)

Each trace includes:

| Field              | Type       | Description               |
| ------------------ | ---------- | ------------------------- |
| `id`               | `string`   | Trace ID                  |
| `session_id`       | `string`   | Parent session            |
| `name`             | `string`   | Trace name                |
| `status`           | `string`   | `unset`, `ok`, `error`    |
| `started_at`       | `ISO 8601` | Start time                |
| `ended_at`         | `ISO 8601` | End time                  |
| `duration_ms`      | `float`    | Duration                  |
| `cost`             | `float`    | Total cost                |
| `span_count`       | `int`      | Number of spans           |
| `root_span_input`  | `string`   | Input from the root span  |
| `root_span_output` | `string`   | Output from the root span |
| `tags`             | `object`   | Tags                      |

***

## Sessions

### Create Sessions

```
POST /sessions
```

Create one or more sessions. Sessions are also auto-created when ingesting spans with a `session_id` or `session` field.

**Request body:** `SessionCreate[]`

| Field        | Type            | Required | Default        | Description                 |
| ------------ | --------------- | -------- | -------------- | --------------------------- |
| `project_id` | `string`        | Yes      | —              | Project ID                  |
| `name`       | `string`        | No       | `null`         | Human-readable session name |
| `id`         | `string (UUID)` | No       | auto-generated | Client-provided session ID  |
| `attributes` | `object`        | No       | `{}`           | Arbitrary attributes        |
| `tags`       | `object`        | No       | `{}`           | Key-value tags              |

**Response:** `SessionRead[]` (200)

***

### Query Sessions

```
GET /sessions
```

| Parameter        | Type       | Default | Description                                             |
| ---------------- | ---------- | ------- | ------------------------------------------------------- |
| `id`             | `string`   | —       | Filter by session ID                                    |
| `created_before` | `ISO 8601` | —       | Upper bound on created\_at                              |
| `created_after`  | `ISO 8601` | —       | Lower bound on created\_at                              |
| `names`          | `string`   | —       | Comma-separated session names                           |
| `has_error`      | `bool`     | —       | `true` or `false`                                       |
| `duration_min`   | `float`    | —       | Minimum duration                                        |
| `duration_max`   | `float`    | —       | Maximum duration                                        |
| `cost_min`       | `float`    | —       | Minimum cost                                            |
| `cost_max`       | `float`    | —       | Maximum cost                                            |
| `sort_by`        | `string`   | —       | `created_at`, `name`, `trace_count`, `cost`, `duration` |
| `sort_order`     | `string`   | `desc`  | `asc` or `desc`                                         |
| `limit`          | `int`      | `50`    | 1–500                                                   |
| `offset`         | `int`      | `0`     | Pagination offset                                       |

Tag filters work the same as for spans.

**Response:** `SessionRead[]` (200)

Each session includes:

| Field         | Type     | Description      |
| ------------- | -------- | ---------------- |
| `id`          | `string` | Session ID       |
| `project_id`  | `string` | Project ID       |
| `name`        | `string` | Session name     |
| `trace_count` | `int`    | Number of traces |
| `error_count` | `int`    | Number of errors |
| `cost`        | `float`  | Total cost       |
| `duration`    | `float`  | Total duration   |
| `tags`        | `object` | Tags             |

***

## Feedback

To retrieve feedback (human reviews and judge evaluations) for spans, traces, or sessions, use the [Feedback API](/feedback/api-reference#unified-entity-feedback).

***

## OpenTelemetry (OTLP)

### Ingest OTLP Traces

```
POST /v1/traces
```

Accepts standard OpenTelemetry Protocol (OTLP) trace data in JSON or protobuf format.

**Headers:**

| Header          | Value                                          |
| --------------- | ---------------------------------------------- |
| `Authorization` | `Bearer YOUR_ZEROEVAL_API_KEY`                 |
| `Content-Type`  | `application/json` or `application/x-protobuf` |

This endpoint is compatible with the OpenTelemetry Collector's `otlphttp` exporter. See [OpenTelemetry](/tracing/opentelemetry) for collector configuration.


# Introduction
Source: https://docs.zeroeval.com/tracing/introduction

Capture every step your AI agent takes so you can debug, evaluate, and optimize

Your agent makes dozens of decisions every run -- retrieving context, calling models, executing tools, generating responses. Without observability, failures are invisible, regressions go unnoticed, and optimization is guesswork.

ZeroEval Tracing captures the full execution graph of your AI system so you can:

* **Debug** failed runs by inspecting the exact inputs, outputs, and errors at every step
* **Evaluate** output quality at scale with [calibrated judges](/judges/introduction) that score your traces automatically
* **Optimize** prompts and models by comparing versions against real production data with [prompt optimization](/autotune/introduction)
* **Monitor** cost, latency, and error rates across sessions, traces, and spans

## How it works

<Steps>
  <Step title="Instrument your code">
    Add a few lines to your application. The SDK automatically captures LLM
    calls, or you can create custom spans for any operation.
  </Step>

  <Step title="Traces flow into ZeroEval">
    Every agent run becomes a trace -- a tree of spans showing what happened, in
    what order, with full inputs and outputs.
  </Step>

  <Step title="Organize with sessions and tags">
    Group related traces into sessions and tag them with metadata for filtering.
    Attach [human feedback](/feedback/human-feedback) or let [judges](/judges/introduction) evaluate outputs automatically.
  </Step>

  <Step title="Unlock the feedback loop">
    Use your traced data to run judges, optimize prompts, and build evaluations
    \-- all from the same production data.
  </Step>
</Steps>

## Get started

Create an API key from [Settings → API Keys](https://app.zeroeval.com/settings?section=api-keys), then pick your integration path:

<CardGroup>
  <Card title="Python SDK" icon="python" href="/tracing/sdks/python/setup">
    Decorators and context managers for Python apps. Auto-instruments OpenAI,
    LangChain, Gemini, and more.
  </Card>

  <Card title="TypeScript SDK" icon="js" href="/tracing/sdks/typescript/setup">
    Wrapper functions for Node.js and Bun. Auto-instruments OpenAI and Vercel AI
    SDK.
  </Card>

  <Card title="REST API" icon="code" href="/tracing/api-reference">
    Send spans, traces, and sessions directly over HTTP from any language.
  </Card>

  <Card title="OpenTelemetry" icon="tower-broadcast" href="/tracing/opentelemetry">
    Route OTLP traces from any OpenTelemetry-instrumented app to ZeroEval.
  </Card>
</CardGroup>

<Tip>
  Using Cursor, Claude Code, or another coding agent? The [`zeroeval-install`
  skill](/integrations/skills) can handle SDK setup, first trace, and prompt
  migration for you.
</Tip>


# OpenTelemetry
Source: https://docs.zeroeval.com/tracing/opentelemetry

Send traces to ZeroEval via the OpenTelemetry Protocol (OTLP)

ZeroEval accepts standard OTLP trace data at `POST /v1/traces`, so any OpenTelemetry-instrumented application can export directly -- through a collector, or straight from your app using an OTLP exporter.

## When to use OTLP

Use this integration when:

* Your application already uses OpenTelemetry for instrumentation
* You want to fan out traces to multiple backends (ZeroEval + Datadog, Jaeger, etc.)
* You're running infrastructure you can't modify but can route through a collector
* You prefer a vendor-neutral instrumentation layer

<Note>
  If you're starting fresh, the [Python SDK](/tracing/sdks/python/setup) or
  [TypeScript SDK](/tracing/sdks/typescript/setup) provide a simpler setup with
  automatic LLM instrumentation.
</Note>

## Endpoint Reference

```
POST https://api.zeroeval.com/v1/traces
```

| Header          | Value                                          |
| --------------- | ---------------------------------------------- |
| `Authorization` | `Bearer YOUR_ZEROEVAL_API_KEY`                 |
| `Content-Type`  | `application/json` or `application/x-protobuf` |

The endpoint accepts the standard `ExportTraceServiceRequest` payload defined in the [OTLP specification](https://opentelemetry.io/docs/specs/otlp/). Spans are converted to ZeroEval's internal format -- trace IDs, parent-child relationships, attributes, and status are all preserved.

***

## Option 1: OpenTelemetry Collector

Route traces through a collector when you need batching, processing, or multi-destination fan-out.

### Collector Configuration

Create `otel-collector-config.yaml`:

```yaml theme={null}
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlphttp:
    endpoint: https://api.zeroeval.com
    headers:
      Authorization: "Bearer ${env:ZEROEVAL_API_KEY}"
    traces_endpoint: https://api.zeroeval.com/v1/traces

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp]
```

### Run with Docker Compose

```yaml theme={null}
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    environment:
      - ZEROEVAL_API_KEY=${ZEROEVAL_API_KEY}
    restart: unless-stopped
```

### Run with Kubernetes

<Accordion title="Full Kubernetes manifest">
  ```yaml theme={null}
  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: otel-collector-config
  data:
    config.yaml: |
      receivers:
        otlp:
          protocols:
            grpc:
              endpoint: 0.0.0.0:4317
            http:
              endpoint: 0.0.0.0:4318
      processors:
        batch:
          timeout: 1s
        k8sattributes:
          extract:
            metadata:
              - k8s.namespace.name
              - k8s.deployment.name
              - k8s.pod.name
      exporters:
        otlphttp:
          endpoint: https://api.zeroeval.com
          headers:
            Authorization: "Bearer ${env:ZEROEVAL_API_KEY}"
          traces_endpoint: https://api.zeroeval.com/v1/traces
      service:
        pipelines:
          traces:
            receivers: [otlp]
            processors: [batch, k8sattributes]
            exporters: [otlphttp]
  ---
  apiVersion: v1
  kind: Secret
  metadata:
    name: zeroeval-secret
  type: Opaque
  stringData:
    api-key: "YOUR_ZEROEVAL_API_KEY"
  ---
  apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: otel-collector
  spec:
    replicas: 2
    selector:
      matchLabels:
        app: otel-collector
    template:
      metadata:
        labels:
          app: otel-collector
      spec:
        containers:
          - name: otel-collector
            image: otel/opentelemetry-collector-contrib:latest
            args: ["--config=/etc/config.yaml"]
            env:
              - name: ZEROEVAL_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: zeroeval-secret
                    key: api-key
            ports:
              - containerPort: 4317
                name: otlp-grpc
              - containerPort: 4318
                name: otlp-http
            volumeMounts:
              - name: config
                mountPath: /etc/config.yaml
                subPath: config.yaml
        volumes:
          - name: config
            configMap:
              name: otel-collector-config
  ---
  apiVersion: v1
  kind: Service
  metadata:
    name: otel-collector
  spec:
    selector:
      app: otel-collector
    ports:
      - name: otlp-grpc
        port: 4317
      - name: otlp-http
        port: 4318
  ```
</Accordion>

***

## Option 2: Direct from Python

Export OTLP traces directly from your Python application without a collector. Use the `ZeroEvalOTLPProvider` included in the SDK, or configure a standard `OTLPSpanExporter`.

### Using ZeroEvalOTLPProvider

```python theme={null}
from opentelemetry import trace
from zeroeval.providers import ZeroEvalOTLPProvider

provider = ZeroEvalOTLPProvider(
    api_key="YOUR_ZEROEVAL_API_KEY",
    service_name="my-service"
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-service")

with tracer.start_as_current_span("process_request") as span:
    span.set_attribute("user.id", "12345")
    result = do_work()
    span.set_attribute("result.status", "ok")
```

### Using standard OTLPSpanExporter

```python theme={null}
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="https://api.zeroeval.com/v1/traces",
    headers={"Authorization": "Bearer YOUR_ZEROEVAL_API_KEY"}
)

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
```

***

## Option 3: Direct from Node.js

```typescript theme={null}
import { NodeTracerProvider } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";

const exporter = new OTLPTraceExporter({
  url: "https://api.zeroeval.com/v1/traces",
  headers: {
    Authorization: "Bearer YOUR_ZEROEVAL_API_KEY",
  },
});

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
```

***

## Attribute Mapping

ZeroEval maps standard OpenTelemetry span attributes to its internal format:

| OTLP Attribute          | ZeroEval Field         | Notes                                      |
| ----------------------- | ---------------------- | ------------------------------------------ |
| `span.name`             | `name`                 | Span name                                  |
| `span.kind`             | `kind`                 | Mapped to `generic`, `llm`, etc.           |
| `span.status`           | `status`               | `ok`, `error`, `unset`                     |
| `span.start_time`       | `started_at`           | Nanosecond timestamp converted to ISO 8601 |
| `span.end_time`         | `ended_at`             | Nanosecond timestamp converted to ISO 8601 |
| `span.trace_id`         | `trace_id`             | Hex-encoded trace ID                       |
| `span.parent_span_id`   | `parent_span_id`       | Hex-encoded span ID                        |
| `span.attributes.*`     | `attributes`           | All attributes preserved                   |
| `resource.service.name` | Service identification | Used for grouping                          |

### LLM Spans

To get LLM-specific features (cost calculation, token tracking), set these attributes on your spans:

| Attribute                                               | Description                                 |
| ------------------------------------------------------- | ------------------------------------------- |
| `llm.provider` or `gen_ai.system`                       | Provider name (`openai`, `anthropic`, etc.) |
| `llm.model` or `gen_ai.request.model`                   | Model identifier                            |
| `llm.input_tokens` or `gen_ai.usage.prompt_tokens`      | Input token count                           |
| `llm.output_tokens` or `gen_ai.usage.completion_tokens` | Output token count                          |

### Sessions

Attach session context to your OTLP spans via attributes:

| Attribute               | Description                    |
| ----------------------- | ------------------------------ |
| `zeroeval.session.id`   | Session ID for grouping traces |
| `zeroeval.session.name` | Human-readable session name    |

The `ZeroEvalOTLPProvider` stamps these automatically when you configure a session via environment variables (`ZEROEVAL_SESSION_ID`, `ZEROEVAL_SESSION_NAME`).


# Integrations
Source: https://docs.zeroeval.com/tracing/sdks/python/integrations

Automatic instrumentation for popular AI/ML frameworks

The [ZeroEval Python SDK](https://pypi.org/project/zeroeval/) automatically traces intruments the supported integrations, meaning the only thing to do is to initialize the SDK before importing the frameworks you want to trace.

## OpenAI

```python theme={null}
import zeroeval as ze
ze.init()

import openai
client = openai.OpenAI()

# This call is automatically traced
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Streaming is also automatically traced
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

## LangChain

```python theme={null}
import zeroeval as ze
ze.init()

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# All components are automatically traced
model = ChatOpenAI()
prompt = ChatPromptTemplate.from_template("Tell me about {topic}")
chain = prompt | model

response = chain.invoke({"topic": "AI"})
```

## LangGraph

```python theme={null}
import zeroeval as ze
ze.init()

from langgraph.graph import StateGraph, START, END
from langchain_core.messages import HumanMessage

# Define a multi-node graph
workflow = StateGraph(AgentState)
workflow.add_node("reasoning", reasoning_node)
workflow.add_node("agent", agent_node)
workflow.add_node("tools", tool_node)

workflow.add_conditional_edges(
    "agent",
    should_continue,
    {"tools": "tools", "end": END}
)

app = workflow.compile()

# Full graph execution is automatically traced
result = app.invoke({"messages": [HumanMessage(content="Help me plan a trip")]})

# Streaming is also supported
for chunk in app.stream({"messages": [HumanMessage(content="Hello")]}):
    print(chunk)
```

## PydanticAI

PydanticAI agents are automatically traced, including multi-turn conversations. The SDK ensures that all LLM calls within an agent execution share the same trace, and consecutive conversation turns share the same trace ID when using shared message history.

```python theme={null}
import zeroeval as ze
ze.init()

from pydantic_ai import Agent
from pydantic import BaseModel

class Response(BaseModel):
    message: str
    sentiment: str

# Create an agent with structured output
agent = Agent(
    model="openai:gpt-4o-mini",
    output_type=Response,
    system_prompt="You are a helpful assistant."
)

# Single execution - automatically traced
result = await agent.run("Hello!")

# Multi-turn conversation - all turns share the same trace
message_history = []

async with agent.iter("First message", message_history=message_history) as run:
    async for node in run:
        pass
    message_history = run.result.all_messages()

# Second turn reuses the same trace_id
async with agent.iter("Follow-up message", message_history=message_history) as run:
    async for node in run:
        pass
    message_history = run.result.all_messages()
```

<Info>
  When you pass the same `message_history` list across multiple agent runs, ZeroEval automatically groups all runs under a single trace. This provides a unified view of the entire conversation.
</Info>

## LiveKit

The SDK automatically creates traces for LiveKit agents, including events from the following plugins:

* Cartesia (TTS)
* Deepgram (STT)
* OpenAI (LLM)

```python theme={null}
import zeroeval as ze
ze.init()

from livekit import agents
from livekit.agents import AgentSession, Agent
from livekit.plugins import openai

async def entrypoint(ctx: agents.JobContext):
    await ctx.connect()

    # All agent sessions are automatically traced
    session = AgentSession(
        llm=openai.realtime.RealtimeModel(voice="coral")
    )

    await session.start(
        room=ctx.room,
        agent=Agent(instructions="You are a helpful voice AI assistant.")
    )

    # Agent interactions are automatically captured
    await session.generate_reply(
        instructions="Greet the user and offer your assistance."
    )

if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
```

<Note>
  Need help? Contact us at [founders@zeroeval.com](mailto:founders@zeroeval.com) or join our [Discord](https://discord.gg/MuExkGMNVz).
</Note>


# Reference
Source: https://docs.zeroeval.com/tracing/sdks/python/reference

Complete API reference for the Python SDK

## Installation

```bash theme={null}
pip install zeroeval
```

## Core Functions

### `init()`

Initializes the ZeroEval SDK. Must be called before using any other SDK features.

```python theme={null}
def init(
    api_key: str = None,
    workspace_name: str = "Personal Organization",
    organization_name: str = None,
    debug: bool = False,
    api_url: str = None,
    disabled_integrations: list[str] = None,
    enabled_integrations: list[str] = None,
    setup_otlp: bool = True,
    service_name: str = "zeroeval-app",
    tags: dict[str, str] = None,
    sampling_rate: float = None
) -> None
```

| Parameter               | Type             | Default                      | Description                                       |
| ----------------------- | ---------------- | ---------------------------- | ------------------------------------------------- |
| `api_key`               | `str`            | `None`                       | API key. Falls back to `ZEROEVAL_API_KEY` env var |
| `workspace_name`        | `str`            | `"Personal Organization"`    | Deprecated -- use `organization_name`             |
| `organization_name`     | `str`            | `None`                       | Organization name                                 |
| `debug`                 | `bool`           | `False`                      | Enable debug logging with colors                  |
| `api_url`               | `str`            | `"https://api.zeroeval.com"` | API endpoint URL                                  |
| `disabled_integrations` | `list[str]`      | `None`                       | Integrations to disable (e.g. `["langchain"]`)    |
| `enabled_integrations`  | `list[str]`      | `None`                       | Only enable these integrations                    |
| `setup_otlp`            | `bool`           | `True`                       | Configure OpenTelemetry OTLP export               |
| `service_name`          | `str`            | `"zeroeval-app"`             | OTLP service name                                 |
| `tags`                  | `dict[str, str]` | `None`                       | Global tags applied to all spans                  |
| `sampling_rate`         | `float`          | `None`                       | Sampling rate 0.0-1.0 (1.0 = sample all)          |

**Example:**

```python theme={null}
import zeroeval as ze

ze.init(
    api_key="your-api-key",
    sampling_rate=0.1,
    disabled_integrations=["langchain"],
    debug=True
)
```

## Decorators

### `@span`

Decorator and context manager for creating spans around code blocks.

```python theme={null}
@span(
    name: str,
    session_id: Optional[str] = None,
    session: Optional[Union[str, dict[str, str]]] = None,
    attributes: Optional[dict[str, Any]] = None,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None,
    tags: Optional[dict[str, str]] = None
)
```

**Parameters:**

* `name` (str): Name of the span
* `session_id` (str, optional): **Deprecated** - Use `session` parameter instead
* `session` (Union\[str, dict], optional): Session information. Can be:
  * A string containing the session ID
  * A dict with `{"id": "...", "name": "..."}`
* `attributes` (dict, optional): Additional attributes to attach to the span
* `input_data` (str, optional): Manual input data override
* `output_data` (str, optional): Manual output data override
* `tags` (dict, optional): Tags to attach to the span

**Usage as Decorator:**

```python theme={null}
import zeroeval as ze

@ze.span(name="calculate_sum")
def add_numbers(a: int, b: int) -> int:
    return a + b  # Parameters and return value automatically captured

# With manual I/O
@ze.span(name="process_data", input_data="manual input", output_data="manual output")
def process():
    # Process logic here
    pass

# With session
@ze.span(name="user_action", session={"id": "123", "name": "John's Session"})
def user_action():
    pass
```

**Usage as Context Manager:**

```python theme={null}
import zeroeval as ze

with ze.span(name="data_processing") as current_span:
    result = process_data()
    current_span.set_io(input_data="input", output_data=str(result))
```

### `artifact_span`

Ergonomic wrapper for creating artifact-bearing spans. Produces a span with `kind="llm"` and the `completion_artifact_*` attributes pre-filled so the prompt completions page surfaces it as a first-class artifact.

```python theme={null}
artifact_span(
    name: str,
    *,
    artifact_type: str,
    role: str = "primary",
    label: Optional[str] = None,
    kind: str = "llm",
    session_id: Optional[str] = None,
    session: Optional[Union[str, dict[str, str]]] = None,
    attributes: Optional[dict[str, Any]] = None,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None,
    tags: Optional[dict[str, str]] = None,
)
```

**Parameters:**

* `name` (str): Name of the span
* `artifact_type` (str): Artifact type identifier (e.g. `"final_decision"`, `"customer_card"`)
* `role` (str): `"primary"` (used for row preview) or `"secondary"`. Defaults to `"primary"`
* `label` (str, optional): Human-friendly label shown in the artifact switcher. Defaults to the span name
* `kind` (str): Span kind. Defaults to `"llm"`
* `session` (Union\[str, dict], optional): Session information
* `attributes` (dict, optional): Additional attributes merged with artifact metadata. Artifact keys take precedence
* `input_data` (str, optional): Manual input data override
* `output_data` (str, optional): Manual output data override
* `tags` (dict, optional): Tags to attach to the span

**Usage as Context Manager:**

```python theme={null}
import zeroeval as ze

with ze.artifact_span(
    name="final-decision",
    artifact_type="final_decision",
    role="primary",
    label="Final Decision",
    tags={"judge_target": "support_ops_final_decision"},
) as s:
    s.set_io(input_data="ticket text", output_data=decision_json)
```

**Usage as Decorator:**

```python theme={null}
@ze.artifact_span(
    name="generate-card",
    artifact_type="customer_card",
    role="secondary",
    label="Customer Card",
)
def generate_card(ticket):
    return render_card(ticket)
```

<Note>
  `ze.artifact_span` is available in the **Python SDK only** for now.
</Note>

### `@experiment`

Decorator that attaches dataset and model information to a function.

```python theme={null}
@experiment(
    dataset: Optional[Dataset] = None,
    model: Optional[str] = None
)
```

**Parameters:**

* `dataset` (Dataset, optional): Dataset to use for the experiment
* `model` (str, optional): Model identifier

**Example:**

```python theme={null}
import zeroeval as ze

dataset = ze.Dataset.pull("my-dataset")

@ze.experiment(dataset=dataset, model="gpt-4")
def my_experiment():
    # Experiment logic
    pass
```

## Classes

### `Dataset`

A class to represent a named collection of dictionary records.

#### Constructor

```python theme={null}
Dataset(
    name: str,
    data: list[dict[str, Any]],
    description: Optional[str] = None
)
```

**Parameters:**

* `name` (str): The name of the dataset
* `data` (list\[dict]): A list of dictionaries containing the data
* `description` (str, optional): A description of the dataset

**Example:**

```python theme={null}
dataset = Dataset(
    name="Capitals",
    description="Country to capital mapping",
    data=[
        {"input": "France", "output": "Paris"},
        {"input": "Germany", "output": "Berlin"}
    ]
)
```

#### Methods

##### `push()`

Push the dataset to the backend, creating a new version if it already exists.

```python theme={null}
def push(self, create_new_version: bool = False) -> Dataset
```

**Parameters:**

* `self`: The Dataset instance
* `create_new_version` (bool, optional): For backward compatibility. This parameter is no longer needed as new versions are automatically created when a dataset name already exists. Defaults to False

**Returns:** Returns self for method chaining

##### `pull()`

Static method to pull a dataset from the backend.

```python theme={null}
@classmethod
def pull(
    cls,
    dataset_name: str,
    version_number: Optional[int] = None
) -> Dataset
```

**Parameters:**

* `cls`: The Dataset class itself (automatically provided when using `@classmethod`)
* `dataset_name` (str): The name of the dataset to pull from the backend
* `version_number` (int, optional): Specific version number to pull. If not provided, pulls the latest version

**Returns:** A new Dataset instance populated with data from the backend

##### `add_rows()`

Add new rows to the dataset.

```python theme={null}
def add_rows(self, new_rows: list[dict[str, Any]]) -> None
```

**Parameters:**

* `self`: The Dataset instance
* `new_rows` (list\[dict]): A list of dictionaries representing the rows to add

##### `add_image()`

Add an image to a specific row.

```python theme={null}
def add_image(
    self,
    row_index: int,
    column_name: str,
    image_path: str
) -> None
```

**Parameters:**

* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the image to
* `image_path` (str): Path to the image file to add

##### `add_audio()`

Add audio to a specific row.

```python theme={null}
def add_audio(
    self,
    row_index: int,
    column_name: str,
    audio_path: str
) -> None
```

**Parameters:**

* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the audio to
* `audio_path` (str): Path to the audio file to add

##### `add_media_url()`

Add a media URL to a specific row.

```python theme={null}
def add_media_url(
    self,
    row_index: int,
    column_name: str,
    media_url: str,
    media_type: str = "image"
) -> None
```

**Parameters:**

* `self`: The Dataset instance
* `row_index` (int): Index of the row to update (0-based)
* `column_name` (str): Name of the column to add the media URL to
* `media_url` (str): URL pointing to the media file
* `media_type` (str, optional): Type of media - "image", "audio", or "video". Defaults to "image"

#### Properties

* `name` (str): The name of the dataset
* `description` (str): The description of the dataset
* `columns` (list\[str]): List of all unique column names
* `data` (list\[dict]): List of the data portion for each row
* `backend_id` (str): The ID in the backend (after pushing)
* `version_id` (str): The version ID in the backend
* `version_number` (int): The version number in the backend

#### Example

```python theme={null}
import zeroeval as ze

# Create a dataset
dataset = ze.Dataset(
    name="Capitals",
    description="Country to capital mapping",
    data=[
        {"input": "France", "output": "Paris"},
        {"input": "Germany", "output": "Berlin"}
    ]
)

# Push to backend
dataset.push()

# Pull from backend
dataset = ze.Dataset.pull("Capitals", version_number=1)

# Add rows
dataset.add_rows([{"input": "Italy", "output": "Rome"}])

# Add multimodal data
dataset.add_image(0, "flag", "flags/france.png")
dataset.add_audio(0, "anthem", "anthems/france.mp3")
dataset.add_media_url(0, "video_url", "https://example.com/video.mp4", "video")
```

### `Experiment`

Represents an experiment that runs a task on a dataset with optional evaluators.

#### Constructor

```python theme={null}
Experiment(
    dataset: Dataset,
    task: Callable[[Any], Any],
    evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
    name: Optional[str] = None,
    description: Optional[str] = None
)
```

**Parameters:**

* `dataset` (Dataset): The dataset to run the experiment on
* `task` (Callable): Function that processes each row and returns output
* `evaluators` (list\[Callable], optional): List of evaluator functions that take (row, output) and return evaluation result
* `name` (str, optional): Name of the experiment. Defaults to task function name
* `description` (str, optional): Description of the experiment. Defaults to task function's docstring

**Example:**

```python theme={null}
import zeroeval as ze

ze.init()

# Pull dataset
dataset = ze.Dataset.pull("Capitals")

# Define task
def capitalize_task(row):
    return row["input"].upper()

# Define evaluator
def exact_match(row, output):
    return row["output"].upper() == output

# Create and run experiment
exp = ze.Experiment(
    dataset=dataset,
    task=capitalize_task,
    evaluators=[exact_match],
    name="Capital Uppercase Test"
)

results = exp.run()

# Or run task and evaluators separately
results = exp.run_task()
exp.run_evaluators([exact_match], results)
```

#### Methods

##### `run()`

Run the complete experiment (task + evaluators).

```python theme={null}
def run(
    self,
    subset: Optional[list[dict]] = None
) -> list[ExperimentResult]
```

**Parameters:**

* `self`: The Experiment instance
* `subset` (list\[dict], optional): Subset of dataset rows to run the experiment on. If None, runs on entire dataset

**Returns:** List of experiment results for each row

##### `run_task()`

Run only the task without evaluators.

```python theme={null}
def run_task(
    self,
    subset: Optional[list[dict]] = None,
    raise_on_error: bool = False
) -> list[ExperimentResult]
```

**Parameters:**

* `self`: The Experiment instance
* `subset` (list\[dict], optional): Subset of dataset rows to run the task on. If None, runs on entire dataset
* `raise_on_error` (bool, optional): If True, raises exceptions encountered during task execution. If False, captures errors. Defaults to False

**Returns:** List of experiment results for each row

##### `run_evaluators()`

Run evaluators on existing results.

```python theme={null}
def run_evaluators(
    self,
    evaluators: Optional[list[Callable[[Any, Any], Any]]] = None,
    results: Optional[list[ExperimentResult]] = None
) -> list[ExperimentResult]
```

**Parameters:**

* `self`: The Experiment instance
* `evaluators` (list\[Callable], optional): List of evaluator functions to run. If None, uses evaluators from the Experiment instance
* `results` (list\[ExperimentResult], optional): List of results to evaluate. If None, uses results from the Experiment instance

**Returns:** The evaluated results

### `Span`

Represents a span in the tracing system. Usually created via the `@span` decorator.

#### Methods

##### `set_io()`

Set input and output data for the span.

```python theme={null}
def set_io(
    self,
    input_data: Optional[str] = None,
    output_data: Optional[str] = None
) -> None
```

**Parameters:**

* `self`: The Span instance
* `input_data` (str, optional): Input data to attach to the span. Will be converted to string if not already
* `output_data` (str, optional): Output data to attach to the span. Will be converted to string if not already

##### `set_tags()`

Set tags on the span.

```python theme={null}
def set_tags(self, tags: dict[str, str]) -> None
```

**Parameters:**

* `self`: The Span instance
* `tags` (dict\[str, str]): Dictionary of tags to set on the span

##### `set_attributes()`

Set attributes on the span.

```python theme={null}
def set_attributes(self, attributes: dict[str, Any]) -> None
```

**Parameters:**

* `self`: The Span instance
* `attributes` (dict\[str, Any]): Dictionary of attributes to set on the span

##### `set_error()`

Set error information for the span.

```python theme={null}
def set_error(
    self,
    code: str,
    message: str,
    stack: Optional[str] = None
) -> None
```

**Parameters:**

* `self`: The Span instance
* `code` (str): Error code or exception class name
* `message` (str): Error message
* `stack` (str, optional): Stack trace information

##### `add_screenshot()`

Attach a screenshot to the span for visual evaluation by LLM judges. Screenshots are uploaded during ingestion and can be evaluated alongside text data.

```python theme={null}
def add_screenshot(
    self,
    base64_data: str,
    viewport: str = "desktop",
    width: Optional[int] = None,
    height: Optional[int] = None,
    label: Optional[str] = None
) -> None
```

**Parameters:**

* `self`: The Span instance
* `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format (`data:image/png;base64,...`)
* `viewport` (str, optional): Viewport type - `"desktop"`, `"mobile"`, or `"tablet"`. Defaults to `"desktop"`
* `width` (int, optional): Image width in pixels
* `height` (int, optional): Image height in pixels
* `label` (str, optional): Human-readable description of the screenshot

**Example:**

```python theme={null}
import zeroeval as ze

with ze.span(name="browser_test", tags={"test": "visual"}) as span:
    # Capture and attach a desktop screenshot
    span.add_screenshot(
        base64_data=desktop_screenshot_base64,
        viewport="desktop",
        width=1920,
        height=1080,
        label="Homepage - Desktop"
    )

    # Also capture mobile view
    span.add_screenshot(
        base64_data=mobile_screenshot_base64,
        viewport="mobile",
        width=375,
        height=812,
        label="Homepage - iPhone"
    )

    span.set_io(
        input_data="Navigate to homepage",
        output_data="Captured viewport screenshots"
    )
```

##### `add_image()`

Attach a generic image to the span for visual evaluation. Use this for non-screenshot images like charts, diagrams, or UI component states.

```python theme={null}
def add_image(
    self,
    base64_data: str,
    label: Optional[str] = None,
    metadata: Optional[dict[str, Any]] = None
) -> None
```

**Parameters:**

* `self`: The Span instance
* `base64_data` (str): Base64 encoded image data. Accepts raw base64 or data URL format
* `label` (str, optional): Human-readable description of the image
* `metadata` (dict, optional): Additional metadata to store with the image

**Example:**

```python theme={null}
import zeroeval as ze

with ze.span(name="chart_generation") as span:
    # Generate a chart and attach it
    chart_base64 = generate_chart(data)

    span.add_image(
        base64_data=chart_base64,
        label="Monthly Revenue Chart",
        metadata={"chart_type": "bar", "data_points": 12}
    )

    span.set_io(
        input_data="Generate revenue chart for Q4",
        output_data="Chart generated with 12 data points"
    )
```

##### Attaching images via URL (S3 presigned or CDN)

If your images are already hosted externally, you can pass an HTTPS URL instead of base64 data. ZeroEval will download, validate, and copy the image into its own storage during ingestion.

Supported URL sources:

* **S3 presigned URLs** (`*.amazonaws.com` with valid authentication parameters)
* **CDN URLs** from trusted domains

Attach URLs directly via `attributes.attachments` using the `url` key:

```python theme={null}
import boto3
import zeroeval as ze

# Option A: Presigned S3 URL
s3 = boto3.client("s3")
presigned_url = s3.generate_presigned_url(
    "get_object",
    Params={"Bucket": "my-bucket", "Key": "images/chart.png"},
    ExpiresIn=300,
)

with ze.span(name="chart_generation") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": presigned_url,
            "label": "Monthly Revenue Chart",
        }
    ]

    span.set_io(
        input_data="Generate revenue chart for Q4",
        output_data="Chart generated"
    )
```

```python theme={null}
import zeroeval as ze

# Option B: CDN URL
cdn_url = "https://cdn.example.com/images/product-photo.png"

with ze.span(name="product_image_check") as span:
    span.attributes["attachments"] = [
        {
            "type": "image",
            "url": cdn_url,
            "label": "Product listing photo",
        }
    ]

    span.set_io(
        input_data="Check product image quality",
        output_data="Image attached for evaluation"
    )
```

<Note>
  Images attached to spans can be evaluated by LLM judges configured for
  multimodal evaluation. See the [Multimodal
  Evaluation](/judges/multimodal-evaluation) guide for setup instructions.
</Note>

## Context Functions

### `get_current_span()`

Returns the currently active span, if any.

```python theme={null}
def get_current_span() -> Optional[Span]
```

**Returns:** The currently active Span instance, or None if no span is active

### `get_current_trace()`

Returns the current trace ID.

```python theme={null}
def get_current_trace() -> Optional[str]
```

**Returns:** The current trace ID, or None if no trace is active

### `get_current_session()`

Returns the current session ID.

```python theme={null}
def get_current_session() -> Optional[str]
```

**Returns:** The current session ID, or None if no session is active

### `set_tag()`

Sets tags on a span, trace, or session.

```python theme={null}
def set_tag(
    target: Union[Span, str],
    tags: dict[str, str]
) -> None
```

**Parameters:**

* `target`: The target to set tags on
  * `Span`: Sets tags on the specific span
  * `str`: Sets tags on the trace (if valid trace ID) or session (if valid session ID)
* `tags` (dict\[str, str]): Dictionary of tags to set

**Example:**

```python theme={null}
import zeroeval as ze

# Set tags on current span
current_span = ze.get_current_span()
if current_span:
    ze.set_tag(current_span, {"user_id": "12345", "environment": "production"})

# Set tags on trace
trace_id = ze.get_current_trace()
if trace_id:
    ze.set_tag(trace_id, {"version": "1.5"})
```

## Judge Feedback APIs

### `send_feedback()`

Programmatically submit user feedback for a completion or judge evaluation.

```python theme={null}
def send_feedback(
    *,
    prompt_slug: str,
    completion_id: str,
    thumbs_up: bool,
    reason: Optional[str] = None,
    expected_output: Optional[str] = None,
    metadata: Optional[dict] = None,
    judge_id: Optional[str] = None,
    expected_score: Optional[float] = None,
    score_direction: Optional[str] = None,
    criteria_feedback: Optional[dict] = None
) -> dict
```

**Notes:**

* Existing usage without `criteria_feedback` is unchanged.
* `criteria_feedback` is optional and supported for scored judges.
* `judge_id` is required when sending `expected_score`, `score_direction`, or `criteria_feedback`.

### `get_judge_criteria()`

Fetch normalized criteria metadata for a judge (useful before criterion-level feedback).

```python theme={null}
def get_judge_criteria(
    project_id: str,
    judge_id: str
) -> dict
```

**Returns:**

* `judge_id`
* `evaluation_type`
* `score_min`, `score_max`, `pass_threshold`
* `criteria` (list of `{key, label, description}`)

## CLI Commands

The ZeroEval SDK includes a CLI tool for running experiments and setup.

### `zeroeval run`

Run a Python script containing ZeroEval experiments.

```bash theme={null}
zeroeval run script.py
```

### `zeroeval setup`

Interactive setup to configure API credentials.

```bash theme={null}
zeroeval setup
```

## Environment Variables

Set before importing ZeroEval to configure default behavior.

| Variable                         | Type    | Default                      | Description                             |
| -------------------------------- | ------- | ---------------------------- | --------------------------------------- |
| `ZEROEVAL_API_KEY`               | string  | `""`                         | API key for authentication              |
| `ZEROEVAL_API_URL`               | string  | `"https://api.zeroeval.com"` | API endpoint URL                        |
| `ZEROEVAL_WORKSPACE_NAME`        | string  | `"Personal Workspace"`       | Workspace name                          |
| `ZEROEVAL_SESSION_ID`            | string  | auto-generated               | Session ID for grouping traces          |
| `ZEROEVAL_SESSION_NAME`          | string  | `""`                         | Human-readable session name             |
| `ZEROEVAL_SAMPLING_RATE`         | float   | `"1.0"`                      | Sampling rate (0.0-1.0)                 |
| `ZEROEVAL_DISABLED_INTEGRATIONS` | string  | `""`                         | Comma-separated integrations to disable |
| `ZEROEVAL_DEBUG`                 | boolean | `"false"`                    | Enable debug logging                    |

```bash theme={null}
export ZEROEVAL_API_KEY="ze_1234567890abcdef"
export ZEROEVAL_SAMPLING_RATE="0.1"
export ZEROEVAL_DEBUG="true"
```

## Runtime Configuration

Configure after initialization via `ze.tracer.configure()`.

| Parameter              | Type              | Default | Description                          |
| ---------------------- | ----------------- | ------- | ------------------------------------ |
| `flush_interval`       | `float`           | `1.0`   | Flush frequency in seconds           |
| `max_spans`            | `int`             | `20`    | Buffer size before forced flush      |
| `collect_code_details` | `bool`            | `True`  | Capture code details in spans        |
| `integrations`         | `dict[str, bool]` | `{}`    | Enable/disable specific integrations |
| `sampling_rate`        | `float`           | `None`  | Sampling rate (0.0-1.0)              |

```python theme={null}
ze.tracer.configure(
    flush_interval=0.5,
    max_spans=100,
    sampling_rate=0.05,
    integrations={"openai": True, "langchain": False}
)
```

## Available Integrations

| Integration            | Name          | Auto-Instruments     |
| ---------------------- | ------------- | -------------------- |
| `OpenAIIntegration`    | `"openai"`    | OpenAI client calls  |
| `GeminiIntegration`    | `"gemini"`    | Google Gemini calls  |
| `LangChainIntegration` | `"langchain"` | LangChain components |
| `LangGraphIntegration` | `"langgraph"` | LangGraph workflows  |
| `HttpxIntegration`     | `"httpx"`     | HTTPX requests       |
| `VocodeIntegration`    | `"vocode"`    | Vocode voice SDK     |

Control integrations via:

* **Environment:** `ZEROEVAL_DISABLED_INTEGRATIONS="langchain,langgraph"`
* **Init:** `disabled_integrations=["langchain"]` or `enabled_integrations=["openai"]`
* **Runtime:** `ze.tracer.configure(integrations={"langchain": False})`

## Configuration Examples

### Production

```python theme={null}
ze.init(
    api_key="your_key",
    sampling_rate=0.05,
    debug=False,
    disabled_integrations=["langchain"]
)

ze.tracer.configure(
    flush_interval=0.5,
    max_spans=100
)
```

### Development

```python theme={null}
ze.init(
    api_key="your_key",
    debug=True,
    sampling_rate=1.0
)
```

### Memory-Optimized

```python theme={null}
ze.tracer.configure(
    max_spans=5,
    collect_code_details=False,
    flush_interval=2.0
)
```


# Setup
Source: https://docs.zeroeval.com/tracing/sdks/python/setup

Get started with ZeroEval tracing in Python applications

The [ZeroEval Python SDK](https://pypi.org/project/zeroeval/) provides seamless integration with your Python applications through automatic instrumentation and a simple decorator-based API.

## Installation

<CodeGroup>
  ```bash pip theme={null}
  pip install zeroeval
  ```

  ```bash poetry theme={null}
  poetry add zeroeval
  ```
</CodeGroup>

## Basic Setup

```python theme={null}
import zeroeval as ze

# Option 1: ZEROEVAL_API_KEY in your environment variable file
ze.init()

# Option 2: Provide API key directly from
#           https://app.zeroeval.com/settings?tab=api-keys
ze.init(api_key="YOUR_API_KEY")
```

<Note>
  Run `zeroeval setup` once to save your API key securely to
  `~/.config/zeroeval/config.json`
</Note>

## Patterns

### Decorators

The `@span` decorator is the easiest way to add tracing:

```python theme={null}
import zeroeval as ze

@ze.span(name="fetch_data")
def fetch_data(user_id: str):
    # Function arguments are automatically captured as inputs
    # Return values are automatically captured as outputs
    return {"user_id": user_id, "name": "John Doe"}

@ze.span(name="process_data", attributes={"version": "1.0"})
def process_data(data: dict):
    # Add custom attributes for better filtering
    return f"Welcome, {data['name']}!"
```

### Context Manager

For more control over span lifecycles:

```python theme={null}
import zeroeval as ze

def complex_workflow():
    with ze.span(name="data_pipeline") as pipeline_span:
        # Fetch stage
        with ze.span(name="fetch_stage") as fetch_span:
            data = fetch_external_data()
            fetch_span.set_io(output_data=str(data))

        # Process stage
        with ze.span(name="process_stage") as process_span:
            processed = transform_data(data)
            process_span.set_io(
                input_data=str(data),
                output_data=str(processed)
            )

        # Save stage
        with ze.span(name="save_stage") as save_span:
            result = save_to_database(processed)
            save_span.set_io(output_data=f"Saved {result} records")
```

### Artifact Spans

When a single prompt run produces multiple judged outputs (e.g. a final decision and a customer card), use `ze.artifact_span` to mark each output as a named artifact. The prompt completions page surfaces the primary artifact as the row preview and lets you switch between artifacts in the detail view.

```python theme={null}
import zeroeval as ze

def resolve_ticket(ticket):
    with ze.span(name="resolve-ticket"):
        system_prompt = ze.prompt(name="support-copilot", content="...")
        decision = run_agent(system_prompt, ticket)

        with ze.artifact_span(
            name="final-decision",
            artifact_type="final_decision",
            role="primary",
            label="Final Decision",
            tags={"judge_target": "support_ops_final_decision"},
        ) as s:
            s.set_io(input_data=ticket.body, output_data=decision.json())

        with ze.artifact_span(
            name="customer-card",
            artifact_type="customer_card",
            role="secondary",
            label="Customer Card",
            tags={"has_customer_card": "true"},
        ) as card:
            card.set_io(input_data=ticket.summary, output_data=decision.json())
            card.add_image(base64_data=render_card(decision))
```

`artifact_span` defaults to `kind="llm"` and writes the `completion_artifact_*` attributes automatically. All other `ze.span` features (tags, sessions, prompt metadata inheritance) work the same way.

<Note>
  `ze.artifact_span` is available in the **Python SDK only** for now.
</Note>

## Advanced Configuration

Fine-tune the tracer behavior:

```python theme={null}
from zeroeval.observability.tracer import tracer

# Configure tracer settings
tracer.configure(
    flush_interval=5.0,        # Flush every 5 seconds
    max_spans=200,             # Buffer up to 200 spans
    collect_code_details=True  # Capture source code context
)
```

## Context

Access current context information:

```python theme={null}
# Get the current span
current_span = ze.get_current_span()

# Get the current trace ID
trace_id = ze.get_current_trace()

# Get the current session ID
session_id = ze.get_current_session()
```

## Sessions

Sessions group related spans together, making it easier to track complex workflows, user interactions, or multi-step processes.

### Basic Session

Provide a session ID to associate spans with a session:

```python theme={null}
import uuid
import zeroeval as ze

session_id = str(uuid.uuid4())

@ze.span(name="process_request", session=session_id)
def process_request(data):
    return transform_data(data)
```

### Named Sessions

For better organization in the dashboard, provide both an ID and a name:

```python theme={null}
@ze.span(
    name="user_interaction",
    session={
        "id": session_id,
        "name": "Customer Support Chat - User #12345"
    }
)
def handle_support_chat(user_id, message):
    return generate_response(message)
```

### Session Inheritance

Child spans automatically inherit the session from their parent:

```python theme={null}
session_info = {
    "id": str(uuid.uuid4()),
    "name": "Order Processing Pipeline"
}

@ze.span(name="process_order", session=session_info)
def process_order(order_id):
    validate_order(order_id)
    charge_payment(order_id)
    fulfill_order(order_id)

@ze.span(name="validate_order")
def validate_order(order_id):
    return check_inventory(order_id)

@ze.span(name="charge_payment")
def charge_payment(order_id):
    return process_payment(order_id)
```

### Context Manager Sessions

```python theme={null}
session_info = {
    "id": str(uuid.uuid4()),
    "name": "Data Pipeline Run"
}

with ze.span(name="etl_pipeline", session=session_info) as pipeline_span:
    with ze.span(name="extract_data") as extract_span:
        raw_data = fetch_from_source()
        extract_span.set_io(output_data=f"Extracted {len(raw_data)} records")

    with ze.span(name="transform_data") as transform_span:
        clean_data = transform_records(raw_data)
        transform_span.set_io(
            input_data=f"{len(raw_data)} raw records",
            output_data=f"{len(clean_data)} clean records"
        )
```

## Tags

Tags are key-value pairs attached to spans, traces, or sessions. They power the facet filters in the console so you can slice your telemetry by user, plan, model, tenant, or anything else.

### Tag Once, Inherit Everywhere

Tags on the first span automatically flow down to all child spans:

```python theme={null}
@ze.span(
    name="handle_request",
    tags={
        "user_id": "42",
        "tenant": "acme-corp",
        "plan": "enterprise"
    }
)
def handle_request():
    with ze.span(name="fetch_data"):
        ...

    with ze.span(name="process", tags={"stage": "post"}):
        ...
```

### Tag a Single Span

Tags provided on a specific span stay only on that span -- they are not copied to siblings or parents:

```python theme={null}
@ze.span(name="top_level")
def top_level():
    with ze.span(name="db_call", tags={"table": "customers", "operation": "SELECT"}):
        query_database()

    with ze.span(name="render"):
        render_template()
```

### Granular Tagging

Add tags at the span, trace, or session level after creation:

```python theme={null}
with ze.span(name="root_invoke", session=session_info, tags={"run": "invoke"}):
    current_span = ze.get_current_span()
    ze.set_tag(current_span, {"phase": "pre-run"})

    current_trace = ze.get_current_trace()
    ze.set_tag(current_trace, {"run_mode": "invoke"})

    current_session = ze.get_current_session()
    ze.set_tag(current_session, {"env": "local"})
```

## Feedback

To attach human or programmatic feedback to completions, see [Human Feedback](/feedback/human-feedback) and the [Feedback SDK docs](/feedback/python). For automated quality evaluations, see [Judges](/judges/introduction).

## CLI Tooling

The Python SDK includes helpful CLI commands:

```bash theme={null}
# Save your API key securely
zeroeval setup

# Run scripts with automatic tracing
zeroeval run my_script.py
```


# Integrations
Source: https://docs.zeroeval.com/tracing/sdks/typescript/integrations

Automatic tracing for popular AI/ML libraries

The ZeroEval TypeScript SDK provides automatic tracing for popular AI libraries through the `wrap()` function.

## OpenAI

Wrap your OpenAI client to automatically trace all API calls:

```typescript theme={null}
import { OpenAI } from 'openai';
import * as ze from 'zeroeval';

ze.init();
const openai = ze.wrap(new OpenAI());

// Chat completions are automatically traced
const completion = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Hello!' }]
});

// Streaming is also automatically traced
const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
```

### Supported Methods

The OpenAI integration automatically traces:

* `chat.completions.create()` (streaming and non-streaming)
* `embeddings.create()`
* `images.generate()`, `images.edit()`, `images.createVariation()`
* `audio.transcriptions.create()`, `audio.translations.create()`

## Vercel AI SDK

Wrap the Vercel AI SDK module to trace all AI operations:

```typescript theme={null}
import * as ai from 'ai';
import { openai } from '@ai-sdk/openai';
import * as ze from 'zeroeval';

ze.init();
const wrappedAI = ze.wrap(ai);

// Text generation
const { text } = await wrappedAI.generateText({
  model: openai('gpt-4'),
  prompt: 'Write a haiku about coding'
});

// Streaming
const { textStream } = await wrappedAI.streamText({
  model: openai('gpt-4'),
  messages: [{ role: 'user', content: 'Hello!' }]
});

for await (const delta of textStream) {
  process.stdout.write(delta);
}

// Structured output
import { z } from 'zod';

const { object } = await wrappedAI.generateObject({
  model: openai('gpt-4'),
  schema: z.object({
    name: z.string(),
    age: z.number()
  }),
  prompt: 'Generate a random person'
});
```

### Supported Methods

The Vercel AI SDK integration automatically traces:

* `generateText()`, `streamText()`
* `generateObject()`, `streamObject()`
* `embed()`, `embedMany()`
* `generateImage()`
* `transcribe()`
* `generateSpeech()`

## LangChain / LangGraph

Use the callback handler for LangChain and LangGraph applications:

```typescript theme={null}
import { 
  ZeroEvalCallbackHandler, 
  setGlobalCallbackHandler 
} from 'zeroeval/langchain';

// Option 1: Set globally (recommended)
setGlobalCallbackHandler(new ZeroEvalCallbackHandler());

// All chain invocations are now automatically traced
const result = await chain.invoke({ topic: 'AI' });
```

```typescript theme={null}
import { ZeroEvalCallbackHandler } from 'zeroeval/langchain';

// Option 2: Per-invocation
const handler = new ZeroEvalCallbackHandler();
const result = await chain.invoke(
  { topic: 'AI' },
  { callbacks: [handler] }
);
```

## Auto-Detection

The `wrap()` function automatically detects which client you're wrapping:

```typescript theme={null}
import { OpenAI } from 'openai';
import * as ai from 'ai';
import * as ze from 'zeroeval';

ze.init();

// Automatically detected as OpenAI client
const openai = ze.wrap(new OpenAI());

// Automatically detected as Vercel AI SDK
const wrappedAI = ze.wrap(ai);
```

<Info>
  If `ze.init()` hasn't been called and `ZEROEVAL_API_KEY` is set in your environment, the SDK will automatically initialize when you first use `wrap()`.
</Info>

## Using with Prompts

The integrations automatically extract ZeroEval metadata from prompts created with `ze.prompt()`:

```typescript theme={null}
import { OpenAI } from 'openai';
import * as ze from 'zeroeval';

ze.init();
const openai = ze.wrap(new OpenAI());

// Create a version-tracked prompt
const systemPrompt = await ze.prompt({
  name: 'customer-support',
  content: 'You are a helpful customer support agent for {{company}}.',
  variables: { company: 'TechCorp' }
});

// The integration automatically:
// 1. Extracts the prompt metadata
// 2. Links the completion to the prompt version
// 3. Patches the model if one is bound to the prompt version
const response = await openai.chat.completions.create({
  model: 'gpt-4',  // May be replaced by bound model
  messages: [
    { role: 'system', content: systemPrompt },
    { role: 'user', content: 'I need help with my order' }
  ]
});
```

<Note>
  Need help? Check out our [GitHub examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or reach out on [Discord](https://discord.gg/MuExkGMNVz).
</Note>


# Reference
Source: https://docs.zeroeval.com/tracing/sdks/typescript/reference

Complete API reference for the TypeScript SDK

## Installation

```bash theme={null}
npm install zeroeval
```

## Core Functions

### `init()`

Initializes the ZeroEval SDK. Must be called before using any other SDK features.

```typescript theme={null}
function init(opts?: InitOptions): void;
```

#### Parameters

| Option               | Type                      | Default                    | Description                             |
| -------------------- | ------------------------- | -------------------------- | --------------------------------------- |
| `apiKey`             | `string`                  | `ZEROEVAL_API_KEY` env     | Your ZeroEval API key                   |
| `apiUrl`             | `string`                  | `https://api.zeroeval.com` | Custom API URL                          |
| `workspaceName`      | `string`                  | `"Personal Organization"`  | Workspace/organization name             |
| `flushInterval`      | `number`                  | `10`                       | Interval in seconds to flush spans      |
| `maxSpans`           | `number`                  | `100`                      | Maximum spans to buffer before flushing |
| `collectCodeDetails` | `boolean`                 | `true`                     | Capture source code context             |
| `integrations`       | `Record<string, boolean>` | —                          | Enable/disable specific integrations    |
| `debug`              | `boolean`                 | `false`                    | Enable debug logging                    |

#### Example

```typescript theme={null}
import * as ze from "zeroeval";

ze.init({
  apiKey: "your-api-key",
  debug: true,
});
```

***

## Wrapper Functions

### `wrap()`

Wraps a supported AI client to automatically trace all API calls.

```typescript theme={null}
function wrap<T extends object>(client: T): WrappedClient<T>;
```

#### Supported Clients

* OpenAI SDK (`openai` package)
* Vercel AI SDK (`ai` package)

#### Examples

```typescript theme={null}
// OpenAI
import { OpenAI } from "openai";
import * as ze from "zeroeval";

const openai = ze.wrap(new OpenAI());

// Vercel AI SDK
import * as ai from "ai";
import * as ze from "zeroeval";

const wrappedAI = ze.wrap(ai);
```

***

## Spans API

### `withSpan()`

Wraps a function execution in a span, automatically capturing timing and errors.

```typescript theme={null}
function withSpan<T>(
  opts: SpanOptions,
  fn: () => Promise<T> | T,
): Promise<T> | T;
```

#### SpanOptions

| Option        | Type                      | Required | Description                           |
| ------------- | ------------------------- | -------- | ------------------------------------- |
| `name`        | `string`                  | Yes      | Name of the span                      |
| `sessionId`   | `string`                  | No       | Session ID to associate with the span |
| `sessionName` | `string`                  | No       | Human-readable session name           |
| `tags`        | `Record<string, string>`  | No       | Tags to attach to the span            |
| `attributes`  | `Record<string, unknown>` | No       | Additional attributes                 |
| `inputData`   | `any`                     | No       | Manual input data override            |
| `outputData`  | `any`                     | No       | Manual output data override           |

#### Example

```typescript theme={null}
import * as ze from "zeroeval";

const result = await ze.withSpan({ name: "fetch-user-data" }, async () => {
  const user = await fetchUser(userId);
  return user;
});
```

### `@span` Decorator

Decorator for class methods to automatically create spans.

```typescript theme={null}
span(opts: SpanOptions): MethodDecorator
```

#### Example

```typescript theme={null}
import * as ze from "zeroeval";

class UserService {
  @ze.span({ name: "get-user" })
  async getUser(id: string): Promise<User> {
    return await db.users.findById(id);
  }
}
```

<Note>Requires `experimentalDecorators: true` in your `tsconfig.json`.</Note>

***

## Context Functions

### `getCurrentSpan()`

Returns the currently active span, if any.

```typescript theme={null}
function getCurrentSpan(): Span | undefined;
```

### `getCurrentTrace()`

Returns the current trace ID.

```typescript theme={null}
function getCurrentTrace(): string | undefined;
```

### `getCurrentSession()`

Returns the current session ID.

```typescript theme={null}
function getCurrentSession(): string | undefined;
```

### `setTag()`

Sets tags on a span, trace, or session.

```typescript theme={null}
function setTag(
  target: Span | string | undefined,
  tags: Record<string, string>,
): void;
```

#### Parameters

| Parameter   | Description                             |
| ----------- | --------------------------------------- |
| `Span`      | Sets tags on the specific span          |
| `string`    | Sets tags on the trace or session by ID |
| `undefined` | Sets tags on the current span           |

***

## Prompts API

### `prompt()`

Creates or fetches versioned prompts from the Prompt Library. Returns decorated content for downstream LLM calls.

```typescript theme={null}
async function prompt(options: PromptOptions): Promise<string>;
```

#### PromptOptions

| Option      | Type                     | Required | Description                                                          |
| ----------- | ------------------------ | -------- | -------------------------------------------------------------------- |
| `name`      | `string`                 | Yes      | Task name associated with the prompt                                 |
| `content`   | `string`                 | No       | Raw prompt content (used as fallback or for explicit mode)           |
| `variables` | `Record<string, string>` | No       | Template variables to interpolate `{{variable}}` tokens              |
| `from`      | `string`                 | No       | Version control: `"latest"`, `"explicit"`, or a 64-char SHA-256 hash |

#### Behavior

* **Auto-optimization (default)**: If `content` is provided without `from`, tries to fetch the latest optimized version first, falls back to provided content
* **Explicit mode** (`from: "explicit"`): Always uses provided `content`, bypasses auto-optimization
* **Latest mode** (`from: "latest"`): Requires an optimized version to exist, fails if none found
* **Hash mode** (`from: "<hash>"`): Fetches a specific version by its 64-character SHA-256 content hash

#### Examples

```typescript theme={null}
import * as ze from "zeroeval";

// Auto-optimization mode (recommended)
const prompt = await ze.prompt({
  name: "customer-support",
  content: "You are a helpful {{role}} assistant.",
  variables: { role: "customer service" },
});

// Explicit mode - bypass auto-optimization
const prompt = await ze.prompt({
  name: "customer-support",
  content: "You are a helpful assistant.",
  from: "explicit",
});

// Latest mode - require optimized version
const prompt = await ze.prompt({
  name: "customer-support",
  from: "latest",
});

// Hash mode - specific version
const prompt = await ze.prompt({
  name: "customer-support",
  from: "a1b2c3d4e5f6...", // 64-char SHA-256 hash
});
```

#### Return Value

Returns a decorated prompt string with metadata header used by integrations:

```
<zeroeval>{"task":"...", "prompt_version": 1, ...}</zeroeval>Your prompt content here
```

#### Errors

| Error                 | When                                                                       |
| --------------------- | -------------------------------------------------------------------------- |
| `Error`               | Both `content` and `from` provided (except `from: "explicit"`), or neither |
| `PromptRequestError`  | `from: "latest"` but no versions exist                                     |
| `PromptNotFoundError` | `from` is a hash that does not exist                                       |

***

### `sendFeedback()`

Sends feedback for a completion to enable prompt optimization.

```typescript theme={null}
async function sendFeedback(
  options: SendFeedbackOptions,
): Promise<PromptFeedbackResponse>;
```

#### SendFeedbackOptions

| Option           | Type                      | Required | Description                               |
| ---------------- | ------------------------- | -------- | ----------------------------------------- |
| `promptSlug`     | `string`                  | Yes      | The slug of the prompt (task name)        |
| `completionId`   | `string`                  | Yes      | UUID of the span/completion               |
| `thumbsUp`       | `boolean`                 | Yes      | `true` for positive, `false` for negative |
| `reason`         | `string`                  | No       | Explanation of the feedback               |
| `expectedOutput` | `string`                  | No       | What the expected output should be        |
| `metadata`       | `Record<string, unknown>` | No       | Additional metadata                       |
| `judgeId`        | `string`                  | No       | Judge automation ID for judge feedback    |
| `expectedScore`  | `number`                  | No       | Expected score for scored judges          |
| `scoreDirection` | `'too_high' \| 'too_low'` | No       | Score direction for scored judges         |

#### Example

```typescript theme={null}
import * as ze from "zeroeval";

await ze.sendFeedback({
  promptSlug: "support-bot",
  completionId: "550e8400-e29b-41d4-a716-446655440000",
  thumbsUp: false,
  reason: "Response was too verbose",
  expectedOutput: "A concise 2-3 sentence response",
});
```

***

## Utility Functions

### `renderTemplate()`

Render a template string with variable substitution.

```typescript theme={null}
function renderTemplate(
  template: string,
  variables: Record<string, string | number | boolean>,
  options?: { ignoreMissing?: boolean },
): string;
```

### `extractVariables()`

Extract variable names from a template string.

```typescript theme={null}
function extractVariables(template: string): Set<string>;
```

### `sha256Hex()`

Compute SHA-256 hash of text.

```typescript theme={null}
async function sha256Hex(text: string): Promise<string>;
```

### `normalizePromptText()`

Normalize prompt text for consistent hashing.

```typescript theme={null}
function normalizePromptText(text: string): string;
```

***

## Error Classes

### `PromptNotFoundError`

Thrown when a specific prompt version (by hash) is not found.

```typescript theme={null}
class PromptNotFoundError extends Error {
  constructor(message: string);
}
```

### `PromptRequestError`

Thrown when a prompt request fails (e.g., no versions exist for `from: "latest"`).

```typescript theme={null}
class PromptRequestError extends Error {
  constructor(message: string, statusCode?: number);
}
```

***

## Types

### `Prompt`

```typescript theme={null}
interface Prompt {
  id: string;
  prompt_id: string;
  content: string;
  content_hash: string;
  version: number;
  model_id?: string;
}
```

### `PromptMetadata`

```typescript theme={null}
interface PromptMetadata {
  task: string;
  prompt_slug?: string;
  prompt_version?: number;
  prompt_version_id?: string;
  content_hash?: string;
  variables?: Record<string, string>;
}
```

## Environment Variables

Set before importing ZeroEval to configure default behavior.

| Variable                         | Type    | Default                      | Description                             |
| -------------------------------- | ------- | ---------------------------- | --------------------------------------- |
| `ZEROEVAL_API_KEY`               | string  | `""`                         | API key for authentication              |
| `ZEROEVAL_API_URL`               | string  | `"https://api.zeroeval.com"` | API endpoint URL                        |
| `ZEROEVAL_WORKSPACE_NAME`        | string  | `"Personal Workspace"`       | Workspace name                          |
| `ZEROEVAL_SESSION_ID`            | string  | auto-generated               | Session ID for grouping traces          |
| `ZEROEVAL_SESSION_NAME`          | string  | `""`                         | Human-readable session name             |
| `ZEROEVAL_SAMPLING_RATE`         | float   | `"1.0"`                      | Sampling rate (0.0-1.0)                 |
| `ZEROEVAL_DISABLED_INTEGRATIONS` | string  | `""`                         | Comma-separated integrations to disable |
| `ZEROEVAL_DEBUG`                 | boolean | `"false"`                    | Enable debug logging                    |

```bash theme={null}
export ZEROEVAL_API_KEY="ze_1234567890abcdef"
export ZEROEVAL_SAMPLING_RATE="0.1"
export ZEROEVAL_DEBUG="true"
```

## Configuration Examples

### Production

```typescript theme={null}
ze.init({
  apiKey: "your_key",
  flushInterval: 1,
  maxSpans: 200,
  debug: false,
  integrations: {
    openai: true,
    vercelAI: true,
  },
});
```

### Development

```typescript theme={null}
ze.init({
  apiKey: "your_key",
  debug: true,
  collectCodeDetails: true,
});
```

<Note>
  Need help? Check out our [GitHub
  examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or
  reach out on [Discord](https://discord.gg/MuExkGMNVz).
</Note>


# Setup
Source: https://docs.zeroeval.com/tracing/sdks/typescript/setup

Get started with ZeroEval tracing in TypeScript and JavaScript applications

The [ZeroEval TypeScript SDK](https://www.npmjs.com/package/zeroeval) provides tracing for Node.js applications through wrapper functions and integration callbacks.

## Installation

<CodeGroup>
  ```bash npm theme={null}
  npm install zeroeval
  ```

  ```bash yarn theme={null}
  yarn add zeroeval
  ```

  ```bash pnpm theme={null}
  pnpm add zeroeval
  ```
</CodeGroup>

## Basic Setup

```typescript theme={null}
import * as ze from 'zeroeval';

// Option 1: ZEROEVAL_API_KEY in your environment variable
ze.init();

// Option 2: Provide API key directly
ze.init({ apiKey: 'YOUR_API_KEY' });

// Option 3: With additional configuration
ze.init({
  apiKey: 'YOUR_API_KEY',
  apiUrl: 'https://api.zeroeval.com', // optional
  flushInterval: 10, // seconds
  maxSpans: 100,
});
```

## Patterns

The SDK offers two ways to add tracing to your TypeScript/JavaScript code:

### Function Wrapping

Use `withSpan()` to wrap function executions:

```typescript theme={null}
import * as ze from 'zeroeval';

// Wrap synchronous functions
const fetchData = (userId: string) =>
  ze.withSpan({ name: 'fetch_data' }, () => ({
    userId,
    name: 'John Doe'
  }));

// Wrap async functions
const processData = async (data: { name: string }) =>
  ze.withSpan(
    {
      name: 'process_data',
      attributes: { version: '1.0' }
    },
    async () => {
      const result = await transform(data);
      return `Welcome, ${result.name}!`;
    }
  );

// Complex workflows with nested spans
async function complexWorkflow() {
  return ze.withSpan({ name: 'data_pipeline' }, async () => {
    const data = await ze.withSpan(
      { name: 'fetch_stage' },
      fetchExternalData
    );

    const processed = await ze.withSpan(
      { name: 'process_stage' },
      () => transformData(data)
    );

    const result = await ze.withSpan(
      { name: 'save_stage' },
      () => saveToDatabase(processed)
    );

    return result;
  });
}
```

### Decorators

Use the `@span` decorator for class methods:

```typescript theme={null}
import { span } from 'zeroeval';

class DataService {
  @span({
    name: 'fetch_user_data',
    tags: { service: 'user_api' }
  })
  async fetchUser(userId: string) {
    const response = await fetch(`/api/users/${userId}`);
    return response.json();
  }

  @span({
    name: 'process_order',
    attributes: { version: '2.0' }
  })
  processOrder(orderId: string, items: string[]) {
    return { orderId, processed: true };
  }
}
```

<Note>
  **Decorators require TypeScript configuration**: Enable `experimentalDecorators` in your `tsconfig.json`:

  ```json theme={null}
  {
    "compilerOptions": {
      "experimentalDecorators": true
    }
  }
  ```

  When using runtime tools like `tsx` or `ts-node`, pass the `--experimental-decorators` flag.
</Note>

## Sessions

Sessions group related spans together for tracking workflows, user interactions, or multi-step processes.

### Basic Session

```typescript theme={null}
import { v4 as uuidv4 } from 'uuid';
import * as ze from 'zeroeval';

const sessionId = uuidv4();

async function userJourney(userId: string) {
  return ze.withSpan(
    {
      name: 'user_journey',
      sessionId: sessionId,
      sessionName: 'User Onboarding'
    },
    async () => {
      // All nested spans inherit the session
      await ze.withSpan({ name: 'step_1' }, () => welcome(userId));
      await ze.withSpan({ name: 'step_2' }, () => setupProfile(userId));
      await ze.withSpan({ name: 'step_3' }, () => sendConfirmation(userId));
    }
  );
}
```

### Multi-Step Pipeline

```typescript theme={null}
async function ragPipeline(query: string) {
  const sessionId = uuidv4();

  return ze.withSpan(
    { name: 'rag_pipeline', sessionId, sessionName: 'RAG Query' },
    async () => {
      const docs = await ze.withSpan(
        { name: 'retrieve' },
        () => vectorSearch(query)
      );

      const ranked = await ze.withSpan(
        { name: 'rerank' },
        () => rerankDocs(query, docs)
      );

      return ze.withSpan(
        { name: 'generate' },
        () => generateAnswer(query, ranked)
      );
    }
  );
}
```

## Context

Access current context information:

```typescript theme={null}
import * as ze from 'zeroeval';

// Get the current span
const currentSpan = ze.getCurrentSpan();

// Get the current trace ID
const traceId = ze.getCurrentTrace();

// Get the current session ID
const sessionId = ze.getCurrentSession();
```

## Tagging

Attach tags for filtering and organization:

```typescript theme={null}
import * as ze from 'zeroeval';

// Set tags on the current span
ze.setTag(undefined, { user_id: '12345', environment: 'production' });

// Set tags on a specific trace
const traceId = ze.getCurrentTrace();
if (traceId) {
  ze.setTag(traceId, { feature: 'checkout' });
}

// Set tags on a span object
const span = ze.getCurrentSpan();
if (span) {
  ze.setTag(span, { action: 'process_payment' });
}
```

## Feedback

To attach human or programmatic feedback to completions, see [Human Feedback](/feedback/human-feedback) and the [Feedback SDK docs](/feedback/typescript). For automated quality evaluations, see [Judges](/judges/introduction).

## Advanced Configuration

Fine-tune the SDK behavior:

```typescript theme={null}
import * as ze from 'zeroeval';

ze.init({
  apiKey: 'YOUR_API_KEY',
  apiUrl: 'https://api.zeroeval.com',
  flushInterval: 5,           // Flush every 5 seconds
  maxSpans: 200,              // Buffer up to 200 spans
  collectCodeDetails: true,   // Capture source code context
  debug: false,               // Enable debug logging
  integrations: {
    openai: true,             // Enable OpenAI integration
    vercelAI: true,           // Enable Vercel AI SDK integration
  }
});
```

<Note>
  Need help? Check out our [GitHub examples](https://github.com/zeroeval/zeroeval-ts-sdk/tree/main/examples) or reach out on [Discord](https://discord.gg/MuExkGMNVz).
</Note>