How to Build Infinite Conversations with Claude's Compaction API

Java’s garbage collector was sold as “set it and forget it” — until production teams spent a decade debugging GC pauses, memory pressure spirals, and stop-the-world freezes. The Claude Compaction API is having its garbage collection moment.

Released in beta on January 12, 2026, Anthropic’s server-side context summarization lets developers run conversations indefinitely beyond the 1M token window. Anthropic’s implementation integrates directly with the Messages API that most Claude developers already use. But like any automatic memory system, the defaults are a starting point, not a finish line.

This tutorial walks you from zero to production-ready compaction: threshold selection, budget enforcement, cost math, and the failure modes that will find you if you skip the advanced configuration. If you’ve already hit a context window wall while building with the Claude Agent SDK, this is the fix — once you tune it right.

Why Long Conversations Get Worse Before They Hit a Wall

Context rot is not theoretical. As Anthropic’s engineering blog explains, transformers create n-squared pairwise relationships between tokens — a 200K token conversation doesn’t just cost more, it performs worse as the model’s attention budget stretches thin. On the MRCR v2 multi-needle retrieval benchmark at 1M tokens, InfoQ reported Opus 4.6 achieves 76% accuracy. Sonnet 4.5 manages 18.5% — a fourfold gap that shows the problem is real and model-dependent.

Before compaction, the workarounds were all client-side: manual truncation, sliding windows, client-side summarization, RAG-based memory. Compaction moves the problem server-side — Claude generates a structured summary, drops everything before it, and continues indefinitely. Anthropic described the Opus 4.6 1M token context window as “a qualitative shift in how much context a model can actually use.” Compaction is the mechanism that makes that shift practical for conversations that exceed even that window.

Basic Implementation: Enabling the Claude Compaction API

Prerequisites: the Anthropic Python SDK (Python 3.9+ required by the SDK) and client.beta.messages (not the standard client.messages namespace). You need two additions to your existing Messages API call: the beta header compact-2026-01-12 and a context_management block.

Here’s the minimal Python implementation:

import anthropic

client = anthropic.Anthropic()
conversation_history = []

while True:
    user_input = input("You: ")
    conversation_history.append({"role": "user", "content": user_input})

    response = client.beta.messages.create(
        model="claude-opus-4-6",
        max_tokens=8096,
        betas=["compact-2026-01-12"],
        system="You are a helpful assistant.",
        context_management={
            "edits": [
                {
                    "type": "compact_20260112",
                    "trigger": {"type": "input_tokens", "value": 75000},
                }
            ]
        },
        messages=conversation_history,
    )

    # Append the full response — including any compaction blocks
    conversation_history.append({"role": "assistant", "content": response.content})
    print(f"Claude: {response.content[0].text}")

That’s it. The API detects when input tokens exceed the trigger threshold, generates a compaction block containing a structured summary, and drops all content before that block on subsequent requests. Developers just append the full API response — compaction blocks included — to their message array and keep calling. The default trigger fires at 150,000 tokens; the minimum configurable value is 50,000. The official compaction documentation provides examples in eight languages for teams not working in Python.

Choosing the Right Compaction Threshold

The trigger threshold is the single biggest tuning lever. Low thresholds compact more frequently with less information loss per event. High thresholds preserve more raw detail but cost more when compaction fires. Here’s how to match thresholds to workloads:

Low (5K–20K tokens): Sequential task processing with clear boundaries — batch-processing customer support tickets, for example, where each task is independent.
Medium (50K–100K tokens): Multi-phase workflows with moderate context needs. This is the sweet spot for most agentic applications.
High (100K–150K tokens): Context-heavy analysis where detail preservation matters more than cost efficiency — code reviews, research synthesis, long debugging sessions.

The numbers back this up. In Anthropic’s compaction cookbook, a 5K threshold processing five customer support tickets reduced total token consumption from 208,838 to 86,446 — a 58.6% reduction, saving 122,392 tokens with only two compaction events. The same workload completed in 26 turns instead of 37.

Practical guidance: start at 75K, instrument your usage.iterations array to observe how often compaction fires, then tune based on whether you’re losing context that matters.

Illustration: Claude Compaction API threshold tuning and production safety

Production Safety: pause_after_compaction and Budget Enforcement

Here’s the failure mode that should motivate your advanced configuration. Developer @yagudaev reported Claude Code getting stuck in a compaction loop — each compaction produced 0% free context, triggering another compaction immediately: “The ONLY way to get out of a situation like that in an agent loop is check context room left and have a graceful exit.” This is exactly what pause_after_compaction prevents.

When pause_after_compaction is set to true, the API returns with stop_reason: "compaction" after generating the summary — giving you a checkpoint before the conversation continues. At that checkpoint, you have three jobs:

Inspect the summary for information loss — did the compaction drop a critical error message or file path?
Inject critical context that got compressed away — recent messages, key instructions, domain-specific state.
Enforce a token budget — track n_compactions, multiply by your trigger threshold, and compare against a total budget ceiling.

The budget enforcement pattern from Anthropic’s documentation looks like this:

TRIGGER_THRESHOLD = 100_000
TOTAL_TOKEN_BUDGET = 3_000_000
n_compactions = 0

# After each response:
if response.stop_reason == "compaction":
    n_compactions += 1
    tokens_consumed = n_compactions * TRIGGER_THRESHOLD

    if tokens_consumed >= TOTAL_TOKEN_BUDGET:
        # Inject a wrap-up instruction
        conversation_history.append({
            "role": "user",
            "content": "Please wrap up the current task and provide a final summary."
        })

Custom summarization instructions give you one more lever. Pass an instructions field in context_management to completely replace the default prompt — for example: “Focus on preserving code snippets, variable names, and technical decisions made in this session.” Task IDs, categories, and outcomes tend to survive the default summarizer. Full tool results, exact error messages, and specific file paths typically get compressed away.

For applications requiring full audit trails, compaction is not the right tool. For everything else, pause_after_compaction in production agentic workflows is not optional — it’s the difference between a controlled system and a runaway one. The session memory compaction cookbook covers thread-safe implementation patterns for instant (zero-wait) compaction architectures.

The Cost Math Developers Skip

Compaction is not free. Each summarization step is billed as standard input/output tokens — a real cost that compounds over long sessions.

At Opus 4.6 rates ($5/M input, $25/M output): a compaction step processing 180K input tokens and generating 3.5K output costs approximately $0.99. At Sonnet 4.6 rates ($3/M input, $15/M output): the same step runs about $0.59 — a roughly 1.7x cost lever from model choice alone. For a deeper comparison, see our Claude API pricing breakdown.

The 58.6% token reduction from the benchmark is real savings, but weigh it against the compaction step itself. For most workloads, net cost is still lower — you’re paying ~$1 to avoid re-transmitting 180K tokens on every subsequent turn.

One billing detail that will bite you: the top-level usage.input_tokens and usage.output_tokens do not include compaction iteration usage. Sum across all entries in usage.iterations for true billed amounts — miss this and your cost dashboards will under-report by the exact amount you’re spending on compaction.

Prompt caching compounds the savings. Add cache_control breakpoints on compaction blocks and system prompts separately — this preserves cache hits across compaction events and saves approximately 80% per memory update (per Anthropic’s cookbook). Over a long agentic session with dozens of compaction events, that adds up fast.

Compaction vs. Context Editing: Choosing the Right Tool

The same context_management API surface offers a complementary approach. Context editing surgically removes specific content — clear_tool_uses_20250919 strips stale tool outputs, clear_thinking_20251015 manages extended thinking tokens. Context editing uses a different beta header (context-management-2025-06-27), so check the docs before combining both in a single request.

The decision framework: long multi-turn conversations need compaction. Agents making heavy tool calls where old results are noise need context editing. If both apply — and in most production systems, they do — layer them: context editing clears tool bloat within the window, compaction acts as the backstop when the window approaches its limit.

Tuning Compaction for Production

Developer @dani_avila7 offered a contrarian take: “If you configure CLAUDE.md, commands, sub-agents, and hooks properly, you almost never hit auto-compact. Every time it triggered for me, I lost important context.” That’s not an argument against compaction — it’s an argument for treating it as a tuned subsystem, not a fire-and-forget default.

The open question: does compaction’s “garbage collection moment” play out the same way GC did — years of production teams discovering edge cases and demanding granular controls? As @odysseus0z put it: “Just like git rebase -i HEAD~5, we should be able to compact recent messages without losing everything.” That feature doesn’t exist yet. The API launched on January 12, 2026 and is available across Anthropic’s API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry — but the tuning primitives may still be ahead.

Here’s the insight most tutorials miss: compaction doesn’t make context management disappear. It moves it from your code to your configuration — trigger thresholds, custom summarization prompts, pause checkpoints, budget ceilings. That’s the right trade-off, but only if you actually do the configuration. The API is still in beta, so pin your headers and instrument everything. The developers who ship default settings and walk away are the ones who’ll discover the compaction loop at 2 AM.

Get the Daily Pulse

Sharp analysis on what's actually moving in AI. No hype, no filler, no weekly digest.

Why Long Conversations Get Worse Before They Hit a Wall

Basic Implementation: Enabling the Claude Compaction API

Choosing the Right Compaction Threshold

Production Safety: pause_after_compaction and Budget Enforcement

The Cost Math Developers Skip

Compaction vs. Context Editing: Choosing the Right Tool

Tuning Compaction for Production

Get the Daily Pulse

Related Posts

Grok Speech API Tutorial: TTS and STT at $4.20/1M Chars

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro Compared

Kimi Code CLI Tutorial: Install, Run, and Configure K2.6

Get the Daily Pulse