Compact

After 50+ turns, your agent’s chat history is mostly filler: greetings, failed attempts, irrelevant code blocks. The model starts losing track of earlier context, and performance degrades. This is “context rot.” Shrink chat history and code context before sending it to your LLM at 33,000 tok/s. 100K tokens compresses in under 2 seconds. Pass in text, get back shorter text with irrelevant lines removed. 50-70% reduction, every surviving line byte-for-byte identical to input.

const result = await morph.compact({
  input: chatHistory,               // string or message array
  query: "JWT token validation",    // what the user is about to ask
});

// result.output is the compressed text, pass it to your LLM

The query parameter is optional but makes compression much better. It tells the model which lines matter for the next question, so query="auth middleware" keeps auth code and drops DB setup, while query="database schema" does the opposite.


Model	`morph-compactor`
Speed	33,000 tok/s
Context window	1M tokens
Typical reduction	50-70% fewer tokens
Output	Verbatim lines from input (no rewriting)

Quick Start

import { MorphClient } from '@morphllm/morphsdk';

const morph = new MorphClient({ apiKey: "YOUR_API_KEY" });

const result = await morph.compact({
  input: chatHistory,
  query: "How do I validate JWT tokens?",
});

// Pass compressed history to your LLM
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-5-20250929",
  messages: [
    { role: "user", content: result.output },
    { role: "user", content: "How do I validate JWT tokens?" },
  ],
});

import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://api.morphllm.com/v1",
});

const response = await client.responses.create({
  model: "morph-compactor",
  input: chatHistory,
});

const compressed = response.output[0].content[0].text;

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.morphllm.com/v1",
)

response = client.responses.create(
    model="morph-compactor",
    input=chat_history,
)

compressed = response.output[0].content[0].text

import requests

# String input
response = requests.post(
    "https://api.morphllm.com/v1/compact",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "input": source_code,
        "query": "authentication",
        "compression_ratio": 0.5,
        "preserve_recent": 0,
    },
)

data = response.json()
print(data["output"])

for r in data["messages"][0]["compacted_line_ranges"]:
    print(f"  lines {r['start']}-{r['end']} removed")

curl -X POST "https://api.morphllm.com/v1/compact" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "def hello():\n    return 1\n\ndef unused():\n    pass\n\ndef world():\n    return 2",
    "query": "hello function",
    "compression_ratio": 0.5,
    "preserve_recent": 0
  }'

curl -X POST "https://api.morphllm.com/v1/compact" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Help me build a Node.js API with JWT auth"},
      {"role": "assistant", "content": "Sure, here is a full implementation..."},
      {"role": "user", "content": "Now add rate limiting"}
    ],
    "query": "rate limiting",
    "compression_ratio": 0.5
  }'

Query-Conditioned Compression

The query parameter tells the model what matters. The model scores every line’s relevance to that query, then drops lines below the threshold.

// Same chat history, different queries, different output
const forAuth = await morph.compact({
  input: chatHistory,
  query: "JWT token validation",
});
// DB setup and CSS discussion dropped, auth code kept

const forDB = await morph.compact({
  input: chatHistory,
  query: "database connection pooling",
});
// Auth code dropped, DB setup kept

Without query, the model auto-detects from the last user message. Explicit queries give tighter compression.

Line Ranges and Markers

By default, each message includes compacted_line_ranges (which lines were removed) and (filtered N lines) markers in the text. Both are configurable:

// Default: markers + ranges
const result = await morph.compact({
  input: codeFile,
  query: "auth middleware",
  compressionRatio: 0.5,
  preserveRecent: 0,
});

console.log(result.output);
// def authenticate():
//     ...
// (filtered 12 lines)
// def handle_request():
//     ...

for (const r of result.messages[0].compacted_line_ranges) {
  console.log(`lines ${r.start}-${r.end} removed`);
}

// No markers: empty lines instead of "(filtered N lines)"
await morph.compact({ input: codeFile, includeMarkers: false });

// No line ranges: skip tracking removed ranges
await morph.compact({ input: codeFile, includeLineRanges: false });
// result.messages[0].compacted_line_ranges === []

Preserving Critical Context

Wrap sections you never want compressed in <keepContext> / </keepContext> tags. Tagged content survives compression verbatim regardless of the compression ratio.

const input = `
// Database connection setup
const pool = new Pool({ host: 'localhost', port: 5432 });

<keepContext>
// CRITICAL: Auth middleware - do not compress
function authenticate(req, res, next) {
  const token = req.headers.authorization?.split(' ')[1];
  if (!token) return res.status(401).json({ error: 'No token' });
  const decoded = jwt.verify(token, process.env.JWT_SECRET);
  req.user = decoded;
  next();
}
</keepContext>

// Logging utilities
function logRequest(req) { console.log(req.method, req.path); }
function logError(err) { console.error(err.stack); }
// ... 200 more lines of helpers
`;

const result = await morph.compact({
  input,
  query: "authentication",
  compressionRatio: 0.3,
});

// The authenticate() function is fully preserved.
// DB setup and logging helpers are compressed.
// The <keepContext> tags themselves are stripped from output.

Rules:

Tags must be on their own line (no inline code() <keepContext>)
Tags must open and close within the same message
Kept content counts against the compression_ratio budget. If you keep 40% and request 0.5, the remaining 60% compresses harder to hit the target.
Unclosed <keepContext> preserves everything from the tag to the end of the message

The response includes kept_line_ranges showing which lines were force-preserved:

for (const r of result.messages[0].kept_line_ranges) {
  console.log(`lines ${r.start}-${r.end} preserved via keepContext`);
}

API Reference

`POST /v1/compact`

The primary endpoint. Accepts string input or message arrays. Parameters

Parameter	Type	Default	Description
`input`	string or array	-	Text or `{role, content}` array. One of `input`/`messages` required.
`messages`	array	-	`{role, content}` messages. Takes priority over `input`.
`query`	string	auto-detected	Focus query for relevance-based pruning
`compression_ratio`	float	`0.5`	Fraction to keep. `0.3` = aggressive, `0.7` = light
`preserve_recent`	int	`2`	Keep last N messages uncompressed
`compress_system_messages`	bool	`false`	When `true`, system messages are also compressed. By default they are preserved verbatim.
`include_line_ranges`	bool	`true`	Include `compacted_line_ranges` in response
`include_markers`	bool	`true`	Include `(filtered N lines)` text markers. When `false`, gaps become empty lines
`model`	string	`morph-compactor`	Model ID

Response

{
  "id": "cmpr-7373faf8af65",
  "object": "compact",
  "model": "morph-compactor",
  "output": "def hello():\n    print(\"hello world\")\n(filtered 6 lines)\ndef world():\n    return 42",
  "messages": [
    {
      "role": "user",
      "content": "def hello():\n    print(\"hello world\")\n(filtered 6 lines)\ndef world():\n    return 42",
      "compacted_line_ranges": [{ "start": 5, "end": 10 }],
      "kept_line_ranges": []
    }
  ],
  "usage": {
    "input_tokens": 101,
    "output_tokens": 65,
    "compression_ratio": 0.644,
    "processing_time_ms": 109
  }
}

`POST /v1/responses`

OpenAI Responses API format. Works with any OpenAI SDK pointed at https://api.morphllm.com/v1.

Parameter	Type	Required	Description
`model`	string	Yes	`morph-compactor`
`input`	string or array	Yes	Text or `{role, content}` array
`query`	string	No	Focus query for relevance-based pruning

{
  "id": "cmpr-abc123",
  "object": "response",
  "model": "morph-compactor",
  "output": [{
    "type": "message",
    "role": "assistant",
    "content": [{ "type": "output_text", "text": "compressed text..." }]
  }],
  "usage": { "input_tokens": 4200, "output_tokens": 1800 }
}

`POST /v1/chat/completions`

OpenAI Chat Completions format. Drop-in replacement — point any OpenAI-compatible client at https://api.morphllm.com/v1.

Parameter	Type	Required	Description
`model`	string	Yes	`morph-compactor`
`messages`	array	Yes	`{role, content}` message array

{
  "id": "cmpr-def456",
  "object": "chat.completion",
  "model": "morph-compactor",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "compressed text..." },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 4200, "completion_tokens": 1800, "total_tokens": 6000 }
}

Errors

Status	Meaning
`400`	Malformed request or input too large
`401`	Invalid API key
`503`	Model not loaded
`504`	Request timed out

SDK Reference

CompactInput

{
  input?: string | Array<{ role: string, content: string }>,
  messages?: Array<{ role: string, content: string }>,
  query?: string,
  compressionRatio?: number,    // 0.05-1.0, default 0.5
  preserveRecent?: number,      // default 2
  includeLineRanges?: boolean,  // default true
  includeMarkers?: boolean,     // default true
  model?: string,
}

CompactResult

{
  id: string,
  output: string,              // all messages joined
  messages: Array<{
    role: string,
    content: string,
    compacted_line_ranges: Array<{ start: number, end: number }>,
    kept_line_ranges: Array<{ start: number, end: number }>,  // force-preserved via <keepContext>
  }>,
  usage: { input_tokens, output_tokens, compression_ratio, processing_time_ms },
  model: string,
}

CompactConfig

{
  morphApiKey?: string,     // defaults to MORPH_API_KEY env
  morphApiUrl?: string,
  timeout?: number,         // defaults to 120000 (2 min)
  retryConfig?: RetryConfig,
  debug?: boolean,
}

Edge / Cloudflare Workers

import { CompactClient } from '@morphllm/morphsdk/edge';

export default {
  async fetch(request: Request, env: Env) {
    const compact = new CompactClient({ morphApiKey: env.MORPH_API_KEY });
    const { input, query } = await request.json();

    const result = await compact.compact({ input, query });
    return Response.json({ output: result.output, usage: result.usage });
  }
};

Best Practices

Keep recent messages verbatim

Set preserve_recent to at least 3. Recent turns contain the user’s active intent and the assistant’s latest reasoning. Compacting them risks dropping context the LLM needs right now.

Always pass a query

Without it, the model falls back to auto-detection from the last user message. An explicit query gives tighter, more relevant compression because the model knows exactly which lines to score.

Compact before the LLM call

The value is in reducing what you send to your LLM. Compacting a response after generation saves storage but doesn’t cut inference cost.

Tune compression_ratio

Default 0.5 is a good starting point. Agent loops past 100 turns: try 0.3. Shorter conversations where nuance matters: try 0.7.

const result = await morph.compact({
  input: chatHistory,
  query: "rate limiting",
  preserveRecent: 3,  // last 3 messages pass through untouched
});

FAQ

How is this different from summarization?

Summarization rewrites. Compaction deletes. Every sentence that survives compaction is character-for-character identical to your original. No drift, no hallucinated context.

When should I compact?

Agent loops past 50 turns. Multi-session chat where you carry history forward. Any pipeline where you pay for tokens that are greetings and filler.

String input or conversation array?

Both work. A conversation array preserves role structure through compaction.

Can I use the standard OpenAI SDK?

Yes. Point baseURL at https://api.morphllm.com/v1 and call client.responses.create() or client.chat.completions.create() with model: "morph-compactor".

How fast?

33,000 tok/s. 100K tokens compresses in under 2 seconds. 180K tokens in about 3 seconds.

Notes for AI Coding Agents

Integration guide for LLMs and coding agents

If you are an AI coding agent integrating Morph Compact into a project, follow these steps:1. Understand the existing compaction logic. Before writing any code, read the project’s current context management. This might be in pre-commit hooks, middleware, a context window manager, or a custom prompt builder. Search for keywords like compact, compress, truncate, summarize, or context in the codebase.2. Choose the right integration method.

If the project uses TypeScript/JavaScript, use the @morphllm/morphsdk package — it handles retries, types, and edge runtime support out of the box.
If the project uses Python with OpenAI SDK, point the client at https://api.morphllm.com/v1 and call responses.create() or chat.completions.create() with model: "morph-compactor".
For any other language, call POST /v1/compact directly with a JSON body containing input (or messages) and optionally query.

3. Follow best practices.

Always pass query — set it to the user’s current question or task description for much better compression.
Set preserve_recent to at least 3 to keep the latest turns intact.
Compact before sending to the LLM, not after — the goal is to reduce inference cost.
Use compression_ratio: 0.3 for long agent loops (100+ turns), 0.7 for shorter conversations.
Wrap critical context in <keepContext> tags if certain sections must never be removed.

Get Started

Products

API Reference

Integrations

Quick Start

Query-Conditioned Compression

Line Ranges and Markers

Preserving Critical Context

API Reference

`POST /v1/compact`

`POST /v1/responses`

`POST /v1/chat/completions`

Errors

SDK Reference

Edge / Cloudflare Workers

Best Practices

Keep recent messages verbatim

Always pass a query

Compact before the LLM call

Tune compression_ratio

FAQ

Notes for AI Coding Agents

Get Started

Products

API Reference

Integrations

​Quick Start

​Query-Conditioned Compression

​Line Ranges and Markers

​Preserving Critical Context

​API Reference

​POST /v1/compact

​POST /v1/responses

​POST /v1/chat/completions

​Errors

​SDK Reference

​Edge / Cloudflare Workers

​Best Practices

Keep recent messages verbatim

Always pass a query

Compact before the LLM call

Tune compression_ratio

​FAQ

​Notes for AI Coding Agents

Quick Start

Query-Conditioned Compression

Line Ranges and Markers

Preserving Critical Context

API Reference

`POST /v1/compact`

`POST /v1/responses`

`POST /v1/chat/completions`

Errors

SDK Reference

Edge / Cloudflare Workers

Best Practices

FAQ

Notes for AI Coding Agents