ποΈ Awesome Harness Engineering
The most comprehensive, information-dense collection of resources for harness engineering β the practice of shaping the environment around AI agents so they work reliably in real-world production systems.
Harness engineering sits at the intersection of context engineering , evaluation , observability , orchestration , safe autonomy , and software architecture . While an agent is the model plus its tools, the harness is everything else: the constraints, state management, verification loops, and runtime infrastructure that make agents dependable.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AGENT HARNESS β
β β
β ββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Context β β Guardrailsβ β Evals & β β Runtime & β β
β β Engine β β & Safety β βObservabilityβ β Orchestration β β
β ββββββ¬ββββββ ββββββ¬ββββββ βββββββ¬ββββββ ββββββββ¬βββββββββ β
β β β β β β
β ββββββββββββββββ΄βββββββ¬ββββββββ΄ββββββββββββββββ β
β β β
β ββββββββ΄βββββββ β
β β LLM Agent β β
β β (Model+Tools)β β
β βββββββββββββββ β
β β
β ββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Memory & β β Specs & β β Sandbox & β β Benchmarks β β
β β State β βAgent Filesβ β Execution β β β β
β ββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why this list? The shift from prompt engineering β context engineering β harness engineering marks a maturation of the AI engineering discipline. This list tracks that evolution with primary sources, not commentary.
The foundational writings that defined harness engineering as a discipline.
Source
Article
Description
Harness engineering: leveraging Codex in an agent-first world
The article that coined "harness engineering" β how OpenAI built a large application with Codex using architectural constraints, repo-local instructions, browser validation, and telemetry
Effective harnesses for long-running agents
Core article on initializer agents, feature lists, init.sh, self-verification, and handoff artifacts across many context windows
Harness design for long-running application development
GAN-inspired multi-agent harness β generator/evaluator loop for autonomous frontend design and long-running app generation
Building effective agents
Foundational guide distinguishing workflows vs. agents, with composable patterns for building reliable systems
The Anatomy of an Agent Harness
Derives harness components from first principles: prompts, tools, middleware, orchestration, and runtime infrastructure
Harness Engineering
Framing harness work into context engineering, architectural constraints, and "garbage collection" against entropy
Skill Issue: Harness Engineering for Coding Agents
A practical argument that weak results from coding agents are often harness problems, not model problems
Your Agent Needs a Harness, Not a Framework
Why state, retries, traces, and concurrency are first-class infrastructure concerns
Unlocking the Codex harness: how we built the App Server
Deep dive into the Codex app server harness implementation
Unrolling the Codex agent loop
Technical breakdown of the Codex agent loop architecture
Understanding the trajectory from prompt engineering to harness engineering.
π§ Context, Memory & Working State
The art of managing what goes into the context window β treating it as a working memory budget, not a dumping ground.
Memory & State Management
How agents persist knowledge, recover from interruptions, and maintain state across sessions.
Project
Description
Letta (formerly MemGPT)
Stateful agents with OS-like memory management (RAM/disk analogy)
Rearchitecting Letta's Agent Loop
Lessons from ReAct, MemGPT, and Claude Code for agent loop design
Memory Blocks β Letta
Discrete functional memory units for context window management
LangGraph + Redis
<1ms latency state persistence for agents
LangGraph + DynamoDB
Durable agent state on AWS infrastructure
Checkpoint/Restore Systems for AI Agents
Survey of checkpoint/restore techniques
Databricks Agent Memory
Built-in memory for Databricks agent framework
Mem0
Hybrid storage (Postgres + vector); extracts memories with ADD/UPDATE/DELETE operations; up to 26% accuracy gains
Zep
Temporal knowledge graph tracking how facts change over time; combines graph memory with vector search
Cognee
Knowledge graph layer that structures, connects, and retrieves information as interconnected knowledge
Structured Output (Agent I/O Harness)
Libraries that ensure reliable, schema-compliant agent output.
Project
Description
Instructor
Type-safe structured extraction from LLMs using Pydantic; SDKs for Python, TypeScript, Go, Ruby
Outlines
FSM-based token masking ensures 100% schema-compliant output at generation time
π‘οΈ Constraints, Guardrails & Safe Autonomy
Safety & Control Patterns
Reducing approval friction without losing control β sandboxing, policy design, and quality loops.
Project
Description
Invariant Guardrails (now Snyk)
Rule-based guardrailing for MCP and agentic AI
Invariant MCP-scan
Security scanner for MCP servers: prompt injection, tool poisoning detection
NeMo Guardrails β NVIDIA
Open-source programmable guardrails; sub-100ms latency; GPU-accelerated
Guardrails AI
Open-source framework for LLM output validation
π Specs, Agent Files & Workflow Design
Agent Instruction Standards
How to tell agents what to do β repo-local instruction files and machine-readable specifications.
Workflow & Orchestration Design
π Evals & Observability
Evaluation Guides & Frameworks
How to measure whether your agent actually works β evaluation methodology for non-deterministic systems.
Platform
Type
Description
Arize Phoenix
OSS
OpenTelemetry-based tracing, evals, and experiments for AI
Langfuse
OSS
LLM observability: tracing, prompt management, evals (MIT license)
LangSmith
Commercial
Agent engineering platform: tracing, evaluation, deployment
Braintrust
Commercial
AI observability + evaluation; used by Notion, Stripe, Zapier
Helicone
Commercial
AI Gateway with routing, caching, rate limiting, cost analytics
AI observability tools buyer's guide 2026
Guide
Comprehensive comparison of observability platforms
Portkey
Commercial
AI gateway + observability; routing, fallbacks, load balancing, caching, and prompt versioning
LiteLLM
OSS
Unified proxy for 100+ LLMs in OpenAI format; cost tracking, guardrails, load balancing
OpenTelemetry for LLMs
Standard
Emerging standard; OpenLLMetry and OpenLIT emit OTLP-compatible spans
Comparing open-source AI agent frameworks
Guide
Framework comparison with observability perspective
Benchmarks that stress harness quality , not just model quality β context handling, tool calling, environment control, verification logic, and runtime scaffolding.
Benchmark
Focus
Description
SWE-bench Verified
π§ Code
Real GitHub issues and tests; harness choices around retrieval, patching, and validation are highly visible
SWE-PolyBench β Amazon
π§ Code
Multi-language: 2,110 instances across 21 repos in Java/JS/TS/Python
SWE-Bench Pro
π§ Code
1,865 problems from 41 repos and 123 programming languages
FeatureBench
π§ Code
200 eval instances; SOTA agents achieve only 11% (vs 74% on SWE-bench)
Terminal-Bench
π» Terminal
Terminal-native agents in shells, filesystems, and verification-heavy environments
Terminal-Bench 2.0 & Harbor
π» Terminal
Harder tasks and generalized evaluation harness
OSWorld
π₯οΈ Desktop
369 tasks across Ubuntu, Windows, macOS with execution-based evaluators
AppWorld
π Interactive
Controllable world of apps for testing planning, code generation, and collateral-damage control
AgentBench
π Multi-env
Cross-environment: OS, databases, knowledge graphs, web browsing
tau2-bench
π Multi-step
Realistic multi-step tasks where success depends on tool use and execution quality
WebArena-Verified
π Web
Curated web-agent tasks with deterministic evaluators over responses and network traces
WorkArena
π Enterprise
Common knowledge-work tasks on realistic enterprise-style web workflows
GAIA
π€ General
General AI assistant benchmark for tools, planning, verification, and long-horizon autonomy
HAL: Holistic Agent Leaderboard
π Leaderboard
Reliability, cost, and broad task coverage for comparing end-to-end harness behavior
DPAI Arena β JetBrains
π§ Code
Open platform for coding agent benchmarks across full dev lifecycle
LOCA-bench
π§ Context
Benchmarks long-context agents; reveals "context rot" phenomenon
SWE-bench Live
π§ Code
Live benchmark with real-time GitHub issues
Evaluation Frameworks & Tools
Tool
Description
Inspect AI
UK AI Safety Institute's eval framework; batteries-included with pre-built benchmarks
ai-agent-benchmark-compendium
Compendium of 50+ agent benchmarks, categorized by function calling, reasoning, coding
Galileo Agent Eval
Framework with metrics, rubrics, and benchmarks for production agent evaluation
βοΈ Runtimes, Harnesses & Reference Implementations
Framework
Maintainer
Description
Claude Agent SDK
Anthropic
Production-oriented SDK with sessions, tools, orchestration, and compact feature
OpenAI Agents SDK
OpenAI
Visual canvas + Agent Builder + ChatKit + Connector Registry
Google ADK
Google
Open-source framework for building multi-agent applications
Microsoft Agent Framework
Microsoft
Convergence of AutoGen + Semantic Kernel; checkpointing & resuming
AutoGen
Microsoft
Open-source multi-agent programming framework
LangGraph
LangChain
Graph-based agent orchestration with built-in persistence
deepagents
LangChain
Deeper, longer-running agents with middleware and harness patterns
CrewAI
CrewAI
Role-driven multi-agent orchestration; fastest-growing for multi-agent
MetaGPT
Open Source
Simulates software company with PM/Architect/Engineer/QA agents
Pydantic AI
Pydantic
Type-safe Python agent framework
Agno
Agno
High-performance multi-agent runtime
Smolagents
Hugging Face
Ultra-minimal agent framework
Mastra
Gatsby team
JavaScript agent framework
AWS Strands Agents
AWS
Model-driven ReAct pattern; deep Lambda integration
AgentKit
Inngest
TypeScript toolkit for durable, workflow-aware agents
Vercel AI SDK
Vercel
Unified toolkit for 30+ LLM providers; frontend-to-backend agent infrastructure
VoltAgent
VoltAgent
TypeScript agent platform with orchestration, memory, RAG, and enterprise observability
Sandbox & Execution Environments
Platform
Description
E2B
Open-source Firecracker microVM sandboxing; ~150ms cold starts
Modal
Container-based agent execution; scales to 50K+ concurrent instances
Daytona
Docker-based sandbox; sub-90ms creation; pivoted to agent infra in 2025
SWE-ReX
Sandboxed code execution infrastructure for AI agents
awesome-sandbox
Curated list of code sandboxing solutions for AI agents
Browserbase
Cloud-hosted browser instances for AI agents at scale
Stagehand
Browserbase's open-source SDK bridging Playwright and AI agents
Firecrawl
Managed isolated browser environment + web scraping API for agents
Top AI Code Sandbox Products β Modal
2025 comparison of sandbox solutions
Reference Implementations
Project
Description
SWE-agent
Mature research coding agent with inspectable harness, prompt, tools, and environment
Harbor
Generalized harness for evaluating and improving agents at scale
Terminal-Bench
Open-source terminal benchmark implementation
π MCP (Model Context Protocol)
The emerging standard for giving agents structured, controlled access to tools and data sources.
π» Coding Agents in Practice
Tool
Type
Description
Claude Code
CLI
Anthropic's agentic coding CLI with hooks, sub-agents, and MCP
Codex
CLI
OpenAI's cloud-based coding agent
Cursor
IDE
AI-first code editor with Background Agents
Windsurf
IDE
Cascade engine for agentic coding workflows
Aider
CLI
Open-source AI pair programming in the terminal
Continue
Extension
Open-source AI code assistant for VS Code and JetBrains
OpenHands
Platform
Open platform for AI software developers
Gemini CLI
CLI
Google's open-source AI agent for the terminal
Devin
Platform
Cognition's autonomous coding agent
Replit Agent
Platform
In-browser agent with snapshot engine and self-healing tests
Goose
CLI
Block's fully open-source (Apache-2.0) MCP-native agent; model-agnostic
Cline
Extension
BYOM (bring your own model) agent for VS Code
Devon
CLI
Open-source pair programmer with autonomous planning and debugging
OpenCode
CLI
75+ provider support, LSP integration, privacy-first
v0
Platform
Vercel's AI-powered frontend development agent
Field Reports from Coding Agent Companies
Real-world insights from teams building and deploying coding agents at scale.
π Production Deployment
Lessons from running agents in production β what breaks, what works, and what scales.
Papers advancing the theoretical and empirical foundations of harness engineering.
Harness & Context Engineering
Agent Reliability & Safety
Paper
Venue/Date
Key Contribution
Memory Management for Long-Running Low-Code Agents
arXiv, Sep 2025
Memory management for persistent agent sessions
Efficient On-Device Agents via Adaptive Context Management
arXiv, Nov 2025
Context management for resource-constrained environments
Agentic AI: Challenges and Opportunities
arXiv, Jan 2026
Comprehensive survey of verifiable planning, coordination, memory, governance
From Competition to Coordination: Safe Multi-Agent LLM Systems
arXiv, Nov 2025
Market-making framework for safe multi-agent coordination
Emergent Coordination in Multi-Agent Language Models
arXiv, Oct 2025
How prompt design steers multi-agent LLMs
Towards a Science of AI Agent Reliability
arXiv, Feb 2026
Evaluates 14 models across 3 providers with scaffolding strategies
Confucius Code Agent
arXiv, Dec 2025
Scalable agent scaffolding with persistent note-taking for cross-session learning
Agentic AI Frameworks: Architectures, Protocols, Design
arXiv, Aug 2025
Comprehensive survey of agentic AI architectures and protocols
A Practical Guide for Production-Grade Agentic AI Workflows
arXiv, Dec 2025
Nine best practices: tool-first design, single-responsibility agents, KISS principle
Evaluation & Benchmarking
π Learning Resources & Curated Lists
Harness & Context Engineering
Agent Frameworks & General AI
Research Paper Collections
Contributions are welcome! Please prefer resources that are:
Primary sources β original implementations, first-party articles, or seminal papers
Specific β about how agents are constrained, evaluated, resumed, observed, or orchestrated
Practical β useful to practitioners building real harnesses, not generic AI commentary
Current β actively maintained or recently published (2024+)
If two links say the same thing, prefer the more primary, practical, and implementation-oriented one.
See CONTRIBUTING.md for contribution guidelines and the preferred entry format.
CC0 1.0