🏗️ Awesome Harness Engineering

The most comprehensive, information-dense collection of resources for harness engineering — the practice of shaping the environment around AI agents so they work reliably in real-world production systems.

Harness engineering sits at the intersection of context engineering, evaluation, observability, orchestration, safe autonomy, and software architecture. While an agent is the model plus its tools, the harness is everything else: the constraints, state management, verification loops, and runtime infrastructure that make agents dependable.

┌─────────────────────────────────────────────────────────────────┐
│                        AGENT HARNESS                            │
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌───────────────┐  │
│  │ Context  │  │ Guardrails│  │  Evals &  │  │   Runtime &   │  │
│  │ Engine   │  │ & Safety │  │Observability│ │ Orchestration │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────┬────────┘  │
│       │              │              │               │           │
│       └──────────────┴──────┬───────┴───────────────┘           │
│                             │                                   │
│                      ┌──────┴──────┐                            │
│                      │  LLM Agent  │                            │
│                      │ (Model+Tools)│                           │
│                      └─────────────┘                            │
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌───────────────┐  │
│  │ Memory & │  │  Specs & │  │ Sandbox & │  │  Benchmarks   │  │
│  │  State   │  │Agent Files│  │ Execution │  │               │  │
│  └──────────┘  └──────────┘  └───────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Why this list? The shift from prompt engineering → context engineering → harness engineering marks a maturation of the AI engineering discipline. This list tracks that evolution with primary sources, not commentary.

📑 Contents

Foundations
- Seminal Articles
- Evolution & Big Picture
Context, Memory & Working State
Constraints, Guardrails & Safe Autonomy
- Safety & Control Patterns
- Guardrail Frameworks
Specs, Agent Files & Workflow Design
- Agent Instruction Standards
- Workflow & Orchestration Design
Evals & Observability
- Evaluation Guides & Frameworks
- Observability Platforms
Benchmarks
- Evaluation Frameworks & Tools
Runtimes, Harnesses & Reference Implementations
MCP (Model Context Protocol)
Coding Agents in Practice
- Tools & Products
- Field Reports from Coding Agent Companies
Production Deployment
Academic Research
Learning Resources & Curated Lists
Contributing
License

🏛️ Foundations

Seminal Articles

The foundational writings that defined harness engineering as a discipline.

Source	Article	Description
	Harness engineering: leveraging Codex in an agent-first world	The article that coined "harness engineering" — how OpenAI built a large application with Codex using architectural constraints, repo-local instructions, browser validation, and telemetry
	Effective harnesses for long-running agents	Core article on initializer agents, feature lists, `init.sh`, self-verification, and handoff artifacts across many context windows
	Harness design for long-running application development	GAN-inspired multi-agent harness — generator/evaluator loop for autonomous frontend design and long-running app generation
	Building effective agents	Foundational guide distinguishing workflows vs. agents, with composable patterns for building reliable systems
	The Anatomy of an Agent Harness	Derives harness components from first principles: prompts, tools, middleware, orchestration, and runtime infrastructure
	Harness Engineering	Framing harness work into context engineering, architectural constraints, and "garbage collection" against entropy
	Skill Issue: Harness Engineering for Coding Agents	A practical argument that weak results from coding agents are often harness problems, not model problems
	Your Agent Needs a Harness, Not a Framework	Why state, retries, traces, and concurrency are first-class infrastructure concerns
	Unlocking the Codex harness: how we built the App Server	Deep dive into the Codex app server harness implementation
	Unrolling the Codex agent loop	Technical breakdown of the Codex agent loop architecture

Evolution & Big Picture

Understanding the trajectory from prompt engineering to harness engineering.

🔗 The Third Evolution: Prompt → Context → Harness — Epsilla's three-era framework for AI engineering evolution
🔗 2025 Was Agents. 2026 Is Agent Harnesses. — Industry overview of the agent-to-harness shift
🔗 The Rise of AI Harness Engineering — Overview of the emerging discipline
🔗 The importance of Agent Harness in 2026 — CPU-OS analogy; "the harness is the dataset"; trajectory capture as competitive advantage
🔗 Harness Engineering: The Missing Layer Behind AI Agents — Clear distinction between prompt/context/harness engineering
🔗 Agent Harnesses: From DIY Patterns to Product — Evolution from ad-hoc patterns to productized harnesses
🔗 What we miss when we talk about "AI Harnesses" — Philosophical critique of the harness framing
🔗 Agent Frameworks, Runtimes, and Harnesses, Oh My! — LangChain's decomposition of what belongs in a framework, a runtime, and a harness
🔗 Conductors to Orchestrators: The Future of Agentic Coding — O'Reilly's take on the future of agent-driven development
🔗 Martin Fowler on Preparing for AI's Nondeterministic Computing — Engineering for nondeterminism in agent systems
🔗 Why AI Agent Reliability Depends More on the Harness Than the Model — HackerNoon on harness-first reliability
🔗 What Is an Agent Harness? — Firecrawl's breakdown: infrastructure that makes AI agents actually work
🔗 What is an agent harness in the context of LLMs? — The model is CPU, context window is RAM, harness is OS, agent is application
🔗 The GAN-Style Agent Loop: Deconstructing Anthropic's Harness — Analysis of generator-evaluator harness pattern
🔗 From Prompts to Context to Harness Engineering — Evolution overview
🔗 OpenAI Introduces Harness Engineering — InfoQ coverage of the paradigm shift
🔗 Agent Engineering: A New Discipline — LangChain on the emergence of agent engineering
🔗 Agents At Work: The 2026 Playbook — Building reliable agentic workflows
🔗 Anthropic 2026 Agentic Coding Trends Report (PDF) — How coding agents are reshaping development

🧠 Context, Memory & Working State

Context Engineering

The art of managing what goes into the context window — treating it as a working memory budget, not a dumping ground.

Source	Article	Description
	Effective context engineering for AI agents	Managing the context window as a working memory budget with practical strategies
	Context Engineering for Agents	Four strategies: writing, selecting, compressing, isolating context
	The Rise of Context Engineering	Why context engineering matters more than prompt engineering
	Context Engineering for AI Agents: Lessons from Building Manus	KV-cache locality, tool masking, filesystem memory, and keeping useful failures in-context
	Context Engineering for Coding Agents	Shaping the task environment so coding agents stay grounded and productive
	Architecting efficient context-aware multi-agent framework	ADK's tiered state architecture: Session, Memory, Artifacts
	Advanced Context Engineering for Coding Agents	Patterns for reducing context drift and making coding sessions easier to resume
	Context-Efficient Backpressure for Coding Agents	Preventing agents from burning context on noisy or low-value work
	Context Engineering: What It Is and Techniques to Consider	LlamaIndex's techniques for context engineering
	Context Engineering (Lance Martin)	Practical guide from LangChain co-founder
	Context Engineering: Bringing Engineering Discipline to Prompts — Addy Osmani	Systematic approach to context management
	Efficient Context Management for LLM-Powered Agents	NeurIPS 2025 research: simple observation masking ≈ LLM summarization, ~50% cost reduction
	Context Engineering Best Practices — Comet	Best practices for agentic systems
	Context Engineering Best Practices — Kubiya	Practical reliability-focused best practices

Memory & State Management

How agents persist knowledge, recover from interruptions, and maintain state across sessions.

Project	Description
Letta (formerly MemGPT)	Stateful agents with OS-like memory management (RAM/disk analogy)
Rearchitecting Letta's Agent Loop	Lessons from ReAct, MemGPT, and Claude Code for agent loop design
Memory Blocks — Letta	Discrete functional memory units for context window management
LangGraph + Redis	<1ms latency state persistence for agents
LangGraph + DynamoDB	Durable agent state on AWS infrastructure
Checkpoint/Restore Systems for AI Agents	Survey of checkpoint/restore techniques
Databricks Agent Memory	Built-in memory for Databricks agent framework
Mem0	Hybrid storage (Postgres + vector); extracts memories with ADD/UPDATE/DELETE operations; up to 26% accuracy gains
Zep	Temporal knowledge graph tracking how facts change over time; combines graph memory with vector search
Cognee	Knowledge graph layer that structures, connects, and retrieves information as interconnected knowledge

Structured Output (Agent I/O Harness)

Libraries that ensure reliable, schema-compliant agent output.

Project	Description
Instructor	Type-safe structured extraction from LLMs using Pydantic; SDKs for Python, TypeScript, Go, Ruby
Outlines	FSM-based token masking ensures 100% schema-compliant output at generation time

🛡️ Constraints, Guardrails & Safe Autonomy

Safety & Control Patterns

Reducing approval friction without losing control — sandboxing, policy design, and quality loops.

Source	Article	Description
	Beyond permission prompts: making Claude Code more secure	Better sandboxing and policy design for secure autonomous agents
	Code execution with MCP: building more efficient agents	Controlled execution power through explicit, inspectable tool boundaries; 150K → 2K token reduction
	Writing effective tools for agents	Tool interfaces that are easier for models to call correctly and safely
	Advanced tool use on the Claude Developer Platform	Tool Search, Programmatic Tool Calling, and Tool Use Examples
	Assessing internal quality while coding with an agent	Moving quality checks into the loop instead of relying on after-the-fact review
	Anchoring AI to a reference application	Constraining agents with concrete exemplars for more consistent output
	Humans and Agents in Software Engineering Loops	Where humans should strengthen the harness instead of micromanaging artifacts
	Claude Code: Best practices for agentic coding	Repo structure, checkpoints, validation, and delegation in agentic workflows
	Agentic Engineering Patterns — Simon Willison	Coding practices and patterns for working with agents
	The lethal trifecta for AI agents — Simon Willison	Private data + untrusted content + external communication = security risk
	Governing Claude Code with Kong AI Gateway	Secure agent harness rollouts via API gateway
	AI Agent Safety — Cleanlab	Managing unpredictability at scale in production agents
	Lessons from 2025: Agent Mitigation	How "agent mitigation" became a new discipline

Guardrail Frameworks

Project	Description
Invariant Guardrails (now Snyk)	Rule-based guardrailing for MCP and agentic AI
Invariant MCP-scan	Security scanner for MCP servers: prompt injection, tool poisoning detection
NeMo Guardrails — NVIDIA	Open-source programmable guardrails; sub-100ms latency; GPU-accelerated
Guardrails AI	Open-source framework for LLM output validation

📋 Specs, Agent Files & Workflow Design

Agent Instruction Standards

How to tell agents what to do — repo-local instruction files and machine-readable specifications.

Project	Description
AGENTS.md	Open format for repo-local instructions; intro by OpenAI
agent.md	Related standardization effort for machine-readable agent instructions
GitHub Spec Kit	Toolkit for spec-driven development — agents execute against explicit specs
Writing a good CLAUDE.md	Practical guide to creating durable, repo-local instructions; 150–200 instruction limit
Equipping agents with Agent Skills — Anthropic	SKILL.md-based progressive disclosure system for domain-specific agent capabilities
How to write a great agents.md — GitHub	Lessons from 2,500+ repositories
awesome-agents-md	Curated list of real-world AGENTS.md files, templates, guides & tools
awesome-agent-skills	1,000+ agent skills from official dev teams and community
CLAUDE.md vs AGENTS.md vs .cursorrules	Comparison of agent configuration file formats

Workflow & Orchestration Design

Source	Article	Description
	12 Factor Agents	Operating principles for production agents: explicit prompts, state ownership, clean pause-resume
	12-Factor AgentOps	Operations-oriented companion focused on context discipline and reproducibility
	Understanding Spec-Driven-Development	Why strong specs make AI-assisted delivery more dependable
	How we built our multi-agent research system	Orchestrator-worker pattern with parallel subagents; 90%+ improvement over single-agent
	Developer's guide to multi-agent patterns in ADK	Multi-agent orchestration patterns in Google ADK
	Introducing AgentWorkflow	Multi-agent orchestration system
	Workflows 1.0	Event-driven framework for agentic workflows
	Emerging Patterns in Building GenAI Products — Martin Fowler	Architecture patterns for generative AI products
	Agent-Native Engineering	Engineering practices for agent-first development

📊 Evals & Observability

Evaluation Guides & Frameworks

How to measure whether your agent actually works — evaluation methodology for non-deterministic systems.

Source	Article	Description
	Testing Agent Skills Systematically with Evals	Turning agent traces into repeatable evals with JSONL logs and deterministic checks
	Agent evals	Measuring agent quality with reproducible task-level and workflow-level evaluations
	Evaluation best practices	Building eval suites that match real-world distributions and catch regressions
	Trace grading	Grading agent traces directly, especially for long multi-step tasks
	Demystifying Evals for AI Agents	What to measure when agents have many possible trajectories
	Quantifying infrastructure noise in agentic coding evals	Runtime configuration can move benchmark scores more than many leaderboard gaps
	Evaluating Deep Agents: Our Learnings	Single-step, full-run, and multi-turn eval design for stateful agents
	Improving Deep Agents with harness engineering	Top 30 → Top 5 on Terminal-Bench 2.0 by only changing the harness
	How we build evals for Deep Agents	Eval methodology for LangChain's deep agents
	How Middleware Lets You Customize Your Agent Harness	Middleware patterns for loop detection and custom harness behavior
	8 benchmarks shaping the next generation of AI agents	Overview of key agent benchmarks

Observability Platforms

Platform	Type	Description
Arize Phoenix	OSS	OpenTelemetry-based tracing, evals, and experiments for AI
Langfuse	OSS	LLM observability: tracing, prompt management, evals (MIT license)
LangSmith	Commercial	Agent engineering platform: tracing, evaluation, deployment
Braintrust	Commercial	AI observability + evaluation; used by Notion, Stripe, Zapier
Helicone	Commercial	AI Gateway with routing, caching, rate limiting, cost analytics
AI observability tools buyer's guide 2026	Guide	Comprehensive comparison of observability platforms
Portkey	Commercial	AI gateway + observability; routing, fallbacks, load balancing, caching, and prompt versioning
LiteLLM	OSS	Unified proxy for 100+ LLMs in OpenAI format; cost tracking, guardrails, load balancing
OpenTelemetry for LLMs	Standard	Emerging standard; OpenLLMetry and OpenLIT emit OTLP-compatible spans
Comparing open-source AI agent frameworks	Guide	Framework comparison with observability perspective

🏆 Benchmarks

Benchmarks that stress harness quality, not just model quality — context handling, tool calling, environment control, verification logic, and runtime scaffolding.

Benchmark	Focus	Description
SWE-bench Verified	🔧 Code	Real GitHub issues and tests; harness choices around retrieval, patching, and validation are highly visible
SWE-PolyBench — Amazon	🔧 Code	Multi-language: 2,110 instances across 21 repos in Java/JS/TS/Python
SWE-Bench Pro	🔧 Code	1,865 problems from 41 repos and 123 programming languages
FeatureBench	🔧 Code	200 eval instances; SOTA agents achieve only 11% (vs 74% on SWE-bench)
Terminal-Bench	💻 Terminal	Terminal-native agents in shells, filesystems, and verification-heavy environments
Terminal-Bench 2.0 & Harbor	💻 Terminal	Harder tasks and generalized evaluation harness
OSWorld	🖥️ Desktop	369 tasks across Ubuntu, Windows, macOS with execution-based evaluators
AppWorld	🌐 Interactive	Controllable world of apps for testing planning, code generation, and collateral-damage control
AgentBench	🌐 Multi-env	Cross-environment: OS, databases, knowledge graphs, web browsing
tau2-bench	🔄 Multi-step	Realistic multi-step tasks where success depends on tool use and execution quality
WebArena-Verified	🌐 Web	Curated web-agent tasks with deterministic evaluators over responses and network traces
WorkArena	🌐 Enterprise	Common knowledge-work tasks on realistic enterprise-style web workflows
GAIA	🤖 General	General AI assistant benchmark for tools, planning, verification, and long-horizon autonomy
HAL: Holistic Agent Leaderboard	📊 Leaderboard	Reliability, cost, and broad task coverage for comparing end-to-end harness behavior
DPAI Arena — JetBrains	🔧 Code	Open platform for coding agent benchmarks across full dev lifecycle
LOCA-bench	🧠 Context	Benchmarks long-context agents; reveals "context rot" phenomenon
SWE-bench Live	🔧 Code	Live benchmark with real-time GitHub issues

Evaluation Frameworks & Tools

Tool	Description
Inspect AI	UK AI Safety Institute's eval framework; batteries-included with pre-built benchmarks
ai-agent-benchmark-compendium	Compendium of 50+ agent benchmarks, categorized by function calling, reasoning, coding
Galileo Agent Eval	Framework with metrics, rubrics, and benchmarks for production agent evaluation

⚙️ Runtimes, Harnesses & Reference Implementations

Agent SDKs & Frameworks

Framework	Maintainer	Description
Claude Agent SDK	Anthropic	Production-oriented SDK with sessions, tools, orchestration, and compact feature
OpenAI Agents SDK	OpenAI	Visual canvas + Agent Builder + ChatKit + Connector Registry
Google ADK	Google	Open-source framework for building multi-agent applications
Microsoft Agent Framework	Microsoft	Convergence of AutoGen + Semantic Kernel; checkpointing & resuming
AutoGen	Microsoft	Open-source multi-agent programming framework
LangGraph	LangChain	Graph-based agent orchestration with built-in persistence
deepagents	LangChain	Deeper, longer-running agents with middleware and harness patterns
CrewAI	CrewAI	Role-driven multi-agent orchestration; fastest-growing for multi-agent
MetaGPT	Open Source	Simulates software company with PM/Architect/Engineer/QA agents
Pydantic AI	Pydantic	Type-safe Python agent framework
Agno	Agno	High-performance multi-agent runtime
Smolagents	Hugging Face	Ultra-minimal agent framework
Mastra	Gatsby team	JavaScript agent framework
AWS Strands Agents	AWS	Model-driven ReAct pattern; deep Lambda integration
AgentKit	Inngest	TypeScript toolkit for durable, workflow-aware agents
Vercel AI SDK	Vercel	Unified toolkit for 30+ LLM providers; frontend-to-backend agent infrastructure
VoltAgent	VoltAgent	TypeScript agent platform with orchestration, memory, RAG, and enterprise observability

Sandbox & Execution Environments

Platform	Description
E2B	Open-source Firecracker microVM sandboxing; ~150ms cold starts
Modal	Container-based agent execution; scales to 50K+ concurrent instances
Daytona	Docker-based sandbox; sub-90ms creation; pivoted to agent infra in 2025
SWE-ReX	Sandboxed code execution infrastructure for AI agents
awesome-sandbox	Curated list of code sandboxing solutions for AI agents
Browserbase	Cloud-hosted browser instances for AI agents at scale
Stagehand	Browserbase's open-source SDK bridging Playwright and AI agents
Firecrawl	Managed isolated browser environment + web scraping API for agents
Top AI Code Sandbox Products — Modal	2025 comparison of sandbox solutions

Reference Implementations

Project	Description
SWE-agent	Mature research coding agent with inspectable harness, prompt, tools, and environment
Harbor	Generalized harness for evaluating and improving agents at scale
Terminal-Bench	Open-source terminal benchmark implementation

🔌 MCP (Model Context Protocol)

The emerging standard for giving agents structured, controlled access to tools and data sources.

Resource	Description
MCP Specification (2025-11-25)	Latest protocol specification
2026 MCP Roadmap	Priorities: remote deployment, auth, enterprise features
MCP Roadmap Growing Pains — The New Stack	Production challenges and planned solutions
MCP Servers	Official reference server implementations
MCP Auth Spec Updates — Auth0	Authentication additions to MCP
MCP + Codex — OpenAI	How Codex integrates with MCP
MCP.so	Marketplace/directory for MCP servers; 1,000+ live connectors
Context7 MCP	Provides LLMs with up-to-date, version-specific documentation and code examples
awesome-mcp-servers	Most popular community-curated list of MCP servers

💻 Coding Agents in Practice

Tools & Products

Tool	Type	Description
Claude Code	CLI	Anthropic's agentic coding CLI with hooks, sub-agents, and MCP
Codex	CLI	OpenAI's cloud-based coding agent
Cursor	IDE	AI-first code editor with Background Agents
Windsurf	IDE	Cascade engine for agentic coding workflows
Aider	CLI	Open-source AI pair programming in the terminal
Continue	Extension	Open-source AI code assistant for VS Code and JetBrains
OpenHands	Platform	Open platform for AI software developers
Gemini CLI	CLI	Google's open-source AI agent for the terminal
Devin	Platform	Cognition's autonomous coding agent
Replit Agent	Platform	In-browser agent with snapshot engine and self-healing tests
Goose	CLI	Block's fully open-source (Apache-2.0) MCP-native agent; model-agnostic
Cline	Extension	BYOM (bring your own model) agent for VS Code
Devon	CLI	Open-source pair programmer with autonomous planning and debugging
OpenCode	CLI	75+ provider support, LSP integration, privacy-first
v0	Platform	Vercel's AI-powered frontend development agent

Field Reports from Coding Agent Companies

Real-world insights from teams building and deploying coding agents at scale.

Source	Article	Key Insight
	A practical guide to building agents	Comprehensive guide covering use case selection, design patterns, guardrails
	Building an AI-native engineering team	Guide for teams adopting agent-first development
	OpenAI Cookbook — Agents	Collection of agent-related code examples and tutorials
	Coding Agents 101	Practical guide to working with coding agents effectively
	Devin's 2025 Performance Review	Learnings from 18 months of agents at work; task scoping insights
	Rebuilding Devin for Claude Sonnet 4.5	Context management insights from model migration
	How Cognition Uses Devin to Build Devin	Self-referential agent development case study
	Decision-Time Guidance	Injecting situational instructions at key moments vs. front-loading
	Inside Replit's Snapshot Engine	Reversible compute and storage fabric for agent safety
	Introducing Agent 3	Self-healing testing, 200-minute autonomous runtime
	Introducing the new v0	Sandbox-based runtime, Git workflow integration
	Ranking Engineer Agent (REA)	Autonomous AI agent accelerating Meta's ads ranking engineering
	Closing the knowledge gap with agent skills	How agent skills help bridge domain knowledge gaps
	My LLM Coding Workflow Going into 2026 — Addy Osmani	Practical coding workflow with agents
	The Cognition: Devin is in the Details — swyx	Deep dive into Devin's architecture
	Introducing Agent 4	Parallel task execution, multi-platform development
	Building agents with the Claude Agent SDK	Production-oriented SDK; compact feature for context management

🏭 Production Deployment

Lessons from running agents in production — what breaks, what works, and what scales.

Source	Article	Key Finding
	AI Agents in Production 2025 — Cleanlab	Survey of 1,837 respondents; only 95 with agents live in production
	Key Findings from 1,200 Production Deployments — ZenML	95% of agent deployments fail; system fragility, not model intelligence
	The State of Agentic AI in 2025: A Year-End Reality Check	Industry reality check on agent deployment
	Building Production-Grade AI Agents — Towards AI	Complete technical guide for production agents
	Building Reliable Autonomous Agentic AI — TechEmpower	Practical reliability patterns
	State of Agent Engineering	Survey of 1,300+ professionals on agent engineering challenges
	Lessons from 2025 on agents and trust	Google Cloud CTO lessons on agent deployment and trust
	Harness Engineering 101: Claude Code / Codex Workflows	Practical reproducible, safe, long-running workflows

📚 Academic Research

Papers advancing the theoretical and empirical foundations of harness engineering.

Harness & Context Engineering

Paper	Venue/Date	Key Contribution
Building Effective AI Coding Agents for the Terminal	arXiv, Mar 2026	OpenDev agent; scaffolding vs. harness architecture distinction
Natural-Language Agent Harnesses	arXiv, Mar 2026	Harness-level control via natural language: roles, contracts, verification gates
Agentic Context Engineering (ACE)	arXiv, Oct 2025	Contexts as evolving playbooks; 14.8% improvement over ReAct
Meta Context Engineering via Agentic Skill Evolution	arXiv, Jan 2026	Bi-level framework where meta-level agent refines engineering skills
Context Engineering for AI Agents in Open-Source Software	arXiv, Oct 2025	Study of context engineering file adoption in 466 open-source projects
PAACE: Plan-Aware Automated Agent Context Engineering	arXiv, Dec 2025	Context engineering as a learnable, plan-aware optimization problem
The Complexity Trap	NeurIPS 2025	Simple observation masking ≈ LLM summarization; ~50% cost reduction

Agent Reliability & Safety

Paper	Venue/Date	Key Contribution
Memory Management for Long-Running Low-Code Agents	arXiv, Sep 2025	Memory management for persistent agent sessions
Efficient On-Device Agents via Adaptive Context Management	arXiv, Nov 2025	Context management for resource-constrained environments
Agentic AI: Challenges and Opportunities	arXiv, Jan 2026	Comprehensive survey of verifiable planning, coordination, memory, governance
From Competition to Coordination: Safe Multi-Agent LLM Systems	arXiv, Nov 2025	Market-making framework for safe multi-agent coordination
Emergent Coordination in Multi-Agent Language Models	arXiv, Oct 2025	How prompt design steers multi-agent LLMs
Towards a Science of AI Agent Reliability	arXiv, Feb 2026	Evaluates 14 models across 3 providers with scaffolding strategies
Confucius Code Agent	arXiv, Dec 2025	Scalable agent scaffolding with persistent note-taking for cross-session learning
Agentic AI Frameworks: Architectures, Protocols, Design	arXiv, Aug 2025	Comprehensive survey of agentic AI architectures and protocols
A Practical Guide for Production-Grade Agentic AI Workflows	arXiv, Dec 2025	Nine best practices: tool-first design, single-responsibility agents, KISS principle

Evaluation & Benchmarking

Paper	Venue/Date	Key Contribution
Towards a Science of Scaling Agent Systems	DeepMind, Dec 2025	Scaling multi-agent systems scientifically
Measuring Agents in Production	arXiv, Dec 2025	Framework for measuring agent performance in production
Evaluation and Benchmarking of LLM Agents: A Survey	arXiv, 2025	Comprehensive survey of agent evaluation methods
Harnessing Multi-Agent LLMs for Complex Engineering	arXiv, Jan 2025	Multi-agent framework for engineering design projects
Multi-Agent Coordination: A Survey	arXiv, Feb 2025	Survey of coordination mechanisms across domains

🎓 Learning Resources & Curated Lists

Harness & Context Engineering

Resource	Description
awesome-agent-harness	Curated list of agent harness resources
walkinglabs/awesome-harness-engineering	The original awesome list for harness engineering
Context-Engineering handbook	First-principles handbook inspired by Karpathy
Awesome-Context-Engineering	Comprehensive survey: hundreds of papers, frameworks, guides
yzfly/awesome-context-engineering	Curated papers, tools, and best practices for context engineering
learn-claude-code	Reverse-engineers Claude Code's harness mechanisms session by session
harness-engineering (deusyu)	Learning guide from concept to practice
Harness Engineering Academy	Tutorials, career guides, and learning paths
agent-engineering.dev	Articles on harness engineering as a production discipline
harness-engineering.ai	Complete guide to agent harness concepts
Prompt Engineering Guide: Context Engineering	Community reference guide
ACE-FCA (HumanLayer)	"Frequent intentional compaction" approach; tested on 300K LOC Rust codebase

Agent Frameworks & General AI

Resource	Description
awesome-ai-agents-2026	300+ resources across 20+ categories, updated monthly
awesome-agents (kyrolabs)	Open-source tools and products to build AI agents
awesome-ai-agent-frameworks	Most up-to-date list of AI Agent Frameworks
awesome-cli-coding-agents	Terminal-native agents and harnesses
awesome-vibe-coding	Curated list of vibe coding references
awesome-claude-code	Tools, IDE integrations, frameworks for Claude Code
awesome-copilot	GitHub's official awesome-copilot with AGENTS.md
Awesome-LLMOps	LLMOps tools for developers

Agent Security

Resource	Description
awesome-ai-agents-security	Living map of AI agent security ecosystem by security lifecycle
awesome-ai-guardrails	Curated materials on AI guardrails

Research Paper Collections

Resource	Description
awesome-ai-agent-papers	Curated 2026 AI agent research papers, updated weekly from arXiv
Awesome-Agent-Papers	Up-to-date LLM Agent survey: methodology, applications, challenges
Awesome-Self-Evolving-Agents	Comprehensive survey of self-evolving AI agents (2023–2025)
Autonomous-Agents	Autonomous Agents research papers, updated daily
KDD 2025 Tutorial: Evaluation of LLM Agents	Two-dimensional taxonomy of evaluation objectives and processes

Contributing

Contributions are welcome! Please prefer resources that are:

Primary sources — original implementations, first-party articles, or seminal papers
Specific — about how agents are constrained, evaluated, resumed, observed, or orchestrated
Practical — useful to practitioners building real harnesses, not generic AI commentary
Current — actively maintained or recently published (2024+)

If two links say the same thing, prefer the more primary, practical, and implementation-oriented one.

See CONTRIBUTING.md for contribution guidelines and the preferred entry format.

License

CC0 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🏗️ Awesome Harness Engineering

📑 Contents

🏛️ Foundations

Seminal Articles

Evolution & Big Picture

🧠 Context, Memory & Working State

Context Engineering

Memory & State Management

Structured Output (Agent I/O Harness)

🛡️ Constraints, Guardrails & Safe Autonomy

Safety & Control Patterns

Guardrail Frameworks

📋 Specs, Agent Files & Workflow Design

Agent Instruction Standards

Workflow & Orchestration Design

📊 Evals & Observability

Evaluation Guides & Frameworks

Observability Platforms

🏆 Benchmarks

Evaluation Frameworks & Tools

⚙️ Runtimes, Harnesses & Reference Implementations

Agent SDKs & Frameworks

Sandbox & Execution Environments

Reference Implementations

🔌 MCP (Model Context Protocol)

💻 Coding Agents in Practice

Tools & Products

Field Reports from Coding Agent Companies

🏭 Production Deployment

📚 Academic Research

Harness & Context Engineering

Agent Reliability & Safety

Evaluation & Benchmarking

🎓 Learning Resources & Curated Lists

Harness & Context Engineering

Agent Frameworks & General AI

Agent Security

Research Paper Collections

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages