Skip to content

Jiaaqiliu/Awesome-Harness-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ—οΈ Awesome Harness Engineering Awesome

The most comprehensive, information-dense collection of resources for harness engineering β€” the practice of shaping the environment around AI agents so they work reliably in real-world production systems.

Harness engineering sits at the intersection of context engineering, evaluation, observability, orchestration, safe autonomy, and software architecture. While an agent is the model plus its tools, the harness is everything else: the constraints, state management, verification loops, and runtime infrastructure that make agents dependable.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        AGENT HARNESS                            β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Context  β”‚  β”‚ Guardrailsβ”‚  β”‚  Evals &  β”‚  β”‚   Runtime &   β”‚  β”‚
β”‚  β”‚ Engine   β”‚  β”‚ & Safety β”‚  β”‚Observabilityβ”‚ β”‚ Orchestration β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚       β”‚              β”‚              β”‚               β”‚           β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                             β”‚                                   β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚                      β”‚  LLM Agent  β”‚                            β”‚
β”‚                      β”‚ (Model+Tools)β”‚                           β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Memory & β”‚  β”‚  Specs & β”‚  β”‚ Sandbox & β”‚  β”‚  Benchmarks   β”‚  β”‚
β”‚  β”‚  State   β”‚  β”‚Agent Filesβ”‚  β”‚ Execution β”‚  β”‚               β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why this list? The shift from prompt engineering β†’ context engineering β†’ harness engineering marks a maturation of the AI engineering discipline. This list tracks that evolution with primary sources, not commentary.


πŸ“‘ Contents


πŸ›οΈ Foundations

Seminal Articles

The foundational writings that defined harness engineering as a discipline.

Source Article Description
OpenAI Harness engineering: leveraging Codex in an agent-first world The article that coined "harness engineering" β€” how OpenAI built a large application with Codex using architectural constraints, repo-local instructions, browser validation, and telemetry
Anthropic Effective harnesses for long-running agents Core article on initializer agents, feature lists, init.sh, self-verification, and handoff artifacts across many context windows
Anthropic Harness design for long-running application development GAN-inspired multi-agent harness β€” generator/evaluator loop for autonomous frontend design and long-running app generation
Anthropic Building effective agents Foundational guide distinguishing workflows vs. agents, with composable patterns for building reliable systems
LangChain The Anatomy of an Agent Harness Derives harness components from first principles: prompts, tools, middleware, orchestration, and runtime infrastructure
Thoughtworks Harness Engineering Framing harness work into context engineering, architectural constraints, and "garbage collection" against entropy
HumanLayer Skill Issue: Harness Engineering for Coding Agents A practical argument that weak results from coding agents are often harness problems, not model problems
Inngest Your Agent Needs a Harness, Not a Framework Why state, retries, traces, and concurrency are first-class infrastructure concerns
OpenAI Unlocking the Codex harness: how we built the App Server Deep dive into the Codex app server harness implementation
OpenAI Unrolling the Codex agent loop Technical breakdown of the Codex agent loop architecture

Evolution & Big Picture

Understanding the trajectory from prompt engineering to harness engineering.


🧠 Context, Memory & Working State

Context Engineering

The art of managing what goes into the context window β€” treating it as a working memory budget, not a dumping ground.

Source Article Description
Anthropic Effective context engineering for AI agents Managing the context window as a working memory budget with practical strategies
LangChain Context Engineering for Agents Four strategies: writing, selecting, compressing, isolating context
LangChain The Rise of Context Engineering Why context engineering matters more than prompt engineering
Manus Context Engineering for AI Agents: Lessons from Building Manus KV-cache locality, tool masking, filesystem memory, and keeping useful failures in-context
Thoughtworks Context Engineering for Coding Agents Shaping the task environment so coding agents stay grounded and productive
Google Architecting efficient context-aware multi-agent framework ADK's tiered state architecture: Session, Memory, Artifacts
HumanLayer Advanced Context Engineering for Coding Agents Patterns for reducing context drift and making coding sessions easier to resume
HumanLayer Context-Efficient Backpressure for Coding Agents Preventing agents from burning context on noisy or low-value work
LlamaIndex Context Engineering: What It Is and Techniques to Consider LlamaIndex's techniques for context engineering
Context Engineering (Lance Martin) Practical guide from LangChain co-founder
Context Engineering: Bringing Engineering Discipline to Prompts β€” Addy Osmani Systematic approach to context management
JetBrains Efficient Context Management for LLM-Powered Agents NeurIPS 2025 research: simple observation masking β‰ˆ LLM summarization, ~50% cost reduction
Context Engineering Best Practices β€” Comet Best practices for agentic systems
Context Engineering Best Practices β€” Kubiya Practical reliability-focused best practices

Memory & State Management

How agents persist knowledge, recover from interruptions, and maintain state across sessions.

Project Description
Letta (formerly MemGPT) Stateful agents with OS-like memory management (RAM/disk analogy)
Rearchitecting Letta's Agent Loop Lessons from ReAct, MemGPT, and Claude Code for agent loop design
Memory Blocks β€” Letta Discrete functional memory units for context window management
LangGraph + Redis <1ms latency state persistence for agents
LangGraph + DynamoDB Durable agent state on AWS infrastructure
Checkpoint/Restore Systems for AI Agents Survey of checkpoint/restore techniques
Databricks Agent Memory Built-in memory for Databricks agent framework
Mem0 Hybrid storage (Postgres + vector); extracts memories with ADD/UPDATE/DELETE operations; up to 26% accuracy gains
Zep Temporal knowledge graph tracking how facts change over time; combines graph memory with vector search
Cognee Knowledge graph layer that structures, connects, and retrieves information as interconnected knowledge

Structured Output (Agent I/O Harness)

Libraries that ensure reliable, schema-compliant agent output.

Project Description
Instructor Type-safe structured extraction from LLMs using Pydantic; SDKs for Python, TypeScript, Go, Ruby
Outlines FSM-based token masking ensures 100% schema-compliant output at generation time

πŸ›‘οΈ Constraints, Guardrails & Safe Autonomy

Safety & Control Patterns

Reducing approval friction without losing control β€” sandboxing, policy design, and quality loops.

Source Article Description
Anthropic Beyond permission prompts: making Claude Code more secure Better sandboxing and policy design for secure autonomous agents
Anthropic Code execution with MCP: building more efficient agents Controlled execution power through explicit, inspectable tool boundaries; 150K β†’ 2K token reduction
Anthropic Writing effective tools for agents Tool interfaces that are easier for models to call correctly and safely
Anthropic Advanced tool use on the Claude Developer Platform Tool Search, Programmatic Tool Calling, and Tool Use Examples
Thoughtworks Assessing internal quality while coding with an agent Moving quality checks into the loop instead of relying on after-the-fact review
Thoughtworks Anchoring AI to a reference application Constraining agents with concrete exemplars for more consistent output
Thoughtworks Humans and Agents in Software Engineering Loops Where humans should strengthen the harness instead of micromanaging artifacts
Anthropic Claude Code: Best practices for agentic coding Repo structure, checkpoints, validation, and delegation in agentic workflows
Agentic Engineering Patterns β€” Simon Willison Coding practices and patterns for working with agents
The lethal trifecta for AI agents β€” Simon Willison Private data + untrusted content + external communication = security risk
Governing Claude Code with Kong AI Gateway Secure agent harness rollouts via API gateway
AI Agent Safety β€” Cleanlab Managing unpredictability at scale in production agents
Lessons from 2025: Agent Mitigation How "agent mitigation" became a new discipline

Guardrail Frameworks

Project Description
Invariant Guardrails (now Snyk) Rule-based guardrailing for MCP and agentic AI
Invariant MCP-scan Security scanner for MCP servers: prompt injection, tool poisoning detection
NeMo Guardrails β€” NVIDIA Open-source programmable guardrails; sub-100ms latency; GPU-accelerated
Guardrails AI Open-source framework for LLM output validation

πŸ“‹ Specs, Agent Files & Workflow Design

Agent Instruction Standards

How to tell agents what to do β€” repo-local instruction files and machine-readable specifications.

Project Description
AGENTS.md Open format for repo-local instructions; intro by OpenAI
agent.md Related standardization effort for machine-readable agent instructions
GitHub Spec Kit Toolkit for spec-driven development β€” agents execute against explicit specs
Writing a good CLAUDE.md Practical guide to creating durable, repo-local instructions; 150–200 instruction limit
Equipping agents with Agent Skills β€” Anthropic SKILL.md-based progressive disclosure system for domain-specific agent capabilities
How to write a great agents.md β€” GitHub Lessons from 2,500+ repositories
awesome-agents-md Curated list of real-world AGENTS.md files, templates, guides & tools
awesome-agent-skills 1,000+ agent skills from official dev teams and community
CLAUDE.md vs AGENTS.md vs .cursorrules Comparison of agent configuration file formats

Workflow & Orchestration Design

Source Article Description
HumanLayer 12 Factor Agents Operating principles for production agents: explicit prompts, state ownership, clean pause-resume
12-Factor AgentOps Operations-oriented companion focused on context discipline and reproducibility
Thoughtworks Understanding Spec-Driven-Development Why strong specs make AI-assisted delivery more dependable
Anthropic How we built our multi-agent research system Orchestrator-worker pattern with parallel subagents; 90%+ improvement over single-agent
Google Developer's guide to multi-agent patterns in ADK Multi-agent orchestration patterns in Google ADK
LlamaIndex Introducing AgentWorkflow Multi-agent orchestration system
LlamaIndex Workflows 1.0 Event-driven framework for agentic workflows
Emerging Patterns in Building GenAI Products β€” Martin Fowler Architecture patterns for generative AI products
Agent-Native Engineering Engineering practices for agent-first development

πŸ“Š Evals & Observability

Evaluation Guides & Frameworks

How to measure whether your agent actually works β€” evaluation methodology for non-deterministic systems.

Source Article Description
OpenAI Testing Agent Skills Systematically with Evals Turning agent traces into repeatable evals with JSONL logs and deterministic checks
OpenAI Agent evals Measuring agent quality with reproducible task-level and workflow-level evaluations
OpenAI Evaluation best practices Building eval suites that match real-world distributions and catch regressions
OpenAI Trace grading Grading agent traces directly, especially for long multi-step tasks
Anthropic Demystifying Evals for AI Agents What to measure when agents have many possible trajectories
Anthropic Quantifying infrastructure noise in agentic coding evals Runtime configuration can move benchmark scores more than many leaderboard gaps
LangChain Evaluating Deep Agents: Our Learnings Single-step, full-run, and multi-turn eval design for stateful agents
LangChain Improving Deep Agents with harness engineering Top 30 β†’ Top 5 on Terminal-Bench 2.0 by only changing the harness
LangChain How we build evals for Deep Agents Eval methodology for LangChain's deep agents
LangChain How Middleware Lets You Customize Your Agent Harness Middleware patterns for loop detection and custom harness behavior
8 benchmarks shaping the next generation of AI agents Overview of key agent benchmarks

Observability Platforms

Platform Type Description
Arize Phoenix OSS OpenTelemetry-based tracing, evals, and experiments for AI
Langfuse OSS LLM observability: tracing, prompt management, evals (MIT license)
LangSmith Commercial Agent engineering platform: tracing, evaluation, deployment
Braintrust Commercial AI observability + evaluation; used by Notion, Stripe, Zapier
Helicone Commercial AI Gateway with routing, caching, rate limiting, cost analytics
AI observability tools buyer's guide 2026 Guide Comprehensive comparison of observability platforms
Portkey Commercial AI gateway + observability; routing, fallbacks, load balancing, caching, and prompt versioning
LiteLLM OSS Unified proxy for 100+ LLMs in OpenAI format; cost tracking, guardrails, load balancing
OpenTelemetry for LLMs Standard Emerging standard; OpenLLMetry and OpenLIT emit OTLP-compatible spans
Comparing open-source AI agent frameworks Guide Framework comparison with observability perspective

πŸ† Benchmarks

Benchmarks that stress harness quality, not just model quality β€” context handling, tool calling, environment control, verification logic, and runtime scaffolding.

Benchmark Focus Description
SWE-bench Verified πŸ”§ Code Real GitHub issues and tests; harness choices around retrieval, patching, and validation are highly visible
SWE-PolyBench β€” Amazon πŸ”§ Code Multi-language: 2,110 instances across 21 repos in Java/JS/TS/Python
SWE-Bench Pro πŸ”§ Code 1,865 problems from 41 repos and 123 programming languages
FeatureBench πŸ”§ Code 200 eval instances; SOTA agents achieve only 11% (vs 74% on SWE-bench)
Terminal-Bench πŸ’» Terminal Terminal-native agents in shells, filesystems, and verification-heavy environments
Terminal-Bench 2.0 & Harbor πŸ’» Terminal Harder tasks and generalized evaluation harness
OSWorld πŸ–₯️ Desktop 369 tasks across Ubuntu, Windows, macOS with execution-based evaluators
AppWorld 🌐 Interactive Controllable world of apps for testing planning, code generation, and collateral-damage control
AgentBench 🌐 Multi-env Cross-environment: OS, databases, knowledge graphs, web browsing
tau2-bench πŸ”„ Multi-step Realistic multi-step tasks where success depends on tool use and execution quality
WebArena-Verified 🌐 Web Curated web-agent tasks with deterministic evaluators over responses and network traces
WorkArena 🌐 Enterprise Common knowledge-work tasks on realistic enterprise-style web workflows
GAIA πŸ€– General General AI assistant benchmark for tools, planning, verification, and long-horizon autonomy
HAL: Holistic Agent Leaderboard πŸ“Š Leaderboard Reliability, cost, and broad task coverage for comparing end-to-end harness behavior
DPAI Arena β€” JetBrains πŸ”§ Code Open platform for coding agent benchmarks across full dev lifecycle
LOCA-bench 🧠 Context Benchmarks long-context agents; reveals "context rot" phenomenon
SWE-bench Live πŸ”§ Code Live benchmark with real-time GitHub issues

Evaluation Frameworks & Tools

Tool Description
Inspect AI UK AI Safety Institute's eval framework; batteries-included with pre-built benchmarks
ai-agent-benchmark-compendium Compendium of 50+ agent benchmarks, categorized by function calling, reasoning, coding
Galileo Agent Eval Framework with metrics, rubrics, and benchmarks for production agent evaluation

βš™οΈ Runtimes, Harnesses & Reference Implementations

Agent SDKs & Frameworks

Framework Maintainer Description
Claude Agent SDK Anthropic Production-oriented SDK with sessions, tools, orchestration, and compact feature
OpenAI Agents SDK OpenAI Visual canvas + Agent Builder + ChatKit + Connector Registry
Google ADK Google Open-source framework for building multi-agent applications
Microsoft Agent Framework Microsoft Convergence of AutoGen + Semantic Kernel; checkpointing & resuming
AutoGen Microsoft Open-source multi-agent programming framework
LangGraph LangChain Graph-based agent orchestration with built-in persistence
deepagents LangChain Deeper, longer-running agents with middleware and harness patterns
CrewAI CrewAI Role-driven multi-agent orchestration; fastest-growing for multi-agent
MetaGPT Open Source Simulates software company with PM/Architect/Engineer/QA agents
Pydantic AI Pydantic Type-safe Python agent framework
Agno Agno High-performance multi-agent runtime
Smolagents Hugging Face Ultra-minimal agent framework
Mastra Gatsby team JavaScript agent framework
AWS Strands Agents AWS Model-driven ReAct pattern; deep Lambda integration
AgentKit Inngest TypeScript toolkit for durable, workflow-aware agents
Vercel AI SDK Vercel Unified toolkit for 30+ LLM providers; frontend-to-backend agent infrastructure
VoltAgent VoltAgent TypeScript agent platform with orchestration, memory, RAG, and enterprise observability

Sandbox & Execution Environments

Platform Description
E2B Open-source Firecracker microVM sandboxing; ~150ms cold starts
Modal Container-based agent execution; scales to 50K+ concurrent instances
Daytona Docker-based sandbox; sub-90ms creation; pivoted to agent infra in 2025
SWE-ReX Sandboxed code execution infrastructure for AI agents
awesome-sandbox Curated list of code sandboxing solutions for AI agents
Browserbase Cloud-hosted browser instances for AI agents at scale
Stagehand Browserbase's open-source SDK bridging Playwright and AI agents
Firecrawl Managed isolated browser environment + web scraping API for agents
Top AI Code Sandbox Products β€” Modal 2025 comparison of sandbox solutions

Reference Implementations

Project Description
SWE-agent Mature research coding agent with inspectable harness, prompt, tools, and environment
Harbor Generalized harness for evaluating and improving agents at scale
Terminal-Bench Open-source terminal benchmark implementation

πŸ”Œ MCP (Model Context Protocol)

The emerging standard for giving agents structured, controlled access to tools and data sources.

Resource Description
MCP Specification (2025-11-25) Latest protocol specification
2026 MCP Roadmap Priorities: remote deployment, auth, enterprise features
MCP Roadmap Growing Pains β€” The New Stack Production challenges and planned solutions
MCP Servers Official reference server implementations
MCP Auth Spec Updates β€” Auth0 Authentication additions to MCP
MCP + Codex β€” OpenAI How Codex integrates with MCP
MCP.so Marketplace/directory for MCP servers; 1,000+ live connectors
Context7 MCP Provides LLMs with up-to-date, version-specific documentation and code examples
awesome-mcp-servers Most popular community-curated list of MCP servers

πŸ’» Coding Agents in Practice

Tools & Products

Tool Type Description
Claude Code CLI Anthropic's agentic coding CLI with hooks, sub-agents, and MCP
Codex CLI OpenAI's cloud-based coding agent
Cursor IDE AI-first code editor with Background Agents
Windsurf IDE Cascade engine for agentic coding workflows
Aider CLI Open-source AI pair programming in the terminal
Continue Extension Open-source AI code assistant for VS Code and JetBrains
OpenHands Platform Open platform for AI software developers
Gemini CLI CLI Google's open-source AI agent for the terminal
Devin Platform Cognition's autonomous coding agent
Replit Agent Platform In-browser agent with snapshot engine and self-healing tests
Goose CLI Block's fully open-source (Apache-2.0) MCP-native agent; model-agnostic
Cline Extension BYOM (bring your own model) agent for VS Code
Devon CLI Open-source pair programmer with autonomous planning and debugging
OpenCode CLI 75+ provider support, LSP integration, privacy-first
v0 Platform Vercel's AI-powered frontend development agent

Field Reports from Coding Agent Companies

Real-world insights from teams building and deploying coding agents at scale.

Source Article Key Insight
OpenAI A practical guide to building agents Comprehensive guide covering use case selection, design patterns, guardrails
OpenAI Building an AI-native engineering team Guide for teams adopting agent-first development
OpenAI OpenAI Cookbook β€” Agents Collection of agent-related code examples and tutorials
Cognition Coding Agents 101 Practical guide to working with coding agents effectively
Cognition Devin's 2025 Performance Review Learnings from 18 months of agents at work; task scoping insights
Cognition Rebuilding Devin for Claude Sonnet 4.5 Context management insights from model migration
Cognition How Cognition Uses Devin to Build Devin Self-referential agent development case study
Replit Decision-Time Guidance Injecting situational instructions at key moments vs. front-loading
Replit Inside Replit's Snapshot Engine Reversible compute and storage fabric for agent safety
Replit Introducing Agent 3 Self-healing testing, 200-minute autonomous runtime
Vercel Introducing the new v0 Sandbox-based runtime, Git workflow integration
Meta Ranking Engineer Agent (REA) Autonomous AI agent accelerating Meta's ads ranking engineering
Google Closing the knowledge gap with agent skills How agent skills help bridge domain knowledge gaps
My LLM Coding Workflow Going into 2026 β€” Addy Osmani Practical coding workflow with agents
The Cognition: Devin is in the Details β€” swyx Deep dive into Devin's architecture
Replit Introducing Agent 4 Parallel task execution, multi-platform development
Anthropic Building agents with the Claude Agent SDK Production-oriented SDK; compact feature for context management

🏭 Production Deployment

Lessons from running agents in production β€” what breaks, what works, and what scales.

Source Article Key Finding
AI Agents in Production 2025 β€” Cleanlab Survey of 1,837 respondents; only 95 with agents live in production
Key Findings from 1,200 Production Deployments β€” ZenML 95% of agent deployments fail; system fragility, not model intelligence
The State of Agentic AI in 2025: A Year-End Reality Check Industry reality check on agent deployment
Building Production-Grade AI Agents β€” Towards AI Complete technical guide for production agents
Building Reliable Autonomous Agentic AI β€” TechEmpower Practical reliability patterns
LangChain State of Agent Engineering Survey of 1,300+ professionals on agent engineering challenges
Google Lessons from 2025 on agents and trust Google Cloud CTO lessons on agent deployment and trust
Harness Engineering 101: Claude Code / Codex Workflows Practical reproducible, safe, long-running workflows

πŸ“š Academic Research

Papers advancing the theoretical and empirical foundations of harness engineering.

Harness & Context Engineering

Paper Venue/Date Key Contribution
Building Effective AI Coding Agents for the Terminal arXiv, Mar 2026 OpenDev agent; scaffolding vs. harness architecture distinction
Natural-Language Agent Harnesses arXiv, Mar 2026 Harness-level control via natural language: roles, contracts, verification gates
Agentic Context Engineering (ACE) arXiv, Oct 2025 Contexts as evolving playbooks; 14.8% improvement over ReAct
Meta Context Engineering via Agentic Skill Evolution arXiv, Jan 2026 Bi-level framework where meta-level agent refines engineering skills
Context Engineering for AI Agents in Open-Source Software arXiv, Oct 2025 Study of context engineering file adoption in 466 open-source projects
PAACE: Plan-Aware Automated Agent Context Engineering arXiv, Dec 2025 Context engineering as a learnable, plan-aware optimization problem
The Complexity Trap NeurIPS 2025 Simple observation masking β‰ˆ LLM summarization; ~50% cost reduction

Agent Reliability & Safety

Paper Venue/Date Key Contribution
Memory Management for Long-Running Low-Code Agents arXiv, Sep 2025 Memory management for persistent agent sessions
Efficient On-Device Agents via Adaptive Context Management arXiv, Nov 2025 Context management for resource-constrained environments
Agentic AI: Challenges and Opportunities arXiv, Jan 2026 Comprehensive survey of verifiable planning, coordination, memory, governance
From Competition to Coordination: Safe Multi-Agent LLM Systems arXiv, Nov 2025 Market-making framework for safe multi-agent coordination
Emergent Coordination in Multi-Agent Language Models arXiv, Oct 2025 How prompt design steers multi-agent LLMs
Towards a Science of AI Agent Reliability arXiv, Feb 2026 Evaluates 14 models across 3 providers with scaffolding strategies
Confucius Code Agent arXiv, Dec 2025 Scalable agent scaffolding with persistent note-taking for cross-session learning
Agentic AI Frameworks: Architectures, Protocols, Design arXiv, Aug 2025 Comprehensive survey of agentic AI architectures and protocols
A Practical Guide for Production-Grade Agentic AI Workflows arXiv, Dec 2025 Nine best practices: tool-first design, single-responsibility agents, KISS principle

Evaluation & Benchmarking

Paper Venue/Date Key Contribution
Towards a Science of Scaling Agent Systems DeepMind, Dec 2025 Scaling multi-agent systems scientifically
Measuring Agents in Production arXiv, Dec 2025 Framework for measuring agent performance in production
Evaluation and Benchmarking of LLM Agents: A Survey arXiv, 2025 Comprehensive survey of agent evaluation methods
Harnessing Multi-Agent LLMs for Complex Engineering arXiv, Jan 2025 Multi-agent framework for engineering design projects
Multi-Agent Coordination: A Survey arXiv, Feb 2025 Survey of coordination mechanisms across domains

πŸŽ“ Learning Resources & Curated Lists

Harness & Context Engineering

Resource Description
awesome-agent-harness Curated list of agent harness resources
walkinglabs/awesome-harness-engineering The original awesome list for harness engineering
Context-Engineering handbook First-principles handbook inspired by Karpathy
Awesome-Context-Engineering Comprehensive survey: hundreds of papers, frameworks, guides
yzfly/awesome-context-engineering Curated papers, tools, and best practices for context engineering
learn-claude-code Reverse-engineers Claude Code's harness mechanisms session by session
harness-engineering (deusyu) Learning guide from concept to practice
Harness Engineering Academy Tutorials, career guides, and learning paths
agent-engineering.dev Articles on harness engineering as a production discipline
harness-engineering.ai Complete guide to agent harness concepts
Prompt Engineering Guide: Context Engineering Community reference guide
ACE-FCA (HumanLayer) "Frequent intentional compaction" approach; tested on 300K LOC Rust codebase

Agent Frameworks & General AI

Resource Description
awesome-ai-agents-2026 300+ resources across 20+ categories, updated monthly
awesome-agents (kyrolabs) Open-source tools and products to build AI agents
awesome-ai-agent-frameworks Most up-to-date list of AI Agent Frameworks
awesome-cli-coding-agents Terminal-native agents and harnesses
awesome-vibe-coding Curated list of vibe coding references
awesome-claude-code Tools, IDE integrations, frameworks for Claude Code
awesome-copilot GitHub's official awesome-copilot with AGENTS.md
Awesome-LLMOps LLMOps tools for developers

Agent Security

Resource Description
awesome-ai-agents-security Living map of AI agent security ecosystem by security lifecycle
awesome-ai-guardrails Curated materials on AI guardrails

Research Paper Collections

Resource Description
awesome-ai-agent-papers Curated 2026 AI agent research papers, updated weekly from arXiv
Awesome-Agent-Papers Up-to-date LLM Agent survey: methodology, applications, challenges
Awesome-Self-Evolving-Agents Comprehensive survey of self-evolving AI agents (2023–2025)
Autonomous-Agents Autonomous Agents research papers, updated daily
KDD 2025 Tutorial: Evaluation of LLM Agents Two-dimensional taxonomy of evaluation objectives and processes

Contributing

Contributions are welcome! Please prefer resources that are:

  • Primary sources β€” original implementations, first-party articles, or seminal papers
  • Specific β€” about how agents are constrained, evaluated, resumed, observed, or orchestrated
  • Practical β€” useful to practitioners building real harnesses, not generic AI commentary
  • Current β€” actively maintained or recently published (2024+)

If two links say the same thing, prefer the more primary, practical, and implementation-oriented one.

See CONTRIBUTING.md for contribution guidelines and the preferred entry format.

License

CC0 1.0

About

πŸ—οΈ A collection of resources for harness engineering β€” shaping the environment around AI agents for reliability in production.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors