-
Notifications
You must be signed in to change notification settings - Fork 302
feat: Deterministic LLM agents through composable skills with static type checking #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Deterministic LLM agents through composable skills with static type checking #13
Conversation
Reference library for Agent Skills with CLI and Python API: - validate: Check skill directories for valid SKILL.md with proper frontmatter - read-properties: Parse and output skill properties as JSON - to-prompt: Generate suggested <available_skills> XML for agent prompts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude <[email protected]>
* Document reference SDK * Update docs/integrate-skills.mdx Co-authored-by: Keith Lazuka <[email protected]> --------- Co-authored-by: Keith Lazuka <[email protected]>
Introduces the ability to build complex behaviours from simpler skills, like higher-order functions in functional programming. New frontmatter fields: - level: Composition tier (1=Atomic, 2=Composite, 3=Workflow) - operation: Safety classification (READ/WRITE/TRANSFORM) - composes: List of skill dependencies Benefits: - Reusability: Write atomic skills once, compose everywhere - Testability: Each level tested independently - Safety: READ/WRITE separation propagates upward - Transparency: Explicit dependency graph via composes field Changes: - docs/architecture.mdx: Full design rationale and patterns - docs/specification.mdx: New field documentation - skills-ref/models.py: SkillLevel, SkillOperation enums - skills-ref/parser.py: Parse new fields - skills-ref/validator.py: Validate composability fields - examples/: Working examples at all three levels Backwards compatible - skills without new fields work unchanged.
This commit enhances the composability extension with powerful graph analysis tools: - New graph module (skills_ref.graph): - CompositionGraph: Build and analyze skill dependency graphs - Circular dependency detection using DFS - Missing dependency validation - Level hierarchy violation warnings - ASCII and Mermaid diagram generation - JSON export with statistics - New CLI command (skills-ref graph): - Visualize skill composition in ASCII, Mermaid, or JSON - Built-in validation for cycles and missing dependencies - Supports single skills or entire skill directories - Enhanced exports in __init__.py: - CompositionGraph, GraphAnalysis, SkillNode - validate_composition convenience function - Bug fixes: - Parser now correctly coerces level strings to integers - Validator handles string-to-int coercion for strictyaml compatibility - Comprehensive test coverage (84 tests total): - test_graph.py: 24 tests for graph analysis - test_validator.py: Extended with composability field tests Visual diagram added to architecture.mdx showing composition hierarchy.
Recursion is a fundamental pattern in functional programming that should be supported in composable skills. This commit clarifies the distinction: ALLOWED - Self-recursion (a skill composing itself): - Enables divide-and-conquer algorithms with minimal code - Supports dynamic parallelisation of sub-agents - Reduces context consumption through concise recursive definitions - Follows established functional programming principles PROHIBITED - Circular dependencies (A → B → A): - Create ambiguous execution order - Prevent static analysis of composition graph - Indicate design flaws (skills should compose downward) Changes: - Updated detect_cycles() to skip self-references - Added comprehensive tests for self-recursion scenarios - Documented the distinction in architecture.mdx with examples
- Add deep-research example demonstrating self-recursion pattern - Documents divide-and-conquer, parallelisation, and context efficiency benefits - Fix level hierarchy check to not warn about self-recursion - Update examples README to highlight recursion capability The deep-research skill shows how a single recursive definition can replace multiple hardcoded depth levels (research-depth-1, research-depth-2, etc.) while enabling natural parallelisation of sub-agents.
|
cc @dsp @simonw @jspahrsummers This PR addresses the non-deterministic tool selection problem that becomes critical as MCP tool registries scale beyond 50-100 definitions. The core insight: when tools have semantic overlap, LLMs choose inconsistently across invocations, not due to temperature, but attention mechanics over large, similar definitions. The solution is a hierarchical composition system that reduces decision complexity at each level:
This keeps tool sets small at each decision point, enables self-recursion for divide-and-conquer patterns, and propagates safety classifications (READ/WRITE/TRANSFORM) through the hierarchy. Would welcome feedback on the architecture, particularly whether this aligns with where you see MCP tooling evolving. |
- Move from _composite to _workflows directory - Change level from 2 to 3 - Compose 'research' instead of atomics directly (web-search, pdf-save) - This maintains proper hierarchy: L3 → L2 → L1 - Eliminates semantic overlap at Level 2 Recursion is a form of orchestration (deciding when to recurse, when to stop), which correctly places it at Level 3 (workflows).
- Add decision flowchart for determining skill level - Add Level Criteria Summary table - Document L3 patterns: recursion, loops, dynamic dispatch, fan-out, state - Add L3 → L3 composition section with quarterly-review example - Fix deep-research example to show Level 3 (recursion = workflow) - Clarify that deep-research composes research (L2), not atomics This makes the level classification testable and unambiguous.
…itecture
Complete example showcasing:
- Fan-out parallelization (evaluate 12 destinations simultaneously)
- Expected value optimization (prioritize high-potential options)
- Gradient descent refinement (local search around best option)
- Self-recursion (trip-optimize and option-explore recurse)
- L3 → L3 composition (trip-optimize calls option-explore)
- Early termination (stop when marginal return < marginal cost)
- Binary constraint filtering (eliminate infeasible options first)
- MECE compliance (clear level separation)
Microeconomic concepts applied:
- Expected value, marginal cost/return, opportunity cost
- Pareto frontier for final recommendations
- Game-theoretic compute efficiency
Skills included:
- Level 3: trip-optimize, option-explore
- Level 2: destination-evaluate, route-price, feasibility-check
- Level 1: flight-search, hotel-search, weather-fetch, visa-check,
activity-search, calendar-read
Features: - FieldSchema dataclass for typed inputs/outputs with epistemic requirements - TypeDefinition for custom types (future expansion) - Parse inputs/outputs from YAML frontmatter - Validate field schemas (type, range, requires_source, requires_rationale) - typecheck_composition() validates type compatibility between composed skills - New CLI command: skills-ref typecheck Type checking catches: - Input type mismatches between parent and child skills - Output type mismatches in composition chains - Missing composed skill dependencies - Invalid ranges (min > max) Supports type widening (integer → number, datetime → date) and the 'any' escape hatch for flexibility. Documentation added to architecture.mdx explaining the type system, primitive types, epistemic requirements, and CLI usage. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
a910d53 to
d1d32d1
Compare
- Add FP-to-hardware parallel table explaining architectural principles - Reference FCCM 2014 paper on compiling higher-order functional programs - Add PR_BODY.md with comprehensive PR description The architecture applies the same principles that made FP-based hardware synthesis tractable to LLM agent orchestration: type-checked composition, latency-insensitive interfaces, and isolated execution contexts. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
The comprehensive PR description is now the actual body of PR agentskills#13, not a separate file in the repository. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
- README.md: Add Static Type System section with epistemic requirements - README.md: Add Trip Optimizer Showcase to Getting Started - architecture.mdx: Add Acknowledgements for FCCM 2014 co-authors - Create CHANGELOG.md to track composable skills release Ensures valuable context from PR agentskills#13 description persists in repo files after merge. Type system, theoretical foundation, and acknowledgements now live in their canonical locations. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
|
cc @fabiopelosin - This PR directly addresses the concerns you raised in #11 ("Skill composition without context bloat"). How this PR maps to your proposal:
Additional features addressing your concerns:
Showcase example: The trip-optimizer demonstrates all 3 levels with typed contracts. Would welcome your feedback on whether this addresses your design questions, particularly around whether composition should be spec-level (as implemented here) vs runtime-only. |
Community Validation: Related Issues in anthropics/skillsResearching the broader ecosystem, I found two open issues in anthropics/skills that validate the need for this PR: anthropics/skills#150: Allow "dependencies" in skill metadata
How this PR addresses it: Our anthropics/skills#132: Make skills spec future proof
How this PR addresses it:
These upstream issues demonstrate that the community is independently arriving at the same conclusions: skills need dependency management and typed contracts. This PR provides a concrete, backwards-compatible implementation. |
|
cc @gattimassimo @rahimnathwani @omnisci3nce - Given your engagement with Issue #11, you may find this PR relevant. TL;DR: This PR implements the composable skills architecture that @fabiopelosin described, with:
The trip-optimizer showcase demonstrates all 3 composition levels with 12 skills. Would welcome your thoughts on whether this addresses the composition challenges you've encountered. |
|
cc @elmariachi111 @remygendron - Your issues in anthropics/skills are directly relevant here: @elmariachi111 (#150): You requested a composes:
- web-search
- pdf-save@remygendron (#132): You proposed "SKILLS should be defined as a contract, not an implementation." This PR adds typed inputs:
- name: query
type: string
required: true
outputs:
- name: answer
type: string
requires_source: trueThe Would value your input on whether this approach addresses your use cases. |
|
cc @Christian-Blank - Your work on task orchestration methodology articulates the same problem this PR addresses. Specifically, your principle:
This PR implements that layered approach for skills:
Your event-sourcing pattern ( Related industry validation: Salesforce's Agent Graph work calls this "guided determinism" - separating LLM reasoning from explicit choreography. Our Would welcome your perspective on whether this aligns with the SyntropicSystems vision. |
Alignment with @maheshmurag's Vision for Agent SkillsIn his AI Engineer Summit talk and Anthropic engineering blog, Mahesh articulated core challenges this PR addresses: Problems Identified by Mahesh
Key Quote
This insight applies directly to skill composition: declaring dependencies statically ( The "One Agent, Many Skills" ArchitectureMahesh advocates for "one universal agent powered by domain-specific skills" rather than multi-agent orchestration. Our hierarchical composition (L1 Atomic → L2 Composite → L3 Workflow) implements exactly this: a single agent selects high-level skills, and composition handles the rest deterministically. |
|
What you're actually describing is MCPs executed in code Skills inherently are of non-deterministic nature and through means of the sandbox environment they run in and the apis they expose via scripts, is what tightens that non-determinism Hate to break it to you but this is too dense a PR where there already is a widely regarded solution to solve this problem: That is where you should consider your efforts IMO |
|
@numman-ali Is there an implementation of this pattern that you'd recommend? Maybe one of these two? |
|
@rahimnathwani I am still spending time deciding on an approach but at a minimum, any MCP I want to use, I simply extra the API commands I care about into a skill using skill-creator: here are some additional links to explore: https://github.com/jx-codes/lootbox If you tell me your use case, I can likely give better advice |
|
Thanks for engaging, @numman-ali. Let me make sure I understand your position before responding: Your argument (as I understand it):
I appreciate the directness. Here's where I'd respectfully push back: 1. "Skills inherently are of non-deterministic nature": is this physics, or an arbitrary philosophical choice?Barry Zhang (Anthropic, co-creator of Agent Skills) wrote in Making Peace with LLM Non-determinism:
The skill-creator skill itself says:
The question isn't whether skills CAN be non-deterministic; it's whether they SHOULD be for production workflows. This PR provides the "code components" Barry recommends, but at the composition layer. 2. MCP Code Execution has documented limitationsFrom Anthropic's engineering post and the a16z deep dive:
Static type checking at composition time catches errors BEFORE code generation, reducing the debugging burden. 3. This is fully optional and backwards-compatibleTo be clear: this PR doesn't change how existing skills work. All new fields ( This is an opt-in enhancement for teams who need deterministic composition. If your use case doesn't require it, you can ignore these features entirely. The spec explicitly states: "Teams can adopt composability incrementally." 4. This isn't just about client-side tool callingYou may be viewing this from the client perspective. But consider the gateway MCP pattern, which is increasingly common in production: Why gateways need typed composition:
Skill composability applies at every layer:
A typed composition graph is a "mental model" LLMs can use at ALL these layers, not just a runtime constraint. Is it possible you're considering this only from the user/client side rather than more holistically? 5. Genuine question: What's the most complex production system you've built with MCP code execution?I ask sincerely because I'd love to understand how you've achieved mission-critical reliability. Specifically:
@rahimnathwani asked a similar question and you mentioned you're "still spending time deciding on an approach." That's fair; this is new territory. But that's precisely why exploring multiple approaches, including typed composition, benefits the community. The core disagreementIf we accept that skills must be non-deterministic, we're making an arbitrary philosophical choice that limits what agents can reliably accomplish. The SyntropicSystems methodology captures this:
This PR implements that layered approach as an optional extension. My mission is to solve for LLM reliability and efficiency by providing "mental models" that help LLMs reason better than their present "best efforts." |
- Add Industry Recognition section to README with a16z, Simon Willison, and Barry Zhang citations - Add References section to README with FCCM 2014 paper and key sources - Add Industry Validation section to architecture.mdx with detailed table - Add Gateway MCP Pattern section with ASCII diagram and source table - Add Barry Zhang quote on reducing non-determinism with code components - All sources properly cited with links These additions strengthen the case for composable skills by showing: 1. Industry recognition of the problems being solved 2. Gateway MCP pattern applicability beyond client-side 3. Theoretical foundation in FP-to-hardware synthesis
|
Dude @edu-ap sorry, but if you're going to reply with a wall of text it's obvious you're using a tonne of AI You're not being thoughtful in your response - this is an open source protocol, people give their spare time to contribute I read your first point and saw thats a blog post from April 2024, things change Please come with a hand written response, based on your personal experience, and of real world user experience not copy paste of random facts |
|
Thanks for the pushback, @numman-ali. Point taken on the HOW; but I'd argue it shouldn't distract from the WHAT. I took a look at openskills; nice work on the universal CLI loader! I see these as complementary: openskills solves distribution (getting skills to agents); this PR addresses composition (what skills contain and how they fit together). A typed composition graph could actually help loaders like openskills optimise what to load and when. Would you consider opening PRs to agentskills for:
The format benefits from diverse perspectives. Your loader experience is valuable here. |
- Add to Industry Recognition section in README.md - Add to References section in README.md - Add to Industry Validation table in architecture.mdx Key quote: "By writing explicit orchestration logic, Claude makes fewer errors" This directly validates the composition layer approach.
e3f77ed to
1197719
Compare
|
@edu-ap so sorry, had to force push and this PR got auto-closed. You may need to rebase and re-push it, and link back to this for discussion. Apologies for the hassle! |
|
No worries, it happens. I'll do that tomorrow 🙏🏼 |
The comprehensive PR description is now the actual body of PR agentskills#13, not a separate file in the repository. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
- README.md: Add Static Type System section with epistemic requirements - README.md: Add Trip Optimizer Showcase to Getting Started - architecture.mdx: Add Acknowledgements for FCCM 2014 co-authors - Create CHANGELOG.md to track composable skills release Ensures valuable context from PR agentskills#13 description persists in repo files after merge. Type system, theoretical foundation, and acknowledgements now live in their canonical locations. Co-authored-by: Eduardo Aguilar Pelaez <[email protected]> Co-authored-by: Claude <[email protected]>
|
@maheshmurag Following your note about the force push, I've rebased and opened a new PR: → PR #29: Quasi-deterministic LLM agents through composable skills with static type checking All discussion from this PR is preserved here for reference. The new PR includes updated acknowledgements and links to community validation from related issues. Thanks for the heads up! |
Executive Summary
This PR introduces an optional composability extension to the Agent Skills format, enabling atomic skills to be combined into higher-order workflows with static type checking. All new fields are optional; existing skills work unchanged.
The Problem
LLM agents with flat tool definitions suffer from non-deterministic behaviour that makes them unsuitable for production:
The Solution
This PR introduces composable skills with static type checking, applying principles from functional programming-based hardware synthesis (FCCM 2014) to LLM agent orchestration:
Key Benefits
Core Features Implemented
New Frontmatter Fields:
Field Schema Properties:
CLI Commands:
```bash
skills-ref validate ./skills/research
skills-ref graph --format=mermaid ./skills
skills-ref typecheck ./skills
```
Architecture Overview
Three-level composition hierarchy:
Showcase: Trip Optimizer
Complete example with 12 skills across 3 levels, demonstrating:
Industry Validation
The problems addressed align with documented challenges:
Gateway MCP Pattern
Typed composition applies across the stack:
The FP-to-Hardware Parallel
This architecture draws from research on compiling functional programs to parallel hardware (FCCM 2014):
Files Changed
Core Implementation:
Documentation & Examples:
Tests:
Test Results
All 103 tests pass successfully.
Backwards Compatibility
All new fields are optional. Existing skills work unchanged. This is an opt-in enhancement for teams requiring deterministic composition; adoption can be incremental.