-
Notifications
You must be signed in to change notification settings - Fork 2
research: tool invocation reliability taxonomy — 12 categories, model-size threshold for reliable tool use (arXiv:2601.16280) #2234
Copy link
Copy link
Closed
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementtoolsTool execution and MCP integrationTool execution and MCP integration
Description
Source
arXiv:2601.16280 — "When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems"
Summary
Introduces a 12-category diagnostic framework for tool invocation failures across setup, parameter handling, execution, and result interpretation phases. Benchmarks 1,980 scenarios across GPT-4, Claude, and Qwen2.5. Key finding: mid-sized models (qwen2.5:14b) achieve 96.6% tool success rate at the best cost/reliability tradeoff.
Applicability to Zeph
HIGH — zeph-tools ToolExecutor, audit logging, and tool error taxonomy.
Zeph already has a ToolErrorCategory enum (PR #2214) with 12 categories. This paper provides empirical grounding:
- Cross-validate Zeph's 12-category taxonomy against the paper's framework
- The model-size threshold finding is actionable for routing tool-heavy tasks: prefer qwen2.5:14b-equivalent models for reliability-critical tool calls
- Setup/parameter handling failures map to Zeph's
InvalidInput/SchemaErrorcategories
Implementation Direction
- Annotate Zeph's
ToolErrorCategorywith paper's phase labels (setup/param/execution/result) - Add phase-level error metrics to
[tools.audit]output - Use model-size findings to configure
[orchestration]tool-heavy task provider selection
Priority: P2
Discovered: CI-211 research scan (2026-03-27)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2High value, medium complexityHigh value, medium complexityresearchResearch-driven improvementResearch-driven improvementtoolsTool execution and MCP integrationTool execution and MCP integration