Skip to content

research: tool invocation reliability taxonomy — 12 categories, model-size threshold for reliable tool use (arXiv:2601.16280) #2234

@bug-ops

Description

@bug-ops

Source

arXiv:2601.16280 — "When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems"

Summary

Introduces a 12-category diagnostic framework for tool invocation failures across setup, parameter handling, execution, and result interpretation phases. Benchmarks 1,980 scenarios across GPT-4, Claude, and Qwen2.5. Key finding: mid-sized models (qwen2.5:14b) achieve 96.6% tool success rate at the best cost/reliability tradeoff.

Applicability to Zeph

HIGHzeph-tools ToolExecutor, audit logging, and tool error taxonomy.

Zeph already has a ToolErrorCategory enum (PR #2214) with 12 categories. This paper provides empirical grounding:

  • Cross-validate Zeph's 12-category taxonomy against the paper's framework
  • The model-size threshold finding is actionable for routing tool-heavy tasks: prefer qwen2.5:14b-equivalent models for reliability-critical tool calls
  • Setup/parameter handling failures map to Zeph's InvalidInput/SchemaError categories

Implementation Direction

  • Annotate Zeph's ToolErrorCategory with paper's phase labels (setup/param/execution/result)
  • Add phase-level error metrics to [tools.audit] output
  • Use model-size findings to configure [orchestration] tool-heavy task provider selection

Priority: P2
Discovered: CI-211 research scan (2026-03-27)

Metadata

Metadata

Assignees

Labels

P2High value, medium complexityresearchResearch-driven improvementtoolsTool execution and MCP integration

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions