-
Notifications
You must be signed in to change notification settings - Fork 2
research(performance): PASTE application-level speculative tool execution — 48.5% latency reduction, 1.8x throughput (arXiv:2603.18897) #2409
Description
Source
arXiv:2603.18897 — "Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution" (PASTE, submitted March 19, 2026)
Key Contribution
PASTE identifies that agent tool-call sequences have stable application-level control flows. Speculatively pre-executes predicted tool sequences based on learned call patterns before the LLM finishes reasoning, hiding execution latency. Achieves 48.5% reduction in task completion time and 1.8x tool throughput improvement.
Distinction from #2290
Existing issue #2290 (speculative tool calls) covers pre-fetching based on LLM draft tokens at the decoding level. PASTE is a distinct approach: it learns recurring application-level call sequences (e.g., "search always precedes summarize in research tasks") as patterns. Different mechanism, different implementation point, concrete throughput numbers worth tracking separately.
Relevance to Zeph
zeph-tools / zeph-core agent loop — ToolOrchestrator could learn per-skill tool sequence patterns (e.g., web_search → read → summarize for research skills). Pre-execute first N predicted tools while LLM generates the response selecting them.
Implementation Sketch
- Track per-skill tool invocation sequences in SQLite (extend scheduler/memory tables)
- On skill activation, predict top-K likely tool calls; pre-warm them speculatively
- Cancel if LLM selects a different tool path; commit if prediction matches
Priority Assessment
P3 (research) — High impact if tool call latency becomes user-visible bottleneck. Distinct from #2290; implement independently.