arXiv:2605.26289v1 Announce Type: new Abstract: Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(\Delta_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefi
Source: arXiv cs.LG — read the full report at the original publisher.
