Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

arXiv:2606.01839v1 Announce Type: cross Abstract: LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn's decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them. We show this dependence on prediction is imposed by the scheduling u
The proliferation of complex LLM-based agents necessitates more efficient and adaptive scheduling methods to handle their dynamic and unpredictable resource demands.
This research addresses a fundamental bottleneck in scaling agentic systems, moving towards more robust and cost-effective deployment of advanced AI applications.
Scheduling decisions for agentic workloads move from reactive, prediction-based methods to proactive, observation-driven strategies, improving resource utilization and performance.
- · AI compute providers
- · Developers of agentic AI systems
- · Cloud infrastructure companies
- · Companies with inefficient AI scheduling infrastructure
- · Developers reliant on simpler, less optimized inference pipelines
Improved efficiency and reduced operational costs for large-scale AI agent deployments.
Faster development and deployment cycles for multi-turn AI agents across various industries.
Acceleration of the trend towards autonomous AI systems capable of handling complex, long-running tasks.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG