
arXiv:2605.23918v1 Announce Type: cross Abstract: The AI inference industry keeps models loaded in GPU memory around the clock to avoid cold-start latency, implicitly treating idle power as a fixed cost of readiness. Yet the structure of this cost has never been empirically decomposed - and never across GPU architectures. We present the first cross-architecture measurement of idle GPU power as a function of VRAM allocation, combining 18 days of production telemetry (335,267 samples, 14 H100 GPUs) with controlled dose-response experiments on three GPU architectures spanning three memory technol
The rapid expansion of AI inference and the 'always-on' model deployment strategy is making the implicit costs of idle GPU power significant and quantifiable, necessitating empirical analysis as compute demands scale.
This research provides critical data for optimizing AI infrastructure, revealing hidden energy costs that will drive investment in more efficient GPU utilization, software, and potentially hardware architectures.
The understanding of GPU energy consumption will shift from treating idle power as a fixed cost to a dynamic variable correlated to VRAM allocation and architecture, impacting design and operational decisions for AI inference.
- · GPU power management software developers
- · AI data center operators focused on efficiency
- · Hyperscalers with diverse GPU architectures
- · Developers of custom AI chips with dynamic power states
- · AI firms with inefficient 'always-on' deployment
- · Older GPU architectures with poor idle power characteristics
- · Data centers with undifferentiated power strategies
- · Organizations treating compute as an unlimited, free resource
AI inference costs will be re-evaluated to include dynamic idle power consumption, affecting pricing models and operational budgets.
This will spur demand for advanced power monitoring and management solutions for GPUs, driving innovation in data center and software efficiency.
The quantification of this 'parking tax' could incentivize the development of novel AI model architectures and serving strategies that allow for rapid unloading/loading, reducing reliance on always-on deployment.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG