
arXiv:2509.20241v2 Announce Type: replace Abstract: As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicati
The accelerating scale and deployment of AI models necessitates accurate understanding of their real-world energy consumption beyond laboratory settings.
Accurate energy estimation is critical for capacity planning, cost optimization, and policy development for large-scale AI infrastructure.
The understanding of AI inference energy consumption shifts from potentially overestimated lab figures to more realistic production-scale metrics, influencing infrastructure investment and operational strategies.
- · AI infrastructure providers with energy-efficient deployments
- · Hyperscalers optimizing AI operations
- · Policy makers with better data
- · AI models with high per-query energy use
- · Data centers with inefficient cooling/power
- · Legacy inference architectures
More precise energy cost models for AI inference will emerge, improving financial forecasting for AI-driven services.
Increased pressure will be placed on chip manufacturers and AI developers to prioritize energy efficiency in future designs and models.
Energy consumption could become a more significant differentiator in AI service offerings, driving market preferences towards more sustainable solutions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG