LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

arXiv:2603.23640v2 Announce Type: replace-cross Abstract: Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management sup
This research provides timely empirical data on the performance of quantized LLMs on edge devices, a critical step towards widespread deployment of AI agents in power-constrained environments.
Understanding the precise trade-offs of LLM inference on various edge hardware is crucial for companies developing next-generation AI-powered mobile experiences and agents, directly impacting product roadmaps and market strategies.
The detailed benchmarking clarifies the realistic capacity and limitations of current edge hardware for sustained LLM loads, potentially shifting development priorities towards NPU optimization and efficient model quantization.
- · NPU manufacturers
- · On-device AI application developers
- · Quantization tool providers
- · Mobile device manufacturers
- · Cloud-dependent AI service providers (for certain use cases)
- · Companies underestimating edge computing potential
Increased investment in edge AI hardware and software optimization by tech giants.
Accelerated development and adoption of always-on, personalized AI agents directly on user devices.
A fragmentation of the AI compute landscape, with distinct cloud and edge ecosystems emerging for different application needs.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG