Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv:2605.20295v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization paramete
The rapid deployment of large language models (LLMs) to mobile devices is creating an immediate need for efficient on-device inference, which requires overcoming hardware compatibility issues like those found in NPUs.
Efficient on-device LLM inference addresses critical power, latency, and privacy concerns, enabling broader adoption and new applications for AI directly on user devices.
This advancement enables high-fidelity quantization techniques to be compatible with mobile NPU constraints, allowing more powerful LLMs to run directly on consumer hardware with improved performance.
- · Mobile device manufacturers
- · On-device AI application developers
- · Chip designers (NPUs)
- · Consumers of mobile AI
- · Cloud-dependent AI services
- · Less optimized quantization techniques
Widespread deployment of sophisticated LLMs on mobile devices becomes feasible and performant.
This democratizes access to advanced AI capabilities, reducing reliance on constant network connectivity and cloud infrastructure.
It could accelerate the development of highly personalized and privacy-preserving AI assistants that operate entirely locally.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG