TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

arXiv:2606.11357v1 Announce Type: cross Abstract: With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets trans
The increasing demand for on-device LLM inference and the limitations of current NPU software stacks for practical deployment are driving the need for optimized solutions like TileFuse.
This development allows for more efficient, lower-power, and practical deployment of advanced AI models on edge devices, expanding the reach and utility of LLMs beyond cloud-based solutions.
Optimized kernel libraries can unlock greater performance and energy efficiency from existing NPU hardware, making on-device LLM inference more feasible and widespread.
- · AMD
- · Edge AI device manufacturers
- · AI developers
- · Consumers of AI-powered edge devices
- · Cloud-centric AI providers
- · NPU competitors lacking similar optimization
- · Software stacks with poor low-level control
Improved performance and energy efficiency of LLMs on AMD's edge NPUs.
Increased adoption of locally-run AI models, reducing reliance on cloud infrastructure for many applications.
Enhanced competition in the edge AI hardware and software space, potentially accelerating innovation and lowering costs for on-device inference.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI