Optimizing CUDA like a Human: Micro-Profiling Tools as Expert Surrogates for LLM-Based GPU Kernel Optimization

arXiv:2606.26453v1 Announce Type: new Abstract: We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and pluggable bottleneck detection tools. KernelPro introduces four contributions: (1) a semantic feedback operator that encodes expert heuristics as pluggable micro-profiling tools, transforming raw hardware metrics into actionable natural language guidance; (2) a two-stage tool invocation architecture where roofline-based
The rapid advancement of LLMs and increasing demand for GPU-accelerated computing necessitate automated, intelligent optimization methods to maximize hardware efficiency.
This development allows for more efficient utilization of expensive GPU resources, potentially lowering the cost and accelerating the pace of AI research and deployment.
GPU kernel optimization can now be significantly automated through LLM-based systems, augmenting or even replacing some aspects of expert human optimization efforts.
- · AI developers
- · Cloud computing providers
- · NVIDIA
- · High-performance computing sector
- · Manual GPU optimization consultants
Increased performance and efficiency for GPU-intensive workloads, particularly in AI.
Reduced operational costs for large-scale AI training and inference, democratizing access to advanced AI.
Accelerated development of more complex and larger AI models due to optimized compute infrastructure.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG