ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training

arXiv:2606.15682v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong problem-solving through long chain-of-thought, but their deployment is constrained by the high cost of full-precision inference and growing KV cache footprints. Microscaled FP4 formats enable efficient FP4 deployment; however, fully quantizing weights, activations, and KV caches (W4A4KV4) causes severe reasoning degradation that existing PTQ and QAT fail to recover. We identify that FP4 failures concentrate on low-entropy tokens--precise symbolic commitments such as digits and operators--where quantiza
The increasing scale of Large Reasoning Models is pushing the limits of current inference capabilities, making efficient quantization a critical bottleneck.
Achieving high accuracy with lower precision inference directly impacts the cost and accessibility of large AI models, accelerating their broader deployment.
This research suggests a path to deploy highly capable reasoning models more efficiently, lowering the barriers to entry for advanced AI applications.
- · AI compute providers
- · Cloud AI service platforms
- · Developers of Reasoning Models
- · Companies seeking to deploy LRMs
- · Hardware manufacturers solely focused on full-precision compute
Reduced computational cost and memory footprint for running Large Reasoning Models.
Increased adoption and accessibility of complex AI capabilities across various industries due to lower operational expenses.
Potentially democratized access to advanced AI, fostering innovation beyond well-resourced institutions.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG