
arXiv:2604.18128v2 Announce Type: replace Abstract: We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 P
The continuous push for more efficient and powerful AI models drives the exploration of advanced quantization techniques to reduce computational burdens.
This research significantly improves the viability of 4-bit weight and 4-bit activation (W4A4) quantization for large language models, making advanced AI more accessible and cheaper to operate.
The demonstrated performance of Depth Registers with W4A4 quantization suggests a pathway to running high-quality LLMs on far less capable hardware than currently required.
- · AI hardware manufacturers (edge devices)
- · Cloud AI providers (reduced infrastructure costs)
- · AI developers (wider deployment options)
- · Consumers (more accessible AI features)
- · Manufacturers of memory-intensive high-end AI accelerators
More powerful AI models become deployable on constrained devices, such as smartphones, IoT devices, or embedded systems.
This democratizes access to advanced AI capabilities, fostering innovation in new applications and services.
Reduced computational and energy demands for AI could alleviate some pressure on energy grids and contribute to more sustainable AI development.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL