Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

arXiv:2606.10722v1 Announce Type: new Abstract: We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, t
The accelerating demand for larger, more capable LLMs is driving research into methods for making them more efficient and accessible, particularly as hardware scales and inference costs become critical.
This development proposes a technique to create sparser, more efficient LLMs from dense checkpoints, potentially reducing the computational and energy resources required for their deployment and operation.
The ability to 'upcycle' dense LLMs into sparser, more context-aware models through continual training could significantly lower the bar for deploying high-performance language models.
- · AI developers
- · Cloud providers
- · Edge AI companies
- · LLM-dependent industries
- · Providers of inefficient inference solutions
- · Energy-intensive data centers
More powerful and efficient LLMs become accessible for a wider range of applications and organizations.
Reduced operational costs for AI accelerate the adoption of advanced AI capabilities across various sectors.
This efficiency gain could contribute to a broader democratization of AI, fostering innovation beyond current resource-constrained environments.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL