arXiv:2603.15389v2 Announce Type: replace Abstract: Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we provide evidence that sparsity-like mechanisms can dampen variance propagation and are associated with improved depth utilization Our investigation covers two sources of sparsity: (i) implicit sparsity, wh
Source: arXiv cs.CL — read the full report at the original publisher.
