How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

arXiv:2606.19379v1 Announce Type: cross Abstract: Transformer feed-forward networks (FFNs) are often treated as nonlinear stores of computation, yet how nonlinear a trained FFN block actually is has rarely been measured. We treat each FFN as a position-wise input-to-output map and split it into the exact least-squares linear approximation plus a residual. The held-out variance the closed-form linear map explains defines a block's linear recoverability (R^2_lin), an optimiser-free measure of its linearity. Across all twelve blocks of GPT-2, Pythia-160m, and llama-160m, R^2_lin is highly heterog
This paper investigates a fundamental property of transformer architectures, addressing a gap in understanding how non-linearity is learned within these models as they become more prevalent and complex.
Understanding the linear recoverability of FFN blocks can lead to more efficient transformer designs, better interpretability, and potentially reduce computational overhead for critical AI applications.
This research provides a quantifiable metric (R^2_lin) to measure the linearity of FFN blocks, shifting the focus from general non-linearity assumptions to empirically observable characteristics during training.
- · AI researchers
- · ML model developers
- · Organizations deploying large language models
- · Inefficient transformer architectures
The linearity measure helps in identifying which parts of a transformer model are truly performing non-linear computations.
This understanding could inform the design of more compact and specialized neural network layers for specific tasks, potentially reducing the training and inference costs of large models.
Improved efficiency and interpretability of transformer models could accelerate the development and deployment of advanced AI agents, impacting various industries reliant on complex AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL