When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

arXiv:2601.00065v3 Announce Type: replace Abstract: Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vecto
This research provides a deeper, albeit theoretical, understanding of fundamental challenges in interoperability and transfer learning for large language models, crucial as AI systems become more complex and modular.
A strategic reader should care because this technical insight could impact how AI models are designed, optimized, and transplanted, potentially leading to more efficient or robust cross-model applications.
The understanding of 'asymmetric realizability' in tokenizer transplantation might change approaches to model composition and fine-tuning, highlighting a hidden geometric property that influences AI system integration.
- · AI researchers
- · NLP developers
- · Large Language Model creators
- · Inefficient model transfer methods
- · Brute-force tokenizer integration
Improved methods for transferring components between large language models due to a better understanding of underlying geometric properties.
Faster development and deployment of specialized LLMs by enabling more effective reuse of existing tokenizer knowledge.
Potential for new toolchains and frameworks specifically designed to mitigate or leverage asymmetric realizability in modular AI architectures.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG