Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

arXiv:2605.22005v1 Announce Type: new Abstract: We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.
The rapid advancement and deployment of LLMs necessitate new methods for understanding their internal workings, driven by both ethical concerns and the desire for improved performance.
The ability to easily probe LLM learned data, including 'secret dictionaries' or unintended biases, provides critical transparency for AI development, regulation, and trust.
Developers and researchers can now quickly identify and potentially mitigate problematic data learned by LLMs without extensive computational resources, shifting debugging paradigms.
- · AI developers
- · AI ethics researchers
- · Regulatory bodies
- · LLM users
- · LLM developers concealing proprietary training data
- · Bad actors exploiting LLM weaknesses
- · Black-box AI proponents
Increased transparency and debuggability of large language models for contained bias and unintended learning.
Faster iteration cycles for LLM training and fine-tuning, leading to more robust and ethical models.
The development of 'red-teaming' tools that automatically flag potentially harmful internal states or learned data within deployed LLMs.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG