Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

arXiv:2605.27997v1 Announce Type: cross Abstract: Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Eva
The proliferation of Large Language Models (LLMs) and their deployment in sensitive applications makes the urgent need for robust toxicity control paramount, pushing research into mechanistic interpretability.
This research provides a novel, efficient, and mechanistic approach to control LLM behavior, moving beyond costly retraining and external filtering to address toxicity at its source.
LLM developers and deployers will gain refined tools to suppress unwanted behaviors like toxicity without extensive model redevelopment, improving deployment safety and efficiency.
- · AI developers
- · Model deployers
- · AI safety researchers
- · Users of LLMs
- · Brute-force retraining methods
- · Solely output-filtering solutions
More reliable and less biased AI systems become achievable through targeted internal interventions.
Increased trust in AI systems due to verifiable control over adverse behaviors could accelerate adoption in sensitive sectors.
This mechanistic approach could be generalized to control other unwanted AI behaviors beyond toxicity, leading to more steerable and ethical AI.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG