Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment

arXiv:2607.01239v1 Announce Type: new Abstract: Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger o
This research highlights a newly identified and exploitable vulnerability in LLM alignment stemming from fundamental tokenization practices, making current safety mechanisms brittle.
This discovery reveals a critical flaw in LLM safety, indicating that existing alignment strategies are insufficient and new methods are required to prevent malicious prompt injection and model misuse.
The understanding of LLM security is changed, requiring developers to reassess tokenization strategies and alignment datasets to create more robust safety measures.
- · AI security researchers
- · Companies developing advanced tokenization methods
- · Organisations investing in robust LLM red-teaming
- · LLM developers relying solely on existing alignment datasets
- · Users and platforms vulnerable to LLM misuse
- · Companies with high-stakes LLM deployments
Attackers can more easily bypass LLM safety filters using character-level perturbations.
This necessitates significant investment in novel LLM security and alignment research and development.
The perceived trustworthiness and deployability of current LLM generations in sensitive applications may decrease until this vulnerability is addressed.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.CL