
arXiv:2602.07340v2 Announce Type: replace Abstract: Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enfo
The increasing deployment of LLMs highlights the urgent need for robust safety alignment methods that can withstand real-world variability and adversarial attacks.
Improving LLM safety and robustness is critical for their reliable integration into sensitive applications and preventing unintended or harmful behaviors.
This work introduces a new perspective on LLM safety, focusing on optimization geometry rather than solely on data, which could lead to more inherently robust models.
- · AI developers
- · Organizations deploying LLMs
- · AI safety researchers
- · Ethical AI advocates
- · Malicious actors exploiting LLM vulnerabilities
- · Organizations with brittle LLM safety pipelines
- · Naive LLM alignment strategies
More resilient and trustworthy large language models become available for various applications.
Public trust in AI systems increases, accelerating adoption in critical sectors.
The development of highly robust and self-correcting AI agents becomes more feasible, impacting white-collar work automation.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG