LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

arXiv:2605.29888v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA
The increasing sophistication and widespread use of RL in LLMs necessitate robust methods for ensuring data integrity and preventing contamination, which can undermine model reliability and safety.
Detecting data contamination in RL post-training is critical for ensuring the trustworthiness, generalizability, and ethical deployment of advanced AI models across various critical applications.
The proposed LaRA method offers a novel, trajectory-level analysis approach, moving beyond unreliable output-level signals to improve the detection of data contamination in RL-trained models.
- · AI safety researchers
- · Developers of robust LLMs
- · Sectors reliant on AI reliability (e.g., finance, healthcare)
- · Malicious data injectors
- · Deployments of unchecked RL-trained models
Improved methods for data contamination detection will foster greater trust and reliability in advanced AI systems.
Enhanced reliability could accelerate the adoption of RL-trained LLMs in sensitive applications, given better guarantees of their integrity.
The ability to audit and ensure data cleanliness could become a competitive advantage, leading to the development of 'certified reliable' AI models.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG