EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

arXiv:2605.27846v1 Announce Type: new Abstract: Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response
This research addresses a fundamental limitation in reinforcement learning for Large Reasoning Models, specifically in open-ended question answering, an area of increasing focus for AI development.
Improved training methodologies for Large Reasoning Models directly impact the performance and applicability of advanced AI systems, particularly those aiming for agentic capabilities.
The proposed EAPO method offers a more adaptive and effective way to weight positive and negative samples in RLVR, potentially leading to more robust and generalized AI models for complex tasks.
- · AI research institutions
- · Developers of AI agents
- · SaaS companies leveraging advanced AI
- · AI models without adaptive RL techniques
- · Companies relying on less sophisticated QA systems
AI models, especially large language models, will become more proficient and less prone to errors in open-ended tasks.
This improved proficiency will accelerate the development and deployment of more capable AI agents across various industries.
More reliable AI agents could lead to significant automation of white-collar workflows, transforming labor markets.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.AI