Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

arXiv:2605.29028v1 Announce Type: new Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method en
The continuous evolution of AI models and the pursuit of more efficient and reliable autonomous systems are driving innovation in policy learning.
This development could lead to more robust and predictable AI agents, enhancing their capabilities in complex environments and critical applications.
AI models will be able to align their internal performance metrics (Q-values) more effectively with desired outcomes (return-to-go), leading to more consistent and reliable behavior.
- · AI agents developers
- · Robotics companies
- · Automation sector
- · Researchers in reinforcement learning
- · Companies with less sophisticated AI models
- · Manual labor in some automated sectors
Improved performance and reliability of reinforcement learning agents across various tasks.
Accelerated adoption of AI agents in industries requiring high precision and trustworthiness.
Potential for new ethical considerations as AI agents become more autonomous and self-aware of their performance.
This signal links to a primary source. Continuum Brief monitors and indexes it as part of the live intelligence stream — we do not republish source content.
Read at arXiv cs.LG