arXiv:2603.21558v2 Announce Type: replace Abstract: Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasive failure mode: across multiple rounds, compounding reasoning errors cause accuracy to stall or degrade. We trace this drift to standard filtering criteria that retain solutions based solely on final answer correctness, which lets lucky guesses (correct answers with flawed reasoning) contaminate the training data. We propose Verified Self-Improvement (VSI), a framework that conditions data retention on st
Source: arXiv cs.AI — read the full report at the original publisher.
