arXiv:2601.03525v3 Announce Type: replace Abstract: Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense
Source: arXiv cs.LG — read the full report at the original publisher.
