A research team has developed a novel approach to training large language models on coding tasks that sidesteps one of the field's most persistent limitations: the need for ground-truth solutions. The method, called RiVER (Ranking-induced VERifiable framework), enables models to learn from continuous performance scores rather than binary right-or-wrong answers, opening new possibilities for teaching AI systems skills in domains where perfect solutions are difficult to define or verify.

According to arXiv research published by Lin, Gao, Kuang, Huang, Zhou, and colleagues, the framework addresses two fundamental problems that arise when training models on score-based feedback. The first, termed "scale dominance," occurs when different test cases produce reward signals of wildly different magnitudes, causing the model's learning process to become skewed toward whatever problem happens to have larger numbers attached to it. The second issue, "frequency dominance," emerges when a model repeatedly generates mediocre solutions that collectively outweigh the rare occasions when it produces something genuinely excellent.

RiVER tackles these challenges through a calibrated reward system that compares performance within individual problem instances rather than across them. The framework also prioritizes the model's best attempts while still providing meaningful feedback for suboptimal but valid solutions. This balanced approach prevents the learning signal from being dominated by either arbitrary score magnitudes or common failure modes.

Performance Gains Across Multiple Benchmarks

The researchers trained their system on 12 problems from the AtCoder Heuristic Contest, a competition where tasks typically have multiple valid solutions with varying quality scores rather than single correct answers. When evaluated on the Algorithm Engineering Benchmark, RiVER improved the Qwen 3 8B model by 8.9 percent and the GLM-Z1 9B model by 9.4 percent in terms of rating rank.

What makes these results particularly significant is the transfer learning effect. Despite training exclusively on score-based optimization tasks without any ground-truth solutions, the models showed measurable improvements on exact-solution benchmarks like LiveCodeBench and USACO. Performance gains averaged 2.4 percent and 3.5 percent respectively on these more traditional coding challenges.

This contrasts sharply with baseline approaches that simply use raw execution scores as training signals. Those methods improved performance on the optimization benchmarks but failed to generalize to exact-solution tasks, suggesting that RiVER's calibration approach produces more robust learning.

Implications for AI Training Beyond Code

The implications extend beyond competitive programming. Many real-world domains lack clean ground-truth answers but feature measurable performance metrics. Optimization problems in logistics, design, scientific discovery, and creative tasks often fall into this category. By demonstrating that models can develop generalizable skills from imperfect feedback, this research suggests a pathway for training AI systems on a broader range of practical applications.

The work also hints at a deeper principle about machine learning: that perfect information may not be necessary for developing robust capabilities. The ability to learn from comparative judgments and continuous improvement signals could unlock training methodologies for domains previously considered difficult or impossible to teach through reinforcement learning.

"Score-based optimization tasks, combined with proper reward calibration, can serve as effective training environments for general coding ability without ground-truth solutions."

As large language models move beyond text generation into specialized domains like scientific computing and complex reasoning, the ability to train them on imperfect feedback becomes increasingly valuable. RiVER represents a step toward more flexible and broadly applicable training methods for the next generation of AI systems.