Paper: https://arxiv.org/pdf/2506.00103
Difficulty: Very Hard
Notes: Full solution should include a pipeline to generate pairwise training samples from existing LLMs+rubrics or public sources, an environment to train GenRM, and an environment to train main model using GenRM. Requires some creative decision-making; discuss proposal with Will before getting too deep into it, will give sufficient compute for train experiments conditional on implementation progress.