Let's Verify Step by Step


TL;DR

This article demonstrates through experiments on the highly challenging MATH math dataset that Process Supervision significantly outperforms Outcome Supervision in training reward models, and the resulting model can more reliably solve complex multi-step reasoning problems.

Key Definitions

Although current large language models can generate multi-step reasoning processes through “chain of thought” and similar methods, they still frequently produce logical errors or “hallucination.” Training a reward model to distinguish good and bad outputs to guide model generation or search is an effective way to improve reliability.

Previous work (Uesato et al., 2022) compared outcome supervision and process supervision, but found that their final performance was similar on relatively simple math tasks. This left several key questions unanswered: on more complex tasks, which supervision method is better? How can expensive human feedback be used more efficiently?

This paper aims to address these questions by conducting a more detailed, large-scale comparison of the two supervision methods using stronger base models, more feedback data, and the more challenging MATH dataset.

Method

The core of this paper is a comparison of two methods for training reward models: outcome supervision (ORM) and process supervision (PRM). The evaluation criterion is which reward model can better select the correct solution from N candidate solutions generated by a generator model (Best-of-N).

Method Overview

This study does not involve using reinforcement learning (RL) to optimize the generator model itself, but instead focuses on how to train the most reliable reward model. The experiments are divided into two scales:

  1. Large-scale: Based on fine-tuning GPT-4, with the goal of training the strongest ORM and PRM to push the state of the art.
  2. Small-scale: To enable fairer and more controlled comparison experiments (such as ablation studies), a large-scale PRM is used as a “synthetic supervisor” to provide labels for training smaller models.

Data Collection and PRM800K

To obtain the data required for process supervision, the paper hired human annotators to label, step by step, solutions generated by models for MATH problems.

Screenshot of the annotation interface Figure 1: Screenshot of the interface used to collect feedback on each solution step.

Process-Supervised Reward Model (PRM)

PRM score visualization Figure 2: Two solutions to the same problem scored by PRM. The left is correct, the right is incorrect. Green backgrounds indicate high PRM scores, and red indicates low scores. PRM successfully identifies the erroneous steps in the wrong solution.

Outcome-Supervised Reward Model (ORM)

Experimental Conclusions

Large-Scale Experimental Comparison

In the large-scale experiments, the PRM was trained on the PRM800K dataset, while the ORM was trained on a dataset that was an order of magnitude larger and uniformly sampled. Although the training sets were not perfectly matched, both represent best practices under their respective supervision methods.

Large-model experimental results Figure 3: Performance comparison of different reward models on best-of-N selection. PRM (blue) significantly outperforms ORM (green) and Majority Voting (red).

Small-Scale Synthetic Supervision Experiments

To conduct stricter controlled experiments, the paper used the trained large-scale PRM (called \(PRMlarge\)) as the annotator to simulate human feedback.

small-model synthetic experiment results Figure 4: Comparison of different forms of supervision. (a) shows how performance changes as the amount of data increases; process supervision (blue) has a clear advantage, while active learning (purple dashed line) is more efficient. (b) shows the best-of-N performance of each method under different N values.

Out-of-Distribution Generalization (OOD)

We tested on entirely new STEM competition problems the model had never seen before (such as AP Physics, Calculus, AMC10/12, etc.), with results shown in the table below.

Domain ORM PRM Majority Voting # Problems
AP Calculus 68.9% 86.7% 80.0% 45
AP Chemistry 68.9% 80.0% 71.7% 60
AP Physics 77.8% 86.7% 82.2% 45
AMC10/12 49.1% 53.2% 32.8% 84
Total 63.8% 72.9% 61.3% 234

Key Conclusions

  1. Process supervision is better: By providing more precise feedback, process supervision significantly simplifies the model’s credit assignment problem, enabling it to train a more reliable reward model than outcome supervision.
  2. Negative “alignment tax”: Process supervision is not only more effective, but also inherently safer and more interpretable, because it directly rewards reasoning processes that humans approve of rather than just an outcome. This means that adopting a safer alignment method (process supervision) actually brings performance gains, which the authors call a “negative alignment tax.”
  3. Active learning works: Active learning can significantly improve data annotation efficiency and is a key technique for reducing the cost of applying process supervision.