Outcome-based Exploration for LLM Reasoning


TL;DR

This paper proposes an “Outcome-based Exploration” method that effectively improves the accuracy of large language models on reasoning tasks by providing exploration rewards during reinforcement learning training based on the final answer rather than the entire reasoning process, while also mitigating the decline in generation diversity caused by traditional RL training.

Key Definitions

Currently, post-training large language models (LLMs) with reinforcement learning (RL) is the mainstream approach for improving their reasoning ability. Outcome-based reinforcement learning, which gives rewards only according to the correctness of the final answer, has been shown to significantly improve model accuracy.

However, this approach has a serious bottleneck: systematic loss of diversity. After RL training, the diversity of answers generated by the model drops sharply, as reflected in the \(pass@k\) metric—when \(k\) is large, the RL-trained model can even perform worse than the base model. This collapse in diversity harms the model’s scalability in real applications, because performance gains at test time through multiple sampling or tree search depend on generation diversity.

The core problem this paper aims to solve is: how to improve LLM reasoning accuracy through reinforcement learning while avoiding or mitigating the severe decline in generation diversity, thereby achieving a better balance between accuracy and diversity.

Method

The core innovation of this paper is the proposal of an “Outcome-based Exploration” framework, which shifts the focus of exploration from the intractable space of reasoning paths to the manageable space of final answers.

RL as a Sampling Process: Perspective and Motivation

The paper first views the RL training process as a sampling process over the training set and compares it with direct sampling from the base model. Two key phenomena are observed experimentally, which motivate the proposed method:

  1. Transfer of Diversity Degradation: RL reinforces correct answers on already-solved problems, causing the probability distribution to collapse. This reduction in diversity generalizes to unsolved problems, lowering the model’s ability to explore new answers on those problems as well. As shown in the figure below, RL (solid line) discovers fewer new answers on unsolved problems (dashed line) than base-model sampling.
  2. Tractability of the Outcome Space: In tasks such as mathematical reasoning, although the reasoning process can vary widely, the set of final answers is limited (usually fewer than 50). This makes answer-based counting and exploration possible.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 2: Comparison of RL training dynamics and base-model sampling. Top: number of solved problems; bottom: number of distinct answers discovered. Solid lines represent all problems, dashed lines represent unsolved problems.

Historical Exploration (\(Historical Exploration\))

To address the decline in diversity, the paper first introduces a history-count-based exploration method, similar to the classic UCB algorithm. An exploration reward is added to the RL objective:

\[\widehat{\operatorname{\mathbb{E}}}\_{x,\{y\_{i},a\_{i}\}\_{i=1}^{n}\sim\pi(\cdot\mid x)}\left[\frac{1}{n}\sum\_{i=1}^{n}\widehat{A}\left(x,\{y\_{i},a\_{i}\}\_{i=1}^{n}\right)\_{i}+cb\_{\mathsf{ucb}}(x,a\_{i})-\beta\widehat{\mathrm{KL}}(\pi(\cdot\mid x),\pi\_{\mathsf{base}}(\cdot\mid x))\right],\]

where the exploration reward is $b_{\mathsf{ucb}}(x,a)=\min\left{1,\sqrt{\frac{1}{N(x,a)}}\right}$, and $N(x,a)$ is the number of times answer $a$ has historically appeared for problem $x$.

Refer to caption Refer to caption Figure 3: Training performance comparison between different UCB variants and the GRPO baseline.

Refer to caption Refer to caption Figure 4: Test performance comparison between different UCB variants and the GRPO baseline.

Batch Exploration (\(Batch Exploration\))

Historical exploration aims to find the optimal solution (optimizing \(pass@1\)), but it does not necessarily guarantee generation diversity at test time. To directly optimize test-time diversity (high-\(k\) \(pass@k\)), the paper proposes batch exploration. Its reward mechanism is replaced with:

\[b\_{\mathsf{batch}}\left(x,\{y\_{i},a\_{i}\}\_{i=1}^{n}\right)\_{i}=-\frac{1}{n}\sum\_{j\neq i}\mathbf{1}\{a\_{i}=a\_{j}\}\]

This reward directly penalizes answers that repeatedly appear within the current batch, thereby incentivizing the model to generate more diverse answers for the same problem.

Refer to caption Refer to caption Figure 5: Comparison of the training performance of \(Batch\) and \(UCB-Con\).

Theoretical Analysis: Outcome-Based Bandits

To theoretically justify the rationale behind “outcome-based exploration,” this paper proposes a new model called Outcome-Based Bandits. This model abstracts the inference process of LLMs: there are $K$ arms (representing reasoning paths), but only $m$ outcomes (representing final answers), where $m \ll K$.

Experimental Results

This paper conducted extensive experiments on the \(Llama\) and \(Qwen\) model families using mathematical reasoning datasets such as \(MATH\) and \(DAPO\).

Core Experimental Comparison

Refer to caption Refer to caption Figure 1: Overview of the final experimental results. The exploration methods (\(UCB-Con\) and \(Batch\)) outperform the baseline \(GRPO\) across the board on the \(pass@k\) metric.

Refer to caption Refer to caption Figure 6: Comparison of the test performance of \(Batch\) and \(UCB-Con\). \(Batch\) shows an advantage on high-$k$ \(pass@k\) in the later stages of training.

Additional Analysis


  Correct generation Incorrect generation All
\(GRPO\) 0.080 (0.01) 0.096 (0.04) 0.095 (0.02)
\(UCB-Con\) 0.084 (0.01) 0.103 (0.03) 0.100 (0.02)
\(Batch\) 0.086 (0.01) 0.153 (0.07) 0.125 (0.03)

Table 1: Comparison of the entropy of generated content across different methods.



  Solved problems Unsolved problems All
\(GRPO\) 2.279 (0.018) 4.805 (0.075) 2.883 (0.024)
\(UCB-Con\) 2.272 (0.020) 4.855 (0.084) 2.926 (0.035)
\(Batch\) 2.284 (0.057) 5.390 (0.102) 3.230 (0.062)

Table 2: Comparison of the number of distinct answers generated within each batch.

Summary

This paper confirms that outcome-based exploration is an effective way to address the loss of diversity during RL training. Historical exploration, especially \(UCB-Con\), can significantly improve overall reasoning accuracy, while batch exploration (\(Batch\)) maximizes generation diversity at test time while maintaining accuracy. These two methods are complementary and point to a practical and feasible direction for training LLM reasoning intelligent agents that are both accurate and diverse.