Online Process Reward Leanring for Agentic Reinforcement Learning


TL;DR

This paper proposes an agent reinforcement learning credit assignment strategy called Online Process Reward Learning (OPRL). By alternately optimizing a process reward model and the agent policy online, it seamlessly converts trajectory-level preferences into dense step-level rewards, enabling efficient and stable training of long-horizon large language model (LLM) agents without relying on extra data or step labels.

Key Definitions

The core of this paper is built around implicitly learned step rewards from online learning. The key definitions are as follows:

where $\pi_{\phi}$ is the currently updated PRM, and $\pi_{\theta_{\text{old}}}$ is the snapshot of the policy model from the previous round. This reward measures how much the current action improves over the old policy from the PRM’s perspective, thereby providing dense guidance signals for policy learning.

At present, training large language model (LLM) agents in dynamic, interactive environments faces major challenges, with the main bottlenecks including:

  1. Sparse rewards and credit assignment: Environment rewards are usually only given at the end of a task, making it difficult to determine the contribution of intermediate steps, i.e., the temporal credit assignment problem.
  2. High-variance learning: Agent trajectories are long and complex, and assigning rewards at the token level introduces substantial noise, leading to high variance in policy learning and unstable training.
  3. Complexity of open environments: In open-ended environments such as dialogue, the state space is huge and rarely overlaps, and reward signals are often difficult to verify, causing many traditional RL methods to fail.

Existing process supervision methods each have their own limitations:

This paper aims to address the above problems by proposing a general, label-free, efficient, and stable credit assignment strategy that can adapt to long-horizon agent tasks with sparse, delayed, or even unverifiable rewards.

Method

The proposed Online Process Reward Learning (OPRL) framework learns a process reward model (PRM) online, converting sparse trajectory-level outcome preferences into dense step-level reward signals to guide fine-grained policy updates.

Figure illustration

The figure above shows the overall training flow of OPRL: the agent interacts with the environment to generate trajectories, and an outcome reward model (ORM) evaluates the entire trajectory and provides an outcome reward. These trajectories with outcome labels are used to update the PRM, which then generates implicit process rewards for each step in the trajectory. Finally, the agent policy is updated using both the outcome reward and the implicit step rewards.

Core Procedure

The training process of OPRL is a self-improving loop in which the policy model $\pi_{\theta}$ and the process reward model $\pi_{\phi}$ are alternately optimized:

  1. Data sampling: Use the current policy $\pi_{\theta}$ to interact with the environment and generate a batch of trajectories.
  2. PRM optimization: Based on the trajectories’ outcome rewards (provided by a verifier or ORM), construct preference pairs (e.g., a “successful” trajectory $\tau^{+}$ vs. a “failed” trajectory $\tau^{-}$). Then, update the PRM $\pi_{\phi}$ using a DPO-like objective:

    \[\mathcal{J}_{\text{PRM}}(\phi)=-\mathbb{E}_{(\tau^{+},\tau^{-})\sim\pi_{\theta_{\text{old}}}}\left[\log\sigma\left(\beta\log\frac{\pi_{\phi}(\tau^{+} \mid x)}{\pi_{\theta_{\text{old}}}(\tau^{+} \mid x)}-\beta\log\frac{\pi_{\phi}(\tau^{-} \mid x)}{\pi_{\theta_{\text{old}}}(\tau^{-} \mid x)}\right)\right]\]

This process teaches the PRM to prefer trajectories that lead to better outcomes.

  1. Policy optimization: Use the updated PRM to compute the implicit step reward $r_{\phi}$ for each action. Then, combine two types of advantage functions to update the policy $\pi_{\theta}$:
    • Episode-level Advantage $A^{E}$: Computed from the final outcome reward $r_{o}(\tau)$, reflecting the global performance of the entire trajectory.

      \[A^{E}(\tau_{i})=\big(r_{o}(\tau_{i})-mean(R_{o})\big)/std(R_{o})\]

Finally, the policy is updated using the surrogate objective of standard RL algorithms such as PPO.

Figure illustration

As shown above, when OPRL updates the policy, the final advantage function is a combination of the episode-level advantage $A^{E}(\tau)$ and the step-level advantage $A^{S}(a)$.

Innovations

  1. Label-free fine-grained credit assignment: OPRL cleverly converts sparse, trajectory-level outcome preferences into dense, step-level reward signals through a DPO-style objective, without requiring expensive and biased manual step labels.
  2. Low variance and training stability: By computing rewards at the step (turn) level rather than the token level, OPRL effectively controls reward granularity and avoids the high-variance problems caused by overly fine-grained signals. Theoretical analysis shows that the learned implicit step reward is a potential-based reward shaping reward, which preserves the optimal policy and provides bounded gradients, thereby stabilizing multi-turn RL training.
  3. Generality and scalability: This method relies only on trajectory-level preferences, which can come from rule-based verifiers (such as task success signals) or from unverifiable ORMs such as LLM judges, making it applicable across a wide range of environments, including open-ended dialogue. At the same time, OPRL can be plugged in with mainstream online RL algorithms such as PPO, GRPO, and RLOO.

Theoretical Analysis

This paper theoretically proves the effectiveness and stability of OPRL:

Experimental Conclusions

Experiments were conducted on three challenging agent benchmarks: WebShop (web shopping), VisualSokoban (visual Sokoban), and SOTOPIA (open-ended social interaction).

Main Performance


Method WebShop (Qwen2.5-7B) VisualSokoban (Qwen2.5-VL-7B)  
  Success Rate Score Success Rate
GPT-5 37.5 66.1 16.6
Gemini-2.5-Pro 30.5 38.4 16.0
Base Model (ReAct) 21.5 47.3 14.1
+ RLOO 77.4 ± 1.1 87.6 ± 4.7 86.3 ± 0.6
+ PRIME 81.5 ± 1.8 91.3 ± 0.6 -
+ GiGPO 84.1 ± 3.9 91.2 ± 1.5 85.9 ± 2.6
OPRL (this paper) 86.5 ± 2.8 93.6 ± 1.0 91.7 ± 1.2



Model / Method Self-Chat Against GPT-4o    
  Goal (Hard) Goal (All) Goal (Hard) Goal (All)
Qwen2.5-7B        
+ GRPO 6.97 8.31 6.42 7.84
+ OPRL (this paper) 7.11 8.42 6.76 8.36
Llama3.1-8B        
+ GRPO 7.92 9.12 6.68 8.14
+ OPRL (this paper) 8.06 9.20 7.16 8.45


Figure illustration

Sample Efficiency and Training Stability

Figure illustration

Exploration Efficiency Analysis

Figure illustration

Ablation Study

The ablation study validates the key design choices of OPRL:


Method Ablation WebShop VisualSokoban  
  Success Rate Score Success Rate
RLOO (baseline) 76.6 84.2 85.9
w/ ground-truth PR - - 87.5
w/ merged rewards 81.3 90.7 88.3
w/ token-level PR 82.0 90.0 89.1
OPRL 86.5 93.6 91.7


In summary, OPRL is an efficient, stable, and general credit assignment strategy that significantly improves the performance of LLM agents across a variety of interactive environments.