Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents


TL;DR

This paper proposes a framework called Entropy-Modulated Policy Gradients (EMPG). By leveraging the intrinsic uncertainty (entropy) of an intelligent agent at each step in long-horizon tasks to dynamically adjust policy gradients, it effectively addresses the credit assignment problem under sparse rewards, significantly improving the learning efficiency and final performance of LLM agents.

Key Definitions

The core of this paper is centered on using the model’s intrinsic uncertainty to reshape reinforcement learning signals. The key definitions are as follows:

At present, autonomous agents based on large language models (LLMs) face a core bottleneck when handling long-horizon tasks: because reward signals are extremely sparse (typically provided only at the end of the task), it is difficult to accurately assign credit to important intermediate steps.

To address this issue, current research mainly falls into two directions:

  1. Implicit Reward Guidance: Using traditional reinforcement learning techniques (such as reward shaping, intrinsic curiosity, and inverse reinforcement learning) to create dense reward signals. However, when facing the enormous state and action spaces inherent to LLM agent tasks, these methods are often too computationally expensive, difficult to scale, or heavily dependent on human prior knowledge.
  2. Explicit Step-Level Supervision: Using Process Reward Models (PRMs) to provide feedback for each step. But building PRMs requires high-cost human annotation, and defining the “correct” single step in complex interactive tasks is itself a difficult problem, leading to poor generalization and impracticality.

In addition, some work has tried to use policy entropy as a learning signal, but either it is risky due to “confidently making mistakes,” or its application is limited to single-turn generation tasks, failing to solve the credit assignment problem in multi-step decision-making.

This paper aims to solve the credit assignment problem in long-horizon, multi-step decision-making tasks that the above methods fail to handle effectively. Specifically, it first theoretically reveals the inherent coupling between policy gradient magnitude and policy entropy: high-entropy (uncertain) actions produce large gradients, while low-entropy (confident) actions produce small gradients, resulting in inefficient and unstable learning. The goal of this paper is to directly correct this intrinsic gradient dynamic.

Method

This paper proposes the Entropy-Modulated Policy Gradients (EMPG) framework, aiming to solve the credit assignment problem in long-horizon agent tasks by recalibrating the learning dynamics of policy gradients. Its core idea is to use the agent’s intrinsic, step-by-step uncertainty to modulate the learning signal.

Theoretical Motivation

The paper first theoretically analyzes the relationship between policy gradients and policy uncertainty. As shown in Proposition 1, for a standard softmax policy, the square of the expected L2 norm of the score function has a monotonic relationship with the policy’s Rényi-2 entropy:

\[\mathbb{E}_{a\sim\pi_{\theta}(\cdot \mid s)}\left[ \mid \mid \nabla_{z_{\theta}(s)}\log\pi_{\theta}(a \mid s) \mid \mid ^{2}\right]=1-\exp(-H_{2}(\pi))\]

This reveals an inherent learning dynamic: high-entropy (uncertain) steps naturally produce large gradients, which may lead to unstable training; low-entropy (confident) steps produce small gradients, meaning that even if these steps are correct, their reinforcement effect is limited, thereby reducing learning efficiency. EMPG is designed precisely to directly address this dual challenge.

Figure illustration

Innovations

The innovation of EMPG lies in introducing a new Modulated Advantage $A_{\text{mod}}$, which replaces the traditional practice in reinforcement learning of using a single advantage value for an entire trajectory, and instead customizes the learning signal for each decision step $t$:

\[A_{\text{mod}}(i,t)=\underbrace{A^{(i)}\cdot g(H_{t}^{(i)})}_{\text{self-calibrated gradient scaling}}+\underbrace{\zeta\cdot f(H_{t+1}^{(i)})}_{\text{future clarity reward}}\]

where $A^{(i)}$ is the final return advantage of trajectory $i$, and $H_{t}^{(i)}$ is the entropy at step $t$. This advantage function consists of two core components:

1. Self-Calibrating Gradient Scaling $g(H)$

This component is designed to correct the gradient-entropy coupling problem identified in the theoretical motivation above.

\[g(H_{t}^{(i)})=\frac{\exp(-k\cdot H_{\text{norm},t}^{(i)})}{\frac{1}{\sum_{j=1}^{N_{B}}T_{j}}\sum_{j=1}^{N_{B}}\sum_{t^{\prime}=1}^{T_{j}}\exp(-k\cdot H_{\text{norm},t^{\prime}}^{(i)})}\]

2. Future Clarity Bonus $f(H)$

This component provides the agent with an intrinsic incentive to guide it toward more purposeful exploration.

\[f(H_{t+1}^{(i)})=\exp(-k^{\prime}\cdot H_{\text{norm},t+1}^{(i)})\]

Algorithm Flow

The overall algorithm of EMPG is as follows:

  1. Collect Data: Run the current policy $\pi_{\theta}$ and collect a batch of trajectories.
  2. Compute Base Advantage: Compute a trajectory-level advantage value $A^{(i)}$ for each trajectory based on its final task outcome (success/failure).
  3. Compute Step-Level Entropy: Compute the entropy $H_t$ for all trajectories and all steps in the batch.
  4. Normalize and Compute Modulation Factors: Normalize the entropy within the batch, then compute the self-calibrating scaling factor $g(H_t)$ and the future clarity bonus $f(H_{t+1})$.
  5. Compute Modulated Advantage: Use the formula to compute $A_{\text{mod}}(i,t)$ for each step.
  6. Final Normalization: Perform batch normalization (zero mean) on all $A_{\text{mod}}$ to obtain the final advantage signal $A_{\text{final}}(i,t)$.
  7. Update Policy: Use $A_{\text{final}}$ as the advantage function and update the model parameters $\theta$ via policy gradient methods.

Experimental Results

The paper conducted extensive experiments on three challenging long-horizon agent benchmarks: WebShop, ALFWorld, and Deep Search. The results show that EMPG achieves significant and consistent performance improvements.

Main Results


Method ALFWorld (All) WebShop (Succ.)
Baseline: Qwen2.5-1.5B-Instruct    
GRPO* 65.6 58.2
with EMPG* 73.7 (+8.1) 60.8 (+2.6)
DAPO* 80.8 73.2
with EMPG* 88.1 (+7.3) 73.8 (+0.6)
Baseline: Qwen2.5-7B-Instruct    
GRPO* 74.8 65.6
with EMPG* 78.5 (+3.7) 69.3 (+3.7)
DAPO* 90.0 79.6
with EMPG* 91.6 (+1.6) 82.7 (+3.1)

Table 1: A partial performance summary on the ALFWorld and WebShop tasks. EMPG brings improvements across different models and baselines.



Method ID Avg. OOD Avg. Overall Avg.
Qwen2.5-32B-Instruct      
DAPO (baseline) 63.5 59.8 62.0
+ Gradient scaling 63.7 63.7 63.7
+ Future reward 66.1 61.4 64.2
+ EMPG (this paper) 66.6 (+3.1) 63.7 (+3.9) 65.3 (+3.3)

Table 2: Main results and ablation studies on the Deep Search task.


In-depth analysis

Figure illustration

Figure illustration

Summary

EMPG is a theoretically grounded and general framework that successfully turns sparse final task rewards into dense, informative, and calibrated step-level learning signals by leveraging the intelligent agent’s own intrinsic uncertainty. Experiments show that, without introducing additional annotation costs, this method significantly improves the performance, stability, and generalization ability of LLM intelligent agents in long-horizon tasks, laying the foundation for developing more efficient and robust autonomous intelligent agents.