Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
-
ArXiv URL: http://arxiv.org/abs/2509.20616v1
-
Authors: Changliu Liu; Na Li; Hanjiang Hu; Yebin Wang
-
Affiliations: Carnegie Mellon University; Harvard University; Mitsubishi Electric Research Laboratories
TL;DR
This paper proposes a novel method that transforms complex multi-turn task planning into a series of single-turn task reasoning problems, and then optimizes them with single-turn reinforcement learning (GRPO) based on expert trajectories. It successfully trains a small-parameter intelligent agent that outperforms large baseline models on long-horizon planning tasks.
Key Definitions
To bridge single-turn learning and multi-turn planning, this paper introduces or adopts the following key definitions:
-
Multi-Turn Task Planning Markov Decision Process (Multi-Turn Task Planning MDP): This is a standard finite-horizon MDP, defined as $\mathcal{M}=(\mathcal{S},\mathcal{A},f,R,\mathcal{T},s_0)$, used to describe a complete long-horizon task. Here, the environment transition function $f$ is unknown, and the reward $R$ is sparse (equal to 1 only when the task is finally completed), which makes it difficult to train an LLM policy directly on this MDP.
-
Expert Trajectory: Defined as $\tau^{GT}(T^{GT})$, this is a trajectory generated by an expert policy $\pi^{GT}$ that successfully completes the task and has a unique minimum number of steps. This definition is the cornerstone of the paper’s methodology, as it provides an optimal and unique learning target.
-
Single-Turn Task Reasoning Markov Decision Process (Single-Turn Task Reasoning MDP): This is the simplified MDP constructed in this paper for efficient training, defined as $\mathcal{M}_S=(\mathcal{S},\mathcal{A},\emptyset,r_{\pi^{GT}},1,s_0)$. It has no state transitions and is a one-step (bandit) problem. Its core is the dense and verifiable reward function $r_{\pi^{GT}}(s,a)=\mathds{1}{a=\pi^{GT}(s)}$, meaning that at any state $s$, if the agent’s action $a$ matches the action in the expert trajectory, the reward is 1; otherwise, it is 0.
-
Success Probability with Minimal Steps: Defined as $P_t^{\pi}(s_t)=\mathbb{P}_{\pi}(R(s_{t+T^*(s_t)},a_{t+T^*(s_t)})=1 \mid s_t)$, it represents the probability that policy $\pi$ can successfully complete the task from state $s_t$ in the minimum required number of steps $T^*(s_t)$. This is the key theoretical bridge connecting single-turn learning performance with multi-turn task success.
Related Work
At present, training large language model (LLM) intelligent agents with reinforcement learning (RL) for complex multi-turn task planning is a highly promising direction. However, existing methods face three major bottlenecks:
- Sparse rewards: In multi-turn interactions, rewards are only available when the task is finally completed, making the learning signal extremely sparse.
- Credit assignment: In long-horizon tasks, it is difficult to determine which critical action at which step contributed to the final success or failure.
- Computational cost: The computational complexity of multi-turn RL grows rapidly with sequence length, making training prohibitively expensive for complex tasks that require dozens of decision steps.
In contrast, single-turn RL post-training (such as GRPO) has achieved success on single-turn reasoning tasks like math and code, but these successes are limited to scenarios where the model generates a complete answer in one shot.
The core question this paper aims to solve is: How can we bridge the gap between efficient single-turn reasoning training and the demands of complex multi-turn task planning, thereby improving the long-horizon planning ability of LLM intelligent agents while avoiding the challenges of multi-turn RL?
Method
Core Idea: Transforming Multi-Turn Planning into Single-Turn Reasoning
The central insight of this paper is that any complex multi-turn task planning problem can be decomposed into a series of independent single-turn task reasoning problems. Specifically, based on an optimal expert trajectory $\tau^{GT} = [(s_0^{GT}, a_0^{GT}), \dots, (s_n^{GT}, a_n^{GT})]$, the multi-turn planning problem is transformed into learning a policy at each state $s_i^{GT}$ to predict the optimal next action $a_i^{GT}$.
This transformation cleverly turns a long-horizon, sparse-reward sequential decision problem into a series of one-step decision problems with dense, immediate, and verifiable rewards, enabling training with more efficient RL algorithms.
Optimization Algorithm: Single-Turn GRPO
The paper adopts the Group Relative Policy Optimization (GRPO) algorithm to perform policy optimization on the constructed single-turn MDP. GRPO is an on-policy, model-free RL method characterized by not requiring a complex value or critic network; instead, it estimates advantages by comparing the relative quality of a group of sampled results.
The GRPO objective is as follows:
\[\max_{\pi}\mathbb{E}_{s\sim\rho_Q}\mathbb{E}_{a\sim\pi_{\text{old}}(\cdot \mid s)}\frac{\pi(a \mid s)}{\pi_{\text{old}}(a \mid s)}A(s,a)-\beta\text{KL}(\pi \mid \mid \pi^{\text{ref}})\]Here, the advantage function $A(s,a)$ is computed directly from the binary reward $r_{\pi^{GT}}$ indicating whether the action matches the expert action, without requiring complex value estimation. The KL divergence term is used for regularization to prevent the policy from drifting too far from the reference policy.
Innovations and Advantages:
- Avoiding multi-turn RL difficulties: By transforming the problem into a single-turn one, it completely avoids sparse rewards, credit assignment, and the high computational cost of multi-turn rollouts.
- Dense and verifiable rewards: The reward signal $r_{\pi^{GT}}(s,a)=\mathds{1}{a=\pi^{GT}(s)}$ is very dense (available at every step) and easy to verify (only requiring comparison with the expert action), making learning highly efficient.
Theoretical Guarantee: From Single-Turn Improvement to Multi-Turn Success
One of the most important contributions of this paper is a theoretical proof that connects policy improvement on single-turn reasoning tasks with success probability in multi-turn planning tasks.
The key is a non-standard success probability recurrence equation:
\[P_t^{\pi}(s_t)=\mathbb{E}_{a\sim\pi(\cdot \mid s_t)}[r_{\pi^{GT}}(s_t,a)\cdot P_{t+1}^{\pi}(s_{t+1})]\]This equation differs from the traditional Bellman equation in that it uses multiplication rather than addition to connect the reward at the current step with the success probability of future states. This means that only when the agent chooses the correct action at the current step (i.e., $r_{\pi^{GT}}(s_t,a)=1$) does it have the chance to remain on the optimal path in subsequent steps and thus achieve final success in the minimum number of steps. Any mistake at any step will cause the entire product term to become zero, meaning the task cannot be completed in the minimum number of steps.
Based on this, the paper proves the following key theorems:
-
GRPO Improves Multi-Turn Success Probability (Theorem 3.3): If the policy $\pi^*$ optimized by GRPO on the single-turn task is better than or equal to the reference policy $\pi^{\text{ref}}$ at all states (i.e., $\mathbb{E}_{a \sim \pi^*}[r_{\pi^{GT}}(s,a)] \geq \mathbb{E}_{a \sim \pi^{\text{ref}}}[r_{\pi^{GT}}(s,a)]$), then in the multi-turn task, $\pi^*$ also has a higher probability of completing the task in the minimum number of steps, i.e., $P_t^{\pi^*}(s_t)\geq P_t^{\pi^{\text{ref}}}(s_t)$.
-
Generalization to Subtasks (Corollary 3.2): Likewise, a policy trained on more complex tasks can also improve its success probability on all embedded, simpler subtasks.
Experimental Results
The experiments were conducted on a challenging cooking-scenario task-planning benchmark, aiming to answer two questions: 1) Can single-turn GRPO training improve multi-turn task planning performance? 2) Does the trained intelligent agent generalize to unseen tasks?
Effectiveness and Efficiency Validation
The paper compares the 1.5B-parameter Qwen2.5 model after SFT and single-turn GRPO training with Qwen2.5 baseline models of different sizes (from 1.5B to 14B).
| Task | Model | SR ($\uparrow$) | ASAT ($\downarrow$) | ASST ($\downarrow$) |
|---|---|---|---|---|
| Cheese Burger | Qwen2.5-14b | 0.2 | 22.4 | 20.0 |
| (up to 23 steps) | Qwen2.5-1.5b (SFT) | 0.6 | 18.4 | 15.3 |
| Qwen2.5-1.5b (SFT+GRPO) | 0.7 | 15.8 | 12.7 | |
| Double Cheese Burger | Qwen2.5-14b | 0.0 | 35.0 | — |
| (up to 35 steps) | Qwen2.5-1.5b (SFT) | 0.1 | 34.2 | 27.0 |
| Qwen2.5-1.5b (SFT+GRPO) | 0.3 | 30.5 | 20.0 |
Note: SR = success rate, ASAT = average steps over all attempts, ASST = average steps over successful attempts. The table above excerpts the results for the two most representative complex tasks.
- Significant performance gains: The results show that the 1.5B GRPO-trained model performs significantly better than all baseline models across all tasks, including the 14B model with nearly 10 times more parameters. On the most complex “Double Cheese Burger” (about 30 steps) task, only the model trained by our method achieved a non-zero success rate (30%).
- The key role of GRPO: Compared with the model trained only with supervised fine-tuning (SFT), adding GRPO training led to huge improvements in both success rate and efficiency, validating the effectiveness of single-round RL for learning task reasoning logic, in line with the theoretical expectation (Theorem 3.3).
- Higher efficiency: The GRPO-trained model not only has a higher success rate, but also requires fewer average steps to complete the task, indicating that it learned a strategy closer to optimal (i.e., minimum-step) behavior.
Cross-task generalization ability
To verify the generalization prediction of the theory (Corollary 3.2), the experiments evaluated the zero-shot performance of models trained on a single task across all other tasks (simpler or more complex).
Success-rate generalization matrix (SR)
| Trained on ($\downarrow$) / Evaluated on ($\rightarrow$) | Cheese Sandwich | Burger | Cheese Burger | Double Cheese Burger |
|---|---|---|---|---|
| Cheese Sandwich | 0.3 | 0.3 | 0.0 | 0.0 |
| Burger | 0.4 | 0.7 | 0.0 | 0.0 |
| Cheese Burger | 0.4 | 0.7 | 0.7 | 0.0 |
| Double Cheese Burger | 0.3 | 0.5 | 0.4 | 0.3 |
- Generalization from complex to simple: The model trained on the most complex task (Double Cheese Burger) can successfully solve all simpler subtasks, strongly supporting the theoretical prediction. In contrast, models trained on simple tasks cannot generalize to more complex tasks.
- Efficiency loss in generalization: Although the complex-task model can solve simple tasks, its efficiency (number of steps required) may be worse than a model trained specifically for that simple task, showing some “loss of optimality.”
Final conclusion
This work successfully demonstrates that by decomposing multi-round planning into single-round reasoning and applying GRPO for efficient optimization, it is possible to train a compact yet powerful LLMagent that outperforms models far larger than itself on complex long-horizon tasks. The work not only provides a solid theoretical foundation connecting single-round learning with multi-round success, but also experimentally validates the method’s effectiveness and strong generalization ability. Although the method relies on expert trajectories, it opens up a highly promising path toward building more efficient and capable LLMagent.