Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Alibaba Open-Sources ROME: 57% SWE-bench Win Rate, Unveiling the “Rome” Infrastructure Behind Top-Tier Agents

The development of large models is undergoing a profound transformation from “conversationalist” to “actor.”

ArXiv URL：http://arxiv.org/abs/2512.24873v1

In the past, we were used to giving a model a Prompt and then expecting it to produce a perfect answer. But in real software engineering or complex tasks, this kind of one-shot deal often doesn’t work. True agentic crafting (Agentic Crafting) requires the model to act like a human engineer: plan a solution, write code, observe errors, self-correct, and ultimately solve the problem through multiple interactions.

However, the open-source community has long lacked a proper set of “infrastructure” to support this kind of complex Agent development. Everyone knows that “ROME wasn’t built in a day,” but how do you build it systematically?

The Alibaba team recently released a major paper, not only introducing a high-performance Agent model called ROME, but more importantly, open-sourcing the entire Agentic Learning Ecosystem (Agentic Learning Ecosystem, ALE) behind it. This system helped the ROME model reach 57.4% accuracy on the SWE-bench Verified leaderboard, even approaching the performance of models with hundreds of billions of parameters.

Today, let’s break down how Alibaba built this “Rome” city on top of “Rock and Roll” on the road to AGI.

“Rock and Roll” here is not just music

The “Rock and Roll” in the paper title is actually a clever pun, representing the two most core foundational components of the ALE ecosystem: ROLL and ROCK.

To train an Agent that can work in the real world, data alone is not enough. You need a training ground where the Agent can “get its hands dirty,” as well as an efficient training mechanism. ALE was built for exactly this purpose, and it includes three components that work together:

ROLL（Reinforcement Learning Optimization for Large-Scale Learning）：

This is a training framework designed specifically for large-scale RL. Its key highlight is dynamic GPU resource scheduling. In Agent training, the resource demands of data generation (Rollout) and model updates (Training) fluctuate. ROLL adopts a “time-division multiplexing” strategy: it generates data at full capacity during Rollout peaks, then quickly switches resources to training once enough data has been accumulated, greatly improving GPU utilization.

ROCK（Reinforcement Open Construction Kit）：

This is the Agent’s “training room” — a secure sandbox environment manager. When the Agent writes code or executes commands, it may produce dangerous operations (such as accidental rm -rf or network attacks). ROCK provides strictly isolated container environments, supports fine-grained permission management such as file system and network controls, and ensures that when the Agent “makes mistakes,” it won’t blow up the server, while also keeping the training data clean and safe.

iFlow CLI：

This is an Agent framework that connects the model and the environment. It manages complex context, allowing developers to define the Agent’s behavior flow through configuration rather than hard coding.

ROME: obviously an Agent model

Built on top of the powerful infrastructure above, Alibaba incubated ROME (ROME is Obviously an Agentic ModEl). This is not just a fine-tuned LLM; it went through a carefully designed three-stage training pipeline:

Continual Pretraining (CPT):

In this stage, the model not only learns code, but also learns how to think like an Agent through roughly 300 billion tokens of trajectory data. These data include successful and failed interaction records generated by powerful teacher models (such as Claude), enabling ROME to learn “intent formation” and “error recovery.”

Two-stage Supervised Fine-tuning (SFT):

To prevent the model from getting lost in complex Agent tasks, SFT is divided into two stages. The first stage uses heuristically filtered data for basic training; the second stage introduces adaptive value data revisiting, specifically reinforcing high-quality, high-difficulty Agentic tasks.

Reinforcement Learning (RL):

This is the key step in ROME’s “spiritual elevation.” But in long-horizon Agent tasks, traditional RL faces a major challenge: the credit assignment problem.

Core algorithmic innovation: IPA

Over dozens of rounds of interaction, the Agent may only succeed at the very last step. If you simply reward every token, or only reward the final outcome, the model has a hard time knowing which intermediate step was right and which was wrong.

To solve this problem, the paper proposes a new policy optimization algorithm: Interaction-Perceptive Agentic Policy Optimization, IPA.

IPA’s core insight is: the decision granularity of an Agent is not the token, but the “interaction chunk” (Chunk).

Traditional token-level RL (such as PPO or ReMax) is often too fine-grained, leading to unstable training. IPA models multi-turn conversations as a Chunked MDP, treating each complete “think-act-observe” loop as a semantic unit.

\[\nabla J_{\text{RL}}(\pi) = \underbrace{\sum_{\tau \in \mathcal{T}^{+}} \dots}_{\text{positive-sample weighted update}} + \underbrace{\sum_{\tau \in \mathcal{T}^{-}} \dots}_{\text{negative-sample truncated update}}\]

In simple terms, IPA achieves the following:

Semantic-level credit assignment: It does not blindly reward every word, but evaluates the value of the entire interaction action.
Long-range stability: By computing the advantage function at the semantic chunk level, IPA significantly improves training stability for long-sequence tasks.
Balanced positive and negative samples: It not only learns from successful trajectories, but also uses failed trajectories (through importance sampling truncation) to make clear “what should not be done.”

Experimental results: a small model with a big breakout

With these techniques in place, ROME demonstrated astonishing capability.

On SWE-bench Verified (an authoritative benchmark for evaluating whether LLMs can solve real GitHub issues), ROME achieved a 57.4% solve rate. This result not only crushes open-source models of similar scale, but can even go head-to-head with closed-source models that have several times more parameters (such as the GPT-4 series).

In addition, Alibaba introduced a new benchmark, Terminal Bench Pro. Compared with the previous version, it is much stricter in terms of scale, domain coverage, and contamination control. Even on this “hell-level” test, ROME still delivered highly competitive performance.

Conclusion

The greatest value of this paper may not lie in the ROME model itself, but in the complete Agent production pipeline it presents to the community.

From the secure sandbox of ROCK, to the efficient training of ROLL, and then to the optimization of long-horizon interactions by the IPA algorithm, Alibaba has shown that in the Agent era, improving model capability no longer depends solely on stacking data and parameters; it depends even more on the deep synergy of Environment, Data Synthesis, and System.

As the paper says: “ROME wasn’t built in a day.” If we want to build general-purpose Agents, we first need to build the underlying “Rome infrastructure” well.