Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

DeepMind Drops a Bombshell: Are Autoregressive Models Secretly EBMs? Unveiling the “Global Planning” Ability of Next-Token Prediction

For a long time, there has been a common skepticism in the LLM community: do autoregressive models based on Next-Token Prediction really have logical reasoning and planning abilities? Or are they just “random parrots” sliding along on probabilistic inertia?

ArXiv URL：http://arxiv.org/abs/2512.15605v1

After all, autoregressive models (ARMs) seem extremely short-sighted—they only look at the next word. Another class of models, known as Energy-Based Models (EBMs), can evaluate the quality of an entire sequence from a global perspective and have a natural “god’s-eye view,” but because they are computationally extremely difficult, they have never become mainstream.

Google DeepMind’s latest research has shattered this long-held view.

The paper, titled Autoregressive Language Models are Secretly Energy-Based Models, puts forward a disruptive claim: autoregressive models are, in essence, energy-based models. Although they appear to be doing next-token prediction, they are secretly learning a “soft value function” to achieve global planning through local prediction.

This discovery not only provides solid theoretical support for the “planning ability” of LLMs, but also unifies three major fields: supervised learning, reinforcement learning, and energy-based models.

The Opposition and Unity of Two Camps

Before diving into the technical details, we need to clarify the relationship between the two protagonists:

Autoregressive Models (ARMs): This is the mainstream paradigm for current LLMs, such as the GPT series.

Features: Uses the chain rule to decompose sequence generation into step-by-step conditional probabilities.
Advantages: Highly parallelizable training, simple sampling (Ancestral Sampling).
Disadvantages: Appears to be able to only “take one step at a time.”

Energy-Based Models (EBMs):

Features: Define an energy function (or reward function) $R(x, y)$ to score the entire input-output pair. The higher the energy, the better the generated sequence.
Advantages: Naturally has a global view (Lookahead), because it directly models the full sequence.
Disadvantages: Extremely hard to train and sample from, because it requires computing a highly complex partition function to normalize probabilities.

Through mathematical derivation, the DeepMind researchers found that these two seemingly opposing models actually have an explicit bijection in function space. In other words, every ARM corresponds to a unique EBM, and vice versa.

Core Revelation: The Mathematical Bridge from Local to Global

The paper’s core contribution is the establishment of a transformation mechanism that converts the EBM’s global energy function $r$ into the ARM’s local prediction function $q$.

It is like having a global navigation map (EBM) that tells you which route will ultimately score the highest; DeepMind proved that you can losslessly transform this map into specific road signs at every intersection (ARM), and as long as you follow the signs, the path you take will be equivalent to the optimal route planned by the map.

The mathematical expression of this transformation involves a core concept from reinforcement learning—the Soft Bellman Equation.

Specifically, when an ARM predicts the next token $y_t$, its output logits $q(s_t, y_t)$ actually contain information composed of two parts:

The current immediate reward $r(s_t, y_t)$.
The expected future value $V_q(s_t \oplus y_t)$.

The formula is as follows:

\[q(s_t, y_t) = r(s_t, y_t) + V_q(s_t \oplus y_t)\]

Here, $V_q$ is called the Soft Value Function. In essence, it is a Log-Sum-Exp (LSE) operation, representing the logarithm of the “probability sum” over all possible future paths starting from the current state.

What does this mean?

It means that a perfectly trained autoregressive model, when predicting the next word, is not merely asking “what does the next word look like?” but rather computing “if I choose this word, what will the total energy (quality) of the entire future sentence be?” By learning this $V_q$ function, the ARM implicitly learns to look ahead.

Why Is Teacher Forcing the Optimal Solution?

When training LLMs, we usually use Teacher Forcing—that is, during training, we force the model to take the ground-truth previous token as input rather than the token generated by the model itself. This approach is often criticized for causing a mismatch between training and inference (Exposure Bias).

However, based on the ARM-EBM equivalence above, the paper derives a surprising conclusion: ARM training under supervised learning is exactly equivalent to EBM training.

When we minimize the ARM’s negative log-likelihood loss (NLL), we are in fact distilling an optimal EBM through Teacher Forcing. This theoretically proves that, although Teacher Forcing may seem simple and brute-force, it is indeed searching for the optimal solution in function space.

The Essence of RLHF: Distilling EBM into ARM

The current LLM training pipeline usually includes “pretraining -> SFT -> RLHF.” This paper provides a very clear perspective on RLHF (Reinforcement Learning from Human Feedback).

In the RLHF stage, we usually want to maximize reward $R$ while maintaining a KL-divergence constraint with respect to the reference model (i.e., the MaxEnt RL framework). The paper points out that the optimal solution of MaxEnt RL is essentially an EBM.

However, directly using this EBM at inference time is unrealistic (it is too slow). So what we actually do is:

Define an ideal EBM (defined by the reward model).
Train an ARM (our policy model) to approximate this EBM.

This process is the process of distilling an EBM into an ARM. DeepMind further provides theoretical error bounds, proving that the ARM can indeed effectively approximate the EBM distribution.

Summary and Takeaways

This paper uses elegant mathematical language to resolve the contradiction between “short-sighted prediction” and “global planning.”

Unified view: Autoregressive models (ARMs) and energy-based models (EBMs) are two sides of the same coin. ARM is the manifestation of EBM under temporal decomposition.
Hidden planning ability: Next-Token Prediction is not just simple pattern matching. As long as the model capacity is large enough, it can learn a “soft value function” that contains future information, thereby achieving “long-term foresight” in each prediction step.
Algorithmic confidence: This provides strong theoretical backing for the mainstream training paradigms we use today (Teacher Forcing, RLHF).

The next time you see GPT generate a brilliantly crafted long response, remember: it is not merely guessing the next word; at every step, it is weighing countless future possibilities and choosing the worldline with the highest energy.