First Try Matters: Revisiting the Role of Reflection in Reasoning Models


TL;DR

This paper reveals through large-scale quantitative analysis that the “reflection” step in current reasoning models mainly serves a confirmatory role rather than a corrective one. Performance gains come from higher first-attempt accuracy, and based on this finding, the paper proposes an early-stopping strategy that can significantly improve reasoning efficiency.

Key Definitions

The core analysis in this paper is built on a redefinition and quantification of “reflection,” using the following key concepts:

Current state-of-the-art large language models (LLMs), especially reasoning models trained with reinforcement learning from verifiable rewards (RLVR), demonstrate strong reasoning capabilities. This is usually attributed to their ability to generate longer Chain-of-Thought (CoT) and perform so-called “reflective reasoning” — that is, after arriving at an initial answer, they continue to examine, evaluate, and revise their own reasoning path. The prevailing view in the field is that this reflection is the key mechanism by which models achieve self-correction and improve final answer accuracy.

However, existing research has not reached a consensus on the true role of reflection. Some studies argue that the reflection mechanism is complex and can prevent reasoning collapse, while others believe that reflection patterns are often superficial and do not improve outcomes. The key bottleneck in these studies is the lack of large-scale, systematic quantitative analysis of reflection behavior in reasoning models.

This paper aims to address this core question: Are the reflection steps in reasoning models actually performing effective self-correction, or are they merely confirming existing conclusions?

Method

This paper first designs an analytical framework to quantify reflection behavior, then investigates the role of reflection in training through controlled experiments, and finally proposes a method to improve reasoning efficiency based on the analysis results.

Quantitative Analysis of Reflection Behavior

To systematically study reflection, this paper designs an innovative analysis method.

Distribution of first candidate answer positions across different LLMs and prompts.

Investigating the Role of Reflection in Training

Based on the above analytical framework, this paper explores how the reflection characteristics in training data affect model performance through a series of supervised fine-tuning (SFT) experiments.

Comparison of performance and rollout length after SFT when training on rollouts cut at different positions.

Early-Stopping Strategy for Efficient Reasoning

Based on the core finding that “reflection is mainly confirmatory,” this paper proposes a practical method to improve efficiency at inference time.

Experimental Conclusions

Reflection Behavior Analysis

Training experiment conclusions

Model F→T Ratio Average Tokens Accuracy (%) P(F→T) (%)
Llama3.1-8B-Instruct 0% 7618 49.3 2.1
  25% 7512 48.7 2.2
  50% 7612 49.2 2.0
  75% 7500 48.2 1.8
  100% 7417 47.6 1.8
Qwen2.5-7B-Instruct 0% 8391 54.4 1.9
  25% 8345 54.0 2.1
  50% 8452 53.9 2.0
  75% 8711 55.1 1.8
  100% 8421 53.4 1.9

Final conclusion

This systematic analysis overturns the common view that “reflection equals error correction.” The study shows that in current reasoning models, the core value of long-form reasoning lies in enhancing the model’s ability to “get it right on the first try” through the presentation of diverse reasoning paths, rather than in effective self-correction after making mistakes. Based on this insight, the question-aware early termination strategy proposed in this paper demonstrates that it is entirely feasible to greatly improve reasoning efficiency with almost no sacrifice in core reasoning ability. This points to a new direction for the design and optimization of future reasoning models: rather than pinning hopes on complex reflective error correction, it is better to focus on improving the accuracy and robustness of the model’s first reasoning attempt.