Gemma 2: Improving Open Language Models at a Practical Size


TL;DR

This article introduces the Gemma 2 family of open language models (2B, 9B, 27B). By interleaving local-global attention in the Transformer architecture, adopting grouped-query attention, and applying knowledge distillation to train the 2B and 9B models, it achieves the best performance at the same parameter scale and can even rival models that are 2-3 times larger.

Key Definitions

This article mainly combines and improves on existing techniques. Below are several core techniques that are crucial for understanding the method in this paper:

  1. Knowledge Distillation: A training strategy in which a smaller “student” model (such as Gemma 2’s 2B and 9B models) does not directly learn to predict the next token, but instead learns to imitate the output probability distribution of a larger, stronger “teacher” model. This provides the student model with richer gradient signals than standard one-hot labels, enabling better performance with the same amount of training data and simulating the effect of training on more data.
  2. Interleaving Local-Global Attention: A hybrid attention mechanism. The Transformer layers in the model architecture alternate between two attention modes: one layer uses Sliding Window Attention, focusing only on the most recent 4096 token; the next layer uses Global Attention, which can attend to the entire 8192-token context. This design aims to balance computational efficiency with the ability to capture long-range dependencies.
  3. Grouped-Query Attention (GQA): An attention variant that divides query heads into groups, with each group sharing a set of Key and Value heads. In this paper, \(num_groups\) is set to 2, meaning the number of KV heads is half the number of Q heads. This technique reduces memory usage and computation during inference while preserving model performance.
  4. Logit soft-capping: A technique for stabilizing training by using a \(tanh\) function to constrain the logits of the attention layers and the final output layer within a preset \(soft_cap\) range (50 for attention layers, 30 for the final layer). Its formula is: \(logits\) $\leftarrow$ \(soft_cap\) $\times \tanh(\text{logits} / \text{soft_cap})$.

At present, performance improvements in small language models mainly rely on substantially increasing the amount of training data, but the returns from this approach follow a logarithmically diminishing pattern, meaning the gains become increasingly limited. For example, the latest small models require as many as 15T token to achieve a modest 1-2% performance improvement, indicating that existing small models are still under-trained.

The core problem this paper aims to solve is: how to find more effective ways to improve the performance of small language models without relying solely on massive increases in training data. The researchers explore replacing the traditional “next token prediction” task with richer training objectives, such as knowledge distillation, to provide the model with higher-quality information at each training step.

Method

The Gemma 2 model family is built on the decoder-only Transformer architecture of Gemma 1, but introduces several key architectural and training improvements.

Model Architecture

While retaining Gemma 1 features such as RoPE positional encoding and the GeGLU activation function, Gemma 2 introduces significant updates aimed at improving performance and efficiency.

The table below summarizes the key architectural parameters of Gemma 2 models at different sizes:

Parameter 2B 9B 27B
d_model 2304 3584 4608
Number of layers 26 42 46
Pre-norm Yes Yes Yes
Post-norm Yes Yes Yes
Nonlinearity GeGLU GeGLU GeGLU
Feed-forward dimension 18432 28672 73728
Attention head type GQA GQA GQA
Number of query heads 8 16 32
Number of KV heads 4 8 16
Head size 256 256 128
Global attention range 8192 8192 8192
Sliding window size 4096 4096 4096
Vocabulary size 256128 256128 256128
Tied word embeddings Yes Yes Yes

Pretraining

Gemma 2’s pretraining differs from Gemma 1 in several key aspects.

\[\min_{P_{S}}\sum_{x}-P_{T}(x \mid x_{c}) \log P_{S}(x \mid x_{c})\]

where $P_{S}$ is the student model’s probability distribution, $P_{T}$ is the teacher model’s probability distribution, and $x_c$ is the context. This method was used to “simulate training beyond the available number of token.” The 27B model still uses the traditional from-scratch training approach.

Post-training

To obtain instruction-tuned models, the paper applies a series of post-training steps to the pretrained models.

Experimental Conclusions

Through extensive ablation studies and benchmark evaluations, this paper validates the advantages of Gemma 2 in both architecture and training methods.

Core Experimental Findings

The table below shows the performance of the pretrained models on several core benchmarks:

  LLaMA-3 70B Qwen1.5 32B Gemma-2 27B
MMLU 79.2 74.3 75.2
GSM8K 76.9 61.1 74.0
ARC-c 68.8 63.6 71.4
HellaSwag 88.0 85.0 86.4

Post-training Model Performance

Figure illustration

Final Conclusion

Through architectural improvements (such as interleaved attention) and innovative training methods (especially the large-scale use of knowledge distillation), Gemma 2 successfully delivers a substantial boost in overall model capability without significantly increasing model size. The experimental results show that Gemma 2 not only leads comparable open models on automated benchmarks, but also demonstrates strong competitiveness in human evaluations that reflect real-world applications, providing a powerful new tool for building practical, efficient, and responsible AI applications.