BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data


TL;DR

This paper introduces BabyBabelLM, a multilingual benchmark covering 45 languages and simulating the human language acquisition environment, aiming to advance cross-lingual research on language models in terms of data efficiency and cognitive plausibility.

Key Definitions

This paper introduces or adopts the following key concepts:

The mainstream trend in current language model research is to pursue scale, which has led to two key problems: first, data efficiency is neglected, making model training expensive; second, the gap between how models learn and how humans acquire language is growing wider, as humans can master their native language with fewer than 100 million words, whereas large models require trillions of words.

In response, research such as the BabyLM Challenge has begun to focus on data efficiency and cognitive plausibility, but most of this work is limited to English. Although there have been scattered studies for languages such as French, German, and Japanese, they lack unified, comparable standards and datasets.

The core problem this paper aims to address is: there is currently no standardized, multilingual, developmentally plausible training and evaluation framework. By building BabyBabelLM, this paper provides key infrastructure for studying how data-efficient language models that learn more like humans can acquire language across different types of languages.

Method

The core contribution of this paper is the creation of the BabyBabelLM benchmark, whose construction process and components are as follows.

Dataset Construction

Innovations

The innovation of this method lies in its systematic, principled, and scalable construction of a multilingual, developmentally plausible dataset. Unlike previous fragmented studies, it:

Dataset Composition

  1. Data Categories: To simulate the diverse language input children receive, the dataset includes the following types:
    • Transcription: Mainly from the CHILDES database, consisting of child-directed speech (CDS), characterized by short sentences, simple structure, and high repetition. It also includes some adult-to-adult conversations.
    • Education: Materials from textbooks and exams, providing more direct instructional content.
    • Books, Wiki, News: Children’s books, children’s Wikipedia, and similar sources, providing longer, more complex sentences and richer vocabulary.
    • Subtitles: Subtitles from films and TV shows suitable for children, serving as an approximation of natural spoken language.
    • Padding: Filtered corpora such as OpenSubtitles are used as padding to bring each language up to the standard for its tier.
  2. Language Coverage and Tiers:
    • Covers 45 languages across multiple language families, including Indo-European, Semitic, and Bantu, ensuring linguistic diversity.
    • The languages are divided into three tiers based on data volume (Tier 1/2/3), corresponding to about 100 million/10 million/1 million equivalent English words, enabling fair cross-lingual comparison.

    Figure illustration

  3. Data Preprocessing: This includes language-specific initial processing and a unified standardization pipeline (such as Unicode normalization and whitespace/punctuation normalization), along with language and script verification using GlotLID v3 to ensure data quality.

Evaluation Suite

The paper builds a multilingual evaluation suite designed to assess models’ formal competence and functional competence.

Baseline Models

To provide a starting point for subsequent research, this paper trained a series of baseline models:

Experimental Conclusions

This paper evaluated the trained baseline models, with the main conclusions as follows:

Comparison of multilingual models, monolingual models, and Qwen3-0.6B on MultiBLiMP and Belebele

Impact of bilingual training (with English added) on performance across evaluation tasks

The table below shows the average accuracy of the monolingual models across tasks.

    Formal Capability Functional Capability (After Fine-tuning) Functional Capability (Zero-shot)                          
Tier Language Language Multi BLiMP Linguistic-Probes Belebele XNLI MMLU SIB-200 ARC-c XCOPA TQA XStory Cloze Hella Swag Wino grande XCOMPS      
Random 50.0 50.0 25.0 33.3 25.0 25.0 25.0 50.0 50.0 50.0 25.0 50.0 50.0      
1 Bulgarian                                
1 Chinese                                
1 Dutch                                
1 English                                
1 French                                
1 German                                
1 Indonesian                                
1 Persian                                
1 Ukrainian                                
2 Afrikaans                                
2 Arabic                                
2 Basque                                
2 Estonian                                
2 Greek                                
2 Hebrew                                
2 Italian                                
2 Japanese                                
2 Polish                                
2 Portuguese                                
2 Serbian                                
2 Spanish                                
2 Swedish                                
2 Welsh                                
2 Yue Chinese                                
3 Achinese                                
3 Balinese                                
3 Buginese                                
3 Croatian                                
3 Czech                                
3 Danish                                
3 Hungarian                                
3 Icelandic                                
3 Javanese                                
3 Korean                                
3 Makasar                                
3 Minangkabau                                
3 Norwegian                                
3 Sepedi                                
3 Romanian                                
3 Russian                                
3 Sesotho                                
3 Sundanese                                
3 Turkish                                
3 isiXhosa                                
3 isiZulu