Chinese BabyLM Challenge

Background

Overview

Over the past several years, large language models (LLMs) have achieved remarkable success, largely driven by scaling up both model parameters and training data. However, this data-hungry paradigm stands in stark contrast to human language acquisition: a typical child has been exposed to fewer than 100 million words of linguistic input by age 13, yet achieves robust linguistic competence that current LLMs struggle to match in many respects.

The BabyLM Challenge, first launched in 2023 and now in its fourth year at EMNLP 2026, has been a highly influential effort to incentivize research on sample-efficient pretraining under cognitively inspired data budgets. However, the challenge has predominantly focused on English, with multilingual evaluation only recently introduced. Chinese BabyLM is the first shared task dedicated to sample-efficient pretraining for Chinese, co-located with NLPCC 2026.

Chinese presents unique and compelling challenges for data-efficient language modeling. Its logographic writing system, lack of explicit word boundaries, rich morphological compounding, and flexible syntactic structures make it a particularly interesting testbed for studying how well models can learn from limited data.

Our Goals

📚 Provide standardized resources and evaluation protocols for studying Chinese language learning under limited input, making pretraining research feasible on a university-level compute budget.
🧠 Foster a research community centered on cognitive and computational models of Chinese language acquisition, bringing together researchers across NLP, linguistics, and cognitive science.
🌐 Build strong connections with the global BabyLM community to enable cross-linguistic comparison and collaboration, contributing Chinese-specific insights to the broader understanding of data-efficient language modeling.
🌏 Lay the groundwork for non-English monolingual BabyLM challenges, complementing the multilingual track by demonstrating the value of dedicated, language-specific shared tasks.

Competition

Evaluation Tracks

Chinese BabyLM features three evaluation tracks, each targeting a different dimension of Chinese linguistic competence.

📖

NLU Track

Models are evaluated on natural language understanding tasks drawn from CLUE benchmarks and ZhoBLiMP, assessing syntactic and semantic understanding abilities in Chinese.

🧠

Cognitive Modeling Track

Models are evaluated using MulCogBench, measuring how well model representations align with human cognitive signals spanning behavioral and neural modalities from word to discourse level.

文

HANZI Track

Models are evaluated on character-level knowledge through PinyinBench and HanziBench, testing phonological and structural properties unique to the Chinese writing system via minimal pair comparisons.

ℹ️

No restrictions on model architecture — Transformer encoder/decoder, encoder-decoder, state-space models, or novel designs are all welcome. No limit on training epochs.

Resources

Training Data

The organizers will provide an official training corpus building upon the Chinese portion of the BabyBabelLM dataset, composed of naturalistic and developmentally relevant sources. Participants may also construct their own corpus, provided it does not exceed the 100-million-word budget.

Category	Description	Tokens	Sources
Child-Available Speech	Transcriptions of speech available to children in daily life	7.4M	NaturalConv, ChildMandarin
Children's Books	Stories from children's storybooks and reading comprehension datasets	16.0M	Quangushi, GlotStoryBooks, CFT, CMRC-2019
Child-Directed Speech	Transcriptions of speech directed at children	9.6M	CHILDES, ChildMandarin
Child Wiki	Age-appropriate non-fiction from WikiJunior and Wikibooks	25k	WikiJunior, Wikibooks
Educational	Exam questions, grammar exercises, and student compositions	13.5M	GAOKAO, CK-12, CSQ, FCGEC
Subtitles	Movie and TV subtitles reflecting everyday spoken language	91.3M	WenetSpeech
Total		137.8M

Protocol

Evaluation Pipeline

A two-phase evaluation protocol designed to ensure transparency and robustness of results.

1

Open Evaluation

An open-source evaluation pipeline and preliminary test data will be distributed upon release of task guidelines. Teams evaluate their models locally. A public leaderboard on Hugging Face will be available to track progress during development.

→

2

Hidden Evaluation

After the model submission deadline, the organizers will release held-out test data not previously available to participants. Teams must evaluate their final submitted models on this hidden test set and submit results.

Final Scoring

A team's final score is the average of Phase 1 (open) and Phase 2 (hidden) evaluation scores. This design discourages overfitting while providing meaningful development feedback.

Reproducibility

The top 3 teams in each track must submit full training data, all code (preprocessing, training, evaluation), and trained model weights. Organizers will independently verify results.

Model Submission

All final models must be uploaded to Hugging Face by the model submission deadline. Detailed instructions for model formatting and upload will be provided in the task guidelines.

Schedule

Timeline

March 20, 2026

Task Announcement

Task announcement released; official website launched; registration opens.

April 15, 2026

Guidelines & Data Release

Detailed task guidelines released; training data and open evaluation tasks distributed.

April 22, 2026

Baselines & Leaderboard

Baseline models released; Hugging Face leaderboard goes live.

May 25, 2026

Registration Deadline

Last day to register your team.

June 11, 2026

Model Submission Deadline

Model submission deadline; hidden test set and full evaluation pipeline released.

June 20, 2026

Results Submission

Final results on the hidden test set due.

June 30, 2026

Results Announced

Winners announced and final leaderboard published.

Team

Organizers

Hai Hu City University of Hong Kong

Siyuan Song The University of Texas at Austin

Linyang He Columbia University

Shaonan Wang The Hong Kong Polytechnic University

Yunhao Zhang Institute of Automation, Chinese Academy of Sciences

Rui Wang Shanghai Jiao Tong University

Luan Li Shanghai Jiao Tong University

Zhiheng Qian Shanghai Jiao Tong University

Hong'ao Zhu University of California San Diego

Renfen Hu Beijing Normal University

Xiaozhe Ji Beijing Normal University

Yingxin Lin Tsinghua University

For questions and inquiries, please contact:

chinese.babylm@gmail.com