Data-Efficient and Developmentally Plausible Language Models for Chinese
Over the past several years, large language models (LLMs) have achieved remarkable success, largely driven by scaling up both model parameters and training data. However, this data-hungry paradigm stands in stark contrast to human language acquisition: a typical child has been exposed to fewer than 100 million words of linguistic input by age 13, yet achieves robust linguistic competence that current LLMs struggle to match in many respects.
The BabyLM Challenge, first launched in 2023 and now in its fourth year at EMNLP 2026, has been a highly influential effort to incentivize research on sample-efficient pretraining under cognitively inspired data budgets. However, the challenge has predominantly focused on English, with multilingual evaluation only recently introduced. Chinese BabyLM is the first shared task dedicated to sample-efficient pretraining for Chinese, co-located with NLPCC 2026.
Chinese presents unique and compelling challenges for data-efficient language modeling. Its logographic writing system, lack of explicit word boundaries, rich morphological compounding, and flexible syntactic structures make it a particularly interesting testbed for studying how well models can learn from limited data.
Chinese BabyLM features three evaluation tracks, each targeting a different dimension of Chinese linguistic competence.
Models are evaluated on natural language understanding tasks drawn from CLUE benchmarks and ZhoBLiMP, assessing syntactic and semantic understanding abilities in Chinese.
Models are evaluated using MulCogBench, measuring how well model representations align with human cognitive signals spanning behavioral and neural modalities from word to discourse level.
Models are evaluated on character-level knowledge through PinyinBench and HanziBench, testing phonological and structural properties unique to the Chinese writing system via minimal pair comparisons.
No restrictions on model architecture — Transformer encoder/decoder, encoder-decoder, state-space models, or novel designs are all welcome. No limit on training epochs.
The organizers will provide an official training corpus building upon the Chinese portion of the BabyBabelLM dataset, composed of naturalistic and developmentally relevant sources. Participants may also construct their own corpus, provided it does not exceed the 100-million-word budget.
| Category | Description | Tokens | Sources |
|---|---|---|---|
| Child-Available Speech | Transcriptions of speech available to children in daily life | 7.4M | NaturalConv, ChildMandarin |
| Children's Books | Stories from children's storybooks and reading comprehension datasets | 16.0M | Quangushi, GlotStoryBooks, CFT, CMRC-2019 |
| Child-Directed Speech | Transcriptions of speech directed at children | 9.6M | CHILDES, ChildMandarin |
| Child Wiki | Age-appropriate non-fiction from WikiJunior and Wikibooks | 25k | WikiJunior, Wikibooks |
| Educational | Exam questions, grammar exercises, and student compositions | 13.5M | GAOKAO, CK-12, CSQ, FCGEC |
| Subtitles | Movie and TV subtitles reflecting everyday spoken language | 91.3M | WenetSpeech |
| Total | 137.8M |
A two-phase evaluation protocol designed to ensure transparency and robustness of results.
An open-source evaluation pipeline and preliminary test data will be distributed upon release of task guidelines. Teams evaluate their models locally. A public leaderboard on Hugging Face will be available to track progress during development.
After the model submission deadline, the organizers will release held-out test data not previously available to participants. Teams must evaluate their final submitted models on this hidden test set and submit results.
A team's final score is the average of Phase 1 (open) and Phase 2 (hidden) evaluation scores. This design discourages overfitting while providing meaningful development feedback.
The top 3 teams in each track must submit full training data, all code (preprocessing, training, evaluation), and trained model weights. Organizers will independently verify results.
All final models must be uploaded to Hugging Face by the model submission deadline. Detailed instructions for model formatting and upload will be provided in the task guidelines.
Task announcement released; official website launched; registration opens.
Detailed task guidelines released; training data and open evaluation tasks distributed.
Baseline models released; Hugging Face leaderboard goes live.
Last day to register your team.
Model submission deadline; hidden test set and full evaluation pipeline released.
Final results on the hidden test set due.
Winners announced and final leaderboard published.
For questions and inquiries, please contact:
chinese.babylm@gmail.com