Shared Task @ NLPCC 2026

Chinese BabyLM Challenge

Data-Efficient and Developmentally Plausible Language Models for Chinese

Overview

Over the past several years, large language models (LLMs) have achieved remarkable success, largely driven by scaling up both model parameters and training data. However, this data-hungry paradigm stands in stark contrast to human language acquisition: a typical child has been exposed to fewer than 100 million words of linguistic input by age 13, yet achieves robust linguistic competence that current LLMs struggle to match in many respects.

The BabyLM Challenge, first launched in 2023 and now in its fourth year at EMNLP 2026, has been a highly influential effort to incentivize research on sample-efficient pretraining under cognitively inspired data budgets. However, the challenge has predominantly focused on English, with multilingual evaluation only recently introduced. Chinese BabyLM is the first shared task dedicated to sample-efficient pretraining for Chinese, co-located with NLPCC 2026.

Chinese presents unique and compelling challenges for data-efficient language modeling. Its logographic writing system, lack of explicit word boundaries, rich morphological compounding, and flexible syntactic structures make it a particularly interesting testbed for studying how well models can learn from limited data.

Our Goals

  • 📚 Provide standardized resources and evaluation protocols for studying Chinese language learning under limited input, making pretraining research feasible on a university-level compute budget.
  • 🧠 Foster a research community centered on cognitive and computational models of Chinese language acquisition, bringing together researchers across NLP, linguistics, and cognitive science.
  • 🌐 Build strong connections with the global BabyLM community to enable cross-linguistic comparison and collaboration, contributing Chinese-specific insights to the broader understanding of data-efficient language modeling.
  • 🌏 Lay the groundwork for non-English monolingual BabyLM challenges, complementing the multilingual track by demonstrating the value of dedicated, language-specific shared tasks.

Evaluation Tracks

Chinese BabyLM features three evaluation tracks, each targeting a different dimension of Chinese linguistic competence.

📖

NLU Track

Models are evaluated on natural language understanding tasks drawn from CLUE benchmarks and ZhoBLiMP, assessing syntactic and semantic understanding abilities in Chinese.

🧠

Cognitive Modeling Track

Models are evaluated using MulCogBench, measuring how well model representations align with human cognitive signals spanning behavioral and neural modalities from word to discourse level.

HANZI Track

Models are evaluated on character-level knowledge through PinyinBench and HanziBench, testing phonological and structural properties unique to the Chinese writing system via minimal pair comparisons.

ℹ️

No restrictions on model architecture — Transformer encoder/decoder, encoder-decoder, state-space models, or novel designs are all welcome. No limit on training epochs.

Training Data

The organizers will provide an official training corpus building upon the Chinese portion of the BabyBabelLM dataset, composed of naturalistic and developmentally relevant sources. Participants may also construct their own corpus, provided it does not exceed the 100-million-word budget.

Category Description Tokens Sources
Child-Available Speech Transcriptions of speech available to children in daily life 7.4M NaturalConv, ChildMandarin
Children's Books Stories from children's storybooks and reading comprehension datasets 16.0M Quangushi, GlotStoryBooks, CFT, CMRC-2019
Child-Directed Speech Transcriptions of speech directed at children 9.6M CHILDES, ChildMandarin
Child Wiki Age-appropriate non-fiction from WikiJunior and Wikibooks 25k WikiJunior, Wikibooks
Educational Exam questions, grammar exercises, and student compositions 13.5M GAOKAO, CK-12, CSQ, FCGEC
Subtitles Movie and TV subtitles reflecting everyday spoken language 91.3M WenetSpeech
Total 137.8M

Evaluation Pipeline

A two-phase evaluation protocol designed to ensure transparency and robustness of results.

1

Open Evaluation

An open-source evaluation pipeline and preliminary test data will be distributed upon release of task guidelines. Teams evaluate their models locally. A public leaderboard on Hugging Face will be available to track progress during development.

2

Hidden Evaluation

After the model submission deadline, the organizers will release held-out test data not previously available to participants. Teams must evaluate their final submitted models on this hidden test set and submit results.

Final Scoring

A team's final score is the average of Phase 1 (open) and Phase 2 (hidden) evaluation scores. This design discourages overfitting while providing meaningful development feedback.

Reproducibility

The top 3 teams in each track must submit full training data, all code (preprocessing, training, evaluation), and trained model weights. Organizers will independently verify results.

Model Submission

All final models must be uploaded to Hugging Face by the model submission deadline. Detailed instructions for model formatting and upload will be provided in the task guidelines.

Timeline

March 20, 2026

Task Announcement

Task announcement released; official website launched; registration opens.

April 15, 2026

Guidelines & Data Release

Detailed task guidelines released; training data and open evaluation tasks distributed.

April 22, 2026

Baselines & Leaderboard

Baseline models released; Hugging Face leaderboard goes live.

May 25, 2026

Registration Deadline

Last day to register your team.

June 11, 2026

Model Submission Deadline

Model submission deadline; hidden test set and full evaluation pipeline released.

June 20, 2026

Results Submission

Final results on the hidden test set due.

June 30, 2026

Results Announced

Winners announced and final leaderboard published.

Organizers

Hai Hu City University of Hong Kong
Siyuan Song The University of Texas at Austin
Linyang He Columbia University
Shaonan Wang The Hong Kong Polytechnic University
Yunhao Zhang Institute of Automation, Chinese Academy of Sciences
Rui Wang Shanghai Jiao Tong University
Luan Li Shanghai Jiao Tong University
Zhiheng Qian Shanghai Jiao Tong University
Hong'ao Zhu University of California San Diego
Renfen Hu Beijing Normal University
Xiaozhe Ji Beijing Normal University
Yingxin Lin Tsinghua University

For questions and inquiries, please contact:

chinese.babylm@gmail.com