Participation Guidelines

Rules for model training, data usage, evaluation, and submission.

Tracks

Each team submits a single model. The three tracks differ only in which evaluation tasks are run against that model — unlike the English BabyLM Challenge, there are no separate training-setting requirements per track.

The overall score is the total across all evaluation tasks (both open and hidden). Per-track scores are the total within that track's tasks.

Pretraining Data

  1. Pretrain from scratch. Models must be trained from randomly initialized weights. Loading pretrained checkpoints or distilling from existing large models is not allowed.
  2. Choose one of two data options:
  3. No evaluation leakage. Fine-tuning, validation, or test splits from any track's evaluation datasets must not appear in your pretraining corpus.
  4. Reproducibility for winners. Each track's first-place team and the overall first-place team must submit their full training data (if Option 2) and complete training code. Organizers will reproduce the pipeline to verify results.

Evaluation

Evaluation follows a two-phase protocol: an open phase during development and a hidden phase on held-out tasks released after model submission.

Open Evaluation Tasks

Released alongside the guidelines on April 15, 2026. Teams run the open-source evaluation pipeline locally and may report scores to the public Hugging Face leaderboard during development. Open tasks are fully visible — training data must not include them.

Hidden Evaluation Tasks

Released after the model submission deadline (June 11, 2026). Teams evaluate their already-submitted, frozen models on the held-out test set and submit results by June 20, 2026. Hidden tasks test generalization beyond the open set.

Model Submission

All final models must be uploaded to Hugging Face as public repositories by June 11, 2026. No modifications to weights are permitted after this deadline — hidden evaluation uses exactly the submitted checkpoint.

Final Scoring

A team's final score is the total across all open and hidden evaluation tasks. Per-track winners are decided by the total within that track; the overall winner is decided by the total across all tasks across all tracks.

Ways to Improve Your Score

The following are suggestions, not requirements. Participants are free to explore other approaches within the data budget.

Paper Submissions

Chinese BabyLM is a shared task in NLPCC 2026. Participating teams of top ranking sysmtes will usually be invited to submit system reports. After reviewing, the reports may be pusblished as part of the NLPCC proceedings. See proceedings of 2025: https://link.springer.com/book/10.1007/978-981-95-3352-7?page=2

You can also publish your paper on arXiv or submit to other relevant workshops such as BabyLM Workshop @ EMNLP 2026.

Awards

To be determined. Details will be announced once finalized with NLPCC officials and sponsors.