Research Paper Blog

Training Language Models
via Neural Cellular Automata

What if the path to smarter language models doesn't require more text — but synthetic data from abstract dynamical systems?

6%
Perplexity gain
1.6×
Faster convergence
164M
NCA tokens used

We're running out of text

Large language models are hungry. They require exponentially more data to keep improving, and high-quality natural language is projected to run out by 2028. Worse, internet text carries human biases and entangles knowledge with reasoning, making it hard to control what models actually learn.

This raises a radical question: Is natural language the only path to intelligence?

The core hypothesis: what makes language useful for pre-training is its structure, not its semantics. If so, richly structured, non-linguistic data could also work.

Neural Cellular Automata as synthetic fuel

Neural cellular automata (NCA) generalize systems like Conway's Game of Life by replacing fixed rules with neural networks. Each randomly-sampled network defines a unique transition rule, producing diverse spatiotemporal dynamics on a grid. When unrolled over long horizons, these dynamics give rise to a rich spectrum of behaviors — from simple patterns that converge to a fixed attractor state to complex structures that emerge gradually over time.

Interactive NCA Simulation

Watch how different rules produce different complexity levels. Click to randomize.

Low complexity
Medium complexity
High complexity
gzip compression ratio controls complexity — more compressible = simpler dynamics

These NCA trajectories are tokenized into sequences (using 2×2 patches, similar to vision transformers) and fed to a standard transformer with next-token prediction. The key: since every sequence has a unique latent rule, the model must infer that rule in-context to predict what comes next. This in-context learning ability underpins many of the key reasoning capabilities observed in language models.

Stage 1
Pre-pre-train
164M NCA tokens
Synthetic dynamics
Stage 2
Pre-train
Natural language
Web, math, code (4-13B tokens)
Stage 3
Fine-tune
Task-specific
Instruction tuning (<1B tokens)

The surprising payoff

Under matched token budgets (164M tokens each), NCA pre-pre-training consistently outperforms from-scratch training, pre-pre-training on natural language (C4), and pre-pre-training on other synthetic data (Dyck) across web text, math, and code. The gains aren't just better convergence speed, but also better final perplexity.

Final Perplexity by Domain (↓ lower is better)

OpenWebText
NCA −5.7%
14.66
14.69
14.35
13.82
Scr C4 Dyck NCA
OpenWebMath
NCA −5.2%
8.11
8.14
7.91
7.70
Scr C4 Dyck NCA
CodeParrot
NCA −4.2%
1.92
1.88
1.85
1.84
Scr C4 Dyck NCA
Scratch C4 (natural language) Dyck (synthetic) NCA (synthetic)

Validation Perplexity During Training

OpenWebText
OpenWebMath
CodeParrot
Scratch C4 Dyck NCA

These language modeling gains transfer to real reasoning benchmarks:

Reasoning Benchmark Performance

GSM8K (Math) (pass@1 accuracy)
Scratch
3.82%
C4
3.81%
Dyck
4.10%
NCA
4.36%
HumanEval (Code) (pass@1 accuracy)
Scratch
6.75%
C4
6.27%
Dyck
6.90%
NCA
7.49%
BigBench-Lite (Reasoning) (normalized accuracy pass@2)
Scratch
20.91%
C4
22.76%
Dyck
18.10%
NCA
26.51%
Scratch C4 Dyck NCA

Surprisingly, we observe that our non-linguistic NCA data outperforms natural language at equal scale. So, we investigate further: what happens if we give C4 ~10× more data? We scale C4 pre-pre-training to 1.6B tokens while keeping NCA at 164M. Even with this data advantage, NCA still converges 1.4x faster and achieves 5% better final perplexity .

OpenWebText: NCA (164M tokens) vs C4 (1.6B tokens)

Scratch NCA (164M) Dyck (164M) C4 (1.6B) C4 (1.6B, no reinit)
164M tokens of automata beats 1.6B tokens of natural language. We believe the difference reflects what each data source teaches at each scale . At 1.6B tokens — far below the compute-optimal scale — C4 largely teaches shallow, local patterns, while each NCA sequence trains the model to infer a latent rule from context (i.e., in-context learning) and apply it consistently. This per-token diversity in functions, rather than redundant linguistic patterns, appears more efficient at building the general-purpose representations that transfer to language.

What drives the transfer?

Q K V

Attention is the carrier

Re-initialization experiments show attention layers capture the most transferable computational primitives. MLPs encode domain-specific knowledge — transferable only when source and target align.

GZIP

Complexity must match

The optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex ones. This opens a new lever for targeted training.

Structure, not semantics

NCA data has zero linguistic content — yet teaches models to track long-range dependencies and infer latent rules, the same capabilities needed for language.

tokens

Efficiency over scale

More synthetic data isn't always better. Calibrating the complexity of the data generator matters more than raw volume, enabling smarter training with less compute.

Optimal Complexity by Domain

Click a domain to see which NCA complexity band transfers best

A purer training signal

At small token budgets, natural language pre-training mostly teaches shallow patterns. Models exploit semantic shortcuts and co-occurrence priors rather than learning to reason from structure. On the other hand, NCA sequences do not contain any semantic shortcuts.

Every NCA trajectory is generated by a hidden transition rule — a randomly sampled neural network — that the model must infer purely from context. Since there's no semantic content to fall back on, every token pushes the model toward in-context rule inference: observing a sequence, hypothesizing the underlying rule, and applying it consistently forward. This mirrors one of the core capabilities of language models (i.e., in-context learning).

Because NCA rules are drawn from a universal class of computable functions — some realizing Turing-complete systems — the distribution is too vast to memorize. The model is forced to learn a general mechanism for rule inference rather than memorizing specific rules. This is supported by our empirical findings: attention layers, not the MLPs, carry the most transferable structure. Prior work shows that in-context learning ability emerges with the formation of induction heads —- attention circuits that copy and apply patterns from earlier in the sequence. NCA pre-pre-training exclusively rewards this behavior, likely inducing earlier and more robust formation of these circuits before language training begins.

Beyond one-size-fits-all

This work opens a fundamentally new axis of control for training language models. Instead of treating the training distribution as fixed, we can tune the structure of synthetic data to match target domains. Simpler NCA rules for code, and richer long-range dynamics for genomic sequence modeling.

The long-term vision is: foundation models that acquire reasoning from fully synthetic data, then learn semantics from a small, curated corpus of natural language. This would help us build models that reason without inheriting human biases from inception.

The question is no longer whether synthetic pre-training can work, but how far it can go.

Citation

If you find this work useful, please consider citing our paper:

@inproceedings{placeholder2026nca,
  title={Training Language Models via Neural Cellular Automata},
  author={Seungwook Han and Dan Lee and Akarsh Kumar and Pulkit Agrawal},
  booktitle={TBD,
  year={2026}
}