What if the path to smarter language models doesn't require more text — but synthetic data from abstract dynamical systems?
Large language models are hungry. They require exponentially more data to keep improving, and high-quality natural language is projected to run out by 2028. Worse, internet text carries human biases and entangles knowledge with reasoning, making it hard to control what models actually learn.
This raises a radical question: Is natural language the only path to intelligence?
Neural cellular automata (NCA) generalize systems like Conway's Game of Life by replacing fixed rules with neural networks. Each randomly-sampled network defines a unique transition rule, producing diverse spatiotemporal dynamics on a grid. When unrolled over long horizons, these dynamics give rise to a rich spectrum of behaviors — from simple patterns that converge to a fixed attractor state to complex structures that emerge gradually over time.
Watch how different rules produce different complexity levels. Click to randomize.
These NCA trajectories are tokenized into sequences (using 2×2 patches, similar to vision transformers) and fed to a standard transformer with next-token prediction. The key: since every sequence has a unique latent rule, the model must infer that rule in-context to predict what comes next. This in-context learning ability underpins many of the key reasoning capabilities observed in language models.
Under matched token budgets (164M tokens each), NCA pre-pre-training consistently outperforms from-scratch training, pre-pre-training on natural language (C4), and pre-pre-training on other synthetic data (Dyck) across web text, math, and code. The gains aren't just better convergence speed, but also better final perplexity.
These language modeling gains transfer to real reasoning benchmarks:
Surprisingly, we observe that our non-linguistic NCA data outperforms natural language at equal scale. So, we investigate further: what happens if we give C4 ~10× more data? We scale C4 pre-pre-training to 1.6B tokens while keeping NCA at 164M. Even with this data advantage, NCA still converges 1.4x faster and achieves 5% better final perplexity .
Re-initialization experiments show attention layers capture the most transferable computational primitives. MLPs encode domain-specific knowledge — transferable only when source and target align.
The optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text prefer more complex ones. This opens a new lever for targeted training.
NCA data has zero linguistic content — yet teaches models to track long-range dependencies and infer latent rules, the same capabilities needed for language.
More synthetic data isn't always better. Calibrating the complexity of the data generator matters more than raw volume, enabling smarter training with less compute.
Click a domain to see which NCA complexity band transfers best
At small token budgets, natural language pre-training mostly teaches shallow patterns. Models exploit semantic shortcuts and co-occurrence priors rather than learning to reason from structure. On the other hand, NCA sequences do not contain any semantic shortcuts.
Every NCA trajectory is generated by a hidden transition rule — a randomly sampled neural network — that the model must infer purely from context. Since there's no semantic content to fall back on, every token pushes the model toward in-context rule inference: observing a sequence, hypothesizing the underlying rule, and applying it consistently forward. This mirrors one of the core capabilities of language models (i.e., in-context learning).
Because NCA rules are drawn from a universal class of computable functions — some realizing Turing-complete systems — the distribution is too vast to memorize. The model is forced to learn a general mechanism for rule inference rather than memorizing specific rules. This is supported by our empirical findings: attention layers, not the MLPs, carry the most transferable structure. Prior work shows that in-context learning ability emerges with the formation of induction heads —- attention circuits that copy and apply patterns from earlier in the sequence. NCA pre-pre-training exclusively rewards this behavior, likely inducing earlier and more robust formation of these circuits before language training begins.
This work opens a fundamentally new axis of control for training language models. Instead of treating the training distribution as fixed, we can tune the structure of synthetic data to match target domains. Simpler NCA rules for code, and richer long-range dynamics for genomic sequence modeling.
The long-term vision is: foundation models that acquire reasoning from fully synthetic data, then learn semantics from a small, curated corpus of natural language. This would help us build models that reason without inheriting human biases from inception.
If you find this work useful, please consider citing our paper:
@inproceedings{placeholder2026nca,
title={Training Language Models via Neural Cellular Automata},
author={Seungwook Han and Dan Lee and Akarsh Kumar and Pulkit Agrawal},
booktitle={TBD,
year={2026}
}