SYNTHESIS NOTE

Can AI systems discover better neural architectures than humans?

Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Genesys models the conventional stages of research — ideation, literature search, code generation, pretraining, evaluation — as a multi-agent LLM system. The key innovation is the Ladder of Scales approach: new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M→350M parameters) with a narrowing budget at each scale.

The genetic programming (GP) backbone is critical. Rather than using LLMs to directly prompt-generate architectures (which has an ~86% failure rate), Genesys represents architectures as Generalized Autoregressive Blocks (GABs) — a code construct factorizable into discrete tree representations. GP-style operations (crossover, mutation) on these trees produce meaningful architectural variations far more reliably than direct generation.

Results: 1,162 newly discovered designs (1,062 fully verified through pretraining). The best designs outperform GPT-2, Mamba-2, and other known architectures on 6/9 common benchmarks. This is achieved through a principled search process, not brute-force sampling.

The system architecture mirrors human research:

Designer agents: Propose research ideas and produce executable architecture designs
Verifier agents: Select designs and perform pretraining
Evolution tree: Stores seed designs and discovery artifacts, enabling cumulative progress

Unlike traditional Neural Architecture Search (NAS) which searches within human-defined operation spaces (attention heads, convolution kernels), Genesys searches a broader space of operations and architectures while modeling the broader scientific discovery process.

The factorization into GP-representable trees is the insight that makes this practical: it provides structure to the search space that direct LLM generation lacks. The ~86% improvement in successful design generation from GP vs. direct prompting suggests that current LLMs need structured representations to do creative design work reliably — they cannot yet reliably generate novel working architectures from freeform description alone.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do autonomous architecture discoveries follow predictable scaling laws?

Which computational strategies best support reasoning in language models?

How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?

What critical LLM failures do standard benchmarks hide?

Why does genetic programming outperform direct LLM generation by 86 percent?

How does objective evolution guide discovery better than fixed planning?

Can the same problem be solved by multiple evolutionary search strategies?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 112 in 2-hop network ·medium cluster Open in graph ↗

Can AI systems discover better neural architectu… Can computational power accelerate scientific disc… Do language models generate more novel research id… Why do LLMs generate novel ideas from narrow range…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can computational power accelerate scientific discovery itself? Does the pace of research breakthroughs scale with computing resources, like model performance does? ASI-ARCH tested this by running thousands of autonomous experiments to discover neural architectures.
ASI-ARCH and Genesys are parallel demonstrations of the same principle: automated architecture discovery at scale
Do language models generate more novel research ideas than experts? Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
Genesys addresses feasibility through GP structure; direct LLM generation fails 86% of the time
Why do LLMs generate novel ideas from narrow ranges? LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
GP backbone forces structural diversity that direct prompting cannot maintain

Can AI systems discover better neural architectures than humans?

Inquiring lines that read this note 6

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4