SYNTHESIS NOTE
Agentic Systems and Tool Use Model Architecture and Internals Training, RL, and Test-Time Scaling

Can AI systems discover better neural architectures than humans?

Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Genesys models the conventional stages of research — ideation, literature search, code generation, pretraining, evaluation — as a multi-agent LLM system. The key innovation is the Ladder of Scales approach: new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M→350M parameters) with a narrowing budget at each scale.

The genetic programming (GP) backbone is critical. Rather than using LLMs to directly prompt-generate architectures (which has an ~86% failure rate), Genesys represents architectures as Generalized Autoregressive Blocks (GABs) — a code construct factorizable into discrete tree representations. GP-style operations (crossover, mutation) on these trees produce meaningful architectural variations far more reliably than direct generation.

Results: 1,162 newly discovered designs (1,062 fully verified through pretraining). The best designs outperform GPT-2, Mamba-2, and other known architectures on 6/9 common benchmarks. This is achieved through a principled search process, not brute-force sampling.

The system architecture mirrors human research:

Unlike traditional Neural Architecture Search (NAS) which searches within human-defined operation spaces (attention heads, convolution kernels), Genesys searches a broader space of operations and architectures while modeling the broader scientific discovery process.

The factorization into GP-representable trees is the insight that makes this practical: it provides structure to the search space that direct LLM generation lacks. The ~86% improvement in successful design generation from GP vs. direct prompting suggests that current LLMs need structured representations to do creative design work reliably — they cannot yet reliably generate novel working architectures from freeform description alone.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 132 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

multi-agent LLM systems discover novel neural architectures competitive with human-designed ones through genetic programming