Can AI systems discover better neural architectures than humans?
Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.
Genesys models the conventional stages of research — ideation, literature search, code generation, pretraining, evaluation — as a multi-agent LLM system. The key innovation is the Ladder of Scales approach: new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M→350M parameters) with a narrowing budget at each scale.
The genetic programming (GP) backbone is critical. Rather than using LLMs to directly prompt-generate architectures (which has an ~86% failure rate), Genesys represents architectures as Generalized Autoregressive Blocks (GABs) — a code construct factorizable into discrete tree representations. GP-style operations (crossover, mutation) on these trees produce meaningful architectural variations far more reliably than direct generation.
Results: 1,162 newly discovered designs (1,062 fully verified through pretraining). The best designs outperform GPT-2, Mamba-2, and other known architectures on 6/9 common benchmarks. This is achieved through a principled search process, not brute-force sampling.
The system architecture mirrors human research:
- Designer agents: Propose research ideas and produce executable architecture designs
- Verifier agents: Select designs and perform pretraining
- Evolution tree: Stores seed designs and discovery artifacts, enabling cumulative progress
Unlike traditional Neural Architecture Search (NAS) which searches within human-defined operation spaces (attention heads, convolution kernels), Genesys searches a broader space of operations and architectures while modeling the broader scientific discovery process.
The factorization into GP-representable trees is the insight that makes this practical: it provides structure to the search space that direct LLM generation lacks. The ~86% improvement in successful design generation from GP vs. direct prompting suggests that current LLMs need structured representations to do creative design work reliably — they cannot yet reliably generate novel working architectures from freeform description alone.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do human-designed neural architectures eventually get replaced by learned ones?
- What makes AI-discovered architectures reveal design principles invisible to humans?
- Does architectural discovery follow an empirical scaling law like neural networks?
- How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
- Why does genetic programming outperform direct LLM generation by 86 percent?
- Can the same problem be solved by multiple evolutionary search strategies?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can computational power accelerate scientific discovery itself?
Does the pace of research breakthroughs scale with computing resources, like model performance does? ASI-ARCH tested this by running thousands of autonomous experiments to discover neural architectures.
ASI-ARCH and Genesys are parallel demonstrations of the same principle: automated architecture discovery at scale
-
Do language models generate more novel research ideas than experts?
Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
Genesys addresses feasibility through GP structure; direct LLM generation fails 86% of the time
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
GP backbone forces structural diversity that direct prompting cannot maintain
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Language Modeling by Language Models
- Large Language Model based Multi-Agents: A Survey of Progress and Challenges
- Scaling Behavior of Single LLM-Driven Multi-Agent Systems
- Fundamentals of Building Autonomous LLM Agents
- Textgrad: Automatic “Differentiation” via Text
- Survey on Evaluation of LLM-based Agents
- Dynamic LLM-Agent Network: An LLM-agent Collaboration Framework with Agent Team Optimization
- Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures
Original note title
multi-agent LLM systems discover novel neural architectures competitive with human-designed ones through genetic programming