Can asynchronous expert training beat synchronized distributed LLM training?
Can training domain-specialized LLM copies in parallel without synchronization, then merging their components into a routed mixture, achieve better efficiency and accuracy than keeping all copies synchronized?
The communication cost of keeping many GPU model-copies synchronized is the main bottleneck in scaling LLM training, and synchronized training is fragile (one failed GPU halts everything). Branch-Train-MiX (BTX) sidesteps both: branch a seed model into copies, train each as a domain expert embarrassingly-parallel (high throughput, no synchronization), then bring the experts' feed-forward parameters together as experts in Mixture-of-Expert layers, average the remaining parameters, and run a short MoE-finetuning stage to learn token-level routing.
The keeper is that BTX generalizes two known special cases and dominates them: Branch-Train-Merge (no MoE-finetuning, so no learned routing) and sparse upcycling (no asynchronous expert training) — BTX achieves the best accuracy-efficiency tradeoff by keeping both the parallel expert training and the learned routing. It is a recipe for getting multi-domain capability (code, math, world knowledge) without the communication tax of monolithic synchronized training.
This sits in the vault's MoE/specialization thread as a training-procedure contribution. It complements Can routing mask future experts to prevent knowledge leakage? (TiMoE partitions experts by time; BTX partitions by domain) and the broader move to obtain capability by composing independently-trained parts rather than one synchronized run.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- How do you partition LLM experts by domain versus by time?
- What makes mixture-of-experts routing learn token-level specialization effectively?
- Can you compose independent LLM experts without synchronization overhead?
- Why does Branch-Train-Merge fail without learned routing between experts?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can routing mask future experts to prevent knowledge leakage?
Can models be built so that they respect query timestamps by selectively silencing experts trained on future data? This explores whether temporal causality can be enforced through architecture rather than external retrieval.
both build MoE from independently-scoped experts; BTX by domain, TiMoE by time slice
-
Can brain structure guide how we design intelligent agents?
Does mapping agent capabilities onto human brain functions provide a useful organizing framework for understanding and comparing different agent architectures? This matters because agents need a shared vocabulary to advance beyond one-off designs.
modular composition of specialized parts, here at the parameter level
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM
- Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
- AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
- Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL
- RouteLLM: Learning to Route LLMs with Preference Data
- Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
Original note title
training expert LLMs embarrassingly-parallel then merging their feed-forward layers into a routed mixture-of-experts beats synchronized training