Particle.news
Download on the App Store

MoE Research Delivers New Playbooks for Design, Routing, and Real-World Deployment

The releases point to simpler design rules with practical fixes that make sparse expert models easier to scale.

Overview

  • A wave of Mixture-of-Experts (MoE) papers, released Wednesday, maps fresh advances in model setup, token routing, and deployment.
  • A large sweep of more than 2,000 pretraining runs reports that performance rises with total MoE parameters, and that the best expert size depends on how many parameters are active per token rather than total model size.
  • A mechanistic study finds router weights align with their paired experts, reports that common load‑balancing losses blur that alignment, and shows a parameter‑free online K‑Means router keeps loads even with only a small perplexity cost.
  • Systems work identifies a bimodal token‑to‑expert pattern that leaves some experts overloaded and many underused, and a scheduler called Sieve improves simulated throughput and interactivity by 1.3x to 1.6x by splitting work between GPUs and PIM and overlapping compute and communication.
  • Deployment tests on analog compute‑in‑memory hardware show device noise distorts routing and breaks expert balance, and a post‑training calibration method called ROMER restores balance and cuts perplexity by up to about 59% on DeepSeek‑MoE, Qwen‑MoE, and OLMoE under real‑chip noise.