Overview
- A wave of Mixture-of-Experts (MoE) papers, released Wednesday, maps fresh advances in model setup, token routing, and deployment.
- A large sweep of more than 2,000 pretraining runs reports that performance rises with total MoE parameters, and that the best expert size depends on how many parameters are active per token rather than total model size.
- A mechanistic study finds router weights align with their paired experts, reports that common load‑balancing losses blur that alignment, and shows a parameter‑free online K‑Means router keeps loads even with only a small perplexity cost.
- Systems work identifies a bimodal token‑to‑expert pattern that leaves some experts overloaded and many underused, and a scheduler called Sieve improves simulated throughput and interactivity by 1.3x to 1.6x by splitting work between GPUs and PIM and overlapping compute and communication.
- Deployment tests on analog compute‑in‑memory hardware show device noise distorts routing and breaks expert balance, and a post‑training calibration method called ROMER restores balance and cuts perplexity by up to about 59% on DeepSeek‑MoE, Qwen‑MoE, and OLMoE under real‑chip noise.