Overview
- Aegaeon pooled GPU resources at the token level during a multi‑month production beta, cutting the fleet for dozens of LLMs from 1,192 to 213 accelerators.
- The system raised effective output by up to nine times versus Alibaba’s prior serverless setup by scheduling tiny slices of work across shared GPUs.
- Coverage cites Nvidia H20s as the test hardware, reflecting China’s constrained access to high‑end accelerators under U.S. export controls.
- The research was presented at the 2025 ACM SOSP in Seoul by authors from Peking University and Alibaba, including CTO Jingren Zhou.
- Analyses say results may hinge on Alibaba’s vertically integrated stack and network fabric, and some on Wall Street linked the reports to weakness in data‑center‑related stocks.