Overview
- OpenAI published the Multipath Reliable Connection spec through the Open Compute Project after two years of work with AMD, Broadcom, Intel, Microsoft and Nvidia.
- MRC splits each data transfer across many network paths and reroutes around congestion or failed links in milliseconds to keep GPUs working in sync.
- The spec is built into 800 Gb/s network cards that can break one port into eight 100 Gb/s links to different switches, which lets builders create larger fully connected GPU clusters with fewer switch layers.
- OpenAI says it has already deployed MRC on its supercomputers, including Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater systems, and used it to train multiple models on Nvidia and Broadcom hardware.
- MRC extends RDMA over RoCE, a way for chips to read and write memory across servers without the CPU, and OpenAI casts it as a foundation for its Stargate buildout and a shared industry standard.