Overview
- In March, Google told Meta it could not deliver the full Gemini computing capacity Meta sought, and those limits have stayed in place through late June.
- The caps have delayed several of Meta’s internal AI projects and prompted the company to tell staff to reduce use of AI tokens, the units that measure model running time and cost.
- Google has applied documented compute-based rate limits for Gemini, including rolling refresh windows and weekly caps, and other large customers have seen partial restrictions.
- To ease the shortage, Google and other model providers are signing large external leases for extra GPU capacity, shifting to consumption pricing and encouraging customers to adopt AI FinOps and multi-provider routing.
- The squeeze is driven by shortages of high-bandwidth memory and rented H100 GPUs, pushing firms toward smaller local models, decentralised GPU networks and a major global capex push for data centres and chips.