Microsoft has announced that the company has built a top 5 AI supercomputer for OpenAI, hosted in the Azure cloud. Microsoft invested a billion dollars in the OpenAI industry research group in 2019. The massive system is comprised of some 10,000 GPUs and over 285,000 CPU cores and will be used to advance the industry’s capabilities in dealing with very large AI models, which are doubling in size every 3.5 months, according to OpenAI. Microsoft’s Turing model for natural language generation contains some 17 Billion parameters, which is a 17-fold increase over last years largest model. So this supercomputer will be put to very good use.
Strangely, Microsoft failed to name the computer (an unheard-of omission in the Supercomputer world) and declined to communicate any system configuration details that a user must understand: which GPU’s development stack, which CPU and the number of cores & threads per socket, which networking interface, and the configuration of each node (#CPUs and #GPUs). While no spokesperson would confirm my suspicions on the record, I think I can shed some light on these important factors.
Figure 1: Microsoft’s blog on the announcement included this content-free image of their supercomputer. Source: Microsoft
Whose GPUs were used? NVIDIA V100
First, the GPU’s must be NVIDIA V100s because a) NVIDIA just announced the A100 and would be hard-pressed to deliver 10,000 units before they launched it last week. And b) the GPU’s cannot be AMD Radeons because those GPUs do not yet enjoy the ecosystem required to support the research underway at OpenAI. So, by the process of elimination, one must conclude that the GPU’s are in fact NVIDIA V100s. At 10,000 units, let’s assume that Microsoft got a seriously sweet deal and paid only $5,000 each, which would perhaps produce some $50M in revenue for NVIDIA, likely in the last quarter.
Whose CPUs were used? AMD EPYC Rome
As for the CPU sockets, the math says they are AMD EPYC Rome CPUs; there are just not enough cores in an Intel Xeon to make the numbers work unless Microsoft spent big bucks for the top-shelf 56-core Xeons. At 285,000 cores, let’s assume it is AMD’s 64-core CPU in a two-socket configuration. That would imply some 2220 nodes. At 4 GPUs each, that would connect to some 8800 GPUs, so at least we are in the 10,000 GPU Zip code. When asked, an informed source confirmed my logic here as well, confirming the use of AMD EPYC, but requested anonymity.
As for the interconnect, NVIDIA’s acquisition of Mellanox, and their leadership in the supercomputer space would favor InfiniBand, so that is my assumption.
While I understand that Microsoft and OpenAI wanted to focus the announcement on them and the great research they are undertaking, I find their approach quite old-school, and inconsistent with the culture transformed by Satya Nadella. In an open IT world, such facts are critical and should have been included in the announcement. The company used a comic-like drawing instead of a glamour shot as well, so we cannot tell which system was used (I assume it was an Open Compute HGX, but …). Oh, well. A little detective work leads me to conclude that AMD, NVIDIA, and probably Mellanox won the sockets. And the world of AI research will benefit from their leading technology and efforts; they deserve the credit.