Last fall, as Cray was being acquired by Hewlett-Packard Enterprise for $1.6B, the company announced that it had been selected by the US DOE for two more exa-scale supercomputers based on the processor-agnostic Cray Shasta architecture. The Lawrence Livermore National Labs (LLNL) El Capitan supercomputer, costing some $600M, and planned for 2023, will be based on the next generation “Genoa” AMD EPYC CPU, each integrated with four next-gen AMD Radeon GPUs over AMD’s 3rd generation Infinity fabric. Although details were still scarce, I believe the integrated four-GPU plus CPU approach could present a potential challenge to NVIDIA, but note that it will take years for AMD to catch up to NVIDIA’s lead in AI and HPC software.
What was announced?
Spokespersons from LLNL, Cray, and AMD announced that they would begin to deploy El Capitan performance of this system, which will use simulation and Machine Learning to safeguard the USA’s nuclear stockpile, and would be achieve over 2 exaflops, or over 30% faster than previously disclosed. While the group did not disclose node count, or core count in the Xen 3 CPUs, they did point to the 4 to 1 GPU/CPU architecture interconnected over the memory-coherent Infinity fabric from AMD, with each node then interconnected over the Cray Slingshot interconnect. Steve Scott, CTO of HPE Cray, said that the expected 2,000,000,000,000,000,000 operations per second is more than 10 times the performance of today’s #1 supercomputer, the ORNL Summit which is powered by IBM POWER CPU’s and NVIDIA GPUs, and is more powerful than the top 200 fastest computers in the world today combined.
AMD’s Forrest Norrod said that the AMD EPYC and Radeon GPU’s would be standard products, not special SKUs built for LLNL. One of the key attributes the team featured, in addition to claiming industry-leading per-core and multi-core performance, was simplicity of programming, which is greatly enhanced by the shared memory coherency of the Infinity Fabric. This mean that every software thread operating on a node can access all four HBM-3 stacks on the GPU’s and the CPU’s memory as a single memory space. I suspect that AMD firmware will provide smart memory pre-fetching and management for the HBM memory.
Why is this important?
First and foremost, the selection of AMD in such a performance-critical system speaks highly of the company’s CPU and GPU roadmaps. This means that engineers and management from the DOE and Cray felt confident in AMD’s future and ability to execute.
Technically speaking, the combination of a CPU and four GPUs on an integrated cache-coherent fabric has tremendous potential to optimize performance and minimize programming headaches. Contrast this to NVIDIA, who does not have a data center grade CPU, and uses their proprietary NVLINK V2 to interconnect GPUs, relying on the far slower PCIe Gen 3 I/O interconnect to CPUs. IBM POWER does support NVLINK 2 and unless NVIDIA were to begin to invest heavily in an ARM server CPU, or were to acquire IBM POWER, I don’t see how they will address this opportunity if it materializes as I expect.
This means that AMD may have an advantage at modest (4-GPU-node) scale, although they have a huge software challenge ahead of them. I’d also note that AMD’s approach is limited to a four-GPU fabric, by practical necessity given packaging and cache coherency, and will rely on Cray Slingshot to interconnect more GPUs at scale. But Slingshot is certainly no slouch, delivering extremely high bandwidth at an astonishing 12.8 Tb/s per direction over 64 200-Gbps port. While AI can use thousands of GPUs, a four GPU node is fairly ideal for HPC and can also run AI quite well for today’s DNN models.
NVIDIA, recognizing the coming battle for highly scaled fabrics, is engineering the acquisition of Mellanox, which conceivably could provide a solution for highly scaled GPU fabric. But, of course, no CPU natively “speaks” Mellanox InfiniBand and it remains unclear how NVIDIA will address the CPU-GPU bottleneck. Intel, for its part, has thrown in with Habana Labs, which uses 100Gb Ethernet with RDMA to address the same requirements. Ethernet connects nodes, while Infinity connects processors inside a node.
Finally, I would points to an advantage both AMD and NVIDIA (and Intel in the near future) enjoy over the startups who are building AI-specific ASICs: GPU’s can handle the 64-bit floating-point intensive workloads common in High-Performance Computing as well as the lower precision AI workloads. That’s one reason NVIDIA is so prevalent in public clouds. All three US-based Exa-scale supercomputers will use GPUs (from Intel and AMD) to handle the heavy lifting of HPC while also providing performance acceleration for AI. Some, like Argonne National Labs who is experimenting with Cerebras, will then supplement these systems with specialized AI silicon adjacent to their primary HPC system.
AMD is maneuvering itself into an attractive position in HPC by combining its CPU and GPU prowess into an integrated product, akin to the APU strategy the company has used to differentiate itself in laptops. AMD will have its work cut out for it to build the AI software ecosystem needed to penetrate the AI market, and we will all have to await next gen silicon from AMD, Intel, and NVIDIA to ascertain whether AMD can leverage this into a leadership market position. But there is definitely a new game in town.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.