When I was a VP at AMD in 2016, I worked with a talented team on a new supercomputing platform that married AMD EPYC with Radeon Instinct GPUs, interconnected over the AMD Infinity Fabric. While that product was shelved, a successor of that idea will power what will likely be the fastest supercomputer in the world, called Frontier. Frontier will be the 2nd Exascale computer in the USA, built by supercomputing leader Cray Computer and installed at DOE’s Oakridge National Labs in 2021 at a cost of over $600M. Exascale computers, for the unfamiliar, are capable of executing a mind-numbing one quintillion (that’s a 1 followed by 18 zeros) instructions per second. Frontier is expected to boast 1.5 ExaFlops.
This announcement marks a major milestone for both Cray and AMD. For AMD, it marks a reentry into the top echelon of computing and a path forward to create the AI ecosystem the company needs. AMD was an HPC powerhouse before it lost the recipe for Opteron CPUs and there has been only one system on the Top500 that used AMD GPUs. AMD has been patiently building an AI software stack, called ROCM, but adoption stalled awaiting a competitive GPU and a scalable system architecture. To this end, Frontier will form the foundation for AMD to develop the AI ecosystem it needs to compete with NVIDIA .
Cray now stands to regain the #1 super computer crown it lost in 2013, when the Chinese Tianhe-2 surpassed ORNL’s Titan, the 1st major GPU-equipped supercomputer. It must have rankled Cray that Intel acted as the prime contractor for the ARGON National Labs Aurora (which Intel hopes will be the 1st Exascale Supercomputer built in the USA). Although that system was designed by Cray, it features the same processor-agnostic Shasta system, and uses the Cray Slingshot interconnect. It is unusual for a chip supplier to bid a top-500 supercomputer, but the DOE likes to spread the wealth and risk across multiple vendors.
These systems are part of the $1.8B of investments the DOE plans for the first wave of Exascale computers. A third US Exascale system, likely to be won by IBM with POWER10 and NVIDIA, is planned and the award process is working its way through the final phases now.
The next frontier
Let’s look at the few details the companies disclosed on the Frontier design. The server node will be based on a future, yet-unannounced version of the AMD EPYC CPU, with four yet-unannounced Radeon Instinct GPUs. No details on either chip was disclosed, other than to say that both are HPC custom designs and that the GPU will support mixed precision—a nod to the trend to run HPC codes in 64-bit floating point and AI codes in 8- and 16-bit math. As with the IBM/NVIDIA Summit and Sierra systems, as well as the Intel Aurora, the majority of the heavy lifting will be performed by the accelerators. I would assume that the GPU will also support some sort of native tensor operations, which NVIDIA supports today in its TensorCore GPUs and which are beginning to be leveraged in HPC and AI applications.
Interestingly, the server node is a single socket design (one CPU per node), which is highly unusual since most Intel servers support at least two Xeon CPUs. However, the single node design saves a lot of development, testing, and material expenses, and is one of AMD’s key advantages over Intel Xeons.
The AMD Infinity Fabric is a high bandwidth, low latency CPU interconnect that is fully cache-coherent, a feature which enables multiple CPUs to cooperate when accessing shared memory. Extending this fabric to connect GPUs is a major plus and a natural move for AMD. GPUs today share data over the far slower PCIe interconnect, or in some cases such as the IBM Summit system, the NVIDIA NVLINK 2.0 fabric. We will have to await more details to understand the scope of the cache coherency in this design, especially since the four GPU blade will also support direct access to the Slingshot interconnect.
For me personally, it is great to see these two old HPC partners reconnect and co-design a massive supercomputer. Their earlier efforts produced such supercomputers as ORNL’s Titan, the system housed at NCSA at the University of Illinois, and others. I expect to see more “Mini-Frontier” systems pop up around the world, which will benefit Cray and AMD, and will come at Intel and NVIDIA’s expense. Importantly, this win also marks the beginning of AMD’s relevance in AI acceleration, as AI is no longer an option for HPC advancements.
Now the challenge for AMD is to build a competitive datacenter GPU and AI ecosystem. This will ensure that ORNL will not end up missing the “good old days” of having 27,000 NVIDIA GPUs they know how to use.