AMD Doubles Down On HPC With Milan-X And MI200

AMD introduced the Milan-X CPU and MI200 Series GPU to extend its high-performance computing (HPC) performance capabilities. The Milan-X combines packaging and unique 3-D stacking to achieve what AMD claims is significant performance gains for HPC workloads. Meanwhile, the MI200 is a new GPU design that competes directly with NVIDIA’s A100 GPU, an HPC workhorse.  What do these products mean for the company? And more importantly, what do these announcements mean for the IT consumer? I’ll try and cover this in the following paragraphs. 

First, the setup

Supercomputing is the ultimate performance test for a server. And supercomputing is a broad term that means a workload/application that utilizes many resources for compute-intensive and data-intensive processing. Weather modeling, crash simulations, drug and vaccine development, and financial trading forecasting are examples of high-performance computing (HPC) that fall under supercomputing. 

Ultimately, supercomputing tests the design limits of a system, which is represented in performance.  So, when organizations like Top500 post lists that show the world’s most powerful supercomputing environments, it’s a big deal for all – from the CPU to the GPU to the server platform. In the last couple of years, AMD has been disruptive in the Top500, and Green500 lists, which measure raw compute power and performance per watt, respectively, of the top supercomputing environments around the globe.  

On the Supercomputing list, AMD is the compute power behind 48 of the Top500 systems. Not fantastic, you think? Consider this, nearly all of these (45) were introduced since 2020. And of the 48 new systems introduced in 2021, AMD represents 24. It’s fair to say that the company has reached cruising velocity. 

The Green500 list is even more impressive for AMD and EPYC. On this list, EPYC represents 8 of the top 10 supercomputing environments, with each one also employing an NVIDIA A100 GPU (more on this later).  EPYC also makes up 24 of the 48 systems rolled out in 2021 (compared with 22 for Intel Xeon). Further, nearly all of the EPYC environments employ 64-core CPUs. That’s a lot of raw power. For comparison purposes, Xeon-based systems are mainly in the 24 core range. 

Milan-X – 3D chiplets and 3D V-Cache

With Milan-X, AMD has introduced a new 3D technology directly layering cache on top of the compute complex, aptly named 3D-V cache. This differs from existing 3D stacking technologies where soldering is used. The result? 15x better interconnect density and 3x better energy efficiency versus existing 3D technologies (and a whopping 200x interconnect density improvement versus 2D). The bottom line for organizations running compute-intensive workloads? AMD claims Milan-X will achieve a 50% performance upload over existing Milan for targeted workloads. 

AMD claims an average of 50% performance gain due its 3D V-Cache AMD

This is an impressive narrative that AMD has told. But for the organizations deploying such clusters and compute environments, the decision to deploy one CPU type over another comes down to performance. Often this is a raw performance measurement, and sometimes performance per watt measurement. AMD has already demonstrated pretty impressive numbers on this front with Milan, showing a performance advantage of up to 40% v a similarly configured Intel CPU – a considerable advantage. 

Comparing Milan to the competition AMD

Beyond the specifications and benchmark results, I find two things about AMD’s Milan-X announcement encouraging. Ease of adoption and ecosystem enablement. Milan-X is a socket-compatible upgrade for existing Milan installs. Meaning, IT organizations can simply replace the CPU, flash the BIOS, and enjoy the benefits of Milan-X. Will there need to be a qualification for this new CPU? One would think but that process should be minimal, given architectural consistency. Further, there is no software or OS change or refactoring required to reap the benefits of Milan-X. 

AMD also seems to have done a great job of ecosystem enablement with Milan-X. Ensuring that the leading ISV players in the respective workload categories are fully supported drives the confidence of IT shops that a speedy and frictionless path to significant performance gains is more than just some marketing promise.  

Milan-X ecosystem enablement AMD

AMD has more than proven its chops in the datacenter since it launched EPYC. Organizations that have any hesitancy around deploying EPYC are organizations that are behind the times. For me, Milan-X shows the company is smartly investing in areas where it sees an opportunity to extend an already strong leadership position. 

Instinct MI200 Series GPU Overview

While I traditionally cover the CPU space, I admit that the MI200 caught my attention as AMD appears to be putting a direct target on the NVIDIA A100 GPU.  If one looks through the Top500 and Green500 lists, one will notice that the A100 enjoys a prominent, if not prolific, standing in these lists. Of the 48 supercomputing systems deployed in 2021, 21 of these included an accelerator. Of these 21 systems, 19 utilized the NVIDIA A100. Digging a little deeper, 17 of the 24 AMD-based supercomputing environments rolled out in 2021 are powered by the A100. It seems fair to say that AMD has noticed these numbers and sees opportunity. 

The numbers AMD show certainly support its claims around performance advantages for HPC and AI (specifically, AI training). Floating Point and BFloat performance advantages are significant and can deliver excellent performance – especially when paired with Milan-X CPU. 

Comparing Instinct MI200 to the competition AMD

AMD credits key innovations in the MI200 design for the performance gains demonstrated. 2nd Gen Matrix cores across two AMD CDNA dies fed by eight stacks of high bandwidth memory (HBM) and coherent CPU-to-GPU interconnect make this part tailor-made for memory and compute-intensive workloads. 

Detailing MI200 design innovations AMD

For building and expanding the software ecosystem for Instinct, I think AMD is fighting against strong CUDA gravity. But, the company is taking the right approach, executing appropriately, and demonstrating patience in a space that won’t turn overnight. Its latest reveal of ROCm 5.0 is evidence of this. As ROCm 4.0 focused on delivering and solidifying HPC & ML stacks that were production-ready, ROCm 5.0 looks to take this movement and build upon it. My estimate demonstrates the right approach to solidifying a foothold in what will drive strong adoption of the MI200 – that is, an embracing by the software ecosystem and developers community. 

What it all means 

AMD has (not so) quietly established its performance leadership position in the datacenter. Its first big successes were seen in the cloud service provider (CSP) space, followed by several wins in the HPC market.

AMD has made two significant moves that I believe will accelerate its server market share gains. The first was introducing the “F” series EPYC processors a few quarters back, targeting mainstream enterprise workloads such as databases. These CPUs should enable AMD and its hardware partners to drive significant gains if the positioning and go-to-market (GTM) efforts support.

The second move is this announcement of Milan-X and the Instinct MI200 Series. The performance claims alone should drive even greater EPYC adoption, with an easy and compelling reason to attach the MI200.  

The HPC market is not only lucrative but also a proving ground for mainstream server adoption. Combined with the company’s success in the cloud, the sky could be the limit for EPYC if AMD plays its datacenter marketing cards right. 

Note: Moor Insights & Strategy writers and editors may have contributed to this article.