Advanced Micro Devices (AMD) has shared more details about its upcoming Vega family of GPUs, as well as information regarding the progress the company has made with its open-source ROCm software stack for HPC and Machine Learning (ML). These new Vega GPUs look like they have sufficient performance to be in the same ballpark as NVIDIA’s highly acclaimed family of PASCAL GPUs. Taken together with the optimized ROCm ML software AMD has developed, they form a solid first step for market entry. But it will all come down to the ROCm open source software to transform the hardware’s potential performance into real value.
The New Vega GPUs
AMD announced the Radeon Instinct MI6, MI8, and the Vega-based MI25 for Machine Learning back in December, with Vega shipments expected mid-2017. Vega is being positioned for compute-intensive training jobs against the NVIDIA Pascal-based Tesla P100 with what looks like competitive performance potential.
AMD’s Vega-based Radeon Instinct MI25 appears to have a ~10% performance advantage over NVIDIA Pascal P100. It is now up to the AMD software to deliver that performance to the Machine Learning applications.
(Source: Moor Insights & Strategy)
Now, AMD has announced it is adding a workstation form-factor GPU to the Vega line, called the Radeon Vega Frontier Edition, targeting ML development environments. AMD said its Vega strategy is to deliver better performance on Vega than NVIDIA Pascal, and to price for a 2x advantage in price / performance. AMD intends to court developers and researchers who want to start evaluating AMD’s Vega products for Machine Learning and help build out the company’s open source suite of software and libraries. Note that AMD has a two part story here in reducing TCO: lower cost & higher I/O capacity from the Zen-based Naples server CPU combined with the price / performance advantage of the Vega GPU for Machine Learning.
AMD began building the critical Machine Learning software ecosystem last year, with open source Linux drivers, libraries and tools. The company showed how this effort is making good progress, critical to transforming hardware theoretical specs into application-level performance for ML. AMD has shared at its Financial Analyst Event that it has achieved a 13-16-fold increase in software performance since November as it optimizes its libraries. Company executives also laid out a software roadmap, detailing the path ahead to deliver Caffe2, Tensorflow, PyTorch and MXNet later this summer to support the Vega hardware, along with the optimized solvers in ROCm.
AMD recognizes that NVIDIA has a significant head start in the Machine Learning market, having spent years optimizing its hardware and software, and having just announced the new high end Volta GPU. But the Machine Learning market is still quite young and is growing very fast; there’s room for an Avis to compete with NVIDIA’s Hertz. It appears that AMD is laying the foundation of hardware, software and developer relationships needed to build an attractive business. The company has completely revamped and optimized its software stack and has designed new GPUs that are tailored for Machine Learning algorithms with features such as reduced precision math and optimized solvers. Importantly, the decision to open-source AMD’s software enables the industry’s ecosystem to add momentum to that development effort.
Now AMD needs to execute flawlessly, build out an ecosystem, and establish footholds in the key companies that are shaping this industry. It would be very interesting if the company could also leverage its Semi-Custom Business Unit to gain one or two partners in the autonomous robotic and automotive segments, much in the same way AMD became the leader in the gaming console business.
AMD recognizes that it will be a journey to build the hardware and software ecosystem to compete in this lucrative market, and it is taking great care to start off with the right first steps. AMD’s next steps should be interesting to watch.