NVIDIA’s New GPUs Set A High Bar For HPC And Deep Learning

By Karl Freund - June 20, 2016
With the winds of AI at their backs, NVIDIA has been on a tear of late as they capitalize on their near-monopoly status for GPUs used to train Deep Neural Networks (DNNs) and to power high performance computing (HPC). Now, as the annual International Supercomputing Conference (renamed “ISC High Performance”) event gets underway in Frankfurt, Germany, the company has announced details on their PCIe Pascal Tesla P100 additions to their flagship NVLink Tesla cards announced in April. Together, these products will set the gold standard by which many HPC and Deep Learning accelerators will be measured, including the new Intel Xeon Phi (aka “Knights Landing”). And given that NVIDIA grew its datacenter business last quarter by an eye-popping 63%, you can bet that their competitors are paying attention. NVDA1-1200x1018 The new NVIDIA Tesla P100 PCIe-based accelerator will be available from the channel and OEMs by the end of this year (Source: NVIDIA) What did NVIDIA announce? NVIDIA launched the NVLink-equipped P100 accelerator in April, so this announcement focused on PCIe-based Tesla P100 cards one could buy in the channel or in an OEM system. The new card offers two options for GPU memory, with 12 or 16 GB of High Bandwidth Memory (HBM2). The PCIe cards will be available later this year in systems from Cray, Dell, Hewlett Packard Enterprise (HPE) and IBM, presumably just in time for the Supercomputing ’16 event in Salt Lake City. As NVIDIA stated in April, the NVlink enabled cards will ship in early 2017. It’s important to note that NVIDIA’s performance claims are all based on the faster NVLink package, which delivers about 10% more performance—a mind-blowing 21.2 trillion peak operations per second (TFlops) for Deep Learning workloads. NVDA4-1200x767 The NVLink version of the Tesla P100 is designed to be engineered onto a board with NVLink connectivity to adjacent P100 accelerators, yielding much higher bandwidth and lower latency than is possible with its PCIe brethren (Source: NVIDIA) Different Cards for Different Workloads While NVIDIA unfortunately chose to brand all these products as “Tesla P100”, they are very different, with unique form factors, varying levels of performance, wildly different interconnects and of course different prices. Since they didn’t give them different names, I will. Let’s call them PCIe Tesla and NVLink Tesla P100’s for clarity. NVDA5 NVIDIA’s suite of PCIe and NVlink cards have different specs and form factors. (Source: NVIDIA) The new PCIe Tesla P100 products will be the favorite for many HPC workloads, since they provide acceleration for standard Intel Xeon-based servers and should be more affordable. While the company did not announce prices, which are set by their partners, they indicated that an entry P100 card should be in the same price range as the current Tesla K80 (dual chip) card, which retails for about $4,000 on Amazon.com. That implies that even the low end model will deliver 2.5 times better price performance than its predecessor, and 5 times better for Deep Learning workloads. This doubling for DNN is due to the hardware’s support for half-precision operations (16-bit) that are sufficient for the massive calculations required for training most neural networks. NVIDIA shows that they have increased performance in these workloads by over 60-fold since the K40 was introduced in November, 2013. Note that they are comparing 8 P100 cards vs. a single K40, so this is not apples-to-apples. But even normalizing for that, they have indeed broken Moore’s Law with an ~8x performance boost in 3 years. NVDA2 NVIDIA compares 8 Tesla P100s in a DGX-1 to a single K40 card, claiming over 60-fold increase in less than 3 years. Note that optimized software also provides significant performance contributions (Source: NVIDIA) The flagship NVLink Tesla P100 will appeal to a smaller audience willing to pay a premium for the performance and low-latency scaling it offers. In addition to higher costs for these packages, there are significant system designs considerations that must be addressed to exploit NVLink. An Intel PCIe-based server does not natively support the proprietary NVLink protocol, so more engineering is needed to connect the CPU’s to an array of Tesla P100s via a PCIe switch, and a mezzanine board will be needed to interconnect the Tesla P100’s in a mesh or other topology. The $129K NVIDIA DGX-1 is an example of how this could work. Note that IBM’s OpenPOWER servers will offer the faster and lower latency NVLink connection directly to the IBM POWER8 and upcoming POWER9 chip. NVDA3 The layout of the DGX1 from NVIDIA shows the complexity of interconnected GPUs using PCIe Switches to talk to the Intel Xeon processors (Source: NVIDIA) Deep Learning is a highly parallel workload that can scale to a large number of GPUs. An NVLink enabled server will allow these GPUs to share data with one another over twice as fast compared to using the slower PCIe interface. So, if you are training a very deep neural network, especially with convolutions and tensor calculations, this will be the Ferrari you always dreamed of. Importantly, it allows NVIDIA customers like Cray, Dell, HPE and IBM to design and sell optimized DNN systems to compete with NVIDIA’s own DGX-1 “Supercomputer in a box”. So, where’s the competition? This is NVIDIA’s house; there aren’t any credible competitors that can deliver the performance, efficiency, and the software ecosystem that NVIDIA enjoys today. The next Intel Xeon Phi, code named Knights Landing, will target the market for training neural networks, as will startups like Nervana, KnuEdge, Wave Computing and others. However, they will all need more than good hardware to challenge NVIDIA; they will need optimized software and a community of developers in this open-source world. Advanced Micro Devices (AMD) has some of the capabilities to compete in this market, however the company seems to be content to focus on GPUs for VR and gaming for now, which have very different requirements than HPC and Deep Learning. In the meantime, AMD is completely revamping their software stack for HPC and DNN applications, under the umbrella of the Radeon Open Compute platform (ROCm). ROCm brings some unique and very attractive capabilities to the table, so, when they do come out with HPC-class GPUs, AMD will have a head start on a new ecosystem that can embrace them. We are still in the early innings of the growth of accelerators, especially for Deep Learning, so stay tuned for more innovations and dynamics in this fascinating market.
+ posts