While the AI market has attracted a large number of would-be NVIDIA competitors for acceleration chips, including Intel, AMD , and dozens of well-funded startups, up until recently, there was not a set of benchmarks to enable apple-to-apple comparisons. The industry was simply too immature and fast-moving to take the time to develop one. That is about to change with the MLPerf benchmark suite, announced last May with the support of over thirty companies and seven research universities. The first results from NVIDIA, Intel, and Google were published this week, so let’s take a look at what they can tell us.
The results: it’s complicated
It turns out to be quite a challenge to get competitors and researchers to agree on a suite of benchmarks and rules to directly compare different chips and the systems they go into. The MLPerf v0.5 suite consists of seven benchmarks representing important AI workloads: image classification, object instance segmentation, object detection, non-recurrent translation, recurrent translation, recommendation systems, and reinforcement networks. Only Intel submitted results for the latter, since accelerators today are not very useful for reinforcement learning. Each vendor must follow stringent rules to ensure they are all running the same software stack, reporting the time it takes to train a neural network to perform the job to a specific accuracy level.
At first glance, the benchmarks make for difficult comparisons and may lead to confusion, since the server systems (nodes) in which the run may have 2, 4, 8, or 16 chips on board. Comparing a 2 Intel Xeon CPU server to an NVIDIA DGX-2h behemoth with 16 V100 accelerators is hardly fair or useful. Luckily, the MLPerf results specify how many chips were used, so one can normalize the results to allow for a chip-to-chip comparison—as NVIDIA has done in Figure 1. At least this works in theory; more on this later.
I’d also point out Google did not compare to the newer DGX-2h, the higher power and performance server NVIDIA announced last month, which delivered a modest 3.9-minute advantage over the DGX-2. Not a huge difference, but when combined with the fact that Google used 25% more TPUs to get these results, one can see that NVIDIA indeed is the faster chip. Still, to Google’s credit, it is close for ResNet-50 (image classification).
From these results, I conclude that:
- NVIDIA V100 remains the fastest AI accelerator, as measured by single chip and performance at scale across a wide range of important workloads, especially when one normalizes for constant accelerator count.
- Google’s results for image classification were nearly on par with NVIDIA, which is impressive. However the TPU does not perform as well on other AI tasks, and was notably MIA for 4 out of the 7 benchmarks the companies had agreed were important.
- Intel’s results were far behind both NVIDIA and Google; however, they were respectable and show that general-purpose CPUs can be used for training DNNs (without waiting days or weeks for results) if you don’t want to spend a lot of money on accelerators.
- The lack of any submissions by other accelerator providers such as AMD and a slew of startups is indicative of the fact that they are all still a work in progress. This story will get a lot more interesting once they deliver working silicon and systems, with benchmarks for both training and inference.
Finally, I would point out that NVIDIA is still the only game in town for accelerating neural network training using on-premises infrastructure, in any cloud service (AWS, Azure, Alibaba) other than GCP, or even in any GCP-hosted training other than for TensorFlow. Consequently, this is all a bit of a tempest in a teapot.
Nonetheless, NVIDIA’s breadth of AI hardware and software solutions, TensorCore performance, ease-of-use with the Nvidia GPU Cloud repository, and global ecosystem in AI production and research will keep NVIDIA in the lead for some time to come.