Can Intel’s New Knights Landing Chip Compete With NVIDIA For Deep Learning?

Intel finally launched the highly anticipated “Knights Landing” (KNL) version of their Many Integrated Core (MIC) Xeon Phi processor at this years’ international supercomputing event (now called ISC High Performance), targeting High Performance Computing (HPC) and the white-hot market for training Deep Neural Networks (DNNs). DNNs are used to power everything from the ads you see on Google to autonomous cars to natural language processing to image recognition. The announcement started a bit of a food fight: Intel claims that they can beat NVIDIA in Deep Learning by over 2x, to which NVIDIA responded that, no, actually their new accelerators beat Intel by, you guessed it, by over 2x.

Practically every tech company has staked a claim in Deep Learning as The Next Big Thing, so Intel is now spending a great deal of energy trying to join the party, so I don’t expect this to be all the ammunition they can muster. AI applications have helped boost the market for NVIDIA’s Tesla products, helping NVIDIA grow their Datacenter business by 63% to $142M in their latest quarter. Intel is now offering an attractive approach to this market, providing GPU-class performance on a host CPU that is compatible with the massive base of Xeon applications, potentially simplifying programming and reducing costly data movement. However, given the momentum NVIDIA has built for their GPUs, buyers will need time to validate Intel’s performance claims on their own compute-intensive workloads, especially in Deep Learning applications where NVIDIA dominates.

What did Intel announce?

This new Intel chip is the first full implementation of Intel’s Scalable System Framework and diverges from it predecessor in three critical areas.

  1. The chip is a bootable CPU, not a PCIe accelerator (the company also plans to offer an accelerator version for customers who prefer that model of compute).
  2. The chip has an on-die Omni-Path interconnect for high bandwidth, low latency scaling between nodes.
  3. The chip offers high bandwidth stacked die memory, the first CPU to do so, using Intel’s version of Micron Technology’s Hybrid Memory Cube, dubbed MCDRAM.

INTC1

The New Intel Xeon Phi supports up to 72 bootable cores, and features integrated High Bandwidth Memory and the Intel Omipath Interconnect (Source: Intel)

There are four different SKUs available now, ranging from 64 cores at 1.3GHz to 72 cores running at 1.5GHz. Intel has pre-sold some 100,000 Phi processors from over 30 OEM vendors in HPC for delivery this year and has a blue ribbon list of customers includes five US national labs, the Texas Advanced Computing Center and many others. All versions come with 16GB of integrated MCDRAM, which delivers up to 7.2 giga-transfers per second. The memory can be used as a cache or as a flat file as the programmer deems appropriate.

INTC21

Intel provided some HPC benchmarks comparing the new Xeon Phi to NVidia GPUs (Source: Intel)

The results for traditional HPC workloads are impressive, as this is a large market that has begun experimenting with NVIDIA accelerators intruding on Intel’s space. Assuming they can deliver real application performance, Intel has a good story to tell and may be able to slow the penetration of GPU accelerators into their treasured installed base of Xeon processors. Note that the newly minted Top 500 list has a total of 52 systems with NVIDIA accelerators, while 35 systems use Intel Xeon Phi’s previous generation, Knights Corner, to accelerate their workloads. So Intel has a good upgrade story to tell as well.

But what about Deep Learning?

Unlike most tech companies, Intel has been relatively quiet on the topic of Deep Learning, but the company hopes that the new Xeon Phi may finally allow them to compete in this fast-growing market, which until now has been dominated by NVIDIA. To prove their point, they shared that 4 new Phi processors completed the training of the Caffe Alexnet imaging neural network with 1.33 billion images in 10.5 hours, compared to four Maxwell GPUs which they pegged at 25 hours. Intel also showed how they have acheved an impressive 30x increase in performance for Intel Xeon CPUs through optimizations in their software stack for inference work. This is vital, since the volume market for Deep Learning chips will be in the use of the neural network, or inference, not the training of the network which seems to get all the press.

INTC3

Intel claims they are 2.3x faster than NVIDIA Maxwell M40’s for Deep Learning applications using the Caffe software (Source: Intel)

Scaling, the next frontier

But what if you need to scale to a bigger cluster, which is often the case in computing deep neural networks? Here, Intel claims that 128 Xeon Phis will deliver 50 times the performance of a single Phi in training applications, thanks to the Omni-Path interconnect integrated onto the Xeon Phi die. That amounts to a 38% scaling efficiency advantage over NVIDIA GPUs. This area is of keen interest, since scaling leads to more bottlenecks outside the CPU or accelerator itself.

NVIDIA responds

While the new Intel Xeon Phi is a major step forward for Intel, it will not go unchallenged. NVIDIA has posted their (first?) response to Intel on this webpage. In addition, the company points out that Intel’s claim of 2.3x better performance of Xeon Phi over Maxwell GPUs for DNN relies on a benchmark that is over a year old. NVIDIA ran the same benchmark on a more recent version of the Caffe software, and concludes that a server with 4 Maxwell GPUs trains in only 10 hours, not 25, which is a tad faster than the 12.5 hours for 4 KNL servers. NVIDIA also claims that a single server with 2x P100 GPUs trains AlexNet in 8 hours, faster than 4 KNL processors, and by 8 Pascal P100 GPUs in only 2 hours.

INTC4

Figure 4: NVIDIA claims that Pascal P100 is over 6 (normalizing for an equal number of GPUs) times faster than the same Maxwell chip Intel used in Figure 2. (Source: NVIDIA)

Intel also claims that KNL has scaling efficiency of 87% to 32 nodes while GPUs have 62% efficiency to 32 nodes in training an imaging neural network. However, Baidu has just published a paper that shows near linear scaling up to 128 GPUs for speech training, due in part to the “Strong Scaling” of multiple beefy GPUs in a node vs. a node with a single Xeon Phi. Note that Intel will be able to support strong scaling as well, once PCIe versions of KNL are available. We can expect a flurry of new benchmarks and comparisons between now and the next SuperComputing event in November in Salt Lake City, Utah.

In short, Intel has a new, more competitive chip, and has begun investing in the development of their ecosystem for Deep Learning. The new chip is an impressive set of new technologies, and will do very well in traditional HPC environments. However, given what we know today, I do not think this fundamentally changes the calculus with respect to NVIDIA for Deep Learning; Intel has a long way to go to match NVIDIA’s market presence and preference. After all, of the major Internet players in Deep Learning, not one will yet say they will use Xeon Phi for training neural networks. But this is the early innings of a much more competitive landscape, so we will be watching this space closely for new data.