A Cambrian Explosion In Deep Learning, Part 3: NVIDIA

By Karl Freund - January 25, 2019

Now that I’ve scared everyone who holds NVIDIA NVDA -0.7% stock, and/or given hope to those who spend a lot of money for NVIDIA GPUs, let’s take a realistic view of how NVIDIA might maintain its leadership in a far more crowded marketplace. We will need to look at the training and inference markets separately.

A history lesson from Nervana

First, let’s look at Intel ’s experience with Nervana. Before being acquired by Intel, Nervana claimed it would outperform GPUs by at least 10X. Then a funny thing happened on the way to victory: NVIDIA surprised everyone with TensorCores, delivering not 2X over Pascal, but 5X. Then NVIDIA doubled down with NVSwitch, which allowed it to build an amazingly high performance (and at $400K, quite expensive) 8 GPU DGX-2 server that blows away most, if not all, competitors. Meanwhile, NVIDIA roughly doubled the performance of its CuDNN libraries and drivers. It also built the GPU Cloud to make using its GPUs as easy as clicking and downloading containers of optimized software stacks for some 30 Deep Learning and Scientific workloads. So, as I have shared in previous blogs, Intel’s promised 10X performance advantage evaporated, and Nervana-now-Intel had to go back to the drawing boards—promising to deliver a new chip in late 2019. Basically, NVIDIA proved that 10,000-plus engineers with a solid track record and a warehouse full of technology can out-innovate 50 bright engineers in a virtual garage. Nobody should have been surprised, right?

Give 10,000 engineers a big sandbox

Now, fast-forward three years to 2019. Once again, competitors are claiming 10 or even 100X performance advantages in their silicon, which is all still under development. NVIDIA still has the 10,000 engineer army, with collaborative technical relationships with the top researchers and end users across the globe. Now, they are all working on their best ideas for NVIDIA’s next generation 7nm chip, which will essentially, in my opinion, transform the company’s products from “GPU chips with AI” into “AI chips with GPUs.”

Figure 1: NVIDIA's DGX-2 supercomputer-in-a-box delivers 2 peta-ops of AI performance across 16 V100 GPUs interconnected on NVSwitch. NVIDIA

How much additional sand (logic area) might NVIDIA engineers have to play with for the company’s next generation product? While the following analysis is simplistic, it can be useful—to an order of magnitude—to frame the answer to that critical question.

Let’s start with the first ASIC that seems to have excellent performance, the Google GOOGL +0.15% TPU. I have seen analyses that estimate each Google TPU die is around 2-2.5B transistors. A Volta V100 has roughly 21B transistors in a 12nm manufacturing process. It is the largest chip TSMC can manufacture. As NVIDIA goes from 12nm to 7nm, the die can contain, roughly 1.96 (1.4x1.4) as many transistors. Therefore, in theory, if NVIDIA didn’t add any graphics logic (admittedly unlikely), it would have another 20 billion transistors to play with—roughly ten times the amount of logic in an entire Google TPU. Suppose my logic is off by a factor of 2. In that case, NVIDIA engineers still have 5 times the logic available for new AI features. Now, all this assumes that NVIDIA will want to go all out for performance, instead of lowering costs or power. In the training market, though, that’s what users need: shorter training times. There are a lot of ideas floating around about what NVIDIA might deliver, including memory in processor and more versions of TensorCores.

My point here is that NVIDIA undoubtedly has the expertise and available silicon real estate to innovate, as it did with TensorCores. I’ve spoken to many interesting AI chip startups, but the ones I respect the most tell me they do not underestimate NVIDIA and do not think of them as being locked into a GPU mindset. NVIDA DLA and Xavier, an ASIC and an SOC, respectively, demonstrated that NVIDIA can create all sorts of accelerators, not just GPUs. Many of these startup CEOs have decided, therefore, to stay out of NVIDIA’s way and focus initially on inference.

My view is that NVIDIA will not suffer from a long-term disadvantage in training performance. Its issue may be high silicon costs, but for training, customers will pay up. Additionally, when it comes to inference, NVIDIA’s Xavier is a very impressive chip.

The Cambrian explosion benefits programmability

Let’s get back to the Cambrian Explosion idea. NVIDIA correctly points out that we are in the early days of algorithmic research and experimentation. An ASIC that does a great job of processing, say, Convolutional Neural Networks for image processing may (and almost certainly will) do a lousy job of processing, say, GANs, RNNs, or Neural Networks that have yet to be invented. Here’s where GPU programmability coupled with the NVIDIA ecosystem of researchers can rather quickly adapt a GPU to a new form of neural network processing if NVIDIA can solve the impending memory problems. The company has already significantly reduced the memory capacity problem, at a hefty price, by using NVLINK to create a mesh of 8 GPUs and 256GB of High-Bandwidth (HBM) memory. We will have to await its next-generation GPU to understand if and how it can also solve the latency and bandwidth issues, which will require memory that is ~10X the performance of HBM.

The Inference Wars

For inference, as I wrote in the first installment of this series, there is no 800-pound gorilla incumbent one would have to unseat. The edge and datacenter inference markets are diverse and poised for rapid growth, but I have to wonder whether the mass inference market will be a particularly attractive market from a margin standpoint. After all, with scores of companies vying for attention and sales, margins may be fairly thin in what will become a commodity market. Now, some inference is easy, and some is very, very hard. The latter market will maintain high margins, as only complex SOCs equipped with CPUs, parallel processing engines like Nervana, GPUs, DSPs, and ASICs can deliver the performance required for things like autonomous driving. Intel’s Naveen Rao has recently tweeted clues that the Nervana Inference processor will, in fact, be a 10nm SOC, with Ice Lake CPU cores. NVIDIA already leads this approach with the Xavier SOC for autonomous driving, and Xilinx  is taking a similar approach with its Versal chip later this year. Any startup heading down this path will need to have a) very good performance/watt, and b) a roadmap of innovations that will keep them ahead of the commodity masses.


In summary, I would reiterate the following:

  1. The future of AI is enabled by application-specific silicon, and the market for specialized chips will become huge.
  2. The world’s largest silicon companies intend to win in the AI chip wars of the future. While Intel is playing catch up, do not underestimate what it can do.
  3. There are scores of well-funded startups, and a few will be successful. If you want to invest in a venture-backed firm, be certain that they are not dismissive of NVIDIA’s firepower.
  4. China will largely wean itself off of US-based AI technology over the next 5 years.
  5. NVIDIA has over 10,000 engineers and will probably surprise us all with its next-generation high-end GPUs for AI.
  6. The inference market will grow rapidly, and there will be room for many application-specific devices. FPGAs, especially Xilinx’s next generation, may play a major role here.

Clearly, there’s a lot to cover on this topic, and I’ve only scratched the surface! Thank you for taking the time to read this series—I hope it has been illuminating and informative.  

+ posts