NVIDIA ’s datacenter business has been on a tear lately, roughly doubling every year for the past several years. It hit $1.93 billion for the 2018 fiscal year, an increase of nearly 130% over the previous year. This growth has been largely driven by the pervasive use of NVIDIA GPUs in HPC and in neural network training for Artificial Intelligence research and development.
However, common sense says that at some point, the need to run AI applications will become larger than the demand to build them (assuming these AI tools will indeed be useful). With this in mind, there are now scores of companies, large and small, designing silicon for inference processing, including Google , Intel, Wave Computing, and GraphCore (many of these firms will be presenting their technology on Sept. 18-19 at the inaugural AI HW Summit in Silicon Valley).
Enter the Turing-based Tesla T4 and TensorRT 5 software
When NVIDIA announced the Turing GPU, targeting visualization and real-time rendering, it included some very interesting specs indicating it could make a darned good inference engine. Industry observers have wondered whether NVIDIA GPUs are the right technology to lead this transition to “production AI,” so it was vital for Jensen Huang, NVIDIA’s CEO, to demonstrate the company’s place in inference processing. Not one to disappoint, Mr. Huang announced the new Turing-based Tesla T4 at the GTC-Japan keynote this week—the company’s first GPU to specifically target inference processing in the datacenter.
NVIDIA’s inference platforms to date have been focused on robotics and autonomous driving, such as the Xavier SOC used in DrivePX for autos and in Jetson for robotics (which I covered here). As far as inference processing in the datacenter goes, NVIDIA says its P4 and P40 GPUs have been very popular for AI in the cloud—providing image recognition in video, voice processing, running recommendation engines for eCommerce, and natural language processing for analyzing and translating speech into text. One example NVIDIA shared was Microsoft Bing, which uses these GPUs to power its visual search capability 60 times faster than it could using CPUs. Additionally, each P4 GPU can process 30 simultaneous streams of video running at 30 frames per second.
The new NVIDIA Tesla T4 GPU will effectively replace the P4 and is packaged in a low-profile PCIe card shown in Figure 1. Burning only 75 watts, the new chip features 320 “Turing Tensorcores” optimized for integer calculations popular in inferencing jobs. It can crank out 130 trillion 8-bit integer and 260 trillion 4-bit integer operations per second (or TOPS). If you need floating point operations, such as what is required in neural network training, the T4 can handle 65 TFLOPS for 16-bit calculations—about half the performance of the NVIDIA Volta GPU, while only burning 1/4th the power. The net result is a 2X speedup in processing the video streams I mentioned earlier; while the P4 could handle 30, the T4 can handle 60.
Figure 1: The new NVIDIA T4 GPU for datacenter AI is based on the same Turing architecture NVIDIA recently unveiled for real time ray tracing using AI. NVIDIA
The software side of the story is based on the 5th release of NVIDIA TensorRT, which provides the preprocessing of the neural network to optimize its execution (branch trimming, sparse matrix optimization, etc.) on the new device, as well as run-time libraries to support the execution. TensorRT 5 also supports Kubernetes containerization, load balancing, dynamic batching, and turnkey resource management to help cloud service providers put these new GPUs into their infrastructure. TensorRT 5 also features support for Google Neural Machine Translation (GNMT).
NVIDIA has been struggling to establish its place in AI inference processing in the datacenter for two reasons:
- Inference at scale is just getting started, and much or most of that processing can be handled today by Intel Xeon (or AMD EPYC) CPUs. The primary use case has been for low-resolution still images, such as those uploaded by Facebook users, so there has been little need for the power of a GPU in inference processing.
- NVIDIA does not break down its datacenter business by AI vs. HPC vs. Virtual Desktop Infrastructure, much less AI training vs. inference. It can’t or won’t say how many GPUs are already being used for inference.
As more applications for processing streaming video for branding, security, and marketing are developed, the first challenge should fade. Additionally, now that NVIDIA has a dedicated inference GPU, we can hopefully look forward to use cases. Perhaps we’ll even get an indication of the volume of inference processing the company is able to capture.
Finally, I would point out that there are dozens of startups targeting inference, with the potential to match (and maybe exceed) the performance and efficiency of the Tesla T4. Unlike AI training, this will not likely be a one-horse race. For now though, most of these startups only have PowerPoint. NVIDIA now has a real dedicated inference engine to sell.