In its third round of submissions, MLCommons released results for MLPerf Inference v1.0. MLPerf is a set of standard AI inference benchmarking tests using seven different applications. These seven tests include a range of workloads that include computer vision, medical imaging, recommender systems, speech recognition, and natural language processing.
MLPerf benchmarking measures how fast a trained neural network can process data for each application and its form factor. The results allow unbiased comparison between systems.
Each application test is administered in its environment and with its specific accuracy requirements. The tests are performed on various form factors for data center servers and edge software and hardware. The complete MLPerf data center results are here, and the entire edge results are here.
Data Center Benchmarks
There are different testing scenarios for the data center offline environment and the server environment. Testing is performed on stored data as well as on dynamic data. Unlike static stored data, dynamic data must be processed immediately and inference performed as data arrives.
MLCommons made several changes to this round of testing. To measure steady state performance, run-times increased from one minute to ten minutes. Additionally, MLCommons introduced new power measurement metrics to complement existing performance benchmarks and compare energy consumption for each system submitted.
Results of the data center offline tests and server tests speak for themselves. The NVIDIA A100 GPU was the highest performing accelerator in each application. NVIDIA was the only company to submit to every offline and server scenario.
Intel’s submissions allowed a GPU vs. CPU comparison to being made. The A100 was between 17X to 314X faster than the CPU. Qualcomm submitted its AI 100 on two tests, the image classification ResNet-50 and the object detection SSD-Large, but no others. Qualcomm’s AI 100 outperformed NVIDIA’s newly announced energy-efficient A10 and A30 in the offline tests. Still, it fell short of the NVIDIA A30 performance in the server SSD-large scenario. However, the NVIDIA A100 outperformed the Qualcomm AI 100 in every instance and by a very wide margin.
The two edge scenarios are single screen and multi-screen. NVIDIA supplied these charts normalized to Jetson Xavier NX. Multi-screen is the hardest of the two tests. As its name implies, multi-stream requires many different streams to be simultaneously processed while maintaining latency and accuracy thresholds. Multi-stream requires the processor to take as many requests as possible and process them for inference as it comes in. NVIDIA had the only submissions for multi-stream.
NVIDIA submitted its edge GPUs - A100 PCIe and A100 SXM. It also submitted Jetson Xavier NX and Jetson AGX Xavier with a higher performance than the NX version.
The competitive submissions were sparse and only in a single stream. Edgecortix used a Xilinx U50 accelerator card for a few single stream submissions. Qualcomm and Centaur also had submissions.
Again, NVIDIA dominated the performance landscape in every case.
A100 performance as a Multi-Instance GPU (MIG)
The A100 has 80 GB of memory and delivers over two terabytes per second of memory bandwidth. With that much power, it is understandable why some applications can’t fully utilize A100 capabilities. To allow more workloads to be packed into the A100, NVIDIA partitioned the GPU into seven isolated instances to create the MIG. Each instance has its dedicated computing core and each instance has 10 GB of dedicated memory.
To demonstrate the MIG’s robust technology, NVIDIA submitted all seven tests running simultaneously to MLPerf. MLPerf showed that the A100 MIG achieved 98% of the performance of a single instance running alone on the GPU. This indicates that the overhead required to run seven different applications on the MIG is extremely low. The A100 MIG provides a cost-effective way for users to provision a GPU partition without compromising data center efficiency at a lower cost.
Triton performance nearly 100% of CPU inference results
Triton is a standardized open-source inference server solution. It is unique because it offers flexible solutions by simplifying inference deployments. Testing shows that Triton’s standard inference server solution offers over 90% of the performance of the most optimized solution (NVIDIA GPU).
In addition to GPU submissions, Triton results were also submitted to the CPU-only servers category. When comparing to similar configuration submissions from Intel, NVIDIA performance was close to 100%.
Triton offers several advantages for customers. Here are just a few:
- Allow standardizing on an open-source solution
- Simplifies the entire AI pipeline without performance compromise
- Supports almost all form factors
- Supports different query types, batch and real-time
- It can be extended with any back end
- Works with models coming out of major frameworks
- MLPerf 0.7 results were released six months ago. Comparing those results to MLPerf 1.0, the A100 has increased its performance by 45%.
- NVIDIA’s announcement of the new A10 and A30 is significant in expanding AI acceleration to mainstream data centers which will promote further democratization of AI.
- Every major NVIDIA OEM had submissions to MLPerf on NVIDIA’s platform.
- NVIDIA has indicated strong customer momentum moving to NVIDIA GPU's for running inference, where Titan tensor RT and GPU accelerated platform is more accurate. Improved accuracy results in better use case models, and enables inference to be processed more efficiently.
- As more emphasis is placed on greening data centers, look for additional NVIDIA announcements that pertain to improved performance per watt ratios.
- A100 MIG has 1 GPU and seven accelerators which results in better optimization of resources. It also promotes more efficient use of power.
- There have been concerns expressed about the fact that MLPerf is mostly about NVIDIA. There are few challengers because of NVIDIA's superior performance. However, AI is relatively new. At some point, there will be new software and new architectures, and even new technologies that will make the field interesting from a competitive standpoint. The lack of any benchmarking would be much worse. Quantum computing is a good example. There is not one standard metric that allows a comparison between quantum computers. Quantum volume would be the closest, but researchers argue about its validity. And unfortunately, there is nothing on the horizon.
Note: Moor Insights & Strategy writers and editors may have contributed to this article.