NVIDIA H100 GPU Performance Shatters Machine Learning Benchmarks For Model Training

NVIDIA

NVIDIA’s Hopper H100 Tensor Core GPU made its first benchmarking appearanceearlier this year in MLPerf Inference 2.1. No one was surprised that the H100 and its predecessor, the A100, dominated every inference workload. The H100 set world records in all of them and NVIDIA is the only company to have submitted to every workload for every MLPerf round.

A few weeks ago, a new set of MLCommons training results were released, this time for MLPerf 2.1 Training, which the NVIDIA H100 and A100 also dominated.

Unfortunately, NVIDIA’s dominance of MLPerf benchmarking suites for inference and training has deflected submissions and reports by many important AI companies.

The industry would benefit from the participation of more organizations as we have seen in other sectors like CPUs, it drives competition and innovation. Broad involvement in benchmarking suites is significant because machine learning is growing exponentially. Almost every industry segment uses machine learning for a wide range of applications. As usage increases, so does model size. Since 2018, MLCommons has held testing rounds that alternate between MLPerf Training and MLPerf Inference testing rounds.

In the four years between the first MLPerf test in 2018 and this year’s results, machine learning model size has increased by five orders of magnitude. With the increased model size and larger data sets, standardized tools like MLPerf Training and MLPerf Inference are more crucial than ever. Machine learning model performance must be measured before it can be improved.

MLPerf 2.1 Training benchmarks

Summary of benchmarks used in MLPerf Training v2.1 MLCOMMONS.COM

MLPerf Training and MLPerf Inference use the same eight workloads shown in the above graphic. Mini Go is an exception because it is only used to evaluate reinforcement learning. Each benchmark test is defined by its own specific dataset and quality target. The Key is how much time it takes to train the model using the specified dataset with the specified quality target.

MLPerf is vital to AI and machine learning because it is an industry-standard benchmark with peer review results that provides valid comparisons for model training and inference. It is supported by Amazon, Arm, Baidu, Google, Harvard University, Intel, Meta, Microsoft, Stanford University, and the University of Toronto.

Multiple single models form high performance, multiple models

Real-world AI applications use multiple models NVIDIA

It is common for multiple AI models to be chained together to satisfy a single input. An example of multimodal networks is the verbal request in the above graphic. The question requires ten machine learning models to produce an answer. Not only must multiple models operate sequentially, but it must also deliver real-time solutions.

Some cloud services also use multiple networks to deliver services accelerated by NVIDIA GPUs. All of NVIDIA’s networks and application frameworks are available on its MLPerf repo, on NGC (NVIDIA’s online container repository), and its GitHub repo.

A100 and H100 benchmark training performance

MLPerf Training v2.1 PerformanceNVIDIA

As shown in the MLPerf Training 2.1 performance chart, H100 provided up to 6.7 x more performance for the BERT benchmark compared to how the A100 performed on its first MLPerf submission in 2019.

A100 is still producing record results and high performance with improved performance of up to 2.5X. This gain is the result of software optimization. It will likely be an NVIDIA offering for quite some time.

H100 superior performance on the BERT NLP model is attributed to its Transformer Engine. The A100 does not have a training engine. The new engine, combined with NVIDIA Hopper FP8 Tensor Cores, delivers up to 9x faster AI training and 30x faster AI inference speedups on large language models than the A100. The H100 is based on Hopper architecture and uses fourth-gen tensor cores.

Training speed is crucial and necessary because of AI model size. NVIDIA’s transformer engine achieves additional speed using 16-bit floating-point precision and a new 8-bit floating-point data format. This combination increases Tensor Core throughput by 2x and reduces memory requirements by 2x compared to 16-bit floating-point.

Those improvements, plus advanced Hopper software algorithms, speed up AI performance and capabilities allowing the H100 to train models within days or hours instead of months. The faster a model can move into operation, the earlier its ROI can begin contributing to the bottom line.

The Hopper architecture can dynamically determine if FP8 or 16-bit calculations are needed for accuracy. As the transformer engine trains layer by layer, it analyzes the data to determine if reduced precision should be used. Depending on the degree of usage, reduced precision can cause rounding errors which affect model accuracy.

MLPerf training tests measure the time to solution, so a model not only has to run fast, but it also has to converge. Therefore, it is essential to remember that many errors can prevent a model from converging.

NVIDIA’s transformer engine technology was designed for large transformer-based networks like BERT. However, it is not restricted to NLP. It can be applied to other areas, such as stable diffusion.

Stable Diffusion is a deep learning, compute-intensive text-to-image model released this year. It can generate detailed images or videos conditioned on text descriptions. It can also be applied to tasks such as inpainting, outpainting, and generating image-to-image translations using a text prompt.

Time to train at scale

Time to train at scale NVIDIA

NVIDIA A100 was the only platform to run all workloads in the time to train at scale. NVIDIA was able to train every workload at scale in under 5 minutes except for Mini Go, which took about 17 minutes.

Mini Go uses reinforcement learning which is very compute-intensive. It takes longer to train the network because it requires playing Mini Go turn-by-turn, then rolling it back through the network after each turn.

Training at scale demonstrates that A100 remains a solid platform for training. H100 is a solution for the most advanced models, such as language models with massive datasets and billions of hyperparameters.

While Intel and Habana didn’t turn in record-setting performances, its participation was nevertheless important for the ecosystem and the future of MLPerf.

H100 Sets News Per-Accelerator Records for AI TrainingNVIDIA

This graphic shows relative per accelerator speedup normalized to A100. The H100 (in preview) was submitted for every benchmark and scored superior performance for each. It was 2.6X faster than the A100, which has made significant software gains.

Habana Gaudi2 submitted for Resnet-50 and BERT, and Intel’s Sapphire Rapids submitted for DLRM, ResNet-50, and BERT.

Habana Gaudi2 performed marginally better than A100 on BERT and about 0.75 better than A100 for ResNet-50. Intel acquired Habana in late 2019 for $2 billion. Gaudi2 is Habana’s second-generation deep learning processor. It has 24 tensor cores and 96 GB of memory.

Dave Salvator, Director of AI, Benchmarking and Cloud for NVIDIA, is expecting higher performance from the H100 in the future.

“The H100 turned in a very compelling performance,” he said. “But in the future, we will make software gains with the H100 as we did with the A100. This is the first round we’re submitting H100 for training, and it won’t be the last.”

HPC MLPerf 2.0 Supercomputing benchmarking

Benchmarking information for MLPerf HPC 2.0 MLCOMMONS.COM

MLPerf HPC 2.0 measures the time to train supercomputer models for scientific applications. Additionally, there is an optional throughput measurement for multi-user supercomputing systems. This round was the third iteration of MLPerf HPC. Like MLPerf for training and inference, MLPerf HPC is considered an industry-standard system performance measure for workloads performed on supercomputers.

For this round, five of the world’s largest supercomputers submitted 20 results: Dell (first time for submission), Fujitsu/RIKEN, Helmholz AI, NVIDIA, and Texas Advanced Computing Center (TACC).

MLPerf HPC v2.0 BenchmarksNVIDIA

This is version 2.0 of the benchmarks, however, there have been no major changes since these same three workloads were run in 1.0. MLPerf HPC benchmarks measure training time and throughput for three high-performance simulations that have adopted machine learning techniques – Cosmoflow, DeepCAM, and OpenCatalyst.

Because of climate change, a great deal of concentrated work is being done on weather and climate modeling. NVIDIA is also working on a digital twin of the planet called Earth Two. This giant climate model simulates the entire world.

NVIDIA HPC Platform Performance Leadership

NVIDIA

MLPerf HPC 2.0 has two performance metrics:

  • Strong Scaling measures time and quality for training the dataset. NVIDIA Selene had the lowest training time of all submissions for all three benchmarks.
  • Weak Scaling measures throughput and quality for simultaneously training multiple models on the dataset. Again, NVIDIA trained more models per minute than any submission.
  • For CosmoFlow, NVIDIA has made a 9X improvement in time to train over two years.

Although the NVIDIA A100 Tensor Core GPU and the NVIDIA DGX-A100 SuperPOD are almost three years old, MLPerf 2.0 performance shows that A100 is still the highest performing system for training HPC use cases.

HPC results are for NVIDIA Selene, an implementation of the DGX SuperPOD and demonstrate the A100’s potential. Other supercomputing sites using NVIDIA technology are also delivering good performance.

Wrapping up 

It is important to mention that NVIDIA was the only organization to run all AI training workloads for this and all previous MLPerf Training and inference rounds. It has delivered consistent leadership results from the first MLPerf Training 0.5 in December 2018 to the latest MLPerf Training 2.1 that was released a few weeks ago.

For training, inference, and HPC, MLPerf has proven NVIDIA has the broadest ecosystem support for all the deep learning frameworks. It is advantageous for customers that NVIDIA GPUs are available from all major cloud providers and all major systems for on-prem solutions. Those application frameworks allow customers to deploy solutions rapidly.

NVIDIA has an end-to-end open platform with software that helps expand the full potential of its hardware. NVIDIA’s full-stack solution includes application frameworks such as Merlin and Nemo. With Nemo Megatron service, it is possible to leverage huge language models using custom datasets.

ANALYST NOTES 

  1. There are many reasons why model speed is so vital for inference and training. One overlooked reason relates to the necessity of multiple training runs. Building a model is an experimental process that involves trial and error to get the model properly tuned. The model must be rerun each time something is tweaked to see the results. The ability to run the model faster means more trials can be run in a given time. That allows a solution to be found and deployed more quickly. The faster a model can be deployed, the earlier its benefits can contribute to improved operations and its ROI can be generated.
  2. MLPerf provides peer-reviewed apples-to-apples comparisons for training and inference. It eliminates the need to rely on a company’s cherry-picked stats from its performance testing that may or may not be valid.
  3. NVIDIA works with many of the top AI researchers. I spend much time reviewing research papers on AI and quantum. A lot of AI research work uses NVIDIA platforms. So far this year, over 400 preprint research papers have been published on deep learning using NVIDIA technology. It will be interesting to see future research results using the H100 as its availability increases.
  4. Power consumption is a significant issue in the AI ecosystem. Although measurement of power consumption isn’t currently part of MLPerf Training, it is under consideration. However, power is a measurement in MLPerf Inference. For MLPerf Inference 2.1 in September, 2,400 power measurement results were submitted. For reference, A100 requires 400 watts, and H100 requires 700 watts. When juggling these two figures, consideration needs to be given to performance and speed. BERT is an excellent example because the H100 enjoys a 6X advantage in speed.
  5. Like the A100, the H100 can be partitioned into seven smaller accelerators that can independently run different networks. That is a way to get optimal utilization out of the part on the inference side and reduce the number of total GPUs needed to deploy multiple networks. Ideally, the feature is more useful for training advanced models, but it also has applications on the inference side.