NVIDIA, Microsoft, and Ingrasys, a subsidiary of Foxconn, announced today their plans for HGX-1, a hyperscale GPU accelerator for AI and cloud computing. This open-source design is being released in conjunction with Microsoft’s Project Olympus initiative at the OCP (Open Compute Project) conference and is designed to give hyperscale datacenters a high performance and flexible path for the machine learning industry. I see this release part of a larger narrative for NVIDIA, which has truly been seeing incredible momentum in cloud/hyperscale over the past year—their Tesla business tripled (yes, tripled) last year. We’re seeing NVIDIA in a lot of places right now, from Facebook’s Big Basin OCP design to Fujitsu’s 24 DGX AI cluster. The computing model is changing, and machine learning training currently favors GPU computing—NVIDIA’s bread and butter.
HGX-1 is a component of Microsoft’s Project Olympus contribution to OCP, which, for those who are unfamiliar, is a consortium that is seeking to speed up the pace of innovation by applying the benefits of open source to hardware, which then opens up the design to the OCP community so it can easily be built and deployed. Today, few actually deploy standard OCP designs, but many take those designs and customize them. In addition to HGX-1, NVIDIA also announced today that they would be joining the OCP as a member and that they would continue to collaborate with these companies and other members of the project.
Introducing a new de facto standard
There are a few different kinds of standards out there, those driven by a standards committee like the 5G standard governed by the 3GPP and de-factor, or non-committee standard like VHS.
NVIDIA is drawing comparisons between the HGX-1 (and what it hopes to do for cloud-based AI workloads), with what ATX (Advanced Technology eXtended) accomplished for PC motherboards back in 1995. For some background, Microsoft and Intel worked together back in the day to create ATX, a new industry standard for motherboards—more or less the same version we’re using to this day, over 20 years later. Once the PC industry was able to focus in on one standard, the market really began to take off. Having a single industry standard also can really help affordability, freeing up companies to focus on cost reduction. Microsoft and NVIDIA believe HGX-1 will also be a real game-changer, establishing a new standard that can ideally be quickly be embraced as market demand for AI cloud computing surges. This is a big goal, but only if others adopt it, can it be considered a standard.
Also relevant to reinforce that with Microsoft, NVIDIA, and Ingrasys all collaborating together, there is a lot of muscle being thrown behind HGX-1. This is always key when attempting to establish a new standard—you really need to have major industry leaders pushing for it, for others to support it. It’s clear that the market for AI computing in the cloud is exploding, with the growth of autonomous cars, data and video analytics, personalized telemedicine, and the like. Thousands of companies are investing in AI, from industry stalwarts to a legion of new startups. It seems the time is ripe for a new AI standard in the cloud to help accelerate the industry even faster.
How does it work?
HGX-1 is powered by eight NVIDIA Tesla P100 GPUs per chassis. In my opinion, one of the things that really sets it apart is its switching design, which enables a CPU to connect to any number of GPUs, using NVIDIA NVLink interconnect technology. This configurability comes in handy, by allowing service providers who standardize on HGX-1 to provide customers a custom range of CPU/GPU configurations to meet their specific workload needs—all at optimal performance. You could have 50 CPU-GPU combinations over time and while specialization is good to a point as we saw with mini-computers, consolidation is the natural progression as the need for efficiency and productivity take over. I see this to be very important, considering the increasing complexity and diversity of cloud workloads—a rigid, one-size-fits-all solution just wouldn’t cut it. For this reason, P100 with NVLink is rapidly becoming the preferred standard among hyper-scalers for GPU acceleration as evidenced by all the big wins.
NVIDIA is making some very aggressive claims on the cost of ML cloud instances — 1/10th the cost for AI inferencing, and 1/5th the cost for AI training. What NVIDIA considers to be the greatest advantage that the HGX-1 possesses is how easily and rapidly it can transform pre-existing data centers and give them the green light for AI. This could obviously play nicely into their goal of rapid adoption. I’m looking forward to see cloud instance pricing and performance.
Will HGX-1 achieve its stated goals and establish a new industry GPU standard for the hyperscale datacenter? Of course, others like a Facebook or a Baidu – as well as hundreds of other cloud services providers worldwide –will need to adopt it before it too becomes a standard. Nothing is certain, but I remember when it looked like a very huge goal to drive CUDA as a standard and look where we are now (over 1 million downloads in the last year!). What I do know is that the cloud giants need a better solution to maximize GPU ML performance and flexibility and HGX-1 is the best thing I’ve seen so far. This is obviously what Microsoft saw, too. NVIDIA is riding quite a wave of momentum at the moment and it is striking while the iron is hot, and their prospects are looking pretty good.