Amazon And NVIDIA Simplify Machine Learning

NVIDIA and have announced new Machine Learning software stacks in the NVIDIA GPU Cloud (NGC), and a new 8 Volta GPU EC2 instance for immediate availability, respectively. While this announcement was completely expected, it is an important milestone along the road to simplifying and lowering the costs of Machine Learning development and deployment for AI projects. When NVIDIA announced the NVIDIA GPU Cloud last May at GTC, I explained in this blog that the purpose was to create a registry of compatible and optimized ML software containers which could then, in theory, run on the cloud of users’ choice. That vision has now become a reality, at least for's   customers. I expect other Cloud Service Providers to follow soon, given the momentum in the marketplace for the 120 TFLOP Volta GPU’s. Why do you need NVIDIA's  GPU Cloud for ML?

As anyone who has delved into Machine Learning can tell you, there are two hurdles that you must clear to build a useful neural network. Assuming you’ve already prepared a massive trove of tagged data to feed the training process, and have mastered the art of Deep Neural Network design, you’ll need hardware. In fact you’ll need lots of hardware; expensive hardware you’d have to buy, install, configure, power and maintain. This is where AWS comes in. Its new P3 GPU instances come with 1, 4, or 8 Volta GPUs configured across a fast (25Gb/S) NVLINK2 scalable interconnect, delivering up to a stunning 960 trillion operations per second for serious ML work. That means your training runs will be done in hours instead of days or weeks, getting your AI ready much quicker. It is still not real-time training, but we are getting there.

Figure 1: The Open Compute HGX design provides 8 NVIDIA Pascal or Volta GPUs, which connect to a 2-CPU server Great, so you figured out you should just rent the hardware—that’s smart. But now you need to select, find, and configure a lot of finicky software. And each software component has to play nice with the myriad of other pieces. So, start with the right Linux OS, configure the correct divers, get the software framework from Git Hub, and don’t forget to download the DNN libraries. NO! Not that version! It may not be compatible with everything else you just loaded, and isn’t optimized for the GPU you selected. You DID verify that the entire stack is all inter-compatible, right? I mean, each component changes constantly; that’s the beauty and curse of open software! You can see why NVIDIA  has invested a lot of time and money to build, configure, optimize, and test all the ML software for each and every major ML Framework—ensuring that it is all self-consistent and optimized for each GPU. Just go to the NGC, create a free account, click on which framework you need, where you want to run it, and NGC will give you an ID which tells AWS what container to load on your shiny new P3 instance. Did I say “free”? Yes, use of the NGC services is free to all. Figure 2: NVIDIA's GPU Cloud, or NGC, provides click and go access to all the popular ML Frameworks, from TensorFlow to Caffe2. Figure 3: AWS provides easy access to the P3 instance and the ML Stack you selected from NGC. It can't get much easier than this. Conclusions Machine Learning has gone from an esoteric branch of computer science to the most disruptive technology in the cloud, and Enterprise IT is next. However, it remains difficult for one to get started and obtain fast results. NVIDIA understands this technology better than most, and has realized that the complex and error-prone configuration process is a bottleneck they can solve. They also understand that a solution to this problem will yield consistently optimized performance, reduce confusion and improve satisfaction with its products. In the cloud, AWS has realized that they can get out in front with enterprises and ML startups, by being the first to offer the fastest hardware for machine learning training. In addition AWS has realized it can grease the adoption skids by collaborating with NVIDIA  GPU Cloud to simplify software deployment. Others will surely follow, but for now it is clear who the leaders are.