Last year at AWS reInvent, out of the 100s of announcements, I chose the top 5 for overall, long-term impact. One of those was Amazon’s EC2 Inf1 Instances that used its new Inferentia machine learning inference chip.
I chose Inf1 Instances as a top 5 for a few reasons. Machine learning is a killer app and with the biggest data is more efficient than classic analytics and a pay as you go model was sorely needed. AWS’s claims were huge, too, that EC2 Inf1 instances would deliver 3x higher inference throughput, and up to 40% lower cost-per-inference than the Amazon EC2 G4 instance family, which it said was already the fastest and lowest cost instance for machine learning inference available in the cloud.
So, what has been going on since reInvent? A lot. While the team has been busy with customers and optimizing and adding capabilities, some of the Inferentia and EC2 Inf1 instance leads had the chance to give me an update.
A high performing silicon team
Before I dive in, I wanted to give a tip of the hat to the silicon team behind Inferentia. This is the same team behind the very successful and mature AWS Nitro System (security, network, storage acceleration) and the new Graviton2 processor which I wrote about here. This is the core Annapurna team that came by way of a 2015 Amazon acquisition. The team focuses on the common themes of simplicity, focus on ease of use and end customer value. I have spent 30 years evaluating, managing, and marketing processors, and I’m impressed.
Inf1 now supported in Sagemaker
The biggest update is that developers can now integrate EC2 Inf1 instances into their SageMaker workflows. If you are not familiar with SageMaker, it is an ML workflow tool and with Studio, an integrated development environment. Machine learning is hard without an army of data scientists and DL/ML-savvy developers. The problem is that these skills are expensive, hard to attract and retain, not to mention the need to have unique infrastructure like GPUs, FPGAs and ASICs. Enter SageMaker which provides one place to label, build, train, tune deploy and manage ML workloads.
With Inf1 support for SageMaker, enterprises can now tap into the performance, low cost and low latency alongside CPUs and GPUs.
AWS Neuron performance improvements
While I have not talked much about Neuron much previously, I think it is important to start as it is core to the AWS ML value proposition. Neuron is an SDK including a compiler, runtime and profiling tools for AWS Inferentia and is integrated with popular ML frameworks like TensorFlow, MXNet, and PyTorch. Because of this integration, users can continue using these frameworks to compile their models and deploy them in production without writing custom Neuron-specific code. I believe integrating popular frameworks matches the overall AWS theme of giving choices, allowing customers to choose whatever software their like, and not locking to a specific framework. More sophisticated customers do not like to get locked into one framework, less sophisticated customers will only realize after the lock-in and the Model T saying, “you can get any color as long as it’s black” doesn’t work here.
Customers take their trained models and compile with Neuron to get the best performance on Inferentia. The cool part about any abstraction layer is that if you put enough work into it, you can improve it over time. And that is exactly what the team has been doing since launch.
Since the December 4 launch, AWS has improved Neuron and hence workload performance on models like BERT-Large and Resnet-50 significantly. The team said it had doubled ResNet-50 scores since launch. This matters a lot as, like a fine wine, performance gets better over time without the enterprise paying more for it. A plus is we did not have to wait two years, but rather two months! We have seen this trend with G4 instances and its great to see with Inf1, too.
New, projected cost gains for Alexa migration to EC2 Inf1
In our update, the Inf1 team also shared with me some interesting information about one of current largest customers, the folks next door at Amazon Alexa. Unless you live under a rock, you know that Alexa is Amazon’s intelligent agent you can find on Echo devices, Fire tablets, TVs, phones and even cars.
Alexa is a unique challenge. It requires very low-latency performance like any dialog system like Apple’s Siri or Google Assistant. Given the speech output, Alexa also has a requirement for high throughput. The team also explained to me that workflow is bound to memory bandwidth as context generation is a sequence-to-sequence auto-regressive model. It is compute-bound, too, requiring 90 GigaFlops (yes 90 billion floating point operations!) for one second of output. Sound easy enough? (sarcasm mine)
In December, the folks at Alexa said that they were experiencing 55% of the cost of a P3 instance for speech generation with a 5% increase in latency. With Neuron improvements I discussed above, a move to 16-bit operations and the ability to use Inferentia in “groups”, now, Alexa is projecting 37% the cost of a P3 instance with a 19% reduction in latency.
Some other interesting things that surfaced on the Alexa migration were that it was originally developed to take advantage of FP32, which could run without changes on Inferentia, but migrated to using Inferentia’s native support for FP16 and Bfloat16. This is part of the team’s philosophy on Inferentia design to focus on reducing migration friction, support FP32 even when not optimal.
Also interesting to me is Alexa’s use of MXNet that is natively supported by Neuron. I think this shows the value in supporting many different frameworks as MXNet is known for being good at audio while Tensorflow is better known for images.
In the future, I am looking forward to disclosures from the Alexa team on what percent of its inference is done on EC2 Inf1. I am not questioning the numbers above, but shouldn’t Alexa be moving everything over now?
I always challenge companies on the notion of “lock-in” and I believe that such successful companies like AWS always need to show how their solutions do not lock in customers. When asked, I thought their response on the lack of lock in and portability of workloads to other AI services and processors was good.
Neuron SDK, the compiler for Inferentia chips, is integrated with the most popular ML frameworks (TensorFlow, MXNet, and PyTorch) allowing customer to continue using whatever software they like, and not be constrained to a specific framework/or hardware architecture. I believe the Neuron team has done the heavy lifting of integrating the SDK into the different frameworks and optimizing it to offer customers the best possible inference performance. Customers can bring their pre-trained models and make only a few lines of code changes, from within the framework to accelerate their inference with EC2 Inf1 instances. AWS says operating at the framework level allows customers the freedom to always choose the best solution for their needs, and not be tied to a specific hardware architecture or cloud provider. And I agree.
It was great to talk with the EC2 Inf1 and Inferentia team, particularly as Amazon keeps most folks busy developing aside from reInvents and Summits. Each successive disclosure I get on Amazon’s home-grown silicon, be it the Nitro System, Graviton2 or Inferentia, the more impressed I get with Amazon’s silicon team. As for Inf1 instances based on Inferentia, I am very pleased to see the Neuron and Inf1 performance improvements and the ability to spread multiple parts of the inference pipeline across multiple Inferentias but to the programmer it looks like one. I also like the lack of lock-in as the AWS-proprietary parts are after the framework versus locking people into a framework.
I am hoping to see even more Inf1 customer disclosures and success stories in the future and expect many at reInvent 2020.