AWS Inferentia Gathers More Steam With Alexa, Snap, Anthem, And Conde Nast

By Patrick Moorhead - December 3, 2020

There’s no way around it—as 2020 draws close to an end, machine learning (ML) is hot and getting hotter. Heavy hitters and startups alike have gotten into the game to compete with NVIDIA and its formidable GPUs. The competition includes Google (with its TPUs), AMD (who recently announced the intention to acquire Xilinx and its FPGAs), Intel (and its newly acquired Habana Labs, and more). Since NVIDIA continues to rule the roost for ML training workloads, thanks to its product strength, installed base and vast software ecosystem, most of the chipmaker jockeying occurs in the less settled ML inference market. At AWS re:Invent in late 2018, Amazon Web Services (AWS) threw its hat into the ring with a custom ML chip it calls AWS Inferentia—developed to provide high performance ML inference at a more affordable cost than the competition. AWS recently shared a status update on its Inferentia efforts, and I’d like to provide my takeaways, along with a little more background on Inferentia and what it does. Let’s jump in.

Training vs. inference

To make sure we’re all on the same page, it’s worth explaining the difference between the two stages of ML. Training, essentially, refers to the process of teaching a neural network how to perform a task, such as image recognition, by feeding it massive amounts of pertinent data. A successful training model can learn to perform specific tasks without being programmed explicitly for each workload. Inference, on the other hand, refers to leveraging these trained ML algorithms to predict outcomes.

AWS Inferentia


I have been bullish on the highly economical Inferentia from the get-go. The chip consists of four NeuronCores, each with a highly performant systolic array matrix multiply engine. The NeuronCores also feature a sizable on-chip cache, which reduces the external memory accesses that can bog down latency and throughput. There’s a lot to like about Inferentia, in addition to its high throughput (hundreds of TOPS) and low latency. The chip’s flexibility is also a crucial draw—it supports all the major ML frameworks, including TensorFlow, Apache MXNet and PyTorch, as well as ONNX-formatted models and mixed-precision workloads. This broad support is essential since different frameworks are better suited for specific workloads than others. Additionally, this means AWS’s customers can continue to use the frameworks with which they are already comfortable with. A bonus is AWS’s Neuron SDK, which enables customers to run ML inference workloads utilizing Inferentia. 

The low-cost element is the final kicker—Amazon itself says that 90% of its ML expenses stem from inference, as opposed to 10% for training. A good case study is Amazon’s migration of Alexa workloads to its Inferentia-powered Elastic Compute Cloud (EC2) Inf1 instances. Alexa presents a specific set of demands—extremely low latency (no different than Siri or Google Assistant), but also high throughput to handle speech output. See where I’m going with this? Around this time last year, the Alexa team shared some, frankly, staggering gains it had realized by migrating some text-to-speech workloads to EC2 Inf1. The team claimed it was able to perform speech generation at 55% percent of the cost of running it on a P3 instance and also said it had seen a 5% improvement in latency. With the adoption of 16-bit operations and the ability to leverage Inferentia in groups, AWS projected it could bring costs down to 37% of a P3 instance and latency down by 19%.

The new Inferentia news

The last progress report left me wondering what percentage of Alexa’s inference workloads Amazon had migrated to EC2 Inf1 so far. With the savings and reduced latency offered by the instances, it seemed to me that it would be wise to migrate all of Alexa’s inference tasks to EC2. 

My answer came last week when Amazon announced that the majority of Alexa now runs on EC2 Inf1 instances. As the blog cleverly put it, Alexa’s head is now, literally, “in the cloud.” AWS says these Inferentia-powered Inf1 instances now deliver 25% lower end-to-end latency and a 30% reduction in costs over the leading GPU-based models (NVIDIA). Those kinds of numbers are significant when you think about it—Alexa is one of the most popular hyperscale ML services in the world, with billions of inference requests every week. For that matter, these Inf1 instances purportedly offer a 30% increase in throughput and a 45% reduced cost-per-inference over the previous lowest-cost option, GPU-based NVIDIA G4 instances. 

Amazon also shared that its computer vision platform, Rekognition, has also started leveraging AWS Inferentia. According to the company, this has allowed it to perform object classification with eight times lower latency and twice the throughput of the same workloads running on GPU instances. Other Inferentia customers called out by name included Snap Inc. (maker of Snapchat), Conde Nast (global media company) and Anthem (healthcare). While not all these companies provided concrete numbers, Conde Nast shared that its recommendation engine achieved a jaw-dropping 72% reduction of its inference costs with EC2.

Finally, AWS announced today that it widened its Inf1 availability to its US West (N. California), Canada (Central), Europe (London), Asia Pacific (Hong Kong, Seoul), and Middle East (Bahrain) regions. This brings the tally to 17 regions globally, joining US East (N. Virginia, Ohio), US West (Oregon), Europe (Frankfurt, Ireland, Paris), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), and South America (São Paulo).

We’ll see how Inferentia stacks up to its competitors in the by-and-by, but it’s clear now that AWS is solidly in the inference game. With what looks like the lowest price-tag in town from a major cloud player, I expect many more customers will come running to embrace it. The fact that Amazon itself runs Alexa and Amazon Rekognition on Inferentia speaks volumes about the company’s confidence in the offering. AWS will obviously need to update and refresh at least every 18 months to two years, but what a great start. 

Note: Moor Insights & Strategy writers and editors may have contributed to this article. 

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.