The AWS re:Invent event marks the last and largest conference I will attend in 2020 – and the year certainly ended with a bang. This event is always super-interesting, and the company continues to dial in on topics and trends relevant to enterprise IT and increasingly business leaders. The conference had 100s of announcements, but at some point, the analyst trick is identifying the most significant of all the product announcements. I recently published the most critical non-compute topics for re:Invent, but now I'd like to focus on the top compute, EC2-related reveals at AWS as I believe these have the biggest compute impact on IT.
I got the sense over the past few years that AWS competitors were waiting for compute to get commoditized, but based on this year's re:Invent, this is far from the case. In fact, I think EC2 is doing its best to decommoditize compute, using it as a strategic weapon to get customers and win business. You see, with a stronger IaaS compute, you can deliver a stronger PaaS and SaaS. Of course, you need good networking, storage, security, and other basics, but it sure helps to have that strong, base compute.
1/ EC2 C6gn Instances – Graviton with big networking equals affordable, big performance - C6gn is AWS' latest Graviton2 offering, focusing on compute intense workloads that require lots of networking bandwidth. Think massive compute capabilities (up to 64 Graviton2 vCPUs) supported by 100Gbps networking and 38Gbps of EBS bandwidth. This instance should be able to help the most compute intense workloads an enterprise would require. And oh yeah, and at a 40% price-performance advantage over the current (C5) comparable x86-based instances. HPC, batch processing, distributed analytics, and video-encoding are all good candidates for this Arm-based instance.
Equally compelling is the broad ISV ecosystem support for C6gn. In addition to the usual AWS distributions, EC2 C6gn supports major Linux distributions and applications and services from the major players in the respective spaces such as Datadog, Docker, and GitLab Jenkins, NGINX, Dynatrace, Rancher, and others.
While I am sure that some of the price-performance advantages result from cost advantages with home-grown silicon and pricing levers AWS can pull in favor of its own solution, this is still an incredible value for enterprise IT. I find equally compelling that the amount of security AWS has built into the C6gn without compromising performance due to being built on the AWS Nitro System. Look for C6gn instance availability toward the end of the month.
Why this matters
The big picture here is that AWS's Arm-based general compute instance is growing up and is getting more capable, generation after generation. Graviton2 is currently a "lower price" play versus Intel and AMD, but don't confuse this as "low performance." The 40% comparison is to a custom, specially-binned 3.0 GHz. Intel Xeon Platinum 8000 series processor with all core Turbo-enabled.
2/ EC2 G4ad Instances – All AMD solution for remote graphics needs – It is clear that AWS has found success with its AMD EPYC-based instances as it rolls out this all-AMD instance focused on driving excellent price-performance for graphics-intense workloads like remote workstation and graphics rendering apps. Think of this today as a 1:1 relationship versus fully virtualized, even though the hardware supports it. While there was much discussion early on about workstation apps, I don't see why this couldn't also serve as a game streaming back-end, too.
Technically, EC2-G4ad pairs the AMD Radeon Pro V520 GPU with EPYC, enabling what AWS claims is up to 45% better price performance over comparable G4dn instances (NVIDIA-based) for workloads such as virtual workstations and graphics rendering. The V520 is based on "Navi 12", is fanless, with 8GB of HBM.
Why this matters
First, this looks like a good solution for organizations wanting to utilize graphics-intense workstation capabilities on a per-project or in a bursting fashion. Run out of on-prem capacity or want to share resources around the clock around the world? The G4ad instance is an ideal solution.
Second, I believe this announcement demonstrates AWS's understanding of the market's needs and wants and its commitment to satisfying the needs of all its customers with silicon diversity. NVIDIA is a dominant force in the GPU space, and of course, if you want NVIDIA, AWS supports that as AWS delivers a full range of capabilities on the EC2 product portfolio. However, a growing set of AMD customers would like to take advantage of AMD-on-AMD technologies such as RDNA and RDNA 2.
3/EC2 Habana – Machine Learning training from Intel – An excellent place to start is spending a few words on Habana. For the reader that hasn't followed the Artificial Intelligence (AI) market as closely, Habana Labs is an Intel company that designs and delivers AI processors for training and inference. Moor Insights & Strategy analyst Karl Freund covered Intel's Habana acquisition here.
ML is a workload comprised of learning and inference. These workloads have very distinct and demanding requirements. Out of these needs was the emergence of companies like Habana Labs and products such as its Gaudi processor to drive more efficient Machine Learning. The Intel value proposition of Gaudi is a specialized compute capability that removes the need for a GPU while delivering better, faster more efficient training capabilities in certain circumstances.
At re:Invent, AWS announced its intent to deliver Habana instances to be delivered in 2021, with a goal of delivering up to 40% better single-node price-performance compared to its current GPU-based EC2 instances supporting ML. I haven't seen any detailed facts or figures on the comparison, but I think the AWS EC2 team has an excellent track record of delivering what is promised. This is based on my track record with AWS EC2 folks on claims versus what it delivered on Graviton, Graviton2, and Inferentia. The team I talked with was very confident in image recognition and natural language processing (NLP) workloads and very confident in its Resnet 50 and BERT training performance. Note, this is for single-node performance even though Habana Gaudi is more than capable of multi-node through RoCE.
Why this matters
NVIDIA currently has the only credible, broad-scale deployment of hardware-accelerated ML training solutions for the datacenter, and Intel's Habana Gaudi is the first alternative from a major chip manufacturer.
As AI becomes more and more critical in the enterprise, training becomes more complex as the amount of data feeding training models increases, model complexity increases, and organizations look to refine their models. While 90% of the overall ML compute cost is inference, the other 10% of the cost is training and still staggering. This leads to lots of cost in training compute, and lots of cost in terms of time to intelligence. Habana aims to reduce both of these cost elements and do so with a toolset that enables developers to take existing GPU-based models and port them to its processor. Therefore, TensorFlow and PyTorch frameworks that developers know and love can continue to be leveraged without a hiccup. No major lift-and-shift, no refactoring, just deploy and go.
I believe that this partnership between AWS and Intel is another example of AWS knowing its customers and the market and understanding where (the market) goes. AI is a workload that requires a large investment in both equipment and people. Further, standing up AI is a time-consuming effort. These factors make the adoption of AI extremely difficult for most enterprise organizations. AWS seems to understand this barrier to adoption and looks to remove it with the Habana-based EC2 instance and complete data management layers like SageMaker.
Long-term, I believe the significance of this partnership will be fully appreciated in the market as it should accelerate the adoption of AI across the industry.
4/ AWS Trainium- home-grown ML training- While AWS CEO Andy Jassy stunned the audience with its Habana announcement, he only had to flip one slide to hit the audience with an even bigger announcement it was creating its own ML training silicon. As AWS has done previously, it has a cadence for its new chip announcements. First, it announces the chip, then the instance, and finally, the customers that use it.
AWS says Trainium will provide "the most teraflops of any ML instance in the cloud" and "the most cost-effective training solution in the cloud." While I know this proclamation doesn't satisfy every pundit, and I too want more information, we are where we are on disclosure, and I wouldn't assume AWS is obfuscating anything. This is just how AWS releases information at this stage of the game. I watched the EC2 folks make claims on Graviton, Graviton2, and Inferentia, and it lived up to those promises, so I am expecting AWS to deliver on its Trainium promises. AWS will get the benefit of the doubt from me until it gives me a reason not to.
AWS also announced it will support the same models it supports on NVIDIA-based training instances- Tensorflow, PyTorch, and MXNet, as well as the Neuron SDK and available via SageMaker. I think this is strong as the company may be changing hardware and, yes, some of the lower level compilers, but it is keeping as consistent as experience as it can deliver at this juncture for customers.
Why this matters
As I said above with Habana, NVIDIA is the only game in town right for accelerated ML training at scale in the datacenter, dominating the space for years. This is why Habana is a big deal and why Trainium is an even bigger deal. Like Habana that is focused on performance per dollar, Trainium is as well, but is also claiming teraflop supremacy. This is huge and smart as it gives room for AWS to fulfill its promise if the NVIDIA V200 (my made up product) outperforms expectations.
5/ AWS Outposts 1U and 2U configurations- A few years back the industry was questioning the need for a hybrid-cloud, one that scales the public to the private. Even AWS was questioning it until last year when it announced to everybody's surprise, Outposts. At this year's re:Invent AWS announced that customers could get into Outposts with as little as a 1U or 2U rack-mountable server, designed for space-constrained environments. As these form factors are smaller, they also require less power and the company says they are optimal for branch offices, factories, hospitals, cell sites, or retail stores.
The 1U systems are suitable for 19-inch wide and 24-inch deep cabinets, and use a AWS Graviton2 processor to provide up to 64 vCPUs, 128 GB memory, and 4TB of local NVMe storage. Customers can get the 2U systems configured with an Intel processor, up to 128 vCPUs, 512 GB memory, and 8TB of local NVMe storage and special configurations supporting AWS Inferentia or GPUs.
The 1U and 2U Outposts form factors will be available in 2021.
Why this matters
So far, AWS's on-prem hybrid competitors have done a pretty good job depositioning AWS on the hybrid cloud front. As CEO Andy Jassy said, 96% of IT spend is still on-prem, a huge opportunity and AWS wants to tap more into it. It's not that AWS didn't have many hybrid options before Outposts- it did. It was just that the company wasn't actively talking a lot about it or pushing it as I believe it feared slowing the on-prem legacy to public cloud march. The AWS Outposts announcement last year sealed the deal for the industry that the hybrid cloud is here to stay, and now we can all debate multi-cloud until we're blue in the face.
The new 1U and 2U servers eliminate some Outpost objections, namely form factor and price. The big enterprises seemed hesitant to engage as customers were looking to vendors who could provide for all their needs, big, medium and smaller size, and available in their regions. Most enterprises are looking for vendors that can also scale end to end, to a tiny remote outpost (pun intended) like a gas station, retail store or fast-food franchise where the server is nailed to the wall with no special cooling. This is where the new form factors come in.
Another side note is that the 1U configuration comes with Graviton2 Instances, which is symbolic that these will be the first on-prem Arm-based general-purpose servers in the enterprise. Interesting, right?
I think AWS needs to answer the next future objection, the "no connectivity option" or "how Outposts still functions without talking to the mother ship" objection.
I believe AWS is doing what many didn't expect, and that was to try and decommoditize compute. I heard from many cloud companies that it was going to wait it out until compute was commoditized and everybody gets to the same level. Well, AWS isn't looking to do this anytime soon and wants to differentiate by delivering "the most powerful and cost-effective instances in the cloud." AWS has shown it is willing to spend billions in its own first-part silicon to do this, be it Nitro, Graviton, and now Trainium, while continuing to be first to market with many of Intel, AMD, and NVIDIA silicon.
As for the complexity objection? I think this is a red herring. IT's needs are very different, and if you have the scale AWS has, it even makes business sense to have so many options. It is incumbent on AWS, though, to simplify compute for its customers. The company will need to step up its sales and channel training and invest in more self-help tools to help determine the right EC2 instance. Even better, I think AWS should proactively let customers know when they can save money moving to a lower-priced tier based on historical actuals. Even better than that, it should set up rules, based on customer acceptance, of course, to downshift customers to lower-priced and lower-performance tiers.
AWS also removed many objections about its hybrid-cloud Outposts with more form factors to choose from and more regions. The company looks all-in on hybrid.
To have the most competitive PaaS, you need a competitive IaaS, and the same goes for SaaS. In what I saw at re:Invent for EC2, AWS has raised the bar again for its entire cloud franchise.