The Six Five On the Road with Gadi Hutt of Annapurna Labs at AWS re:Invent 2022

The Six Five On the Road at AWS reInvent 2022. Patrick Moorhead and Daniel Newman sit down with Gadi Hutt, Director, Business Development for Annapurna Labs at AWS re:Invent2022. Their discussion covers:

  • EC2 and silicon innovation including an overview of the news that was announced this week
  • Continued investment in Intel-based and AMD-based instance types
  • AWS Compute wherever it is needed

Be sure to subscribe to The Six Five Webcast so you never miss an episode.

You can watch the full video here:

You can listen to the conversation here:

Disclaimer: The Six Five On the Road is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we do not ask that you treat us as such.

Transcript:

Patrick Moorhead: Hi, this is Pat Moorhead and we are live in Las Vegas at AWS re:Invent 2022. The Six Five is here. We’re having incredible conversations about incredible technology, with great people. Dan, how are you?

Daniel Newman: I think we’re having incredible conversations with incredible people about very capable technology.

Patrick Moorhead: Gosh, I know. One day I’ll get it right. I’m glad this is not the last, oh, it is the last interview. How about that?

Daniel Newman: It is, but the people watching may or may not know that.

Patrick Moorhead: Yeah.

Daniel Newman: Because we did a lot of these and they were all really good and because we’re The Six Five, people watched this stuff all the time. It’s not just live at the event. But Pat, what a great AWS re:Invent it has been.

Patrick Moorhead: No, totally. I have to bring in the maturity of the cloud. The cloud’s 15 years old and it’s walking, it’s getting into a little bit of trouble. Everybody hasn’t figured it out yet, isn’t a full adult yet. But isn’t it amazing we’re like 10% there? At least that’s what Andy Jassy said on CNBC. But we create so much. By the way, Compute has been one of the game changers that has made this happen. I have to be a little bit biased. I love compute.

Daniel Newman: You’re a silicon guy?

Patrick Moorhead: I once was, in a previous life, silicon guy. And then even before that I was the OEM buying this stuff. So selecting it for my products. But I love chips. And by the way, we need to get to introduce our guest. Gadi, how are you my friend?

Gadi Hutt: I’m doing great. Thanks for having me. Super excited to be here.

Patrick Moorhead: I know. It’s so exciting. We’ve had folks, we have talked on The Six Five with Dave Brown, I think twice on The Six Five, did The Six Five summit. And those were some of our highest viewed episodes. People love chips. Maybe a great place to start is what do you do for AWS?

Gadi Hutt: Yeah. Nice to meet you guys. I’m running the business for an Annapurna Labs. Annapurna Labs is the company that AWS acquired in 2015. And we are focused on serving AWS customers across network, storage, compute, and machine learning.

Patrick Moorhead: Huge success story. I remember seeing the press release when I think it was Amazon bought it, not Nest. And I’m thinking, “What are they going to, do their own chips or something?” There’s no way that’s going to work out. That never works out. And here we are now with just this incredible portfolio of multiple ways of doing. It truly is a success story.

Gadi Hutt: We use the silicon innovation across these domains to help customers do more. And we invest, we have a very good advantage to know the customers really intimately and understand what they need and learn from them and also innovate on their behalf in some cases, just to make sure we get the right products at the right time. Silicon design times are, it’s not like a software product, you write something, you deploy, you test, you change. Silicon is a longer design cycle. So we actually have to predict the future quite a bit, in a lot of cases. So it’s a very engaging environment. Just make sure we kind of keep up with the demand and customer innovation.

Daniel Newman: Except for those fun programmable FPGAs.

Patrick Moorhead: Oh, yeah.

Daniel Newman: That’s not what we’re talking about here.

Patrick Moorhead: Exactly.

Daniel Newman: Silicon innovation has been a big part of this story for AWS over the past few years. Started with compute, traditional monolithic compute for the data center and then progressed rapidly into AI, which is what I think we’re going to really focus on here today.

Patrick Moorhead: Don’t forget about Nitro. I mean Nitro’s pretty awesome too. Let’s not cut it short.

Daniel Newman: All right, all right. We’re going to talk about Nitro?

Patrick Moorhead: No, we’re not.

Daniel Newman: All right. So Nitro, Graviton, Trainium, Inferentia. Talk to me a little bit about the customer response to, let’s start with Inferentia.

Gadi Hutt: Yeah, so Inferentia we launched in 2019. Back then, most of the usage, the models were relatively small, hundreds of millions of parameters for machine learning models. And customers were spending most of their money or inference workloads. So we started with an inference chip called Inferentia. Huge success both with external customers. We just had, you saw in Adams keynote, Qualtrics. I was on stage today with Pinterest. Last week, ByteDance, the company that runs TikTok and the Douyin in China. They announced that they’re using Inferentia and they’re saving 80% on their cost and improving latency by 25%. And small startups and large organizations like Entity and Airbnb and Snap are all using Inferentia today. So huge success from external customers. Also, internal customers. We talked in the past on Alexa still growing strong, but we also added a lot of other services. Amazon Search, Amazon product search is using Inferentia, again, saving 80% on the cost of search, running search. Other teams like robotics, ads, Amazon ads. So just huge adoption, of course, multiple use cases, language, computer vision, and recommended models.

Daniel Newman: It’s great that you get to build so many of the models and stuff for your own businesses. That’s kind of cool.

Gadi Hutt: For us, they’re all our customers. So for our team that designs those products and kind of brings them to market, we don’t differentiate between internal customers or external customers. They all are demanding and they have a huge commitments to their customers that we have to make sure we can help them meet.

Patrick Moorhead: Yeah. Really good traction with inference, with Inferentia, Inf1, new Inf2. But it doesn’t stop there, it’s training too. And quite frankly, there’s not a lot of choices when it comes to training. I was pretty excited that you brought out Trainium. How is that going?

Gadi Hutt: Like you said, we start with the success we had with Inferentia. What happened in the meantime is that the models grew in size. So for customers to get more accuracy out of their models, typically what happens is they increase the size of the model, which increases the compute requirements, which also increases the spend on those models. So in some cases, I’ll give you an example, large language model, that takes a few weeks to train to completion, can cost millions of dollars to train. And what happens if you made a mistake? You need to start all over again.

So these are huge spans that our customers are facing today with GPUs. And this is why we decided, okay, let’s replicate the success we had with Inferentia in lowering cost and bring this into training. So today customers see 60% lower cost, up to 60% lower cost on the training work jobs, which means two things. They can either train more for the same budget or use the budget for other things to serve the customers better. So it’s really giving customers more choices in what to do with the resources that they have.

Daniel Newman: So as Pat alluded, to training, it was a bit of an greenfield because there’s only really one player that kind of really. And of course you have a great relationship as a company, but I think also opportunistically, you’re looking at things like power. You were looking at things like cost and saying, “Hey, with training only going one way, the amount required, we want to have another alternative.” And that’s been really a strong spot for AWS. But I’m imagining you’re not coming in saying, “We’re going to do all the training for all the workloads.” You have some types of workloads that you’re probably thinking Trainium1 is really suited for.

Gadi Hutt: Yeah, it’s a great point. So we are giving customers all the choices. We have accelerators from NVIDIA, we have accelerators from Intel, we have accelerators from MD. All of those are available for customers to use. And basically, they should choose which is the accelerator that gives them the best performance, the best cost for their workloads. In the case of Trainium, what we are trying to do, we’re trying to focus on three areas, increase performance significantly. So today, Trainium, in a cluster single node, we are up to 50% faster as compared to the latest GPU instances. When we go to a cluster, our scaling, because we have tight tightly coupled technology together with EFA, which is the Nitro chip, we can do some innovations for customers. So our scaling is almost linear and we get up to 2.6 better costs for large models as compared to the GPU instances. But customers also are used to using GPUs. We’re not taking that away in any way, shape, or form. We’re just giving customers more options.

Patrick Moorhead: One thing that, as a previous chip guy, I appreciate about your claims, is that they end up being true over time. And a lot of times what you see is they’ll be a claim out there and it’s true for a month. You go back and check, and I think you and I talked about all the work that you did on Inf1 one with software and how you were able to get just so much more performance out of that. But that doesn’t mean that there’s not value for the next generation of chips. And now we have Inf2 with the second generation of Inferentia. What were some of the improvements and what are some of the benefits that, that’s bringing to customers?

Gadi Hutt: Yeah. So like you mentioned, we invest a lot in software. We actually have more software people than Silicon people in the organization just because we want to make sure the ease of use is there for our customers. So the new one, SDK, spans all of these product lines. It’s the same in [inaudible 00:10:02], natively integrated into popular frameworks like Pytorch and Tensorflow. Specifically for Inf2, what we have done, we have taken the same silicon architecture like we have in Tranium, but we cost reduced a lot of the elements that are needed for training for large instances that are training connectivity within the instance, and a lot of networking going out of the instance. All of those are not needed for inference.

What we did, which is unique, that doesn’t exist in an inference, in other instances that are inference optimized, we did keep high speed connectivity between all of the accelerators. What that gives, it gives customers options to deploy these large models that they’re spending millions of dollars to train on. The only option before Inf2 was to use the same training machine to deploy the model because you cannot fit the single model into a single chip. So Inf2 is giving that capability, basically all the capabilities we have in the training machine, but in a cost reduced package, that will give customers the ability to deploy these large models up to 175 billion within a box, which is unheard of.

Patrick Moorhead: It really is. It’s nuts. Nuts in a great way.

Daniel Newman: Clarify, why is Inf2 so good for these larger models?

Gadi Hutt: Yeah.

Daniel Newman:

What did you do?

Gadi Hutt: So again, so what we have done, we kept high speed connectivity, we call neural link between the accelerators, and now you can deploy across single model that doesn’t fit into the memory of a single accelerator. We can deploy the cost multiple accelerators in a cost effective way.

Patrick Moorhead: And did you do something with HBM as well?

Gadi Hutt: Yeah, so the chips are integrated with HBM memory. We kept the same HBM memory. We didn’t cost reduce the chip itself. We kept the same configuration as we have on the training, just to make sure the customers that are training these large models can also deploy these large models in a cost effective way. The way we reduce the cost is on the system level in the cloud itself, in the system level, like I said, the reduced connectivity within the server and reduce the EFA and ENA bandwidth, which are not needed for inference workloads.

Patrick Moorhead: I really appreciate, it’s funny, it’s not even fair enough to say you’ve taken a systems level approach. You’ve taken an entire data center level approach.

Gadi Hutt: Yes.

Patrick Moorhead: And that’s something I find myself having to explain to people out in the marketplace, which is this isn’t just a chip that is uniform and fits into everything. This is custom tailored for the architecture inside of AWS’s data center. And I think as we saw Moore’s law, very hard to keep up with, we had to find more novel ways of hitting the higher levels of performance. And you go systems, but I feel like you took it one step further, which is with all your clusters and how do I connect that. But listen, I love monster performance and efficiency. But the other thing I think we have to talk about as we get into this accelerate computing is sustainability.

Gadi Hutt: Oh, yeah.

Patrick Moorhead: And what does all this efficiency mean for sustainability? What’s your take on that?

Gadi Hutt: Yeah, so we are a hundred percent focused on sustainability, this coming direct request from customers. Because these are compute intensive workload, it’s super important for customers to minimize the carbon footprint and the impact. For Inferentia, we are 53% lower powers compared to the comparable GPU. With Trainium, we are 50% lower power. But it doesn’t end there because the performance and the scaling, like we discussed on the clusters, is higher, what happens is there is a compounded effect. So let’s take an example. If someone wants to deploy a million inferences in an hour for a document retrieval service, we actually tested it, if they need to use GPU, the G4 instances, they need to use 33 of those instances to run a million inferences per hour. If they will use Inferentia, they will need only six, to run the same million inferences. What this means is that not only compared the instance to instance with we are lower power, we’re actually using less instances. So the compounded effect is 90% reduced energy.

Patrick Moorhead: Interesting. It’s compounded almost.

Gadi Hutt: Yeah.

Daniel Newman: So I can say though, because we had a conversation with the director of sustainability at AWS and we were talking about economical sustainability. And I think as the economy slows, a lot of these sort of pet project ESG stuff, it’s all important, but it’s been kind of like, okay, I can’t justify this to my shareholders as much right now. This just makes sense.

Gadi Hutt: Yeah.

Patrick Moorhead: Well, it’s a twofer, right?

Daniel Newman: Well, I’m saying you’re getting more performance, you’re lowering your footprint, you’re getting to that lower carbon number that everybody’s aspiring to, and you can show it on the income statement, the balance sheet, like we’re actually delivering value. It’s so smart you almost wonder what are the hurdles? Why are people not, and by the way they are, but why is it not like a-

Patrick Moorhead: We had this, we had a fun… Was it a year ago or 18 months ago? It’s just like, hey, it’s one thing to do it internally, who else… By the way you delivered. Okay?

Gadi Hutt: Yeah, thank you.

Patrick Moorhead: That slide’s getting bigger and bigger and bigger with some very credible customers that I know need your technology a lot like Qualtrics, as an example.

Gadi Hutt: Yep. Again, Trainium instances this just launched recently. So we are still building the customer base. We have good references that I can talk to. We have a company called HeliXon that will do important folding on Trn1. We are collaborating with PyTorch quite heavily. We have a startup, really promising startup called Magic that are doing core generation on Trainium and a few other companies. But we are just getting started.

We have actually what we call the large language model companies. They’re just starting to do that on Trn1. We just recently launched. And so we are not announcing it yet. And it takes time to train these large models, it’s weeks to months. So you should expect more good announcements coming soon.

On the Inf2 side, Inf2 is also enabling innovation that doesn’t exist today on Inf1 because the field evolved. So we can support things like dynamic input shapes that are critical for voice type of workloads. We can support customer operators. One of the feedback we got from customers is, it’s great that you have this compilation, but what happens if your compiler doesn’t know my operators? So we’re giving now customers basically a very simple C++ API. You just write your own custom operators and they run natively on the chip.

So there is a lot of innovation going on as well just to enable more and more customers will feel comfortable. And one key aspect of that is also called industry collaboration. Like I mentioned, we collaborate very closely with PyTorch and we are actually co-founded with PyTorch and Meta and Google and NVIDIA and Apple Foundation called OpenXLA, which will open up the front end, all sorts of compilers that can target different hardware machines. So for Apple, it will probably be something like the iPhone. For us, it’ll be Trainium and Inferentia, for NVIDIA it’ll be Dell GPUs. And what that will help is an open source community around compilers, making sure folks can continue to contribute that and we will all benefit from those contributions.

Daniel Newman: So you talked about PyTorch, we kind of jumped from the sustainability and now you’re talking developers.

Gadi Hutt: Yes.

Daniel Newman: And so developers that are watching this or that are paying attention, seeing the great advancements, seeing the sustainability benefits, seeing the larger models, they’re going to be saying, Okay, I want to develop on Inferentia and I want to train on Trainium.” How do they get started?

Gadi Hutt: So it’s super easy to get started. One of the leading tenants for us is not asking our customers to change the models to fit our hardware. That’s a big no-no. We are investing heavily in the compiler technology to make sure that customers will bring the models, the compiler will do auto optimizations for them and lower the model into the hardware. So in practice, what happens is that you basically need to make sure we include our SDK, you write one line of code, which is the compilation phase, and that’s it. After that, there are no changes to the model, no changes to the training scripts, no changes to the deployment of your inference, all the applications that customers are using today to deploy the models apply as well.

Patrick Moorhead: Whenever I hear the one line of code, I just want to smile. My son could figure this out, right?

Daniel Newman: He could.

Patrick Moorhead: He could. Yeah, sorry to interrupt, but I just love the one line of code. Just so I understand, if I’ve been doing this work on a GPU for the last three or four years with one line of code, I can put that over?

Gadi Hutt: Yes. We are testing with open source models. So anything on HuggingFace will run on Inferentia and Trainium. Anything that is coming from Meta will run natively on Trainium. We actually have a public roadmap of features that we are adding over time. So customers are informed, well informed, on what we are still working on. And the only caveat will be if customers have CUDA related code that is very specific to GPUs, that’s something they will have to take out of their applications. But open source models usually don’t have those CUDA dependencies. So that’s a great starting point for everybody.

Daniel Newman: I just love talking silicon. I don’t know about you.

Patrick Moorhead: Oh, you know me, you know about me.

Daniel Newman: I don’t know about you, but I’m just saying this is exciting. By the way, really, it’s hitting on all the fronts. It’s hitting on performance, which that’s going to be an area that people are going to constantly be monitoring. Like you said, every claim has about a month shelf life in this era. But you’re hitting on some other things. You’re hitting on the economics and the cost and helping people. You’re democratizing, making it more available. You’re hitting on sustainability, which is really important as well. And then of course, the developer community, in the world of AI, it’s kind of like mobile apps. You can’t become successful if you don’t get the developer communities. You really got the quadfecta.

Gadi Hutt: Yeah.

Daniel Newman: You’ve got to be feeling pretty good about your situation.

Gadi Hutt: Yeah, we are super happy. Our customers are happy, we’re getting good feedback. There is a close relationship between all those vectors, right? Because if we increase performance for and make the chip more efficient in doing what it needs to do, that also helps with sustainability because it reduces the performance by what aspect of the workload. When we improve visibility, it makes it easier for customers to come and make use of those accelerators. So all of those are interconnected with each other.

We are doing a lot of work out in the open with PyTorch, contributing features into the PyTorch ecosystem just to make sure it’s really easy to use those instances and we’ll continue to contribute. We have more features coming really soon. So, that’s a really exciting time. And because the models are becoming so expensive to use, we really see more customers looking at these large language models and large computer vision models like Stable Diffusion and others just to kind of, before on GPUs, it was cost prohibitive to many customers. And now we are starting to get you with more and more developers, academia and independent researchers that never had this ability to run on these platforms cost efficiently.

Patrick Moorhead: I love the customer angle and I love that, but I also like the overall story about, I don’t think anybody thought you would be able to pull this off. And that’s a fun part. By the way, it’s a story that doesn’t get told a lot. I wish it did because it’s one thing, I think everybody said, okay, Nitro, internet working, okay, I get this. But then, hey, we’re going to do a general purpose CPU. But hey, it’s going to be limited to what it can do. And then Graviton2 and then Graviton3 for high performance computing. Okay? Now, inference, ah, there’s no way that’s going to work. Right? And then this happens. And then Trainium, which again, different stage. It’s the most recent architecture you brought… Yeah, yeah. No, essentially Nitro is their DPU.

Daniel Newman: I know. I said the DPU.

Patrick Moorhead: Yeah. So it’s been a really fun story and I just want to thank you for coming on and sharing this with us. I’m a little enamored with this whole thing. I’m not usually this positive-

Daniel Newman: No fanboy.

Patrick Moorhead: … On this. But I also know how hard this is. And I look back at the last 30 years of either systems companies who’ve been able to crank out really good silicon, and you can count them on two fingers. Okay? And you’re one of them.

Gadi Hutt: Appreciate it.

Patrick Moorhead: Congratulations on it. The other great part about is you’re unrelenting. I mean, you’re just like, you keep coming out at this rate.

Gadi Hutt: We have to, Pat. Customers are asking us to do this. We just have to. There’s no other choice. This space is moving so fast, you should expect more soon. And we’ll keep improving what we have and we’ll keep improving inference software and of course, new silicon generations for years to come. This is a super exciting vertical for us and segment, and we just see a lot of good feedback from customers. And thank you for the kind words.

Daniel Newman: Sure. Gadi, thank you so much. Everyone out there, thanks so much for tuning in to this episode of The Six Five On the Road at AWS re:Invent 2022. Daniel Newman here, joined by Patrick Moorhead. Great event, great week. If you like what you saw, hit that subscribe button, watch all the other episodes here and of course, everywhere that Patrick and I are with The Six Five. But for this episode, it’s time to say goodbye. We’ll see you all later.