Networking the AI Future: Ethernet’s Role in Scaling AI Infrastructure – Six Five On the Road at Dell Technologies World

By Patrick Moorhead - May 28, 2024

On this episode of the Six Five On the Road, hosts Dave Nicholson and Lisa Martin are joined by Broadcom‘s Hasan Siraj, Head of Software Products / Ecosystem for a conversation on how Ethernet is pivotal in scaling AI infrastructure. Broadcom, a global technology leader, is at the forefront of addressing the complexities and needs of AI networking to facilitate the growth and deployment of AI technologies.

Their discussion covers:

  • The crucial role of networking in AI deployments
  • Broadcom’s solutions to the key challenges in AI networking
  • Reasons for choosing Ethernet over Infiniband for AI cluster deployments
  • Anticipated evolution of AI infrastructure over the next 3-5 years and upcoming challenges
  • The demographic of customers deploying AI and their current stage in the AI adoption journey

Learn more at Broadcom.

Watch the video below, and be sure to subscribe to our YouTube channel, so you never miss an episode.


Or listen to the audio here:

Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we ask that you do not treat us as such.

Transcript:

Lisa Martin: Hey, everyone. Welcome back to Six Five On the Road. Our day three coverage of Dell Technologies World 2024 from Las Vegas continues. I’m Lisa Martin, back with my esteemed co-host, Dave Nicholson. Dave, we’ve been talking about this thing, AI. Have you heard of it?

Dave Nicholson: No.

Lisa Martin: No? Okay. We’ll try to walk you through.

Dave Nicholson: Oh, artificial intelligence. Yeah, that’s right.

Lisa Martin: Yeah, same thing. Same thing. We’ve been having lots of conversations about it. We can’t not talk about it at any event, period. But Dell is talking about this is the AI edition of Dell Technologies World, and we’re really excited to have… They talk a lot about their partner ecosystem as well. It was very much featured, not just on Dell’s main stage and in its breakout sessions, but on our show as well because you’ve been following the last couple of days. We’re pleased to welcome Hasan Siraj to the program head of software products and ecosystem at Broadcom. And we’re going to be talking about networking the AI future. Hasan, it’s great to have you on the program.

Hasan Siraj: Great to be here, Lisa and Dave, thank you for having me.

Lisa Martin: Thanks for joining us. So we want to kind of dig into networking and why and how it plays such an important role in AI deployments.

Hasan Siraj: Yeah, so networking actually plays a very important role no matter which workload you’re trying to deploy, but AI is a slightly different beast. If you look at just the history of what has been happening in the industry over the last 10, 15 years, virtualization has been the focus, right? All the major hyperscalers have built their clouds based on this virtualization technology. And the goal really, is you have a bunch of compute, how do you carve it up? How do you run different workloads on this compute? Now, AI enters, especially all of this focus on generative AI. And we are talking about models which are like ChatGPT-3 I think was 70 billion parameters, GPT-4 is 1.2 trillion parameters. So these are extremely large models.

And we look at compute, you can’t possibly fit any of these models on a CPU, let alone a GPU, which has tens of thousands of cores. So you really need to, these models will take hundreds, maybe thousands, in certain cases, tens of thousands of GPUs to fit. But now if you have these tens of thousands of GPUs that is running workload, you have essentially created a large supercomputer. It’s a distributed computing problem. Now, what is the glue that binds this together? It is the network, right? And that network has to scale, it has to provide the performance, it has to make sure it deals with any issues, right? Whether it’s congestion, whether it’s load balancing, and that’s why it plays an extremely important role.

Meta, actually, a couple of years ago when this generative AI and recommendation models that were being tested, they talked about, they were trying to roll out these recommendation models. And for somebody like Meta, they’re looking at what your trends are and they’ll recommend certain ads to you, but they were finding that up to 57% of the time when you’re deploying AI recommendation models was being spent in networking. So imagine spending billions of dollars in this infrastructure, buying these GPUs, and network is taking up 30 to 50% of the time. It’s not the best use of capital. So even if the network can improve all of this by 10%, and networking is usually 10% of the cost, when you’re deploying this AI clusters, the network essentially pays for itself. So hence it plays a critical role in these kinds of roles.

Lisa Martin: Absolutely critical. Then let’s start zeroing in on Broadcom’s role here. What are some of those key challenges that you’re addressing where the network is concerned? I can imagine performance, scale, load balancing, but walk us through some of the extent challenges that Broadcom is helping customers dial down.

Hasan Siraj: So let me try to explain some of the challenges first, and it’s maybe it’s a good idea to just understand how machine learning works, right, at a 30,000 foot level. You do a lot of computation, which is matrix multiplication. I hated that in college, but that’s what you’re doing. Once you’re done, you generate a lot of weights and gradients, which these GPUs are exchanging with each other. So there’s a lot of communication, a lot of chatter. Then you synchronize this information and what you do is you iterate this process over and over again until you converge, until you’re getting the accuracy that you’re looking for. So in this process, the first challenge is the bandwidth. Bandwidth demands are X-fold, maybe 10X compared to what the regular workloads are.

And this is like from the server, some of the servers that Dell talked about, right? You’re talking about 2 x 25, 2 x 50 gig interfaces. We talked about, we announced a 400 gig NIC card yesterday. We’re talking about move to 801.6 terabit, so performance is huge. These flows are very different from how the flows that you saw on the internet, right, the TCPIP where you have hundreds of thousands of these flows and they load balance across the network very well. Over here you have few flows. They’re very large flows. If you don’t load balance them correctly, you are going to have a lot of congestion. And that means that you’re spending that 50% of the time that Meta was talking about. When these GPUs are communicating with each other, they will all start communicating at the same time and you fill up the links very quickly.

So again, congestion management is extremely key. This AI workloads, they may take weeks, even months to complete. GPT-3 initially on a 100 GPUs, it was almost a month it take to complete a training job. So imagine if you have failures in this, you will always go and go back to a point, a checkpoint, and you’ll start from there. So it’s very important to have resiliency in the infrastructure deal with this. And the last thing you talk a lot about, you hear a lot about latency, but the important factor here is something we call tail latency. Tail latency is still you have to receive the last message before you start the new process, the new iteration of the machine learning process. And I think the network has to make sure that we get the lowest tail latency so you get the lowest job completion time. And the job completion time is the ultimate outcome that anybody who’s deploying AI training clusters is looking for. So these are some of the challenges and that is something that we’re focused on addressing.

Dave Nicholson: I think it’s really, really valuable the way that you sort of set the table for the conversation. Because when we think about the era of virtualization, and we’re still of course, virtualizing things, but the idea that you have this single physical environment that you’re carving up and all of the activity is happening within four walls of sheet metal, if you will, in a server, that goes away in the era of AI, even with relatively small things, as you say, it’s about the network. We’ve had a chance to talk about ecosystem over the last few days to get in some pretty deep conversations with silicon providers in the memory space, in the storage device space. And I was actually surprised when I asked one of them, what’s the biggest challenge for AI?

And they didn’t talk about memory or storage, they said networking, because they feel that they’ve solved some of these issues around density and power and let’s not have data traverse between across a bus. Let’s put the memory on the GPU die itself. We had the pleasure of meeting with one of your colleagues at Broadcom, and there was a little bit of show-and-tell, and it was fascinating to see, to hear the specs on power consumption on some of these devices. So you have HBAs where it’s like 17 watts of consumption as opposed to an alternative that might be double the packaging that you provide around switching. Long way to get to the question, talk to us about power consumption, and how critical that is as you scale out. If you’ve got a thousand nodes in a cluster, how much can you contribute to power savings in an environment like that?

Hasan Siraj: Very good question, Dave, and I think that’s a key consideration today. It’s going to become an even bigger consideration moving forward. There is talk about 20% of the world’s power by 2030 is going to be consumed by AI data centers. Again, I go back to GPT-3, I think if I recall, this is the MIT Lincoln Labs. It took about 1,350 megawatt hours of energy, which can power 1,450 homes to just train GPT-3. And even if you look at from a micro perspective, look at the AI servers. Your regular server you’ll have a NIC, you’ll have some storage, you’ll have some CPU. The NICs today, they will have these eight GPUs, they’re consuming 700 watts, 800 watts of power. You’ll have these eight NICs to go along with it. And you have other PCI and everything else. So power is a key consideration in all of this.

And we are very, very focused on solving that problem. And the way that we are doing this, and actually as customers build these data centers out, they tell us we need multiple miracles. As we build these very large clusters to solve this, we are focusing on some of the miracles that we have to do. So one of the first things is, you’ll see we talked about Tomahawk 5, Charlie talked about Tomahawk 5 yesterday. We are at least a generation ahead on density and performance compared to anybody else. And when you are a generation ahead, you can actually replace six systems with a single system. So that’s about a 75% power reduction right there. The second thing that we are focused on is optics. Optics, like we talked about yesterday as well, is a key component when you’re rolling out this infrastructure. So I think Jazz was here. He talked about our NIC, the 400 gig NIC.

So we use the SerDes between our NIC and the switches. This is the industry’s best SerDes. It gives you five meter reach, which means that when you’re building out these racks, you don’t need to use optical cabling between the top of rack and the servers. So that’s a lot of cost saving, but it’s also a lot of you’re saving on power. The other technology that we are focused on is something called linear drive optics. So linear drive optics, they don’t have any DSPs so that’s an active component that’s out. So you’re consuming less power, about 25% less power. It’s cheaper and all indications are it’s more reliable.

And lastly, I think you saw the co-package optics yesterday. As we move forward, and this is with the Tomahawk 5, the 50 terabit chip co-package with 128, a 100, 400 gig ports. But as we move forward, we are talking about moving from 50 to a 100 terabit, 200 terabits. So think of systems, they may have all of these optics sticking out. How do you manage the thermals and power? And this is why this technology will become more and more relevant, like Charlie talked about, that this is a 70% power reduction in the system.

So key consideration, and this is how we are trying to solve this problem, and this will actually moving forward I mean, we are talking about cluster sizes of 32,000 nodes and 64,000 nodes today, but there’s already talk about getting to a million nodes. So think of a data center, these clusters are going to be across data centers. So your super computer, this cluster is going to span three or four data centers, which are kilometers apart. So power considerations, some of the things that we talked about becoming extremely important in that case as well.

Lisa Martin: Definitely extremely important. You talked about a lot of opportunities, a lot of potential, some of the miracles that need to be done. But I’d love for you to walk us through in our final few minutes here, the customer types that are deploying AI and where are they on this journey? Because we know it’s a journey.

Hasan Siraj: No, absolutely. And we talked a lot about training. There’s this other aspect, inferencing, right. There’s a lot of focus on training these days, and I think the hyperscalers have been leading the way in this case, right. We’ve been working very closely with all of them over the last two, three years. And what you find is that they’re already reaching scales of 64,000 to 32,000 node clusters that I talked about, and almost all of them are running it on ethernet. I think there was perception that this InfiniBand as a technology is the only technology can get this done, but the world’s largest clusters are on base built on ethernet.

Dave Nicholson: And just to double-click on that for a second, how much of that is RDMA over Converged ethernet?

Hasan Siraj: All of this.

Dave Nicholson: By definition, you’re talking Rocky. When you say ethernet, you’re talking Rocky.

Hasan Siraj: I’m talking Rocky. So RDMA is the protocol, and it’s running over ethernet and in all of these cases. So I think if we see who is following, you will see a lot of what we call GPU as a service companies that are coming up. I think they’re clearly seeing an ROI in doing this, and I think they are also deploying pretty rapidly. Then we have what we call the digital natives. They have a clear use case on why do they want to invest in their own models, why do they want to have clusters. And then we get to the enterprise, right. If you get to the enterprise, what you’ll find is there are enterprises that are seeing a clear use case. If you’re trying to do cancer research, if you’re doing this oil exploration, for example. So these are use cases where having something that’s very different even in banking fraud.

So this is where it makes sense to make your investment, differentiate yourselves now. But I think there are enterprises who are still thinking. There are, what are my use cases? There are a few use cases which are common, which are understandable, automating the day-to-day operations and everything else, but what will be my differentiator? Do I need to invest in a model? And these customers, I think over the next 12 months or so, will figure out what is the right model for them? Does it make sense for them to invest in their own models or they take something that’s open source and retrain and then invest in just incremental training on top of it? I mean, at a very high level, this is where… And I think what you’ll see is as we move down three, four years down the road, a lot more focus will start coming on the inference side of the house. As these models are there, you have different use cases covered. Inferencing is where a lot of these guys will also try to monetize.

Lisa Martin: And last question, we’ve got about 30 seconds left, but are you seeing a sense of urgency from those organizations that are really just dipping their toe in the water? Because every conference we go to, Michael Dell said this the other day, if you’re not already playing with AI, I think Jeff Clark said it, it’s too late. Is there a sense of urgency within the enterprise, for example?

Hasan Siraj: Absolutely. But I think this is where they have to think very carefully. There is some fear of missing out as well.

Lisa Martin: Yes.

Hasan Siraj: But there is absolutely urgency. Again, this will be your competitive differentiation moving forward because a lot of these guys, if they can automate a lot of these what I call everyday tasks, they can focus on innovation and differentiation. I think that is why there is a lot of focus, and this is why they’re trying to find what is that use case for me. And everybody has a different use case.

Lisa Martin: They do, but that prioritization is key. Hasan, thank you so much for joining Dave and me on the program talking about differentiation. I think we saw clearly yesterday from the show and tell that Charlie did on stage and that Jas and Robert did when they were visiting us yesterday. So we’re going to keep our eyes on this space and continue success with Broadcom in this ecosystem.

Hasan Siraj: Thank you for having me.

Lisa Martin: Our pleasure. For our guest and for Dave Nicholson, I’m Lisa Martin. You’re watching Six Five On the Road, day three coverage of Dell Technologies World at 2024. We’ll see you soon.

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.