Untether AI and the AI Inference Acceleration Opportunity — MI&S Insider Podcast

By Patrick Moorhead - April 11, 2024

On this episode of the Moor Insights & Strategy Insider Podcast, host Patrick Moorhead is joined by Chris Walker, the newly appointed CEO of Untether AI to talk about the inflection point we’re seeing with AI.

AI is experiencing a pivotal moment. With neural network compute loads increasing and companies now transitioning from training AI models to deploying them (inference), there is a need for an entirely new type of compute structure to deal with this seismic shift. Untether AI, born and based in an international hub of AI (Toronto), with a roster of Silicon Valley chip veterans, has purpose-built AI chips that not only drop the cost barrier but also reduce operational expenses. Untether AI’s ICs run AI workloads faster and more energy efficiently, so they consume significantly less power.

Watch the episode here:

Or listen here:


Patrick Moorhead: Hi, this is Pat Moorhead with Moor Insights and Strategy, and we are here for another podcast. What has been keeping analysts busy 24 by 7 the last two years? It has been AI. Surprise! And like with any inflection, it didn’t start this year, didn’t start last year, started years ago, but we are progressing at delivering real business and consumer value. It has become a real thing. It’s a trend, not a fad. Part of the build out though has been very challenging. Recent reports say 85 percent of cloud giants CapEx expenditures are GPUs and latest reports, and I can vouch for them, looks that data center power is going to double by the year 2026. What are we going to do about this? We’ve definitely hit an inflection point. One of the companies trying to help out with this power and efficiency problem is Untether AI. I have the CEO, Chris Walker with me, Chris, great to see you again.

Chris Walker: Great to be with you, Pat. We’ve done so many of these, and it’s so encouraging to be able to do this in new territory, like, like AI that, us old timers, it’s, it’s fascinating new stuff.

Patrick Moorhead: No, isn’t it crazy? The first AI algorithms were literally created in the 1960s. We didn’t have the acceleration. We didn’t have the storage. We didn’t have the memory to actually even start delivering on them about 15 years ago. Then University of Toronto, a lot of research, found a way to leverage accelerators to do object recognition and machine learning then led to deep learning. And here we are with generative AI, as we’ve seen, just because something new comes along, doesn’t mean that all the other iterations aren’t in full use. And in fact, generative AI is actually a pretty small portion of it as machine learning moves on. So yeah, it’s fun and it’s great to see you. When you and I met when you were at Intel, both of us were, my prior life. I don’t think we met, but I’m sure we intersected somehow, but it’s great to see you, man.

Chris Walker: Yeah, it’s fantastic. And going from the competing across PowerPoints and marketing messages to, collaborating and all the fun stuff that I did in my later years at Intel, just been, always a great voice and, critical analysis of what’s happening in markets and especially in semis, which is, who would have thought that chips and semis would be so hot now from a standpoint of attention and people trying to understand. So, being with you to help, explain that and extend that out to your audience is fantastic.

Patrick Moorhead: So Chris you spent 30 years at Intel and now you’re at a, an AI startup Untether AI. For those who may not be familiar with the company, first of all, why did you join and what does the company do? What problem are you incrementally solving out there?

Chris Walker: Yeah, so I was happily retired from Intel. And then by way of, starting to work in consulting and working with different companies, Untether AI came to my attention and really, a unique and cool opportunity where, I would say, Pat, we saw like multiple different big changes in the compute industry through our careers, right? Workloads moving from, into the desktop, into mobile from, text to GUI, right? We’re that old, right? At least I am.

Patrick Moorhead: 1990, Mark first sorry, Chris. First job. So I’m with you.

Chris Walker: The architecture, battles and shifts that happen. It’s the rise of X 86, the rise of arm with mobility and the internet of things the rise of GPUs first in graphics now in in AI and training especially. And then, the things that are happening in the ecosystem, how things get built new technologies, chiplets all that. It’s now actually in one place. And, Untether, I think was uniquely a position to capitalize on all those kind of big career generational changes that are actually stacked on top of each other. And that was like super exciting, super compelling to get me off the sailboat and back into business. It’s interesting, you mentioned, Toronto, actually I’m in Toronto speaking to you from Toronto today is where our headquarters is because there’s been such a hotbed of talent.

One of the things that’s interesting, as you said, 15 years ago the ML algorithms, the, vision and object detection networks are coming in, they’re all, and even today have been running on a traditional GPU architecture, that’s a couple of decades old. And for us, we think the mandate is, and why we were formed is we need to build the right architectures for the workload of the future, which is AI, and it needs to be done in a way that’s fast, accurate and most importantly energy efficient and that’s been the core of, our approach is taking an app memory architecture approach to really change the bottlenecks to really address, the biggest thing that’s an impediment to us all enjoying and, being able to achieve the promise of what AI has just power.

Patrick Moorhead: Yeah. And, as we’ve seen in different iterations, actually there’s even sub iterate iterations of AI the last 15 years, but there are two worlds, right? There’s the training of the frameworks or the models, and then you have the inference actually running it. And as you’re, it becomes part of an application, AI isn’t an application.

Applications leverage AI to do this. Can you maybe turn up the contrast ratio between the training world and the inference world and talk about where Untethered AI plays?

Chris Walker: Yeah. So I think in the training world, which has been really the dominant story up for the last couple of years about the CapEx build out, it’s brute force of, thousands upon thousands of, servers, gigawatts of power, massive amounts of water to cool these things, months and months to train them, to train a model, that’s a certain, capital and operation and computational requirement. What inference is about is it’s about the next 100 billion market that builds on top of training as we deploy this in the real world, as we deploy this in applications that you see and experience. Whether it’s autonomous driving from a car standpoint, from your running language model queries to things that actually are impacting industry and effectiveness behind the scene.

The recycler using object detection to recycle materials faster, more efficiently, higher accuracy, agriculture tech, and farming it’s coming into play. So this is really, when AI, the rubber hits the road, so to speak, and what people care about as we deploy these models, isn’t, not only what enables from a cool factor, but it also has to be done effectively and efficiently to impact the bottom line and to actually do it at scale. And that’s where we feel that having the right architecture to process very fast throughput with energy efficiency, comes into play because people are form factor constrained. Not every enterprise that runs on premise wants to infinitely build out, data centers or go to liquid cooling, right? These are the things that matter as these things get deployed at real scale.

Patrick Moorhead: Yeah, I’m grinning, uh, when you talk about liquid cooling. To me, liquid cooling, in my experience, has always been the, you hit the end of the road. And whether that was water cooling a desktop, a gaming GPU, or even in data centers, that’s where your architecture has run into something. And typically it’s, hey, you add water to get something more out of it, not necessarily the baseline, but in a way, I feel like the industry is moving to where just to get that baseline increase from, let’s say, prior generation, I’m going to have to hit water.

That is an indication to me. that something is amiss here. And I always like to think of these transitions in terms of quadrangle, which is you have compute, you have memory storage and networking. And what are what is the current bottleneck? I think we have a few of them, but certainly efficiency is there. Can you explain though technologically the difference between let’s say, a training solution and an inference solution because some companies say, “Hey you should do training and inference on the same giant GPU.”

Chris Walker: And really what we see is in the traditional GPU type architecture, the energy is basically going to waste as you move to inference by the movement of the data. So in the movement from memory into the computation, because it’s very, when you deploy it, when you’re doing inference, it is about the speed of the computation is about the throughput. Because that’s how you experience it. That’s how you experience it. How long does it take to get the token back? How fast I process the image, how many networks of signals or imagery can I do at the same time? So everything wants to be highly paralyzed, which we are, right?

Over 1400 customers, five processors, in our design as example. But what in a traditional architecture is you’ve got this bottleneck of memory trying to come into the chip, going through a cache to then be distributed to all the different processing elements, or ALUs, depending what you want to call them.What we’ve done is we’ve put the computation and the memory together, hence AppMemory. And what that does is allows you to more effectively map the algorithms or the network onto the chip and reduce the bottleneck, reduce the amount of energy that’s traditionally put in just moving the data into the chip to process. You can look in certain inference workloads up to 90 percent of the energy consumed by the chip is just in data movement.

So we turn that on its head and say, let’s start by putting the data, the memory, the computation together and then more effectively and efficiently pull what we need for larger networks in from memory. And that’s a, and that’s a game changer where you get the best of both worlds of that speed throughput with lower, lower power because you’re utilizing the cores and you’re utilizing the memory at much higher, much more efficient rates. And so that saves a lot of time, saves a lot of bottleneck. Some cases saves you and the amount of memory actually you need in the chip or the module as a starting point.

Patrick Moorhead: I appreciate that. And by the way, go on onto Untether AI’s website. There are white papers and technological descriptions that do the double click that if you are Uber geek, Uber nerd you will love..

Chris Walker: Love to tell you about, about, about the data types and, how we’re native in FP8 and FP16 support. All that stuff is white papers and happy to connect directly.

Patrick Moorhead: Yeah. Chris, just to make this let’s say more real it, should we expect Untethered AI solutions to be, let’s call it at the commercial edge, right? Which could be retail store factory floor, or is this in a traditional data center or somewhere else? Where should we, what are you targeting from a location standpoint?

Chris Walker: From a market and location standpoint, we’re on our second generation chip, our speedAI family is particularly well suited for kind of heavier edge vision networks. So you’ll see that family come into things like Industry 4.0 applications where, you know, high speed, very detailed like metrology, inspection areas of smart cities or ag tech where we can replace what people are doing in multiple boards, multiple applications and have the computational density to do it on our in our solution. And where that’s important is, it’s not just the density or reducing the footprint of how many different solutions you need. What we’re actually seeing from customers is they have form factor constraints on the edge. I can only have so much, so big of a box in the machine or the application, but they want to be able to upgrade the capability, run more cameras at higher resolution and process those in real time, integrate the autonomous driving of a tractor with the weeding mechanism, be it pesticide or actually lasers.

All that matters where they can run it at higher and higher speeds to get the efficiency out of their AI solution, get efficiency out of the automation. And that’s where we come in, where we can provide them, five, six, eight X performance and performance power benefits to really game changers for how they think about the models, what they can do with AI. It really unleashes untethered, if you will their expectations on what AI can do for them. In their workflows. And so we see that today, I think where we see the efficiency of at memory and this reduction of the memory bottleneck and the energy usage will also apply in things like enterprises will want on natural language processing and things like chatbots, financial fraud detection, speeding that up and we see in data center applications as well for generative solutions where that power performance compute density ratio really matters as you’re deploying it. We think our spatial architecture and at-memory compute has a very good advantage in that as well. Not massive scale training, but specifically, inference acceleration as you’re deploying the models.

Patrick Moorhead: Yeah tremendous need for that. I just got back from Mobile World Congress a couple weeks ago. And, aside from the typical, how do we monetize 5G? There was a, I spent most of my time talking about edge compute. And the insatiable need. And, we both seen through history. Compute naturally moves to the point of data collection. That’s just historically been true. That, that was true from going from mainframes to minis. That was true going from minis to XC6 servers to then to PCs and smartphones. And not that we’ve gotten rid of any of them, right? Technology is always about ands. Maybe we got rid of minis.

But there is a lot of RS 400 still sitting out there and power computers. It, that’s just what happens with history because it’s more economical and if you can manage the distributed, the disaggregated compute, which we have the tools today to do typically, we have a common container for applications. You can manage on the edge. There are companies that are doing edge rollouts where you don’t even need to have a technological person. They just need to know where to plug it in and how to put the networking together and they turn it on. And it’s literally, it works out of the box. And these are these applications like video analysis, people counting security cameras basically industrial IOT web 4.0.

Chris Walker: And then the cases, especially autonomous vehicle, we’re talking about higher levels, level four ADAS. The other thing is it’s more efficient, cost effective. But, in some cases you can’t stream it from the cloud. You can’t afford that latency. And that’s really the big thing is we’re talking about, intelligence at the edge. We’re talking about a level of learning and decision making that can’t tolerate high latency. And so it requires that compute to be closer to the data to be closer to the stream of information.

Patrick Moorhead: I want to make sure that I’ve captured the uniqueness of the company and the advantages that it has out there. There’s a lot of companies making promises out there. In fact, some companies five or six years ago that have burned through hundreds and millions of dollars in funding right, and driving, have very little to show for it. You talked about an architectural advantage. Can you double click just what are the advantages to take advantage of this inference opportunity?

Chris Walker: So the key advantage is we put the data in the memory together with the compute element. So our processing elements, we put the memory on chip with the processor. So we’ve got the data as close to the compute as possible. And what that does is reduces the amount of energy you’re burning, whether it’s power, which matters in a battery operated type, autonomous vehicle or a form factor constraint where the, the other side of the power is a heat generated, people can’t tolerate that. Has real advantage of reducing up towards 90%, just that data traffic movement. The same time, we talk about speed or throughput needs to be done with accuracy. So we do this at very high accuracy levels based on our processing elements and based on the data types that we support. Smaller batch sizes.

Those are the types of things in these edge use cases that are highly valued. And the big thing that we’ve also innovated around is how we move data around the chip, right? We have a very innovative network on chip that doesn’t create the bottlenecks. Allows the banks, allows the processing elements to shuffle data much more effectively than you see in other solutions. It’s great. You can visit the website and see just the pictorial of all that. And so that also translates into, we sell at a chip level, at a card level on the future, at a chiplet level where members of things like UCI E standards based which I really think is the future of how we’re going to partner and develop more advanced solutions cross industry players. So we’re super excited about that as well. And our spatial architecture is very well suited to configurability and working.

We are an inference accelerator working with different processor types in different applications. The other side of it is like, we’re, Pat, you and I, we’re all, chip heads and semi guys, it’s a software, the big differentiation is, we’re a five year young company. We’re in our second generation of Silicon. Most importantly over half of my engineering is in software. It is our imagine SDK that enables people to take their train models and move it into our architecture on an easy flow enables them to do, if they want to do optimizations all the way down to the metal, they can, but most importantly, is they get the performance by porting their model over and a smooth push button flow. Big investment and kernels and compilers and the old tool chain to do that. So that’s a big, as we talk with our customers as we go to do this, deployment of inference on a larger scale. The training is still going to happen in a big data center, still going to happen on a GPU architecture. We work very hard and very focused on the software to enable people to then use the better and more optimal architecture to deploy those models. And that’s, a big differentiator in our journey as well, is doing those together as one product.

Patrick Moorhead: Yeah, I’m really glad you brought up the software piece. And it’s the software, not just to even optimize what’s in front of you, but also has to respect what’s coming in from the training that might be done on somebody else’s GPU. And it just makes sense that an accelerator would do this three times, four times, ten times better and more efficient. It really makes sense. So, I have to ask there’s been a lot of talk about funding, right? We, you know, I mean you see the massive run up of NVIDIA’s market capitalization and then there are players around that which either do something similar or they’re a network working company like a Broadcom and a Marvell. But then you’ve got, Sam Altman, he walked back the 7 trillion fund for AI, but he didn’t put a number on it. Okay. And then you’ve got, talks about SoftBank and a hundred billion dollar fund. I have to ask, What do you think it takes to field a market leading AI accelerator? And Where are you on this journey? Are you asking for seven trillion dollars?

Chris Walker: I think the common joke for everybody is, “We’ll do it for we’ll do it for half.”

Patrick Moorhead: I’ve seen that and that is a funny meme, right? Yeah, it’s good.

Chris Walker: I think the, there’s been the hundred odd billion spent to date on training infrastructure, over the next couple of years, there’s going to be, it’s going to shift, not shift away from training. There’s going to be an additive of upwards of a hundred billion spent on the deployment on the inference hardware. That’s just starting to be the consumption part on typical silicon hardware. Then you look at the economics of what you need, data center, infrastructure, power, et cetera, to go field all that you start getting the fabs to build it. You do start getting in those really, big numbers. For us and what I think it takes, the market opportunity is very clear. I don’t know that we would all look back and go, “Well, the web or the internet was, wasn’t worth it.” And, and a lot of it, it’s going to be a case, just like the internet allowed companies large and small to reach global, I think the AI tools are going to enable small companies to compete looking large.

And it’s going to allow people to, at the edge, to be much more efficient and effective with their industrial products. For us, as a startup, playing in the space opportunities, massive, what it takes to compete. And what it takes is you’re always designing, you’re always working with partners to bring in new IP. Tape out. Silicon itself is always, even as a startup, designing in a fabulous way, it’s still very capital intensive. And what we see is the great pull for, our products and saying, “Well, can you do derivatives? Can you..”, and that’s the very interesting thing about our architecture, it’s very configurable. So we’re able to do different spins of it for different applications, different memory types that people want, we think, very effectively. And so us, like others, you’re always in the market trying to continue to scale up.

And that’s really where we’re the inflection point of second generation product in hand, this year sampling the customers along with the Imagine SDK to enable them to go into deployment, go into production into this year and the next year. And so that’s a very exciting time where the inference part is now real for the market. It’s real for us. And there’s a huge pull from small companies all the way to very large, automakers for this kind of solution. We’re engaged with them. It’s a fantastic pipeline. And so we’re really encouraged by the response to our architecture, response to our tools and team. Great team of industry veterans, startup veterans, people from academia and machine learning, people from machine learning practices, pure software consultancies, all that comes together and, it’s just an intense environment just from, how you work. Everybody’s here because they believe in the architecture. That’s what’s really interesting is it’s not just a startup addressing AI. It’s the, “Wow, there’s something different happening here.” And I think that’s actually what brought me in too.

Patrick Moorhead: I love it. Sounds like you’re having a lot of fun and I guess what’s the right way to say this? As we both get more mature, you want to do something that is, is rewarding and is invigorating and it sounds like, Chris, at Untethered AI, you’ve found just a great opportunity and I’m excited. I’m excited for you and the world needs more competition. If what we’re witnessing now isn’t an absolute exclamation point, whether it’s pricing, whether it’s efficiency, whether it’s total cost of ownership with, putting in electricity and form factor and cooling the world needs more competition. It sounds like, that you’re poised to bring that to the edge.

Chris Walker: Yeah, the history of what our participation in the ecosystem has been, the world needs and rewards open source software, innovative new ideas, new ways of partnership to deliver it. And I think in the AI case, there is the added mandate that we do this, responsibly from a sustainability and ecological standpoint too. To make it happen and make it real and I think that’s been a big motivation for why we’re founded, why we exist today and I think why we’ll continue to be growing.

Patrick Moorhead: Thanks for your time Chris. Hopefully we can do this again.

Chris Walker: Yeah, looking forward to many more.

Patrick Moorhead: So this is Pat Moorhead, Chris Walker. We’re talking efficient AI, less power on the edge. You heard it here first. Check out all of our content on our website about Untether AI. And hey, if you want to do the deep dive, go to the Untether AI website out there. Like what you heard? Hit that subscribe button, tell your friends, family, your pets, everybody about this show. Thanks a lot. And we’ll see you again.

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.