At The Heart Of The AI PC Battle Lies The NPU

By Anshel Sag, Patrick Moorhead - May 21, 2024
NPUs will be a key battleground for AI chip vendors.
getty

There is a clear battle underway among the major players in the PC market about the definition of what makes an AI PC. It’s a battle that extends to how Microsoft and other OEMs interpret that definition as well. The reality is that an AI PC needs to be able to run AI workloads locally, whether that’s using a CPU, GPU or neural processing unit. Microsoft has already introduced the Copilot key as part of its plans to combine GPUs, CPUs and NPUs with cloud-based functionality to enable Windows AI experiences.

The bigger reality is that AI developers and the PC industry at large cannot afford to run AI in the cloud in perpetuity. More to the point, local AI compute is necessary for sustainable growth. And while not all workloads are the same, the NPU has become a new and popular destination for many next-generation AI workloads.

What Is An NPU?

At its core, an NPU is a specialized accelerator for AI workloads. This means it is fundamentally different from a CPU or a GPU because it does not run the operating system or process graphics, but it can easily assist in doing both when those workloads are accelerated using neural networks. Neural networks are heavily dependent on matrix multiplication tasks, which means that most NPUs are designed to do matrix multiplication at extremely low power in a massively parallel way.

GPUs can do the same, which is one reason they are very popular for neural network tasks in the cloud today. However, GPUs can be very power-hungry in accomplishing this task, whereas NPUs have proven themselves to be much more power-efficient. In short, NPUs can perform selected AI tasks quickly, efficiently and for more sustained workloads.

The NPU’s Evolution

Some of the earliest efforts in building NPUs came from the world of neuromorphic computing, where many different companies tried to build processors based on the architecture of the human brain and nervous system. However, most of those efforts never panned out, and many were pruned out of existence. Other efforts were born out of the evolution of digital signal processors, which were originally created to convert analog signals such as sound into digital signals. Companies including Xilinx (now part of AMD) and Qualcomm both took this approach, repurposing some or all of their DSPs into AI engines. Ironically, Qualcomm already had an NPU in 2013 called the Zeroth, which was about a decade too early. I wrote about its transition from dedicated hardware to software in 2016.

One of the advantages of DSPs is that they have traditionally been highly programmable while also having very low power consumption. Combining these two benefits with matrix multiplication has led companies to the NPU in many cases. I learned about DSPs in my early days with an electronic prototype design firm that worked a lot with TI’s DSPs in the mid-2000s. In the past, Xilinx called its AI accelerator a DPU, while Intel called it a vision processing unit as a legacy from its acquisition of low-power AI accelerator maker Movidius. All of these have something in common, in that they all come from a processor designed to analyze analog signals (e.g., sound or imagery) and process those signals quickly and at extremely low power.

Qualcomm’s NPU

As for Qualcomm, I have personally witnessed its journey from the Hexagon DSP to the Hexagon NPU, during which the company has continually invested in incremental improvements for every generation. Now Qualcomm’s NPU is powerful enough to claim 45 TOPS of AI performance on its own. In fact, as far as back as 2017, Qualcomm was talking about AI performance inside the Hexagon DSP, and about leveraging it alongside the GPU for AI workloads. While there were no performance claims for the Hexagon 682 inside the Snapdragon 835 SoC, which shipped that year, the Snapdragon 845 of 2018 included a Hexagon 685 capable of a whopping 3 TOPS thanks to a technology called HVX. By the time Qualcomm put the Hexagon 698 inside the Snapdragon 865 in 2019, the component was no longer being called a DSP; now it was a fifth-generation “AI engine,” which means that the current Snapdragon 8 Gen 3 and Snapdragon X Elite are Qualcomm’s ninth generation of AI engines.

The Rest Of The AI PC NPU Landscape

Not all NPUs are the same. In fact, we still don’t fully understand what everyone’s NPU architectures are, nor how fast they run, which keeps us from being able to fully compare them. That said, Intel has been very open about the NPU in the Intel Core Ultra model code-named Meteor Lake. Right now, Apple’s M3 Neural Engine ships with 18 TOPS of AI performance, while Intel’s NPU has 11 and the XDNA NPU in AMD’s Ryzen 8040 (a.k.a. Hawk Point) has 16 TOPS. These numbers all seem low when you compare them to Qualcomm’s Snapdragon X Elite, which has an NPU-only TOPS of 45 and a complete system TOPS of 75. In fact, Meteor Lake’s complete system TOPS is 34, while the Ryzen 8040 is 39—both of which are lower than Qualcomm’s NPU-only performance. While I expect Intel and AMD to downplay the role of the NPU initially and Qualcomm to play it up, it does seem that the landscape may become much more interesting at the end of this year moving into early next year.

Shifting Apps From The Cloud To The NPU

While the CPU and GPU are still extremely relevant for everyday use in PCs, the NPU has become the center of attention for many in the industry as an area for differentiation. One open question is whether the NPU is relevant enough to justify being a technology focus and, if so, how much performance is enough to deliver an adequate experience? In the bigger picture, I believe that NPUs and their TOPS performance have already become a major battlefield within the PC sector. This is especially true if you consider how many applications might target the NPU simultaneously—and possibly bog it down if there isn’t enough performance headroom.

With so much focus on the NPU inside the AI PC, it makes sense that there must be applications that take advantage of that NPU to justify its existence. Today, most AI applications live in the cloud because that’s where most AI compute resides. As more of these applications shift from the cloud to a hybrid model, there will be an increased dependency on local NPUs to offload AI functions from the cloud. Additionally, there will be applications that require higher levels of security for which IT simply won’t allow data to leave the local machine; these applications will be entirely dependent on local compute. Ironically, I believe that one of those key application areas will be security itself, given that security has traditionally been one of the biggest resource hogs for enterprise systems.

As time progresses, more LLMs and other models will be quantized in ways that will enable them to have a smaller footprint on the local device while also improving accuracy. This will enable more on-device AI that has a much better contextual understanding of the local device’s data, and that performs with lower latency. I also believe that while some AI applications will initially deploy as hybrid apps, there will still be some IT organizations that want to deploy on-device first; the earliest versions of those applications will likely not be as optimized as possible and will likely take up more compute, driving more demand for higher TOPS from AI chips.

Increasing Momentum

However, the race for NPU dominance and relevance has only just begun. Qualcomm’s Snapdragon X Elite is expected to be the NPU TOPS leader when the company launches in the middle of this year, but the company will not be alone. AMD has already committed to delivering 40 TOPS of NPU performance in its next-generation Strix Point Ryzen processors due early next year, while at its recent Vision 2024 conference Intel claimed 100 TOPS of platform-level AI performance for the Lunar Lake chips due in Q4 of 2024. (Recall that Qualcomm’s Snapdragon X Elite claims 75 TOPS across the GPU, CPU and NPU.) While it isn’t official, there is an understanding across the PC ecosystem that Microsoft put a requirement on its silicon vendor partners to deliver at least 40 TOPS of NPU AI performance for running Copilot locally.

One item of note is that most companies are apparently not scaling their NPU performance based on product tier; rather, NPU performance is the same across all platforms. This means that developers can target a single NPU per vendor, which is good news for the developers because optimizing for an NPU is still quite an undertaking. Thankfully, there are low-level APIs such as DirectML and frameworks including ONNX that will hopefully help reduce the burden on developers so they don’t have to target every type of NPU on their own. That said, I do believe that each chip vendor will also have its own set of APIs and SDKs that can help developers take even more advantage of the performance and power savings of their NPUs.

Wrapping Up

The NPU is quickly becoming the new focus for an industry looking for ways to address the costs and latency that come with cloud-based AI computing. While some companies already have high-performance NPUs, there is a clear and very pressing desire for OEMs to use processors that include NPUs with at least 40 TOPS. There will be an accelerated shift towards on-device AI, which will likely start with hybrid apps and models and in time shift towards mostly on-device computing. This does mean that the NPU’s importance will be less relevant early on for some platforms, but having a less powerful NPU may also translate to not delivering the best possible AI PC experiences.

There are still a lot of unknowns about the complete AI PC vision, especially considering how many different vendors are involved, but I hear that a lot of things will get cleared up at Microsoft’s Build conference in late May. That said, I believe the battle for the AI PC will likely drag on well into 2025 as more chip vendors and OEMs adopt faster and more capable NPUs.

Anshel Sag
VP & Principal Analyst| Website| + posts

Anshel Sag is Moor Insights & Strategy’s in-house millennial with over 15 years of experience in the IT industry. Anshel has had extensive experience working with consumers and enterprises while interfacing with both B2B and B2C relationships, gaining empathy and understanding of what users really want. Some of his earliest experience goes back as far as his childhood when he started PC gaming at the ripe of old age of 5 while building his first PC at 11 and learning his first programming languages at 13.

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.