Groq’s Record-Breaking Language Processor Hits 100 Tokens Per Second On A Massive AI Model

By Paul Smith-Goodson, Patrick Moorhead - August 25, 2023

Groq’s newly announced language processor, the Groq LPU, has demonstrated that it can run 70-billion-parameter enterprise-scale language models at a record speed of more than 100 tokens per second.

In a YouTube video, Mark Heaps, VP of Brand and Communications for Groq, uses a cell phone to show what 100 tokens per second looks like with the Groq LPU running Meta’s 70-billion-parameter Llama 2 model. At 100 tokens per second, Groq estimates that it has a 10x to 100x speed advantage compared to other systems.

Groq chips are purpose-built to function as dedicated language processors. Large language models such as Llama 2 work by analyzing a sequence of words; then, using those words, they predict the next term in sequence. How accurate they are in predicting the next word is a critical factor for determining the best model.

Groq chips are optimized for the sequential nature of natural language and other sequential data like DNA, music and code. Being so specific in their design leads to much better performance on language tasks than, for example, GPUs that are optimized for parallel graphics processing.

Groq has proven it is no stranger to large language models. It has experimented using its chips on various LLMs including LLaMA 1 and Vicuna from Anthropic. Its engineers are now running LLaMA 2 with model sizes from 7 billion to 70 billion parameters.

Unlike traditional compilers, Groq’s does not rely on kernels or manual intervention. Through a software-first co-design approach for the compiler and hardware, Groq built its compiler to map models directly to the underlying architecture automatically. The automated compilation process allows the compiler to optimize model execution on the hardware without requiring manual kernel development or tuning.

The compiler also makes it easy to add resources and scale up. So far, Groq has compiled more than 500 AI models for experimental purposes by using the automated process just described.

When Groq ports a customer’s workload from GPUs to the Groq LPU, its first step is to remove non-portable vendor-specific kernels targeted for GPUs, then any manual parallelism or memory semantics. The code that remains is much simpler and more elegant when all the non-essentials are stripped away.

Groq gives an excellent example of this efficiency on its website in the description of its first go-round with Llama 1. What would have normally required months of work from dozens of engineers took only a week for a small team of 10 people to get Llama up and running on a GroqNode server. Even though Llama was not explicitly built for Groq’s architecture, the compiler could automatically uncover parallelism and optimize data layouts for the model. This example demonstrates how the compiler can map models to Groq’s hardware even without hardware-aware model development.

Groq also has an easy-to-use software suite and a low-latency purpose-built AI hardware architecture that synchronously scales to obtain more value from trained models. As the company continues to expand the scale of systems that the compiler can support, training the models will likely also become easier using the Groq approach.

Wrap up

In the future, Groq’s ultra-low latency and ultra-fast language processor could have a major impact on how LLMs are run and used. Groq’s automatic capability to map models to hardware without manual intervention is not only a technical advantage, but also a way to increase ROI by reducing the time needed to move models through development and into operation.

Beyond that, Groq’s focus on sequential language processing provides better performance than general-purpose AI chips. The results speak for themselves: when dealing with massive LLMs, speed is a major factor for performance—and nothing yet can compare to 100 tokens per second.

+ posts
Paul Smith-Goodson is the Moor Insights & Strategy Vice President and Principal Analyst for quantum computing and artificial intelligence.  His early interest in quantum began while working on a joint AT&T and Bell Labs project and, during 360 overviews of Murray Hill advanced projects, Peter Shor provided an overview of his ground-breaking research in quantum error correction. 
+ posts
Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.