Groq Meta LLAMA-2 70B Parameters 100 tps Milestone

By Patrick Moorhead - August 14, 2023

The Six Five team discusses Groq’s milestone of running Llama-2 70B at more than 100 tokens per second.

If you are interested in watching the full episode you can check it out here.

Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we ask that you do not treat us as such.


Daniel Newman: Groq made a pretty big announcement about a hundred tokens per user per second. Pat, what does that mean?

Patrick Moorhead: Yeah, so good question. So first of all, Groq is a company that was founded by the folks that did the Google TPU. So smart cookies. And in my vernacular, they’re creating an ASIC to tackle first inference and then training. As we talked about many times on this show, an ASIC is more efficient than a GPU at doing certain things. And then the challenge is putting a programmatic layer on top of the ASIC to make it programmable. And then there’s Llama 2. So Llama 2 is an open source model that came out of Meta that everybody but trillion-dollar companies can take advantage of for free. And essentially it’s all the rage, right? Open models, right? Because we don’t want one company to have their model.

And what do I mean by closed models, right? So OpenAI and ChatGPT is a closed model. Bard is a closed system as well. So now, you have in the enterprise world at least everybody’s saying, “Hey, it’s about a combination of proprietary and open models distributed through somebody like a hugging face.” And then the 70 billion parameter model where they were literally according to them. And I can’t find any data that says this is not, it’s the fastest performance on Llama 2 70 billion parameter at over a hundred tokens per second per user. And the reason tokens are important as tokens determine the amount of data that can go into the prompt or they can go into the grounding.

So this has a lot to do with the pricing as well. So the cool part is that the cost is just extraordinarily lower to do this. And Dan, you hit this on the NVIDIA piece. Groq says that on a workload like this you get three X lower total cost of ownership from the inception, which is really great value, right? Those are comparisons using an 80-node NVIDIA A100 SuperPOD is $27 million, and H100 SuperPOD is $39 million. And a Groq 80 node system is $18 million. So again, competition is good. Dan, that’s a theme on our show. We say it every day. Competition matters. And one final thing, current silicon is 14 nanometer. Imagine when they get to four nanometer or five nanometer, performance and power should be amazing.

Daniel Newman: Absolutely. So I’m going to keep running. I’ll just say in the press release I did comment availability, Pat. I mean, you can actually buy these things. I just want to point that out. These are actually available which and surprise people wouldn’t want to capitalize on that.

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.