The Six Five team discusses Groq’s milestone of running Llama-2 70B at more than 100 tokens per second.
If you are interested in watching the full episode you can check it out here.
Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we ask that you do not treat us as such.
Daniel Newman: Groq made a pretty big announcement about a hundred tokens per user per second. Pat, what does that mean?
Patrick Moorhead: Yeah, so good question. So first of all, Groq is a company that was founded by the folks that did the Google TPU. So smart cookies. And in my vernacular, they’re creating an ASIC to tackle first inference and then training. As we talked about many times on this show, an ASIC is more efficient than a GPU at doing certain things. And then the challenge is putting a programmatic layer on top of the ASIC to make it programmable. And then there’s Llama 2. So Llama 2 is an open source model that came out of Meta that everybody but trillion-dollar companies can take advantage of for free. And essentially it’s all the rage, right? Open models, right? Because we don’t want one company to have their model.
And what do I mean by closed models, right? So OpenAI and ChatGPT is a closed model. Bard is a closed system as well. So now, you have in the enterprise world at least everybody’s saying, “Hey, it’s about a combination of proprietary and open models distributed through somebody like a hugging face.” And then the 70 billion parameter model where they were literally according to them. And I can’t find any data that says this is not, it’s the fastest performance on Llama 2 70 billion parameter at over a hundred tokens per second per user. And the reason tokens are important as tokens determine the amount of data that can go into the prompt or they can go into the grounding.
So this has a lot to do with the pricing as well. So the cool part is that the cost is just extraordinarily lower to do this. And Dan, you hit this on the NVIDIA piece. Groq says that on a workload like this you get three X lower total cost of ownership from the inception, which is really great value, right? Those are comparisons using an 80-node NVIDIA A100 SuperPOD is $27 million, and H100 SuperPOD is $39 million. And a Groq 80 node system is $18 million. So again, competition is good. Dan, that’s a theme on our show. We say it every day. Competition matters. And one final thing, current silicon is 14 nanometer. Imagine when they get to four nanometer or five nanometer, performance and power should be amazing.
Daniel Newman: Absolutely. So I’m going to keep running. I’ll just say in the press release I did comment availability, Pat. I mean, you can actually buy these things. I just want to point that out. These are actually available which and surprise people wouldn’t want to capitalize on that.