Databricks New DBRX Model- Enterprise Worthy?

By Patrick Moorhead - April 2, 2024

The Six Five team discusses Databricks New DBRX Model- Enterprise Worthy?

If you are interested in watching the full episode you can check it out here.

Disclaimer: The Six Five Webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors and we ask that you do not treat us as such.

Transcript:

Patrick Moorhead: Databricks brought out a new model that, statistically from the benchmarks that are out there, looks pretty good, but is it enterprise worthy?

Daniel Newman: I think there’s a couple of arguments you could make, having an enterprise company that focuses on enterprise data, building an enterprise LLM for enterprises, it could be pretty compelling. You have a bunch of companies that are building large models, trained on a lot of openly available data that you could argue isn’t like with the GPTs and the clouds, and the Groq, and these different models that have largely been trained on the open internet data.

This one’s interesting though, Pat, so I’ll start off talking a little bit about its numbers, which by the way are very compelling. The numbers are very good, and they did some comparative of it with the LLaMA model, mixed role and Groq-1. It didn’t talk so much about some of the more proprietary things like the Google Gemini or OpenAI, but it focused on what mostly people are considering these open, available models. It’s got favorable results in three key categories. They call it programming or human eval, math, as well as language understandings. And it actually outperformed in all of those categories. Now, again, I don’t see comparisons to some of the other, but from where it starts, that’s pretty compelling. And it does say that in its thing that it does surpass GPT3.5 and it’s competitive with Gemini 1.0 Pro. So they did note that, they just didn’t show the data and in any sort of comparative data because, well, not fully open source, would be my guess.

Pat, interesting. Very opaque on training data here. Very opaque on where the data came from. So if you read this long form, what it is, and what it was built, I think it said it was trained on 12 trillion tokens of text and code data. Okay, where’d it come from? It said it was, I believe it’s data that Databricks has access to, Pat, but I do have to ask like, okay, 12 trillion tokens, carefully curated data, maximum content like 32,000 tokens. Is this basically being trained out of Databricks enterprise customer data in some anonymized fashion? Where else did the data come from? I can’t fully pull that together. It said, “It used the full suite of Databrick tools, including Apache, Spark, and Databricks notebooks for data processing, unified catalog for data management, MLflow for experiment tracking, curriculum learning, and pre-training.” Interesting.

Remember the CTO of OpenAI, like, what did you train this on? I am just kind of curious. Are you training this really, really good enterprise data model using a lot of enterprises’ data? And is there some way, some sort of use agreement that you have when you use Databricks that they’re allowed to do this, so long as they do what? Type anonymize? What are they doing with the data to make this work? I imagine they’re also using some of the other data that other open source models are using, Pat.

I do think there’s an interesting opportunity as we see this pivot from open models that are being built mostly on the open training of internet data, to more proprietary, to create better models that are going to be more business use case specific, industry specific. I do think that’s where it’s going. I actually don’t think it’s these mega large language, I think it’s going to be these big large languages coupled with very specific domain knowledge and data that is going to create useful insights. Pat, I mean, look, the investments that have been made, Databricks is probably one of the most exciting IPOs to be, to come. This is very interesting. I do think with Databricks plus Mosaic, they have the tools and the capabilities. How does this get adopted? How does this get used? Does this build popularity? I see it more as being inside the Databricks user community than being something that really drags them into this kind of broader OpenAI discussion. But I’ll kick this over to you to see what you think.

Patrick Moorhead: Yeah, so I’m going to do a John Fort, “On the other hand” here style and kind of debate this with myself. So the notion of having a data management company and having a model together is compelling, okay? There’s some efficiency there for sure, and that is what enterprises are looking for. But on the other hand, if you don’t know where the data comes from, and if you’re an enterprise and start using this model, you are likely going to get sued if the underlying training data was not licensed. And you had mentioned the Sora interview with the CTO who didn’t know what the training data was, which I find absolutely impossible scenario here.

Daniel Newman: I don’t think he didn’t know is the-

Patrick Moorhead: Right, probably knew, but didn’t want to say. And anyways, it was just, gosh, the internet seized on that one. And the other thing is 75% of enterprise data is on-prem, and Databricks is a data management cloud in the public cloud company. So I guess there TAM is 25% of that enterprise data that’s up there. And oh, by the way, where they compete even with Cloudera, who has extensions to AWS, and GCP, and Azure, where they also compete with Snowflake. In fact, I love to combine Snowflake and Databricks, and I’ve heard them referred to as SnowBricks out there. Here is kind of a middle stance which says, let’s say a company like IBM integrates DBRX in there, and then under NDA, they provide IBM the sources of data and how that data is pruned. And then, IBM could make the decision to go in and indemnify people from lawsuits. So IBM brought Mistral in that doesn’t have a… I don’t know if they legally indemnify them, but that’s some research that I’m actually doing now, I’ve got an inquiry into IBM. If nobody indemnifies the enterprise, it’s a hard pass. They’re not going to use this, folks.

Patrick Moorhead

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.