Enterprise AI Needs a Data Foundation You Can Trust The Six Five – Insider Edition

By Patrick Moorhead - July 24, 2023

On this episode of The Six Five – Insider Edition, hosts Daniel Newman and Patrick Moorhead welcome Cloudera’s Ram Venkatesh, CTO for a discussion on the importance of data in enabling Enterprise AI and the role of trust in ensuring AI success in navigating this rapid evolution of AI. This discussion addresses the business benefits of Cloudera’s Open Data Lakehouse and Data Fabric providing secure and governed solutions for all types of data across multiple clouds, enabling customers to trust their data and AI.

Watch the video here:

Or Listen to the full audio here:


Disclaimer: The Six Five webcast is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.


Patrick Moorhead: Hi, this is Pat Moorhead and we are here with another Six Five Insider talking about two of my favorite topics, and that’s data and AI. Daniel, how are you doing my friend?

Daniel Newman: It’s good to be back. And, yes, nothing right now is hotter than AI. And, of course, without data what is AI going to actually artificially “intelligize”?

Patrick Moorhead: Wow, Dan, that’s great. Gosh, I’m going to get that on this recording and just play that back to me a couple times. But hey, let’s stop messing around on the intro here and get right in. Introduce Ram, how are you my friend?

Ram Venkatesh: Doing well, thank you. Thanks for having me on the show.

Patrick Moorhead: Yeah. Thanks for coming back on the Six Five. We really appreciate that.

Ram Venkatesh: Absolutely. My pleasure.

Daniel Newman: Yeah, it’s not the first time here for you. You’ve been on our show before and it’s always good to have returning guests, we call them alumni here on the Six Five. And Pat, as we keep growing and growing and going, we have more and more alumni that joined the show.

Go ahead though and give us a little bit of the background. We’re in a very interesting inflection Ram with all the hype around generative AI, every company is kind of being put under a microscope, a little pressure, what’s your AI story? Cloudera’s been around a long time. Collectively, Pat and I have been watching the industry and the company for decade plus-

Ram Venkatesh: A long time in this business.

Daniel Newman: The data management industry is rapidly changing, but just for those that maybe haven’t been on one of our shows with you or any of the other Cloudera leaders, give us a little bit of the kind of Cloudera background and your background.

Ram Venkatesh: Happy to. As you said, Cloudera, I would say, we are really one of the pioneers in the big data space, helping this open source ecosystem where you could process and manage vast amounts of data come together and actually function. The customer adoption of our technologies has been phenomenal over the years, we are in over 90 countries at this point. We offer a hybrid, open, portable, secure platform for storing all of the data inside a company and processing it pretty much using every popular engine of choice, be it Spark, be it Hive, be it Impala, and so on and so forth.

So that sounds like a really broad statement. In the case of Cloudera, it is. Cloudera’s strength of offering us has always been the breadth of our product portfolio, and that really helps us to address customer needs for some of the largest customers on the planet. So just to give you a sense of the scale, everybody talks about scale these days. So at Cloudera scale, we manage probably about somewhere north of 25 exabytes of data across our customer base. And so that’s sort of the history, the heritage of Cloudera.

I’ve been in the data space for a long time. I spent most of the last decade at Cloudera and before that, the last two decades. So I worked on SQL Server. I probably worked on database technology that you’re either happy or unhappy about, I probably had something to do with it at some point in my career. And as you say, even though this space has a deep legacy and it’s rich in the amount of impact that we have globally, AI and generative AI has shaken up the entire ecosystem and the conversation to a degree that none of us probably could imagine even six months ago.

Patrick Moorhead: Dan and I both covered one of your recent announcements CDP Open Data Lakehouse. And it’s funny, it took me a while for the gear to lock in and how this relates to customers deploying LLMs and taking advantage of generative AI. But can you talk a little bit more about from an enterprise point of view, how CDP Open Data Lakehouse helps them unlock the value out of their data, do something, do more with their data?

Ram Venkatesh: Absolutely. Look, from the beginning, the premise of big data has not been let’s go build a SQL engine. We kind of got there incidentally along the way because that’s what customers are familiar with and that’s what they want. But the actual customer need is for them to be able to ask questions and gain insights over everything that’s happening in their enterprise. That’s the uber thing that they have in mind when they think of putting a data platform together, it’s to answer questions that they may not even have thought about when they actually collected the data in the first place.

So with CDP’s Open Data Lakehouse edition, so what we are doing here is essentially we are enabling customers to perform structured analytics over all of the data in the enterprise. So if you think of the LLM model, LLMs are amazing at what they do, they’re able to hold conversations literally in any natural language at this point in time. But a lot of our companies’ valuable, secure data that’s not on the public internet, thankfully. So now if you want to go build a use case, it could be as simple as show me my bill, the show me my part is a conversation and you can have that with an LLM. Your bill, the fact that you are asking for it, that’s customer context. The bill is customer data. So this is where even a simple one-liner like this, you need the ability to actually bring together user context, enterprise data and a large language model.

So that’s what we’ve enabled now with CDP, with the Open Data Lakehouse architecture. And I’ll talk a little bit more about one of our recent LLM applied machine learning prototype models that we introduced as to how somebody could actually do this in practice using the platform.

Daniel Newman: Yeah, I think that’s really interesting. Pat and I constantly talk about the integration of foundational models along with companies’ proprietary data. It didn’t take very long for ChatGPT, Bard, PaLM, Hugging Face, luminous, we can go down the line, I just wanted you to know that I know them all.

Ram Venkatesh: That’s an impressive vocabulary readout there.

Daniel Newman: Thank you.

Ram Venkatesh: Six months ago, I should have asked you this and recorded your answer.

Daniel Newman: Hey, we’ve been following this a minute, five and a half months now. But in all seriousness, bringing these things together is where value is going to be extracted. It’s not just about using these open source models, it’s not just using the open internet data. It’s really your data and having well organized, structured, unstructured, at scale. Philosophically, we’re talking about this. But how quick are you seeing the customers actually get there and can you give any examples?

Ram Venkatesh: Oh, absolutely. I think that is the most fascinating part of where we are today as an industry, as an ecosystem. I think that with ChatGPT itself probably being the fastest application to get to 100 million users, apparently they did that in less than two months. The previous record was TikTok, which is eight to nine months. So this tells you how much consumer interest there is in that conversational interface.

And when we think about our customers, we serve customers across a wide variety of industries. It could be telco, it’s hospitality, it’s retail, it’s manufacturing, the common sector, so on and so forth. But they all have something in common is that they are serving a very large set of consumers on the front end using our data platforms. So if you think of this conversational model, it’s applicability to a wide variety of use cases in our customer base. It’s so horizontal. In fact, I cannot finish a customer conversation these days without generative AI showing up in the conversation. For them it’s not a theoretical question, it’s not a future thing they’re thinking about, they’re saying, we have this use case today. We are going to go build it out. How can you help us get there faster? How can we bring our data that’s sitting in your platforms to be a part of this scenario?

For example, recently OCBC Bank, right said they did a meetup in Singapore where they talked about all of the things that they’re accomplishing in and around machine learning with the Cloudera platform. One of the things that’s unique about our architecture has always been the ability to bring lots of different engines, including machine learning and the ability to run models into our platform. And so with OCBC Bank, they’ve been using Cloudera Machine Learning for a long time now, for a few years now. But just within the last three months, they’ve actually rolled out the first of their use cases with LLMs. They call this Wingman, which is sort of an internal co-pilot to help them increase productivity for their users. And they’re talking about how this would not be possible without the large language models of course, but also without the data and the CML platform that enables them to do this.

So that’s an example of how customers want to both experiment and deploy CML and LLM models and applications that use them in production today.

Patrick Moorhead: So first off, thanks for sharing a specific example with a specific customer, Dan and I love theory and we love PowerPoint and products, but talking about customers researching or operationalizing all of these goodies is a real value add.

Dan rattled off a lot of folks in the industry at the beginning, and Dan, I do think you’re smart because you rattled them all off-

Daniel Newman: I have a list on the screen, Pat.

Patrick Moorhead: Did you? Okay.

Ram Venkatesh: Well done.

Patrick Moorhead: I think it is important that this does take a village and it does take partners in the ecosystem out there. And I’m curious, who are some of your strategic partners in your AI ecosystems? You’ve got inputs, you have outputs, you have hardware that transforms this data into really cool stuff, there’s model providers. But hey, I’ll let you answer the question.

Ram Venkatesh: Sure. No, I think you’re spot on that this is an ecosystem play. I believe that while it’s very compelling for us as vendors to say we can do it all soup to nuts, the reality is that I think the last six months have demonstrated that innovation when it happens from multiple different perspectives is so much more powerful when you put it together and when you throw open source into the mix, clearly over and over you’ll see there’ll be one research unit at one company that says, this is how we know how to fine tune a model. Two weeks later there’s a different technique that improves upon that in some way and continues to enhance it. So I think that the flexibility to consume all the parts of what you need from different providers is a huge strength.

So when you think of this ecosystem, for us, yes, I think the foundational element is hardware. So our very strong partnership with Nvidia for example, is really important here. We’ve been doing this again with them in the context of Spark for a long time, this whole Spark RAPIDS initiative, which was Nvidia’s hardware acceleration for Spark. And since we are probably among the largest places to run Spark, period, our customers are able to take advantage of that hardware acceleration, that integration with Nvidia. And now that is just logically extending into the GPU cluster management facilities, the libraries that they have. Nvidia is a really important… It’s the underpinning of this entire ecosystem, both when it comes to training the large language models as well as influencing them in real time. So for both of these use cases, GPU support is such a critical part. So they are a very important partner, a foundational partner for us on this journey.

Similarly, when you look at the big data platform providers, again, this is not about just doing something with Cloudera’s data. So for example, if you look at our partnership with IBM. IBM with their Watsonx family of announcements that they did last month, that’s very relevant to this conversation. So customers want to be able to use Watson studio and their tools. They want to be able to use data that’s sitting in Db2 or in Netezza. They want to use all of that in conjunction with what they have in Cloudera, what they’re doing in CML, so they can put it all together cohesively as a solution. So IBM’s is another big strategic partner for us in this initiative.

As is AWS and the other hyperscalers, because given the popularity and the speed with which some of this evolution is happening, for many of our customers experimenting in the cloud, especially at small scale, is extremely attractive until they know whether a particular use case is going to pan out or not. So this is where the fact that we run on-premise and public cloud is one thing, but the fact that we also have these strong partnerships in the ecosystem that let us do this, that’s very enabling for the customer.

But of course to put a bow on all of this is the open source aspect of it. So I think that’s where the diversity of tool chains. You have hardware and you have platforms and you have data, but then you’re actually going to assemble and build all of these applications using the modern tool chain. So it would be LangChain, it would be LlamaIndex, it could be models from Hugging Face. See Dan is starting to duck for cover now. It’s all of these new toolkits and frameworks that people are using to actually build out these applications, that’s the other part of the ecosystem. So I think that’s an ecosystem play almost at every level if you think about it. That’s all goodness, and that’s all helping at the speed at which innovation is happening, that’s actually because you have this multi-vendor approach as opposed to any one approach or any one vendor dominating the landscape.

Daniel Newman: So Cloudera wants to get in on this goodness, there’s so much goodness that is AI and obviously having petabytes and petabytes and petabytes of data under management, I think we’ve put the number out there significantly larger than the big cloud data warehouse in terms of data under management. It sort of implicates that this data’s going to have to be acted upon by all these AI tools in order for companies to get that value, we’ve kind of hit on that. But the customers right now are in the moment where they have to make a decision, they have to choose a platform, it can’t be everything and yes, like hybrid cloud, there will be a little bit of this and a little bit of that. But ultimately going all in on having the Open Data Lakehouse approach is going to be important that they streamline it to maximize the value of the data. So in your case, why are your customers picking Cloudera? Why are they picking CDP Open Data Lakehouse versus the other options in the marketplace?

Ram Venkatesh: Yeah, I think that a lot of it has to do with scale with flexibility and openness. It’s really that open part of Open Data Lakehouse, I think that’s one of the things that customers see a lot of value for. We’ve seen in the data space, I think every 18 to 24 months there’s something significantly new that happens where you need to do something with your data that you did not anticipate. I think this is where, if you think of generative AI now and how it applies to Cloudera’s data states, we have always stored a large amount of data, but there was a sort of data pyramid that indirectly ends up happening where all the raw data is in the base of the pyramid and you do a whole bunch of ETL and you end up with a data set, a tiny 1% of that, and that’s what you’re going to do real time SQL analytics on top of. That’s very powerful. Don’t get me wrong. This is what we do at scale in production, it’s not even one pyramid, it’s usually for the same customer every use case that needs another one of these data sets is going to create another pyramid. So it’s going to be, it’s thousands of pipelines, thousands of jobs. That’s what their ecosystem looks like.

When they look at generative AI, instead of this way of processing data, imagine if you could process the data once and then you could answer questions over all of the data, and not just in SQL but in English. And that’s really the power that customers are expecting we can unlock with the platform. So this demonstrates to them that by going with this open, flexible architecture, they’re able to do things with their data. They’re able to run large language models against their data, which they could not, if they had actually gone for a proprietary approach. They would have to wait for that vendor to support LLMs in some particular way before they could take advantage of it. Whereas our open platform gives them this flexibility. That’s one.

The second is that everybody has a notion of hybrid in their mind. Our customer’s notion of hybrid is they want a single pane of glass where they can see what’s going on and then they want optionality. Whereas they want to be able to run the workload wherever it makes the most sense. So I talked about customers adopting large language models. One of them, this was from a conversation this morning, they are spending $50,000 a day, a day, training a model today they’re doing this with the hyperscalers in the public cloud.

So their goal is to say, we want to make sure that we can identify which of these workloads are going to be activities that we do all the time. And instead of actually renting compute for that, we want to be able to take that use case and have it run on premise with GPUs in an efficient way. Can we do that? And when we are able to say, hell yeah. That’s the whole premise of this consistent hybrid architecture is that it helps customers make these decisions in a manner that’s appropriate for their business, whether it’s cost, whether it’s governance, they have the optionality to run where it makes the most sense for them. I think that these are the two main reasons that I see customers picking Cloudera as their data platform provider.

Patrick Moorhead: So, Ram, love data, love AI, and by the way, I also love the hybrid -multi-cloud. You talked a little bit about the context of the Open Data Lakehouse, but I really want to drill into this. Hybrid cloud, I called this a decade ago and I still believe it, and I call it the hybrid multi-cloud, you don’t have to, that’s okay, is the de facto model. I mean, I’ve never talked to an enterprise in the last three years that didn’t have more than one IS provider. And I’m curious though, how does Cloudera define a hybrid?

Ram Venkatesh: That’s a good question, and I should confess that we have also been on this hybrid journey for a while. And I think over the last year there’s a definition of hybrid that our customers have been expressing back to us that I find it captures the ethos of what is valuable for them. So if it’s okay, I’m going to share that rather than Cloudera’s viewpoint and hybrid. I feel like our customer’s viewpoint on hybrid is that they want three things from their experience. They want consistency, it’s like a single pane of glass over their entire estate. One of the customers told me for observability, for example, this is how they think about Datadog. They don’t think of Datadog as a hybrid platform. They say, this is a single pane of glass, I can now have observability over my entire estate. So that consistency of experience across the entire estate is something that customers are looking for.

The second thing that they’re looking for when it comes to hybrid is portability. So this is where they want to be able to think of deployment as a secondary item on their checklist. They’re thinking about business value, they’re thinking about their use case, they’re thinking about the data that they need to bring together to implement the use case. These are things that they want to worry about. What they don’t want to worry about from day zero is this use case going to run on premise? How much hardware do I need to buy? Oh no, this use case is going to run on GCP or it is going to use Spot Instances on AWS. No, these are deployment time choices. They don’t want to make that at the start of the project. They want to make it at deployment time.

The third piece that they want is optionality. They want to be able to change their mind, for whatever reason. They want to say, Ram, I used to run in the clouds and that was great, but now the bill is too high and I want to actually run this use case on premise. Or, I started this use case on premise and now I want to share this data with a provider, they’re already in AWS. How do I get this use case over there so that I don’t need to let that provider into my data center in some complicated networking way?

So these are the three primary considerations that customers have. Consistency, portability, and optionality. I think that’s a great definition of hybrid because if we can meet the customer where they are on these three fronts and then it turns out to be hybrid and multi-cloud, to your vernacular, but these are enormous benefits that help them de-risk their entire investments in the portfolio and how they think about a data strategy for that company.

Daniel Newman: Yeah, I’m really glad you hit on that. You kind of covered a few things at once for me. And frankly I always want Pat to do the hybrid cloud conversation. His favorite victory lap goes back to his early call, an accurate call, to your credit, my friend, about hybrid. Remember that time Ram when everyone said it was all going to go to the public cloud?

Ram Venkatesh: Exactly. What a difference a couple years makes.

Daniel Newman: We’re like almost two decades later and we’re still like 10%. So this is why even me, and I’m a total AI sensationalist, right Pat?

Patrick Moorhead: I’m waiting for your victory lap, Dan.

Daniel Newman: I have to be grounded in the fact that even the fastest proliferating technology still takes time, largely because regulations, compliance and enterprise architectures are really complex and sometimes the tech just doesn’t make sense. It doesn’t make sense to do it even if it’s the new and the cool and the hip thing.

So clearly Cloudera gets the hybrid play, and has for a long time, we’re well documented on this. And it seems that AI is really a story that’s going to continue to evolve quickly for Cloudera. But data management is so much more. And while we want to be on the hype cycle with you, talk about some of the other data management focused innovations that you’re working on that are going to make Cloudera continually an innovator, a continual disruptor and highly relevant in a competitive space.

Ram Venkatesh: How much time do you have?

Daniel Newman: Do it in 60… No, do what you got to do. I mean we’ll tell you if-

Ram Venkatesh: I’m not kidding because this is literally-

Patrick Moorhead: We’ll cut you off if it goes too long.

Ram Venkatesh: Fair enough. This is obviously something that’s near and dear to our hearts at Cloudera. Making sure that we are innovating in all the areas of the data platform, the data ecosystem in the ways that customers want us to. That is our entire reason to exist. So if you think of some of the things that are top mind for us right now. Apache Ozone is something that we are super excited by. Look, this is the D in data is the ability to store and manage vast amounts of data. If we can’t do that, the rest of this is kind of moved. So we think we have a pretty good file system with HDFS where you could scale our design points for it 10 years ago where about 100 million objects and about between 10 and 25 petabytes, which we thought that was pretty good back in the day. With Ozone, we have now pushed that envelope to whereby design. It’s intended to support more than 10 billion objects. So 100 times the scale of what you could do with HDFS in terms of just the number of objects you could store.

It’s also designed to take advantage of all the dense storage inventions that have come along. So if you have a single machine with 500 terabytes or by the end of the year, we expect to see a single machine with a petabyte to be in that commodity price range. So you can actually manage close to an exabyte of data with an Ozone instance. So Ozone is really going to help our customers capture and store even more kinds of data. Especially when you think of the generative AI world, unstructured data is going to be even more important than it was. So you’re not talking about tables with large partitions and small numbers of files, but we are talking about massive numbers of files, which might be email messages, text documents, Slack messages, things of that nature. So we think the cardinality of that data that we are going to capture is going to go up significantly, which is why Ozone’s going to be such a key part of that storage strategy for customers to be able to store all this data.

On the processing side, we are doing a whole bunch with our data services initiative. This is essentially how do we make it be more consumable for our customers to be able to not have to stand up and manage a multi-thousand node cluster for them to run individual workloads. So these are containerized more lightweight workloads that you can grow and shrink them based on the needs of your application. So these are private cloud data services initiatives. Think of this as what we are doing for storage with Ozone, this will make it 10X easier for our customers to actually run compute against all the data that they have.

The last thing I want to touch upon here, before Pat cuts me off, is if you think about data observability, so this is one of those core capabilities that as we roll out this hybrid architecture, we want to do this in a way we are thinking about TCO all the time. So one of the key ways that we can help simplify our customers’ support experience, both when they contact Cloudera as well as if they’re doing self-service troubleshooting and diagnosing, is that all of this observability data, but what’s actually happening in their systems, is stored, collected, processed, and managed appropriately. So that’s clutter observability. We just went GA with that a month ago. We are very excited about how it can actually help our customers dramatically reduce the cost of operations on the platform. So there’s a couple things to touch upon.

Patrick Moorhead: No, it’s really exciting. And first of all, it’s amazing these open source projects have the coolest names, Ozone. No but seriously, the observability part of this I’m super excited about. We wrote about that and it’s pretty awesome. It’s just amazing the importance of observability has created with the fractalization of not only infrastructure but also the way that applications are developed with APIs as opposed to having monolithic infrastructure and monolithic applications. Super exciting.

So we have talked about a lot of different things and I think it would I think help our audience Ram and I know you’re all over the place doing a ton of things, but can you crystallize the last 30 minutes of what are the key takeaways that you think are the most important?

Ram Venkatesh: So the three things are, I think one is the customer’s ability to store and process all kinds of data, that is now even more important in the context of generative AI. So this is the fundamental, one of the things we believe in is that for generative AI to work, customers have to collect, process and manage all of the data in the enterprise.

The second is that we believe that the ability of AI to unlock value in enterprises it’s going to be really critical for them to accomplish this. They have to be able to have a really good handle on the AI technology of course, but also on their data management. And you have to be able to bring these two together in a very harmonious, secure, compliant way. We didn’t talk about all of the ilities that you would need to actually roll out production use cases for AI, but these are the sort of foundational tenets for them.

And the third is that continuing to have an eye on flexibility and openness and portability, so the hybrid strategy to net this all out, a hybrid strategy for data management also continues to be really important and top of mind. So these are sort of the three things that I would want people to take away from the conversation.

Daniel Newman: Well, Ram, I want to say thank you for spending some time. Clearly you, as the technology leader at Cloudera, and the company are really taking a strong position that you’re in a big moment, almost a seminal moment for the company, it’s really important you get this right. That the sort of importance of data management really hasn’t changed at all, it’s just the onset of generative AI and tools has given a new visibility. And democratization and enterprises and boards are more rapidly looking to implement these technologies to change their companies. But having the data, the unstructured, structured in a management system that’s able to support hybrid architectures with a fabric and transport that can move data quickly from cloud to edge to prem and everywhere else is paramount. This has been really enlightening and I want to thank you so much for spending the time with Patrick and I here on the Six Five. This is the question so many companies are trying to answer right now.

Ram Venkatesh: Absolutely. It’s my pleasure. Thank you for having me.

Daniel Newman: All right everyone, there you have it. We are here on the Six Five. It’s an Insider edition. We just spoke to the CTO of Cloudera, talked data management, AI, hybrid cloud, Pat, we talked hybrid cloud, AI, data management, just going up and down one more time. Hit that subscribe button if you like what you heard. We appreciate you joining our community. Find us on Twitter, LinkedIn, anywhere else that social media is. But for now we got to say goodbye. See you later.

Patrick Moorhead
+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.