As an analyst, I’ve been watching Cloudera for several years and have opined at length. Cloudera solves data management challenges across public and private clouds enabling customers to manage and unlock value from data. Cloudera has been the Big Data leader for over a decade, with 25 exabytes of data under management and used by nine out of ten of the largest global companies across any given industry.
Cloudera has evolved over the years, with Cloudera 1.0 focused on building an open-source enterprise data platform, Cloudera 2.0 bringing Hortonworks and Cloudera together to accelerate the path to hybrid cloud, and Cloudera 3.0 creating the first true hybrid, multi-cloud data platform. In this article, I will explain why the Cloudera Data Platform (CDP) is well-positioned for the new world of enterprise AI
What could go wrong?
Generative AI uses algorithms called large language models (LLMs) to create new content in the form of text, imagery, audio or code using natural language instructions.
Generative AI tools such as the headline-grabbing ChatGPT train on large amounts of data from the internet with dubious data quality, content, ownership and privacy. As many of you have experienced, including an unfortunate lawyer recently, ChatGPT will convincingly present truthful outputs alongside total misinformation, leaving the user to sort out the fact from fiction.
Clearly, in an enterprise setting, this is unacceptable. For enterprises, the success of generative AI and the associated LLMs depend on the quality and trustworthiness of the training data.
In CDP, Cloudera has delivered on the hybrid vision with a single control plane that manages a common security and governance framework across the platform and all data services. The CDP platform can move workloads, data and the associated metadata bi-directionally across public and private clouds.
CDP Open Data Lakehouse provides the “foundation data ” with security, governance, and the enterprise context to deploy with foundation models on-premises or in the cloud.
Trusting AI starts with trusting data
For enterprise AI to succeed, there must be trust in the results. Confidence is trusting the underlying data used to train the models. As part of the CDP architecture, Shared Data Experience (SDX) enables shared security, lineage and governance across all analytics and public and private clouds.
SDX uses two open-source projects, Apache Ranger, to define, administer and manage security policies and Apache Atlas, for metadata management and governance, to build, classify and govern a catalog of assets.
SDX includes a Data Catalog to administer and discover all data assets. The data is profiled and enhanced with rich metadata—including operational, social, and business context—creating trusted and reusable data assets and making them discoverable.
CDP has the functionality to enable holistic security, governance and compliance across the entire data lifecycle, including machine learning models in production environments.
The key here is the ability to explain the model generation, the data used to train the model and data origins — an accurate and complete data source to production environment lineage.
BYO version of GPT and foundation models
Many customers already use ML capabilities as part of CDP. Cloudera’s Machine Learning Service is well established and covers the entire ML Lifecycle from experimental data science to model training and deployment. Cloudera provides a library of end-to-end applied machine learning prototypes (AMPs) to help customers get started on developing applications.
In the recent Six Five Summit analyst event, Cloudera announced the LLM Chatbot Augmented with Enterprise Data as a blueprint for generative AI of large language models in response to customers wanting to create a version of GPT and foundation models in-house rather than training data on public API plug-ins.
With the Cloudera CDP LLM, AMP, customers can build AI applications powered by any open-source LLM with proprietary data, all hosted internally in the enterprise. The AMP is free in the CDP public and private cloud.
Under the covers, Cloudera is using Python dependencies, with open-source models from partner Hugging Face, an open-source vector database for semantic search, injecting the enterprise knowledge base into the vector database and creating and running a Python web application on top. Cloudera used H2O models, NovusDB, CML docs, and Gradio for the UI interface. Everything is customizable and pluggable to a specific use case, using any model, data, database, and application framework. With this AMP and CML, any developer now has the tools to build and host open-source LLM applications for the enterprise.
Cloudera is unique in offering a hybrid open data lake house across public and private clouds at scale. CDP is an integrated platform that provides the capabilities of both a data warehouse and a data lake.
This single platform provides the foundation for business intelligence, machine learning and AI solutions while leveraging open-source innovations such as Iceberg, Airflow and Yunikorn. CDP also provides the flexibility of a hybrid multi-cloud model to deploy across both public and private clouds.
In the new world of enterprise AI, CDP enables enterprise AI across all available data using foundational models and LLMs for generative AI-based applications in a secure, trusted and responsible manner.
As a Chief Data Officer (CDO), you need full data life cycle capability, which means storing data efficiently and resiliently, piping and aggregating data into data lake houses, and applying ML algorithms and AI to uncover actionable insights for the business units. You could assemble a bevy of best-of-breed tools and struggle to cobble them together, but good luck achieving shared security, lineage and governance. Cloudera CDP gives you everything you need out of the box and should be on your shortlist.