In a recent episode of the Six Five Insider podcast, we spoke with Ram Venkatesh, CTO of Cloudera. Our discussion focused on hybrid multi-cloud deployments and associated data management challenges.
As many of you know, it can get tricky to stitch together analytic functions across multiple clouds; this is the sweet spot Cloudera addresses with hybrid data clouds. Hybrid data cloud technology is critical to seamlessly move data and workloads back and forth between on-premises infrastructure and public clouds, handling both data at rest and data in motion.
The unique capability that Cloudera brings to the market is firmly grounded in Cloudera’s open-source approach. In Venkatesh’s words, “We ensure that any API or file format or engine conforms to an open standard with a community.” In this article, we’ll delve into how an open-source approach has enabled Cloudera to deliver data services with complete portability across all clouds.
Apache Iceberg: Build an open lakehouse anywhere
Apache Iceberg started life at Netflix to solve issues with sprawling, petabyte-scale tables; it was then donated by Netflix to the open-source community in 2018 as an Apache Incubator project. Cloudera has been pivotal to the expanding Apache Iceberg industry standard, a high-performance format for huge analytic tables.
Those conversant with traditional structured query language (SQL) will immediately recognize the Iceberg table format, which enables multiple applications such as Hive, Impala, Spark, Trino, Flink and Presto to work simultaneously on the same data. It also tracks the state of dataset evolution and other changes over time.
Iceberg is a core element of the Cloudera Data Platform (CDP). Iceberg enables users to build an open data lakehouse architecture to deliver multi-function analytics over large datasets of both streaming and stored data. It does this in a cloud-native object store that functions both on-premises and across multiple clouds.
By optimizing the various CDP data services, including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE) and Cloudera Machine Learning (CML), users can define and manipulate datasets with SQL commands. Users can also build complex data pipelines using features like time travel, and deploy machine learning (ML) models made from data in Iceberg tables.
Through contributions back to the open-source community, Cloudera has extended support for Hive and Impala, achieving a data architecture for multi-function analytics to handle everything from large-scale data engineering workloads to fast BI and querying as well as ML.
Cloudera has integrated Iceberg into CDP’s Shared Data Experience (SDX) layer, so the productivity and performance benefits of the open table format arrive right out of the box. Also, the Iceberg native integration benefits from various enterprise-grade features of SDX such as data lineage, audit and security functionality.
Cloudera assures that organizations can build an open lakehouse anywhere, on any public cloud or on-premises. Even better, the open approach ensures freedom to choose the preferred analytics tool with no lock-in.
Apache Ranger: Policy administration across the hybrid estate
Apache Ranger is a software framework that enables, monitors and manages comprehensive data security across the CDP platform. It is the tool for creating and managing policies to access data and services in the CDP stack. Security administrators can define security policies at the database, table, column and file levels and administer permissions for specific groups or individuals.
Ranger manages the whole process of user authentication and access rights for data resources. For example, a particular user might be allowed to create a policy and view reports but not allowed to edit users and groups.
Apache Atlas: Metadata management and governance
Apache Atlas is a metadata management and governance system used to help find, organize and manage data assets. Essentially, it functions as the traffic cop within a data architecture. By creating metadata representations of objects and operations within the data lake, Atlas allows users to understand why models deliver specific results, going all the way back to the origin of the source data.
Using the metadata content it collects, Atlas builds relationships among data assets. When Atlas receives query information, it notes the input and output of the query and generates a lineage map that traces how data is used and transformed over time. This visualization of data transformations allows governance teams to quickly identify a data source and understand the impact of data and schema changes.
Apache Ozone: Open-source answer for dense storage on-premises
Separating compute and data resources in the cloud provides many advantages for a CDP deployment. It presents more options for allotting computational and storage resources and allows for server clusters to be shut down to avoid unnecessary compute expense while leaving the data available for use by other applications. Additionally, resource-intensive workloads can be isolated on dedicated compute clusters separated for different workloads.
For these advantages to be consistent everywhere, including on-premises, CDP Private Cloud, the on-premises version of CDP, uses Apache Ozone to separate storage from compute. Apache Ozone is a distributed, scalable, high-performance on-premises object store that supports the same interaction model as AWS S3, Microsoft Azure Data Lake Storage (ADLS) or Google Cloud Storage (GCS).
Cloudera is the standard-bearer for the industrialization of open-source data management and analytics innovation, which I believe is a winning strategy. History has taught us that enterprises vote with their dollars and are unlikely to reward closed or proprietary platforms or those built by a single vendor without a broad ecosystem.
Cloudera is one of twenty vendors in the crowded cloud database management systems (CDMS) marketplace. Selecting the vendor for your specific needs can be a daunting task. The vendor’s approach to openness must be a critical factor in your selection because, in any enterprise deployment, data will originate from many locations and must work with both source and destination systems in a much more open way. Any software you use needs to be built with that in mind.
In this context, the Cloudera strategy to harness multiple open-source systems to deliver hybrid multi-cloud solutions and offer the most choice to customers is bound to enjoy a continuous advantage in terms of innovation and interoperability.
The biggest enterprises with large amounts of data see Cloudera as the right company to manage that end-to-end data on-premises or in the public cloud—or even collecting data that comes through a SaaS application. Cloudera is doing an excellent job of pulling it all together as a one-stop shop for large-scale data management.