Cloudera And Apache Iceberg – Collaborating On The Same Data

By Patrick Moorhead - August 1, 2022

Recently Cloudera announced the general availability of Apache Iceberg in the Cloudera Data Platform (CDP). This article provides some background on data lake storage, the challenges of organizing data within data lake storage, the emergence of Apache Iceberg as the standard for managing data in data lakes, and finally, the benefits for existing and potential users of Cloudera CDP.

Cloudera CLOUDERA

Challenges of organizing data within data lake storage

Data lakes deliver virtually unlimited storage for structured and unstructured data. A data lake is a shared data repository for organizations' applications to access various tasks, including reporting, analytics, and processing.

The Apache Hadoop Distributed File System (HDFS), Cloudera's roots, formed the basis for traditional data lakes. Today, the trend is towards cloud data lakes that utilize object storage systems such as Amazon S3 and Microsoft Azure Data Lake Storage (ADLS).

Data is stored in the data lake precisely as it is collected. A structured dataset maintains the original structure without further indexing or metadata. Similarly, unstructured data such as social media posts, images, and MP3 files land in the original native format.

https://youtu.be/8zI6_v0jupw

A data lake can only work if data can be extracted and used for analysis, which requires data governance. Data catalogs, such as Hive Metastore (HMS), apply metadata and a hierarchical logic to incoming data, so datasets receive the necessary context and trackable lineage.

The limitations of a catalog

While catalogs provide a shared definition of the dataset structure within data lake storage, data changes or schema evolution between applications go untracked. For example, the structure of a large dataset, including column names and data types, can be cataloged by Hive, but the data files present as part of the dataset are unknown. As a result, applications must read file metadata to identify which files are part of a dataset at any given time.

Data integrity is not much of an issue if the dataset is static and does not change. When one application writes to and modifies the dataset, another application that reads from the same dataset must be in sync with the changes. For example, an ETL (Extract, Transform, Load) process updates the dataset by adding and removing several files from storage; another application that reads the dataset may process a partial or inconsistent view of the dataset and generate incorrect results.

What is Apache Iceberg?

Apache Iceberg is a new open table format that enables multiple applications to work together on the same data transactionally. It tracks the state of dataset evolution and changes over time.

Those conversant with traditional SQL tables will immediately recognize the Iceberg table format. It is open and accessible so multiple engines can operate on the same dataset.

HMS, for example, keeps track of data at the “folder” level requiring file list operations when working with data in a table which can often lead to performance degradation.

Iceberg avoids this by keeping track of a complete list of all files within a table using a persistent tree structure.

Apache Iceberg was developed at Netflix to solve issues with huge, petabyte-scale tables, given to the open-source community in 2018 as an Apache Incubator project.

The benefits for Cloudera CDP users

General availability covers Iceberg running within essential data services in the Cloudera Data Platform (CDP)—including Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), and Cloudera Machine Learning (CML).

Cloudera has integrated Iceberg into the CDP’s SDX (Shared Data Experience) layer, so the productivity and performance benefits of the open table format are right out of the box. Also, the Iceberg native integration benefits from enterprise-grade features of SDX such as data lineage, audit, and security.

The Iceberg tables in CDP integrate within the SDX Metastore for table structure and access validation, allowing for the creation of auditing and fine-grained policies. Iceberg enables CDP to expose the same data set to multiple analytical engines, including SparkHiveImpala, and Presto.

There are four other benefits from the CDP Iceberg integration, which users will like:

In-place table evolution saves time.

Users can evolve a table schema or change the partition layout as a single command, much as you would with SQL. Iceberg does not require laborious, costly processes, like rewriting table data or migrating to a new table.

Time travel for forensic visibility and regulatory compliance

Iceberg logs previous table snapshots, allowing the generation of time travel queries or table rollbacks.

Multi-function analytics from the edge to AI

Iceberg enables seamless integration between different streaming and processing engines while maintaining data integrity between them. Multiple engines can concurrently change the table, even with partial writes, without correctness issues and the need for expensive read locks.

Improved performance with very large-scale data sets

Partitioning makes queries faster by grouping similar rows together when writing or dividing a table into certain parts based on some attributes.

Iceberg simplifies partitioning by implementing hidden partitioning and handling all the details of partitioning and querying without user knowledge.

Wrapping up

I like what Cloudera has done here. Analysts and data scientists can easily collaborate on the same data using tools and analytic engines. This functionality requires no effort to get the benefits of Iceberg as part of CDP. No more lock-in, unnecessary data transformations, or data movement across tools and clouds to extract insights from the data.

It is pure to the Cloudera strategy: to take open-source technologies and add enterprise-grade quality and stability. The biggest enterprises with large amounts of data see Cloudera as the company to manage that end-to-end data on-premises or in the public cloud or even collecting data that comes through a SaaS application. Cloudera is doing an excellent job in pulling it all together as a one-stop shop for data management.

Note: Moor Insights & Strategy writers and editors may have contributed to this article.

+ posts

Patrick founded the firm based on his real-world world technology experiences with the understanding of what he wasn’t getting from analysts and consultants. Ten years later, Patrick is ranked #1 among technology industry analysts in terms of “power” (ARInsights)  in “press citations” (Apollo Research). Moorhead is a contributor at Forbes and frequently appears on CNBC. He is a broad-based analyst covering a wide variety of topics including the cloud, enterprise SaaS, collaboration, client computing, and semiconductors. He has 30 years of experience including 15 years of executive experience at high tech companies (NCR, AT&T, Compaq, now HP, and AMD) leading strategy, product management, product marketing, and corporate marketing, including three industry board appointments.