The data mesh is the hottest topic in data management circles these days. A recent webinar on data management I attended had a Q&A session at the end inundated with data mesh questions, and it wasn’t even a topic mentioned in the webinar! (If you’re the presenter, that is when you know you have misread your audience.)
In this article, I explain what a data mesh is, introduce the concept of data as a product, discuss the required organizational changes that go along with these ideas and map the unique features of the Cloudera Data Platform (CDP) to provide the foundation for a data mesh across a hybrid cloud.
Data mesh: the decentralization of data
At a high level, the idea behind the data mesh is to decentralize data and workloads. As opposed to the traditional way of moving data from the source and copying it into a central location owned by IT, a data mesh keeps the data and its ownership in the domain where it originated. A data mesh keeps data in the hands of the people who understand it and allows companies to avoid an IT bottleneck by scaling out with domains.
Only the largest companies with a domain-oriented structure can benefit from a data mesh approach. Data meshes are not for small or medium-sized companies. But for those big companies struggling with large amounts of data, the benefits of a data mesh are worth the investment.
Naturally, technology plays a vital role in enabling a data mesh. What follows are the four principles of the data mesh and how the service components within CDP support each of them.
Principle One: Domain ownership
In a data mesh, the data is decentralized and stays within separate business domains. For example, a manufacturing group would maintain ownership of the manufacturing-specific data—with the clear responsibility to maintain and clean that data.
Because authentication and authorization of users must be applied consistently across all domains, data ownership also comes with the responsibility to manage metadata, access and use policies in a way that matches companywide standards.
In a data mesh, data-sharing between data producers and consumers across domains throughout the organization requires data streaming.
Cloudera Data Flow (CDF) is a real-time streaming data platform that collects, curates, analyzes and acts on data-in-motion across the edge, data center and multiple public clouds. CDF offers capabilities such as edge and flow management, stream messaging and processing and analytics by leveraging open-source projects such as Apache NiFi, Apache Kafka and Apache Flink.
Apache NiFi, in particular, is an essential service for data mesh implementations because it offers centralized management, end-to-end traceability throughout the data lifecycle and interactive command and control, thereby providing real-time operational visibility.
Principle Two: Data as a product
Each domain must now treat data as a product. For example, the manufacturing domain must create the infrastructure and a mini-IT team to convert the source data into an analytical format, place it in a data lake or data warehouse and make it available for others to access.
The data product is the data plus the metadata surrounding it, including dynamic information such as freshness, statistics, access controls, owners, best uses of the data and lineage. The domain owner maintains the data products according to a contract that would make that data available to others outside the domain—for example, in other business operating units.
Data Catalog is a service within CDP.The Data Catalog enables data to be discoverable across the mesh domains and captures user-curated technical and business metadata describing the data product. All data is secured at rest and in transit by FIP 140-2 level encryption and stored in open formats and standards.
Principle Three: Self-serve data platform
Based on the first two principles, imagine we have organized and funded teams around data products. Those teams have been delegated ownership to manage and build the data products and serve them to other consumers across the data mesh. Now what?
Domain teams will need an easy and manageable way to select the data infrastructure services required and analytical tools to build high-quality, reusable data products. Data owners will want to instantiate continuous integration and continuous deployment (CI/CD) services and discover and integrate new data products.
Cloudera Data Hub is an analytics service on CDP that allows users to deploy analytical clusters across the entire data lifecycle using an infrastructure-as-a-service (IaaS) approach. This cloud-based architecture supports the separation of compute and storage and enables the deployment of highly customized analytics workloads.
Principle Four: Federated computational governance
Although data mesh architecture and operations are decentralized, global and regulatory constraints and companywide policies must be consistent across each domain—and managed by IT.
The Common Control Plane in CDP provides a ubiquitous service that is consistent and spans the data mesh. The control plane spans multiple public and private clouds as a federated service that centrally manages metadata, security, encryption and governance.
Cloudera Shared Data Experience (SDX) combines federated security, governance and management capabilities with shared metadata and a data catalog. SDX is a shared service on the control plane that provides a governance layer to assign ownership, capture audits and apply global policies.
Data mesh implies organizational change
As mentioned above, a data mesh requires each domain to have its mini-IT team, so instead of scaling up centrally, organizational scaling is achieved by adding new technical staff domain by domain. Data mesh and domain creation can become very complicated, and domain design can take time. Indeed, teams may not be open to the radical change the data mesh brings.
When a data mesh is introduced, a domain currently sending data to central IT must hire a mini-IT team, administer a budget and follow a contract to become a working piece of the data mesh. Faced with this, the response from the people within the domain may be, “I don't have time, and I can barely keep things afloat now. You want me to do all this for the greater good?”
All it takes is one domain to refuse to comply, and the data mesh is in jeopardy. In the short run, the workaround is to adopt a hybrid approach where most data is centralized, with some domains beginning to operate in a data mesh. In the future, the aim is to convince more of the company to adopt a data mesh—a change-management issue more than a technology issue.
Of the four principles of the data mesh, principles one and two decentralize data and shift ownership to the domain. Principle three still requires a central team to enable the domains to be created and deployed at the click of a button. Principle four assures centralized governance because nobody can operate independently regarding security or regulatory compliance.
Data products are the foundation of the approach. Data products need to be available and consumable across the mesh. Central governance must exist at the enterprise level, with delegation down to domains in the analytical tools and applications and enabling those tools and applications to be self-serviced.
The Cloudera Data Platform (CDP) can manage the deployment of a data architecture across national boundaries, using multiple public clouds and on-premises infrastructure, which is a crucial foundation for building a data mesh. Indeed, the CDP hybrid deployment capability with centralized governance is fundamental to enabling the data mesh architecture.
Cloudera's hybrid data platform should undoubtedly be on your shortlist if a data mesh is in your future.