When we discuss data, we often approach it from a bottoms-up technology perspective focusing on things like data plumbing; pipelines and centralization in data lakes. A more productive approach is first to understand why companies invest in the technology at all. The reason should be to democratize data and make it available to as many users in the company as possible so they can make better data-driven decisions.
An enterprise data strategy must ensure that users can understand and visualize data in the context of their specific use cases. More than that, data must be represented in context regardless of data location and decoupled from the underlying data infrastructure—while also conforming to open standards. For example, a factory floor analyst, typically not skilled in data plumbing, must be able to use the relevant data to make quick decisions around supply chain resiliency or the impact of a particular product recall.
Self-service is essential so citizen data users can explore, visualize and search for information without relying on highly skilled data engineers or data scientists who are time-constrained and in high demand. That said, self-service for citizen data users must also be balanced with IT’s need to maintain control of the enterprise data platform to ensure high security, appropriate data governance and operational reliability.
You need a data strategy more than a cloud strategy
Most companies will eventually settle on a hybrid cloud strategy, deploying workloads in locations that make the most sense, whether on-premises, across multiple clouds, or in edge environments. The data management challenge is to store, process, secure and govern all the data distributed across those environments.
A modern data architecture should provide a layer of data services to enable the movement of data, metadata and workloads across the hybrid cloud with full access controls, data lineage and audit logs.
In the bigger picture, the data strategy should comprehend the cloud strategy so that they are aligned and reinforce one another.
A modern data architecture maximizes the hybrid cloud
Data originates at the edge, in the cloud or on-premises. The goal is to capture, store and process this data in its original format without losing any contextual information about the data and its source of truth.
The data management architecture should provide a consolidated view of all data assets regardless of location with consistent security and governance. The most critical features in modern data architectures for the hybrid world are listed below.
Data fabric is the essential element
The data fabric manages the entire lifecycle of storing, processing, securing and analyzing data no matter where it resides. A data fabric connects diverse data repositories and provides consistency in security, governance and data management capabilities across on-premises infrastructure and multiple clouds.
For example, by utilizing a data fabric, companies can collate customer data from various touchpoints such as CRM systems, social media and websites to create a 360-degree customer profile. Marketing can then use customer sentiment analysis to segment customers or launch targeted campaigns that match consumer preferences.
Another example of data fabric architecture in a multi-cloud environment may involve AWS for customer data, Microsoft Azure for advertising data and Cloudera providing analytical services on the Cloudera Data Platform. The data fabric architecture ties these environments together to create a unified data view.
The data fabric will provide a holistic view using data services and APIs to pull together data from legacy systems, data lakes, data warehouses and different enterprise applications. The problem of data gravity—that data becomes more challenging to move as it grows in size—is alleviated by a data fabric as it abstracts the technological complexities associated with data movement, transformation and integration.
A data fabric will have six fundamental components common to most vendors.
- Data management provides data governance and security.
- Data ingestion combines cloud data and establishes connections between structured and unstructured data.
- Data processing refines the data to ensure that data extraction uses only relevant data.
- Data orchestration transforms, integrates and cleanses the data to make it fully usable.
- Data discovery surfaces new opportunities to integrate disparate data sources.
- Data access ensures the proper permissions and surfaces relevant data through dashboards and other data visualization tools.
These elements should enable the portability of applications, data and metadata across on-premises and cloud boundaries while keeping track of data’s movement and, most importantly, without requiring any changes to the application code.
Managing metadata is more complex with a hybrid cloud
A hybrid cloud will contain data and its associated metadata distributed across different clouds. Metadata management includes collecting and storing metadata related to various asset types and making it available as needed for downstream applications.
An effective metadata management system provides the flexibility to move data and associated metadata across the hybrid cloud without losing sight of and context of the data assets. In fact, consistent data context simplifies the delivery of data and analytics with a multi-tenant data access model defined once and then seamlessly applied everywhere.
Metadata should contain information about database schemas, security policies, audit logs, data lineage and provenance, and the management system should enable all of this metadata to be viewed and managed centrally. This means that users who need to explore data on demand via different analytics engines must have a layer of shared services that provide all the necessary metadata, context and state information for a consistent view of data assets.
Decouple storage and compute for efficiency
The separation of storage and compute allows companies to pay less overall by temporarily shutting down compute clusters to avoid unnecessary expenses. It also allows storage and compute resources to be scaled independently to match business needs.
The data access layer, application APIs and metadata repositories must provide the abstraction necessary to decouple the application code from the underlying infrastructure. Data is free to move across cloud boundaries by leveraging cloud object stores such as Apache Ozone O3 for on-premises, Amazon S3, Azure Data Lake Storage (ADLS) or Google Cloud Storage (GCS).
Ensure data is protected, trusted and compliant at all times
It a hybrid cloud architecture, data access policies and lineage must be consistent across private and public clouds; otherwise, gaps will exist in audit logs, leading to a compliance nightmare. The challenge is that each cloud in use—US public clouds, EU public clouds or private clouds—can have different governance rules around access and control.
A hybrid data platform must also provide cross-platform security and governance. Consistent security and governance based on metadata across all clouds are vital for hybrid cloud success and a requirement for the ongoing mobility of data and services.
Selecting the right vendor in the crowded cloud database management systems (CDMS) market can be daunting. To help simplify the process, I offer three solution attributes that, in my view, must be present for a successful vendor partnership.
First, there must be support for all the tools your team uses to access data. To avoid a considerable training effort, the vendor must offer self-service analytics across a range of tools such as Tableau, Power BI and Jupyter Notebook, whether on-premises or on Google Cloud, AWS or Azure.
Second, anything that touches the end user in terms of an API, file format or engine should run on community-supported open-source software. Consider an open-source data platform that has cloud-agnostic capabilities and enables easy migration of data assets, metadata and workloads.
The third attribute is interoperability. You must be able to securely move data, applications and users bidirectionally among on-premises infrastructure and multiple clouds without changing a single line of code, regardless of where the data resides. Connected to this, the platform should also support the deployment of containerized workloads such as Docker and Kubernetes.
Data democratization is now possible in a hybrid world. The power of data previously kept in the hands of a few data scientists is now available to non-data experts via a hybrid data platform. Because data democratization gives data access to every employee who needs it, it will—if done right—catapult your company to new heights of performance.