The future is hybrid and multi-cloud. If you subscribe to that future, and most now do, then it follows that data management must be hybrid. Effective data management is vital for successful artificial intelligence (AI) projects and digital transformation. The myriad of point solutions we have collected over the years must go as integration costs, and the lack of hybrid support will make them cost prohibitive.
Cloudera is a leading company with a hybrid data platform that spans hybrid and multi-cloud environments. Cloudera enables cloud-native data analytics across the whole data lifecycle – data distribution, data engineering, data warehousing, transactional data, streaming data, data science, and machine learning – analytics that can be written once and run anywhere.
In this article, we dive deeply into stream processing, specifically Cloudera Stream Processing (CSP), which provides advanced messaging, stream processing, and analytics capabilities within the Cloudera DataFlow (CDF) platform.
Processing of data in motion
Once upon a time, applications would perform computations on the data stored in a database. Stream processing is a data management technique that computes the data directly as it is produced or received. The idea of stream processing has been around for decades. Still, it is now easier to implement with the introduction of open-source tools such as Apache Kafka and no code/low code interfaces, such as those provided by Cloudera.
The example of bank fraud can best demonstrate the value. The bank data analyst must stop fraud as soon as it happens. Streaming data on transactions moving at high speed combined with data at rest, such as customer account information, contributes to defining the characteristics of a suspicious transaction. A suspicious transaction might involve multiple access attempts or new device registrations, transactions from different locations too far apart, and transactions happening at or near the customer credit limit. The goal is to monitor in real-time for red flags and freeze the transaction attempt immediately. Later the data can populate a fraud dashboard and score against an AI model. When the customer confirms the transaction was authorized or unauthorized, that then becomes high-quality training data for future AI models.
Other stream processing use cases include network threat analysis, manufacturing intelligence, commerce optimization, real-time offers, and instantaneous loan approvals.
Latency is the enemy of fast decision making
In business, there is a limited time window to act on opportunities. It might be a variety of business events, such as responding to a customer inquiry, acting on a network alert, or responding to a fraudulent transaction. In any event, the ability to positively impact an outcome diminishes rapidly with time.
Between business events and the moment of action, there are multiple sources of latency as headwinds. Too much latency and the data can be completely cold by the time to act. The first source of latency comes from data capture. A vibration sensor, for example, will have latency between the reading and capture of the data. The second source of latency is introduced during raw data processing, integration, and aggregation to derive contextualized valuable information for a business process. The extent of latency from this source varies wildly by use case and is the primary target of real-time stream processing.
The final source of latency is the human decision-maker acting on the data. While automation is an option, some use cases require that a human is in the loop.
Bring the processing to the data
Why do we have all this latency? Slow processing erodes the value of high-speed data. Streaming architectures and analytical processing systems built for entirely different purposes in a different era essentially today remain siloed. The goal of streaming architectures was primarily to ingest, store and make new data types available to various applications, organizing them by topics in a published-subscribe model to maximize flexibility. Streaming architectures, while robust, tend to be code-heavy and lack analytics accessible to the average business user. Data processing occurs after the data lands somewhere and often after the data has been moved a second time to a staging area.
Streaming data through a traditional analytics stack adds tremendous processing and decision latency. Often the data is cold when some action is required. The entire system's performance determines the speed at which action can happen, not just the rate at which the data moves.
The Cloudera solution is to stop bringing the data to the processing and get the processing to the data.
A comprehensive solution for stream processing
Apache Flink and Apache Kafka power Cloudera Stream Processing (CSP) to provide a complete stream management and stateful processing solution. In CSP, Kafka serves as the storage streaming substrate, and Flink is the core in-stream processing engine that supports SQL and REST interfaces.
There are two stream management services for Kafka monitoring and replication. Streams Messaging Manager (SMM) provides a single monitoring dashboard for a Kafka cluster. Streams Replication Manager (SRM) allows enterprises to implement cross-cluster Kafka topic replication.
CSP’s unified processing takes streaming data or data at rest and immediately makes that data available as virtual tables with an enforceable schema. CSP then uses a powerful image engine to process the raw data, aggregating it as needed. Data Analysts can define business via a no code/low code SQL Stream Builder interface that lowers the barriers imposed by code-heavy technologies. CSP allows companies to decouple the business logic, from coding to creating real-time intelligent applications with domain experts as a critical part of the process.
Domain experts can make changes without rewriting the whole application. Data analysts can also use SQL Stream Builder to publish rolling snapshots of the results of the continuous queries available via API for any event-driven application visualization tool or even real-time scoring for AI models.
There is untapped data potential everywhere. Wherever there is a combination of data ingestion, data silos, and data that decays rapidly, that is where stream processing maximizes utilization of the data and removes the friction that makes innovation challenging.
CSP delivers speed over three different time horizons. In the short- term, there is the speed of operations, such as moving fast to stop fraud. CSP also has a significant speed advantage over the mid-term regarding the pace of development or deployment of new solutions, enabled by significantly reducing technical barriers and empowering domain experts to be critical contributors. The long-term advantage is the speed of digital transformation, reimagining business processes from the ground up, and sometimes reimagining entirely new business models. CSP will improve the speed of digital transformation by providing the high-quality data needed for AI-powered automation and establishing a robust data infrastructure required to execute any of those efforts in real-time.
I continue to believe Cloudera is proving to be one of the world's top enterprise data platform companies. I base that conclusion on the fact that, in my estimation, it is the only game in town for end-to-end data management across hybrid, multi-cloud, and on-premises.