By Chris Latimer, vice president, product management, DataStax
There’s a lot of talk about the importance of streaming data and event-driven architectures right now. You might have heard of it, but do you really know why it’s so important to a lot of enterprises? Streaming technologies unlock the ability to capture insights and take instant action on data that’s flowing into your organization; they’re a critical building block for developing applications that can respond in real-time to user actions, security threats, or other events. In other words, they’re a key part of building great customer experiences and driving revenue.
Here’s a quick breakdown of what streaming technologies do, and why they’re so important to enterprises.
Data in motion
Organizations have gotten pretty good at creating a relatively complete view of so-called “data at rest” — the kind of information that’s often captured in databases, data warehouses, and even data lakes to be used immediately (in “real time”) or to fuel applications and analysis later.
Increasingly, data that’s driven by activities, actions, and events that happen in real-time across an organization pours in from mobile devices, retail systems, sensor networks, and telecommunications call-routing systems.
While this “data in motion” might ultimately get captured in a database or other store, it’s extremely valuable while it’s still on the move. For a bank, data in motion might enable it to detect fraud in real time and act upon it instantly. Retailers can make product recommendations based on a consumer’s searching or purchasing history, the instant someone visits a web page or clicks on a particular item.
Consider Overstock, a U.S. online retailer. It must consistently deliver engaging customer experiences and derive revenue from in-the-moment monetization opportunities. In other words, Overstock sought the ability to make lightning-fast decisions based on data that was arriving in real-time (generally, brands have 20 seconds to connect with customers before they move on to another website).
“It’s like a self-driving car,” says Thor Sigurjonsson, Overstock’s head of data engineering. “If you wait for feedback, you’re going to drive off the road.”
The event-driven architecture
To maximize the value of their data as it’s created — instead of waiting hours, days, or even longer to analyze it once it’s at rest—Overstock needed a streaming and messaging platform, which would enable them employ real-time decision-making to deliver personalized experiences and recommend products likely to be well-received by customers at the perfect time (really fast, in other words).
Data messaging and streaming is a key part of an event-driven architecture, which is a software architecture or programming approach built around the capture, communication, processing, and persistence of events—mouse clicks, sensor outputs, and the like.
Processing streams of data involves taking actions on a series of data that originates from a system that continuously creates “events.” The ability to query this non-stop stream and find anomalies, recognize that something important has happened, and act on it quickly and in a meaningful way, is what streaming technology enables.
This is in contrast to batch processing, where an application would store a data after intaking it, process it, and then store the processed result or forward it to another application or tool. Processing might not start until after, say, 1000 data points have been collected. That’s too slow for the kind of applications that require reactive engagement at the point of interaction.
It’s worth pausing to break that idea down:
- The point of interaction could be a system making an API call, or a mobile app.
- Engagement is defined as adding value to the interaction. It could be giving a tracking number to a customer after they place an order, a product recommendation based on a user’s browsing history, or a billing authorization or service upgrade.
- Reactive means the engagement action happens in real-time or near-real-time; this translates to hundreds of milliseconds for human interactions, while machine-to-machine interactions that occur in an energy utility’s sensor network, for example, might not require such a near-real-time response.
When message queue isn’t enough
Some enterprises have recognized that they need to derive value from their data-in-motion and have assembled their own event-driven architectures from a variety of technologies, including message-oriented middleware systems like Java messaging service (JMS) or message queue (MQ) platforms.
But these platforms were built on a fundamental premise that the data they processed was transient and should be immediately discarded once each message had been delivered. This essentially throws away a highly valuable asset: data that’s identifiable as arriving at a particular point in time. Time-series information is critical for applications that involve asynchronous analysis, like machine learning. Data scientists can’t build machine learning models without it. A modern streaming system needs to not only pass events along from one service to another, but also store them in a way that retains their value or usage later.
The system also needs to be able to scale to manage terabytes of data and millions of messages per second. The old MQ systems are not designed to do either of these.
Pulsar and Kafka: The old guard and the unified, next-gen challenger
As I touched upon above, there are a lot of choices available when it comes to messaging and streaming technology.
They include various open-source projects like RabbitMQ, ActiveMQ, and NATS, along with proprietary solutions such as IBM MQ or Red Hat AMQ. Then there are the two well-known, unified platforms for handling real-time data: Apache Kafka, a very popular technology that has become almost synonymous with streaming; and Apache Pulsar, a newer streaming and message queuing platform.
Both of these technologies were designed to handle the high throughput and scalability that many data-driven applications require.
Kafka was developed by LinkedIn to facilitate data communication between different services at the job networking company and became an open source project in 2011. Over the years it’s become a standard for many enterprises looking for ways to derive value from real-time data.
Pulsar was developed by Yahoo! to solve messaging and data problems faced by applications like Yahoo! Mail; it became a top-level open source project in 2018. While still catching up to Kafka in popularity, it has more features and functionality. And it carries a very important distinction: MQ solutions are solely messaging platforms, and Kafka only handles an organization’s streaming needs—Pulsar handles both of these needs for an organization, making it the only unified platform available.
Pulsar can handle real-time, high-rate use cases like Kafka, but it’s also a more complete, durable, and reliable solution when compared to the older platform. To have streaming and queuing (an asynchronous communications protocol that enables applications to talk to one another), for example, a Kafka user would need to bolt on something like RabbitMQ or other solutions. Pulsar, on the other hand, can handle many of the use cases of a traditional queuing system without add-ons.
Pulsar carries other advantages over Kafka, including higher throughput, better scalability, and geo-replication, which is particularly important when a data center or cloud region fails. Geo-replication enables an application to publish events to another data center without interruption, preventing the app from going down—and preventing an outage from affecting end users. (Here’s a more technical comparison of Kafka and Pulsar).
In the case of Overstock, Pulsar was chosen as the retailer’s streaming platform. With it, the company built what its head of engineering Sigurjonsson describes as an “integrated layer of data and connected processes governed by a metadata layer supporting deployment and utilization of integrated reusable data across all environments.”
In other words, Overstock now has a way to understand and act upon real-time data organization-wide, enabling the company to impress its customers with magically fast, relevant offers and personalized experiences.
As a result, teams can reliably transform data in flight in a way that is easy to use and requires less data engineering. This makes it that much easier to delight their customers—and ultimately drive more revenue.
To learn more about DataStax, visit us here.
About Chris Latimer:
Chris is a technology executive whose career spans over twenty years in a variety of roles including enterprise architecture, technical presales, and product management. He is currently Vice President of Product Management at DataStax where he is focused on building the company’s product strategy around cloud messaging and event streaming. Prior to joining DataStax, Chris was a senior product manager at Google where he focused on APIs and API Management in Google Cloud. Chris is based near Boulder, CO, and when not working, he is an avid skier and musician and enjoys the never-ending variety of outdoor activities that Colorado has to offer with his family.