BrandPosts are written and edited by members of our sponsor community. BrandPosts create an opportunity for an individual sponsor to provide insight and commentary from their point-of-view directly to our audience. The editorial team does not participate in the writing or editing of BrandPosts.
By Sunil Jain
Since the advent of Hadoop, enterprises have made remarkable progress on big data analytics. Massive amounts of stored data is “batch” processed to extract the hidden value. More recently, we are witnessing a surge in demand for real time analytics on continuously streaming data. Enterprises are finding “stream” processing very useful in applications such as detecting anomalies and urgent conditions, automating timely actions, uncovering hidden business values, preventing losses and attacks, and improving revenue through new products and services. The obvious next logical step is to combine the two methodologies – concurrent batch and stream processing – analyzing both static-historic and streaming-dynamic data together.
However, several challenges come to play when putting such a solution in practice:
a) Programming models for the computation constructs, that are capable of combining both batch and stream processing, are still in their infancy;
b) Applications must run in a highly distributed manner in order to read-write large data sets from variety of storage, while maintaining a real-time/in-memory data store;
c) Data requires aggregation from spatially and temporally scattered disparate sources such as IoT edge devices; and
d) Skillset to cobble together such reliable applications remains a tribal art
The main ingredient in Nautilus is Pravega – an ordered real time data store with exactly once semantics which has been created from the ground up. The throughput capacity of this data store automatically and intelligently scales to match the incoming traffic. Once the buffering capacity of Pravega has been reached, it infinitely tiers (read-write) into colder storage via S3 or HDFS. Pravega uses Apache Zookeeper and Apache Bookkeeper, and has been contributed back to the community as open source code.
Apache Flink is the other integral ingredient of the Nautilus stack. Unlike many other analytics processors in the market, Flink is a general purpose compute engine that distinctly combines both real time and batch processing into a single programing model. Users can write and run simple to very sophisticated code through interactive environments such as Apache Zeppelin. Users more familiar with SQL can interface with Flink (Table) APIs that are built on top of Apache Calcite which is a SQL Parser and Optimizer framework. A special connector to sit between Pravega and Flink is in works. We are also ironing out the HDFS/HCFS interfacing to make buffering, save pointing, and recovery of Flink jobs easier and flawless.
Nautilus is built on Apache Mesos based easy button scalable private cloud environment that tightly integrates Pravega and Flink along with all their dependencies. The solution stack is fully containerized, deployable on commodity hardware, and is built for ease of use, high availability, performance and fault tolerance in global distributed settings.
Potential use cases for combined batch and stream processing are plentiful: Smart meters, satellite imagery feed, weather forecasting, price fluctuations forecasting, energy trading insights, power outage detection and prediction, load shedding, predictive maintenance, cargo protection, insurance customer profiling, call center customizations, spatial-temporal targeting, competitive intel, automated trading, risk management, personalized care, and smarter surveillance – to mention just a few. Unprecedented value starts emerging as soon as any of these usages are jazzed up with pattern detection, artificial intelligence and/or machine learning algorithms.
However, without a plan, these usages can be fraught with nuances. For example, steaming data at times can be considered perishable. Be it the live bidding for stocks or ads, or choice for an answer on an online test, or the glimpse of a wanted terrorist in a web cam frame, or the arrhythmic beat of a heart patient, if the useful info is not extracted and acted upon quickly within seconds, it might no longer matter. The key is to ingest the data (e.g. using Pravega), compute actionable insights based on combined perishable and accumulated wisdom (e.g. using Flink) , and react to them in real-time. Time is money and of business essence.
Industry is about to hit yet another inflection point. The way we think about data analytics is about to change. It won’t be subtle. I bet by the end of this year, car companies will be able to run a query such as: “show me the real time locations of all xyz model cars in Portland metro, with >90% probability of breakdown in the next 48 hours”, and monetize the extracted business values! Thought leaders are already preparing their enterprises for a smooth upward ride on the streaming analytics wave. Is your company getting ready?