by Raman Mehta

My Big Data Has Schizophrenia

Nov 03, 20144 mins
AnalyticsBig DataHadoop

There are many business models that lends themselves for the monetization of Internet of Things (IoT) and big data convergence. The critical success factor is to generate context sensitive, real actionable alerts.

big data 16
Credit: Thinkstock
iot picture

In my previous blog, we had discussed the Hype vs Reality of the Internet of the things (IoT) and introduced the TaaS (Internet of Things as a Service) framework. In the industrial side of the IoT, there are some business models that lends themselves for the monetization of IoT. Recently, Kaggle had partnered with a major industrial conglomerate. The objective was to run a public quest for developers and data scientists to create the best new algorithms to reduce air travel delays.

As a frequent flier, I am always amazed by how many times flights have to go on holding patterns and just circle the airports. Many factors cause these delays including weather patterns, traffic congestions, and gate availability. One of the interesting statistics that came out of this quest was that even a 10-mile reduction off an average flight can save millions of dollars in fuel costs to airlines.

This is a great example where IoT and big data converge. The algorithms involved holistic analysis of flight history events, flight plans, flight tracks (actual GPS information), weather and FAA programs. The real benefit of IoT is that the course correction of a flight can be done in real time based on the sensor data and availability of prior patterns unearthed by big data analytics.

This brings an interesting paradox – the schizophrenic nature of big data. To provide an actionable insight, analysis of real time streaming data is a must. However, this point in time streaming sensor data is of not much use, unless we know the historical data inter-dependencies and patterns of interactions.

Traditionally big data solutions like Hadoop rely on batch processing using variety of MapReduce architectures built on HDFS. Recently, there have been many stream processing systems such as Apache Spark that are getting a lot of attention. We need a unified solution that relies on both batch as well as stream processing. As an IT leader, the last thing I want is an architecture where I have to maintain multiple code bases to solve a single business problem. An approach could be to build stream processing applications on top of MapReduce and Storm or similar systems.

In the IoT world, one of the most critical success factors would be how can we generate context sensitive, real actionable alerts. It is an age old problem where people have stopped paying attention to car alarms going off in parking lots. The IoT solutions need to crunch millions and millions of sensor data elements and find actionable patterns. For example, what really constitutes a fraudulent transaction on credit card purchase? What weather patterns along with crew skills and airport equipment would actually cause a delay? The proposed architecture below addresses this dilemma.

Big data decision making pattern

There is a dashboard that provides the real time actionable alerts. The dashboard gets its calibrated feeds from a rule engine. The rule engine constantly updates itself through data mining algorithms and machine learning by operating in batch mode. The real time streaming data is continuously looked through the dynamic rules that the batch system generates. This ensures that alerts are only raised when real actions are required.

A number of open source big data technologies come together to achieve this architecture. One of the highlights is the use of Apache Kafka (high-throughput distributed messaging system). Kafka allows listening on multiple sensor topics and provides streaming data to Apache Storm. Apache Flume plays the role of the data transport channel that feeds into both batch and streaming data repositories.

In my next blog post, I will address how this architecture can be leveraged to build some sophisticated predictive intelligence systems that can optimize asset performance on plant floors.