by Kris Applegate

The Evolution of Data Processing Frameworks and Using the Right Tool for the Right Job

May 05, 2016
Data Center

Leveraging the best of both batch and real-time analytics

abstract light horizon.small1

Almost every enterprise customer we engage with in our Customer Solution Centers is embarking on their journey towards using data analytics to optimize the things they already do or to transform their organizations by extracting previously buried insights from their data. The common tools of the past included Relational Database Management Systems (RDBMS) commanded by using Structured Query Language (SQL), and these are still very much in-use today where they still fit. Over the last decade we’ve seen the rise of “Big Data” frameworks such as Hadoop, Cassandra, and MongoDB bring greater levels of speed, capacity, and efficiency to this process or sorting through larger volumes of data. The next few years should give way to even more new tool combinations used in concert to form a fully modern data pipeline leveraging the best of both batch and real-time analytics. Lambda, Smack stack, and PANCAKE STACK may sound like names for children’s television shows, however,  they are in fact complex toolchains for the emerging trends centered around orchestrating a data pipeline.

Lambda’s basic premise is that most organizations can benefit from both a (near) real-time streaming engine as well as a persistent layer that is more batch focused. This type of data pipeline method allows speedy technologies like Spark, Storm, or Impala to deliver the most up to date information, while Hadoop’s Map Reduce delivers the persistent long-term historical information. Unifying both technologies into a single query layer allows your data constituents to request data from a single endpoint, and to have that data joined into a single aggregate stream from both layers, real time and persistent.

The industry term “SMACK” stack refers to a toolchain that favors treating every element of data as an event, and processing it in real-time through distributed low-latency tools. This toolchain combines Spark (data processing), Mesos (Cluster resource management), Akka (message-driven app toolkit), Cassandra (storage engine), and Kafka (event-processing framework). Favoring high write volumes, this toolchain succeeds where others struggle. For example, technologies like Hadoop were built with an eye towards capacity versus performance. Using a high-speed low-latency ingest storage layer provided by Cassandra helps alleviate the concern about “real-time” not being “fast enough”.

The acronym “PANCAKE STACK” combines a very long list of tools into an end-to-end analytics and recommendation pipeline. Combining technologies whose pedigrees descend from the likes of Facebook and Twitter, this toolchain illustrates the beauty of the open source community. Presto, Arrow, Nifi, Cassandra, Airflow, Kafka, ElasticSearch, Spark, Tensorflow, Algebird, CoreNLP, and Kibana is quite the mouthful. The true power of this toolchain comes from its ability to use machine learning and natural language processing to make the types of recommendations we are used to seeing online. Possible use cases range from recommending which customers might be impacted by a problem identified by another, or delivering the overall positive/negative sentiment of a repository of unstructured data sources.

We could go on for days listing interesting and leading edge combinations of tools that customers and industry experts are using to innovate for themselves. Our customers want us to produce the best, most efficient, and most powerful hardware at a competitive price point. In turn, they want us to deliver solutions that allow them to rapidly stand up turn-key combinations of hardware, software, and services that deliver value day one and serve as the foundation of the next great data processing toolchain. By providing venues such as Dell’s Customer Solutions Centers, we can engage with customers to help demonstrate the value of the Dell portfolio. Whether through a simple briefing, a more complex white-boarding session, or even a fully-supported proof-of-concept, our team of solution experts builds our customers’ confidence in Dell as the correct partner for them as they move toward becoming an analytics-driven organization. This modern data economy will require us to adopt more and more tools to do things we’ve never thought possible all while helping to illustrate that Big Data has grown dramatically as it passes its 10th birthday.

Kris Applegate is the Solution Architect, Cloud & Big Data Solutions – Dell Customer Solution Centers.

©2016 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.