by Craig Brown PhD

Transforming data with Apache Spark

Jun 03, 2019
ApacheApache SparkBig Data

Spark is the ideal big data tool for data-driven enterprises because of its speed, ease of use and versatility. It will help you understand your data quickly and help you make informed decisions faster.

apache spark 900x600
Credit: Leo Cheung

Apache Spark is a fast data processing framework dedicated to big data. It allows the processing of big data in a distributed manner (cluster computing). Very popular for a few years now, this framework is about to replace Hadoop. Its main advantages are its speed, ease of use, and versatility.

Apache Spark is an open source big data processing framework that enables large-scale analysis through clustered machines. Coded in Scala, Spark makes it possible to process data from data sources such as Hadoop Distributed File System, NoSQL databases, or relational data stores like Apache Hive. This framework also supports In-memory processing, which increases the performance of analytical applications of big data. It can also be used for conventional disk processing if the data sets are too large for system memory.

Apache Spark Definition: Big data as the main application

Apache Spark is an open source big data processing framework built to perform sophisticated analysis and designed for speed and ease of use. It was originally developed by AMPLab, UC Berkeley University, in 2009 and passed open source as an Apache project in 2010.

Spark has several advantages over other big data technologies and MapReduce like Hadoop and Storm. First, Spark offers a comprehensive and unified framework to meet the needs of big data processing for various data sets, various by their nature (text, graph, etc.) as well as by the type of source (batch or time flow) -real). Then, Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory, 10 times faster on disk. It allows you to quickly write applications in Java, Scala or Python and includes a game of over 80 high-level operators. Furthermore, it can be used interactively to query data from a shell.

In addition to Map and Reduce operations, Spark supports SQL queries and data streaming and offers machine learning and graph-oriented processing capabilities. Developers can use these possibilities in stand-alone or by combining them into a complex processing chain.

The Apache Spark framework can run on Hadoop 2 clusters based on the YARN Resource Manager, or on Mesos. It is also possible to launch it in standalone form or on the cloud with Amazon’s Elastic Compute Cloud service. It provides access to various data sources such as HDFS, Cassandra, HBase, and S3.

The other key point of this framework is its massive community. Apache Spark is used by a large number of companies for big data processing. As an open source platform, Apache Spark is developed by a large number of developers from more than 200 companies. Since 2009, more than 1000 developers have contributed to the project.

With its speed of data processing, its ability to combine many types of databases and to run various analytical applications, it can unify all Spark big data applications. This is the reason why this framework could replace Hadoop.

Features that make Apache Spark a better fit for most businesses

Spark brings improvements to MapReduce through cheaper shuffle steps. With in-memory storage and near real-time processing, performance for many businesses can be many times faster than other big data technologies. Spark also supports lazy evaluations of queries, which helps optimize the processing steps. It offers a high-level API for improved productivity and a consistent architecture model for big data solutions.

Spark keeps the intermediate results in memory rather than on disk, which is very useful especially when it is necessary for businesses to work repeatedly on the same dataset. The runtime engine is designed to work in both memory and disk. Businesses operators perform external operations when the data does not fit in memory, making it possible to handle larger data sets than the aggregate memory of a cluster. Spark tries to store as much as possible in memory before switching to disk. It can work with some of the data in memory, another on disk.

It is necessary to review its data and use cases to assess its memory requirements because, based on the work done in memory, Spark can have significant performance benefits to many businesses. Other features that make Spark standout include:

  • Functions other than Map and Reduce
  • Optimization of arbitrary operator graphs
  • Lazy queries assessment, which helps optimize the overall processing workflow
  • Concise and consistent APIs in Scala, Java, and Python
  • An interactive shell for Scala and Python (not yet available in Java)

Spark vs Hadoop

Hadoop has been positioned as a data processing technology for 10 years and has proved to be the solution of choice for many businesses for processing large volumes of data. MapReduce is a very good solution for single pass processing but is not the most effective for use cases requiring multiple pass processing and algorithms. Since each stage of a processing workflow consists of a Map phase and a Reduce phase, it is necessary to express all the use cases in the form of MapReduce patterns to take advantage of this solution. The output data of the execution of each step must be stored on a distributed file system before the next step begins. This approach tends to be slow due to replication and disk storage.

In addition, Hadoop solutions typically rely on clusters, which are difficult to set up and administer by companies. They also require the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for flow processing).

If a company wants to set up something more complex, they will have to link a series of MapReduce jobs and execute them sequentially, each of these jobs having high latency and no one can start before the previous one has totally completed.

The advantage of Spark over Hadoop for businesses is that it enables the development of complex multi-step data processing pipelines using acyclic oriented graphs ( AGDs ). Spark allows businesses to share data in memory between graphs so that multiple jobs can work on the same dataset. Spark runs on the Hadoop Distributed File System ( HDFS ) infrastructure and offers additional features. It is possible to deploy Spark applications on an existing Hadoop v1 cluster (with SIMR – Spark-Inside-MapReduce), on a Hadoop v2 YARN cluster or even on Apache Mesos. Rather than seeing Spark as a replacement for Hadoop, it’s more correct to see it as an alternative to Hadoop’s MapReduce. Spark was not intended to replace Hadoop but to provide a complete and unified solution to support different use cases and needs for different data-driven businesses.

Common use cases of Spark

A lot of data-driven companies depend on different data sources for their analytical products. Processing like transforming, leaning, and unifying unstructured data from external sources with internal sources all makeup data processing workflows. Especially for new businesses, Spark is proving to be extremely useful. Some companies have also built a simple user interface to help open up batch data processing tasks.

  • Stream processing: many companies have started using Spark because of this feature – Spark Streaming. Applications like real-time scoring2 of analytic models, stream mining, network optimization, etc are also included. Statistics have it that data-driven companies prefer Apache Spark with respect to real-time streaming.
  • Advanced analytics: because of its speed and suitability for handling iterative computations, many companies prefer it to Hadoop. Working with Spark is preferable to many businesses and from early on itself, they started writing their own Spark libraries for regression, classification, and clustering. Present world problems, such as marketing and online advertising, fraud detection, and problems related to scientific research are now being solved with Spark tools and libraries. About 64% of businesses use Apache Spark for their advanced analytics.
  • Business intelligence and visual analytics: business intelligence and visual analytics are one of the most important aspects of any business. Over 91% of businesses are now using Apache Spark because of its performance gain.

An example of data-driven, big companies already using Apache Spark is Yahoo. Yahoo is a web search engine and they are already using Apache Spark and is successfully running projects with Spark.


Databricks allows companies to program and execute routine analysis tasks without human intervention. The publisher, which provides a commercial version of the Apache Spark open source data processing platform, now offers a tool to automatically configure and run scan tasks on top of it.

Databricks was founded by the developers behind Apache Spark and the commercial version of the platform was designed to run on the cloud of Amazon Web Services. Spark allows you to analyze very large data sets on several servers, for example, to send recommendations to an Internet service or to forecast the revenues of a company. As companies become accustomed to processing very large volumes of data, they are increasingly doing so on a regular basis. This requires an administrator to log in to a console to coordinate the steps necessary to complete these tasks.

The new Databricks cloud feature, called Jobs, allows administrators to prepare this upstream scheduling so that each task is run autonomously and at specified intervals in Spark. For example, a company can schedule the execution of a Spark application on a specific cloud Databricks cluster at a specific time. They can also decide whether they prefer to use a dedicated cluster to take advantage of maximum performance, or a cluster shared with other users to save money. The service warns the user when the task is complete. It also keeps a journal that allows you to know if the task has been executed successfully.

Spark is the ideal big data tool for data-driven enterprises because of its speed, ease of use and versatility. If you are a company with large volumes of data, Spark is that tool that will help you understand such data quickly and help you make informed decisions faster. Spark will also easily transform your company data.