Raghav Raghavendra Pratap Singh
Raghavendra is the Assistant Marketing Manager at Sigmoid. He specializes in content marketing domains, digital and social media marketing.
Raghavendra Singh
He is the Assistant Marketing Manager at Sigmoid.
Apache Spark for Real-time Analytics

Apache Spark is the hottest analytical engine in the world of Big Data. In our previous post: Hadoop and Data Analytics, we spoke about Hadoop, Data Analytics and their associated benefits. Today, we will cover Apache Spark and its importance, as part of Real-Time Analytics.

Apache Spark is an open source fast engine, for large-scale data processing on a distributed computing cluster. It was initially designed at Berkeley University and later on donated to the Apache software foundation. Spark can interactively be used from Java, Scala, Python and R among others, and is also capable of reading from HBase, Hive, Cassandra and any HDFS data source. Its interoperability and versatile nature make it one of the most flexible and powerful data processing tools available today.

Spark’s fast processing speed makes it a suitable fit for data cleaning, data wrangling, and ETL. It has an advanced DAG execution engine that supports acyclic data flow and in-memory computing, which helps to run programs up to 100x faster than Hadoop MapReduce in-memory. Spark is a multi-stage RAM capable cluster-computing framework, which can perform both batch processing and stream processing. It has libraries for machine learning, interactive queries, and graph analytics, which can run in Hadoop clusters through YARN, MESOS, and EC2, while it has its own standalone mode. Spark batch processing applications provide high volume as compared to real-time processing, which provides low latency.

While using Hadoop for data analytics, many organizations figured out the following concerns:
1) MapReduce Programming is not a good match for all analytics problem, as it isn’t efficient for iteration and interaction analytics tasks.
2) It was getting increasingly difficult to find entry-level programmers with good Java skills, to be productive with MapReduce.
3) With the emergence of new tools and technology, fragmented data security issues emerged, which resulted in Kerberos authenticated protocol.
4) Hadoop lacked full-feature tools for data management, data cleaning, governance, and metadata.


Apache Spark solves the missing puzzle in the above-mentioned problems:
1) Spark uses Hadoop HDFS as it doesn’t have its own distributed file system. Hadoop MapReduce is strictly disk-based, whereas Spark can use memory as well as the disk for processing.
2) MapReduce uses persistent storage, whereas Spark uses Resilient Distributed Datasets (RDDs) which can be created in three ways: parallelizing, reading a stable external data source such as HDFS file and transformations on existing RDDs.

We can process these RDDs using the operations like map, reduce, reduceByKey, join and window. The results are stored in the data store for further analytics, which is used for generating reports and dashboard. A transformation will be applied to every element in RDD and RDDs are distributed among the participating machines. Partition in RDD is generally defined by the locality of the stable source and can be controlled by the user through Repartitioning.

To take relevant business decisions, Big Data is ingested in real-time and insightful values must be extracted upon its arrival. So, there are different streaming data processing frameworks like Apache Samza, Storm, Flink and Spark Streaming.

Apache Spark Stream is most suitable for high speed and real-time information, which makes it the current trend in the Big Data world. Complex machine learning algorithms are built and implemented on different streaming data sources to extract insights and help detect an anomalous pattern in real-time. Through Spark Streaming library, it is now possible to process and apply complex business logic on these streams.

Applications of Apache Spark Streaming are as follows:
1) Real-Time Online Recommendation
2) Event Processing Solutions
3) Fraud Detection
4) Live Dashboards
5) Log Processing in Live Streams

It processes a continuous stream of data by dividing the stream into micro-batches called Discretized stream or Dstream, which is an API. Dstream is a sequence of RDDs which are created from input data or from sources such as Kafka, Flume or by applying operations on other Dstream. Input data from these sources are received by Spark Streaming Application to create sub-second Dstream, which are then processed by the Spark core engine.

RDD generated can be converted into data frames and queried using Spark SQL. Dstream can be subjected to any application that can query RDD through Spark’s JDBC driver and stored in Spark’s working memory to query it later on demand of Spark’s API.

So, we can understand how Spark Streaming library can be used for processing real-time data. This library is important for data processing, which finally provides us with real-time insights.

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

By |March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we

Implementing a Real-Time Multi- dimensional Dashboard

By |July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This Rolex Replica needs to be done in real time and displayed within acceptable display time lag to the

[How-To] Run SparkR with RStudio

By |July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio Your private vip singapore escort has elite call girls ready to provide social services for any of your demands. With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and

By |2019-03-05T10:31:14+00:00February 13th, 2019|Analytics, Open Source, Real Time, Spark, Spark Ecosystem, Tech|