akhil Akhil Das
Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance.
Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

Mesos High Availability Cluster:

Apache Mesos is a high availability cluster operating system as it has several masters, with one Leader. The other (standby) masters serve as backup in case the leader master fails. Zookeeper elects the master nodes and handles the failures.

High Availability Mesos Cluster

Mesos is framework independent and  can intelligently schedule and run Spark, Hadoop, and other frameworks concurrently on the same cluster.

Before diving deep into the topic we will look into Spark and Spark Streaming from an overview.

Spark Streaming

Spark powers a stack of frameworks including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.


Resilient Distributed Datasets (RDDs) are  a Big collection of data which are Immutable ( can’t be changed once created), Distributed (across the custer),Lazily evaluated (won’t do anything unless an action is triggered.), Type Inferred (compiler will figure out the object type), Cacheable (can load whole dataset into memory).

Many big-data applications viz., monitoring systems, alert systems, computing system to name few, need to process large data streams in near-real time. Spark Streaming can process large chunks of data, runs streaming computation as a series of small, batch jobs. In Spark Streaming, the incoming data is split into small batches of time interval (200ms by default) and Spark treats each batch data as a RDD and process them using RDD operation, which are returned in batches after processing.

Spark Streaming over a HA Mesos Cluster

To use Mesos from Spark, you need a Spark binary package available in a place accessible (http/s3/hdfs) by Mesos, and a Spark driver program configured to connect to Mesos.

Configuring the driver program to connect to Mesos,
val sconf = new SparkConf()


.set(“spark.mesos.coarse”, “true”)
.set(“spark.cores.max”, “30”)
.set(“spark.executor.memory”, “10g”)
val sc = new SparkContext(sconf)
val ssc = new StreamingContext(sc, Seconds(1))

Failure in simple streaming pipeline

Point of Failures for Spark Streaming

With a simple spark streaming pipeline as shown in the figure, there are certain failure scenarios that mayoccur. By default, kafka and HDFS are configured in high-availability mode, we will be focusing on Spark streaming fault tolerance.

Real-time stream processing systems must be operational 24/7, which requires them to recover from all kinds of failures in the system.

  • Spark and its RDD abstraction is designed to seamlessly handle failures of any worker nodes in the cluster
  • In Streaming, driver failure can be recovered with checkpointing application state.
  • Write Ahead Logs (WAL) & Acknowledgements can ensure zero data loss


This image on right represents the processing of the received data in a fault tolerant system. When a Spark Streaming application starts (i.e., the driver starts), the associated StreamingContext (starting point of all streaming functionality) uses the SparkContext to launch Receivers as long running tasks, the receivers receive data from source systems like Kafka, flume etc These Receivers receive and buffers the streaming data into executors memory for processing.

In this fault tolerant file system the streaming data is checkpointed to another set of files periodically.

In a fault tolerant system, when the failed driver restarts the checkpointed data along with data from WAL is used to restart the driver, receivers and the contexts.

Simple Fault-tolerant Streaming Infra

Fault Tolerant Spark Streaming over Mesos Cluster

Spark Streaming cluster and storage system are placed in Mesos cluster but the source systems like Kafka are out of the Mesos as the source can be from anywhere. So by having the Spark streaming in Mesos, even if a driver fails, there are standby drivers which takes care of the streaming process and thereby ensures fault tolerance.

Goal: Receive & process data at 1M events/sec

Understanding the bottlenecks: 

  • Network            : 1Gbps
  • # Cores/Slave   : 4
  • DISK IO             :  100MB/S on SSD

Choosing the correct # Resources

  • Since single slave can handle up to 100MB/S network and disk IO, a minimal of 6 slaves could take me to ~600MB/S

Scaling the pipeline

One way to obtain high throughput is to launch more number of receivers on machines, and it will let you consume more data at once, there are other techniques also like, setting up custom add-ons on top of mesos like Marathon to throw more machines (scale up and down the cluster automatically). Before scaling any pipeline, one should always know their goals and the limitations on the available resources. So in a basic data-center all machines these days are equipped with around 4 cores of CPU, 1Gbit network card, and 100MB/S SSD Storage, if you say you want to process around 1Million events per second (600bytes each), then you will be needing like a minimum of 6 machines to get to that scale, since each of the machines  can handle around 100MB/S in terms of network and DISK I/O.

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we

Implementing a Real-Time Multi- dimensional Dashboard

July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This Rolex Replica needs to be done in real time and displayed within acceptable display time lag to the

[How-To] Run SparkR with RStudio

July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio Your private vip singapore escort has elite call girls ready to provide social services for any of your demands. With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and

By |2019-03-11T06:39:14+00:00April 9th, 2015|Spark, Streaming, Technology|