Arush Arush Kharbanda
Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks.
Arush Kharbanda
He was a technical team member at Sigmoid.
Spark Streaming in Production

This is our next blog in the series of blogs about Spark Streaming.

After talking about Spark Streaming and how it works, now we will look at how to implement this in production. At Sigmoid we have implemented Spark Streaming in production for some customers and have achieved great results by improving the design and architecture of the system. This blog post is about the design considerations and key learning for implementing Spark Streaming in a production environment.

Most Real time Analytics Systems can be broken down into a Receiver System, a Stream Processing System and a Storage System.

Spark Streaming

A good design/architecture is essential for the success of any application. A well designed system is stable, maintainable and scalable. With the design done correctly, maintenance and upgrade efforts can be minimized, which in turn  keeps the costs low. Good design practices have been learnt and implemented while working on Spark Streaming at Sigmoid. Further we talk about how we can design Stable and maintainable systems with Spark Streaming.

Stability

The system must be stable to overcome any unplanned outages as such outages might make clients furious and leaves the user base frustrated.

Fault tolerance Any production systems should be capable to recover from failures. Fault Tolerance is even more important for a real time analytics system. Apart from creating fault tolerant Streaming Processing System, your Receiver needs to be fault tolerant too.

  • Developing Akka based custom receivers with a supervisor allows you to create auto healing receivers. To Create a Supervisor for your actor, the actor needs to implement SupervisorStratergy. You can look atakkaworhdcount for implementing your custom Akka receiver

The below code depicts this

// Restart the storage child when StorageException is thrown.
// After 3 restarts within 5 seconds it will be stopped.private static SupervisorStrategy strategy = new OneForOneStrategy(3,Duration.create(“5 seconds”), new Function<Throwable, Directive>() {
@Override
public Directive apply(Throwable t) {
if (t instanceof StorageException) {
return restart();
     }
else{
return escalate();
}}
});
  • Using auto healing connection pools avoids exhaustion of the connection pool
  • Track your output – The time interval and size of your output are  the most suited parameters to setup alerting in your system. Setting up alerting to track these parameters makes more sense as it covers for failure scenarios at any point in your pipeline.
    1. Insert timestamps in your data to track the latest updates
    2. Setup alerts to track huge changes to the output size

Maintainability

Systems need to be designed in a modular way, this helps in bringing down the development and maintenance costs. It is easy to customize or update modular systems. Bugs can be easily fixed without creating unwanted side effects. A good modular approach creates reusable components. We can safely say modularity brings maintainability.

The Stream processing system can be further divided into 3 subparts – Map, Aggregate and Store.

  • Map– Avoid the use of anonymous functions, as they are difficult to test and even more difficult to maintain. They can’t be tested without initializing a spark context. Move your logic out of your spark code, this would make your code more modular. This would allow you to write better Unit test cases, you can test your functions without initializing a Spark Context.
  • Aggregation– In the aggregation layer, Monoids allow to move the logic away from Spark code. In a crude way, Monads are classes which perform associative operations and can be plugged into spark code to perform Streaming operations.
object LongMonoid extends Monoid[(Long, Long, Long)] {
def zero = (0, 0, 0)
def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = {
(l._1 + r._1, l._2 + r._2, l._3 + r._3)
}}

Twitter Algebird provides an API of monads, pretested and helps save a lot of development and testing effort.

Spark Streaming Maintianability

  • Store– It is preferable to use NoSQL databases over HDFS for Streaming Applications because NoSQL databases allow the application to make incremental updates. Also NoSQL Databases allow to query the data, this is essential to verify the output of your job and test the system, this is not possible in HDFS.

 

Testing your System

If the system is well designed, testing  involves lesser effort and resources.

A good set of automated test cases is essential to make enhancements and improvements to your system. The test cases need to have as much coverage as possible. But for functional testing you don’t want to spin a cluster . You should also try to avoid writing integrated test cases. Its suitable to write unit tests, and you can test your map functions and monads independently.

If you have come across other good design practices, please share with us in the comments section below.

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we

Implementing a Real-Time Multi- dimensional Dashboard

July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This Rolex Replica needs to be done in real time and displayed within acceptable display time lag to the

[How-To] Run SparkR with RStudio

July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio Your private vip singapore escort has elite call girls ready to provide social services for any of your demands. With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and

By |2019-03-11T06:38:27+00:00April 22nd, 2015|Spark, Streaming, Technology|