This is a first blog in the series of blogs about spark streaming.

Before getting into Spark Streaming, on a brief note Apache Spark is an open-source cluster computing framework. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.

Spark Streaming was launched as a part of spark in Spark 0.7 and came out of alpha in Spark 0.9 and has been pretty stable since the beginning. It can be used in many almost real time use cases like monitoring the flow of users on your website and detecting fraud transactions in real time.

But what was the need for Spark streaming?

Lets first understand what are the general requirements for a Stream processing system. Stream processing system can be stateful, a Stateful Stream Processing System, is a system which needs to update its state with the stream of data. For such a system,the latency should be low and even if a node fails the state should not be lost.

For example – Computing the distance covered by a vehicle based on stream of its GPS location. Or a simpler one- counting the occurrences of word ‘spark’ in a stream of data.

Intro of spark

Batch processing systems like hadoop have a high latency, and are not suitable for a near real time processing requirements. Storm guarantees processing of a record if it hasn’t been processed, but this can lead to inconsistency as a record could be processed twice. If a node running Storm goes down then the state is lost. In most environments Hadoop and Storm(or other stream processing systems) have been used for batch processing and Stream Processing respectively. The use of two different programming models causes an increase in code size, number of bugs to fix, the development effort and introduces a learning curve and causes many more issues.

Spark Streaming helps in fixing these issues and provides a scalable, efficient, resilient and integrable (with batch processing) system. These features are supported with the facts in the discussion following.

In spark streaming batches of Resilient Distributed Datasets (RDDs) are passed to Spark Streaming. It processes these batches using the Spark Engine and returns a processed stream of batches. The processed stream can be written to a File System. The batch size can be as low as 0.5 sec, leading to an end to end latency of less than 1 sec.

Spark Streaming provides an API in Scala, Java and Python. The python API was introduced only in Spark 1.2 and still lacks many features. Spark Streaming allows stateful computations – maintaining a state based on data coming in a stream. It also allows window operations, i.e. allows the developer to specify a time frame and perform operations on the data flowing in that time window. The window has a sliding interval, that is the time interval of updating the window. If i define a time window of 10 sec with a sliding interval of 2 secs. I would be performing my computation on the data coming to the stream in the past 10 secs and the window would be updating every 2 seconds.

Apart from these features, the strength of spark streaming lies in its ability to combine with batch processing. Its possible to create a RDD using normal Spark programming and join it with a Spark Stream. Moreover the code base is very similar and allows easy migration if required. And there is zero or no a learning curve from Spark.

In the example below, wordCount is a dStream, using the transform operation of the dStream, the dStream can be joined to another RDD spamInfoRDD. This RDD is generated as a part of spark job.

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(…) // RDD containing spam information
val cleanedDStream = wordCounts.transform(rdd => {
rdd.join(spamInfoRDD).filter(…) // join data stream with spam information to do data cleaning…
})

Spark Streaming allows to ingest data from Kakfa, Flume, HDFS or a raw TCP stream. It allows you to create a stream out of RDD’s. You can provide your RDD’s and spark would treat them as a Stream of RDD’s. It even allows you to create your own receiver. Scala and Java APIs can read data from all the sources, whereas python API can read only form TCP network.

Fault tolerance is the capability of a system to overcome failure. Fault tolerance in Spark Streaming is similar to fault tolerance in Spark. Like RDD partitions, dStreams data is recomputed in case of a failure. The raw input is replicated in memory across the cluster. In case of a node failure, the data can be reproduced using the lineage. The system can recover from a failure under less than a second.

Spark Streaming is capable to process 100-500K records/node/sec. This is much faster than Storm and comparable to other Stream processing systems. At Sigmoid we are able to consume 480K records per second per node machines using Kafka as a source. We are constantly pushing the limits of performance and trying to get maximum out of the same infrastructure.

Sigmoid worked with a leading supply chain analytics firm to help them process real time factory shop floor data using Spark Streaming. The key problem was to digest batch & streaming data from various sources, all within a turnaround time of seconds. They had a challenging requirement of ensuring easy scaling, high availability and low latency for the system to be used in production. We overcame these challenges to implement a stable and maintainable realtime streaming system.

Spark Streaming is the best available Streaming Platform and allows to reach sub-second latency. The processing capability scales linearly with the size of the cluster. Spark Streaming is being used in production by many organizations.

Spark Streaming Users List

Thanks to Matei Zaharia, Tathagata Das and other committers for open sourcing Spark under the Apache licence.

We will be happy to hear your insights on spark streaming and how you could benefit using it.

Stay tuned for more!! In our next blog different source systems used for getting data into Spark Streaming is discussed.

References:

  1. Introduction to using Spark Streaming
  2. Image Courtesy – Stateful Stream Processing: http://www.bigdata.fi/node/971