Arush Arush Kharbanda
Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks.
Arush Kharbanda
He was a technical team member at Sigmoid.
Apache Spark A Look under the Hood

Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark

  • Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
  • Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens over many stages.
  • Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor (machine).
  • DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
  • Executor: The process responsible for executing a task.
  • Driver: The program/process responsible for running the Job over the Spark Engine
  • Master: The machine on which the Driver program runs
  • Slave: The machine on which the Executor program runs

All jobs in spark comprise a series of operators and run on a set of data. All the operators in a job are used to construct a DAG (Directed Acyclic Graph). The DAG is optimized by rearranging and combining operators where possible. For instance let’s assume that you have to submit a Spark job which contains a map operation followed by a filter operation. Spark DAG optimizer would rearrange the order of these operators, as filtering would reduce the number of records to undergo map operation.

DAG Execution

How Spark Works?

Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.

  1. The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications.
  2. As you enter your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph.
  3. When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages.
  4. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages.
  5. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages.
  6. The Worker executes the tasks. A new JVM is started per job. The worker knows only about the code that is passed to it.
How Apache Spark works

Spark caches the data to be processed, allowing it to me 100 times faster than Hadoop. Spark uses Akka for Multithreading, managing executor state, scheduling tasks.

It uses Jetty to share files (Jars and other files), Http Broadcast, run Spark Web UI. Spark is highly configurable, and is capable of utilizing the existing components already existing in the Hadoop Eco-System. This has allowed Spark to grow exponentially, and in a little time many organisations are already using it in production.

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

By | March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we [...]

Implementing a Real-Time Multi- dimensional Dashboard

By | July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This needs to be done in real time and displayed within acceptable display time lag to the user. Any [...]

[How-To] Run SparkR with RStudio

By | July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and Analysts who are comfortable with their R ecosystem and still want to utilize the speed and performance of Spark. In [...]

By | 2018-05-03T09:58:59+00:00 January 24th, 2015|Spark, Streaming, Technology|0 Comments

Leave A Comment