Markdown Content - Containerization of PySpark using Kubernetes

---
# Containerization of PySpark using Kubernetes

**URL:** https://www.sigmoid.com/blogs/containerization-of-pyspark-using-kubernetes/
Date: 2020-07-23
Author: Sigmoid
Post Type: post
Summary: Containerization technology is widely used by data scientists and machine learning practitioners to promote the continuous deployment of models and test the...Read More...
Categories: Data Management
Tags: AI/ML, Cloud Transformation
Featured Image: https://www.sigmoid.com/wp-content/uploads/2023/09/Containerization-of-PySpark-using-Kubernetes-banner-opt.jpg
---

Containerization technology is widely used by [data scientists](/data-science-services/) and machine learning practitioners to promote the continuous [deployment of models](/blogs/5-best-practices-for-putting-ml-models-into-production/) and test the models frequently while carrying out multiple iterations. In this blog, we have detailed the approach of how to use Spark on Kubernetes and also a brief comparison between various cluster managers available for [Spark](/blogs/optimize-nested-queries-using-apache-spark/).

## Introduction to Spark

[Spark](/blogs/spark-streaming-internals/) is a general-purpose distributed [data processing](/blogs/improving-data-processing-with-spark-3-0-delta-lake/) engine designed for fast computation. The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. It supports workloads such as batch applications, iterative algorithms, interactive queries, and [streaming](/blogs/fault-tolerant-stream-processing/). During execution, it creates the following components:

 

 	- Driver

 	- Executor

 
To manage these components, there is a cluster manager that takes care of resource allocation. Following are the various options available for cluster managers:

 

 	- Standalone

 	- Apache Mesos

 	- Hadoop YARN

 	- Kubernetes

 

**Standalone Cluster Manager**
To use Spark Standalone Cluster manager and execute code, there is no default high availability mode available, so we need additional components like Zookeeper installed and configured.

 

**Hadoop YARN**
YARN (“Yet Another Resource Negotiator”) focuses on distributing MapReduce workloads and it is majorly used for Spark workloads.

 

**Apache Mesos**
The Mesos kernel runs on every machine and provides applications with APIs for resource management, scheduling across the entire data center, and [cloud environments](/blogs/cloud-data-warehouse-is-the-future-of-data-storage/). It provides a cluster manager which can execute the Spark code.

 

**Kubernetes**
It uses the kube-api server as a cluster manager and handles execution.

## Comparison between Hadoop YARN and Kubernetes – as a cluster manager

Features
Hadoop YARN
Kubernetes

Cost
For High Availability requirements that demand more than 1 nodes, minimum 3 nodes need to be running. These nodes aren’t shared for other workloads.
Here we need a running cluster which can be shared for various workloads.

Support
Can opt for the Cloudera (aka Hortonworks) distribution for support
Kubernetes community support.

Ease of
setup
There is a need to install various components on multiple nodes and these components are needed for High Availability configurations
Only Kubernetes cluster is to be up and running, code has to be part of Docker image.

Integrations
Relatively limited support
Supports a rich set of tools integrations.

Ability to
isolate jobs
To reuse the same cluster for concurrent requirements Spark apps need to compromise on isolation.
Kubernetes provides cost benefits of a shared infrastructure and full isolation.

## Brief introduction Kubernetes and its component

Kubernetes is a container orchestration engine which ensures there is always a high availability of resources. Apart from that it also has below features.

 

 	- Self-healing

 	- Automatic rolling updates and rollback

 	- Resource management

 	- Service Discovery

 	- Load Balancing

 	- Service Discovery

 
Architecture of Kubernetes

 
![Architecture of Kubernetes](/wp-content/uploads/2023/09/Architecture-of-Kubernetes-img-opt.jpg)
 
The architecture of Kubernetes has 2 major components, they are:

 

 	** Master Components **
All the requests from the user using API, kubectl are sent to master component that is the API Server

 	- API Server converts json or yaml requests to http call.

 	- ETCD contains the details of the cluster and its components and current state.

 	- Controller ensures that the cluster is always in the desired state.

 	- Scheduler takes care of object creation based on resource availability.

 
 	** Worker Components **

The following are the components that are come under the nodes:

 

 	- Kubelet is the agent which takes care of the execution of the tasks which have been assigned to it and reports back the status to the API server.

 	- Kube Proxy is the networking component that takes care of networking related tasks.

 	- Container Runtime it provides an environment on the nodes for container execution.

 
**Other components –** 
Kubectl: is a utility used to communicate with the Kubernetes cluster.

## Spark Execution on Kubernetes

Here we have explained how containerization technology with [Spark](/blogs/spark-streaming-production/) is used. Below is the pictorial representation of spark-submit to API server.

 
![Spark Execution on Kubernetes](/wp-content/uploads/2022/12/Spark-Execution-on-Kubernetes-img.png)
 
We can use spark-submit directly to submit a Spark application to a Kubernetes cluster. Once submitted, the following events occur:

 	- Creation of a Spark driver running as a [Kubernetes pod](https://kubernetes.io/docs/concepts/workloads/pods/).

 	- Creation of executors which are also run within Kubernetes pods connects to them and executes the application code.

 	- Termination and clean-up of executor pods occur when the application completes. But the driver pod persists, logs, and remains in a “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

There are 2 options available for [executing Spark](/blogs/apache-spark-on-dataproc-vs-google-bigquery/) on an EKS cluster

 
Option 1: Using Kubernetes Master as Scheduler

Option 2: Using Spark Operator

## Option 1: Using Kubernetes Master as Scheduler

Below are the prerequisites for executing spark-submit using:

 
A. Docker image with code for execution

B. Service account with access for the creation of pods, services, secrets

C. Spark-submit binary in local machine

**A. Creating Docker image for Java and Py-Spark execution**

 	- Download Spark binary in the local machine using this link [https://archive.apache.org/dist/spark/](https://archive.apache.org/dist/spark/)

 	- In this path **spark/kubernetes/dockerfiles/spark** there is Dockerfile which can be used to build a docker image for jar execution.

 	- Ensure that you are in the Spark directory as it needs jars and other binaries to be copied. So it uses all the directories as context.

![Creating Docker image](/wp-content/uploads/2022/12/Creating-Docker-image-Py-Spark-execution-copy-clip-1.png)
 
Once this image is built, it can be used as a base image for the other code execution.

 
Docker image creation for Py Spark code execution:

 
In this path **spark/kubernetes/dockerfiles/spark/bindings/python** there is a ready Docker file which will be used for py [spark execution](/blogs/getting-data-into-spark-streaming/). Ensure that you are in the Spark directory as it needs jars and other binaries to be copied so it will use all the directories as context.

 
![Docker image creation for Py Spark](/wp-content/uploads/2022/12/Creating-Docker-image-for-Py-Spark-execution-copy-clip-2.png)
 

**B. Creating Kubernetes service account and cluster-role binding**
Components that will be used:

 

 	- Service account: an account which will be used for authentication of processes running inside the pods.

 	- Cluster-Role: defines the access for the service account across the cluster.

 	- Cluster-Role Binding: binds and creates the role with the service account.

 
Create a yaml file with below contents

 
![Creation of Kubernetes service account](/wp-content/uploads/2022/12/Create-a-yaml-file-with-below-contents-copy-clip-img.png)
 
![Creation of yaml file](/wp-content/uploads/2022/12/docker_img.gif)
 
Getting cluster information:

 
![cluster information result](/wp-content/uploads/2022/12/cluter-information-img.png)
 
This command gives the master url it and it will look as shown below

 
Kubernetes master is running at https://ABCDZZZZZZZZZZZZZZZ.sk1.region.eks.amazonaws.com

 
Make a note of this URL

**C. Executing Spark Submit:**
Now go to the directory which has Spark binary and use the below command

![Use of spark binary command](/wp-content/uploads/2022/12/Executing-spark-submit-img.png)
 
![Execution of spark submit](/wp-content/uploads/2022/12/spark_submit.gif)
 
The driver and executor will be created with whatever name is specified as the app name. Using the driver pod we can view logs, access URL below commands can be used for it.

 

 	- To view logs

 
![View of logs](/wp-content/uploads/2022/12/view-blog-img.png)
 

 	- To view spark ui

 
![View of spark ui](/wp-content/uploads/2022/12/view-spark-ui-Img.png)
 
Additional useful options that can be used with Spark-Submit

 
Environment Variables:

 
When the script requires any environment variable that needs to be passed, it can be done using Kubernetes secret and referred to it. Details of achieving this are given below.

 
“spark.kubernetes.executor.secretKeyRef.DB_PASS”: “snowsec:db_pass”,

 
Create a secret with the name snowsec and in that db_pass is the key and which will be referred to the spark environment using DB_PASS.

 
Adding labels to the pod:

 
When we want to add additional labels to pod we can use below options

 
For driver pod

 
spark.kubernetes.driver.label.[LabelName]

 
For executor pod

 
spark.kubernetes.executor.label.[LabelName]

 
Using node affinity:

 
We can control the scheduling of pods on nodes using selector for which options are available in Spark that is

 
spark.kubernetes.node.selector.[labelKey]

## Option 2: Using Spark Operator on Kubernetes

Operators

 
Operator is a method of packaging, deploying and managing a Kubernetes application. Kubernetes application is one that is both deployed on Kubernetes, managed using the Kubernetes APIs and kubectl tooling.

 
Using Spark Operator on Kubernetes

 
Official link:[ https://operatorhub.io/operator/spark-gcp](https://operatorhub.io/operator/spark-gcp)

Installing Kubernetes Operator on EKS

 
Prerequisites:

 

 	- Helm needs to be installed and configured

 	- Verify Helm installation using below command

 
helm version –short

 

 	- Adding Helm repo to running EKS

 	- Helm repo add incubator

 	- Installing Helm repo on EKS :

 
![Installing Kubernetes Operator](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img.png)
 
Verify Installation:

![Verification of Kubernetes Installation](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-2.png)
 
Once applied, the below mentioned components will be created:

 

 	- Spark-operator-deployment

 	- Spark-serviceaccount

 	- Spark-rbac

 	- Spark-operator-serviceaccount

 	- Spark-operator-rbac

 	- Webhook-init-job

 	- Webhook-service

 
Here we need the job in yaml. Once we have yaml file, we can submit the job using the below command:

 
![submit job command code](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-3.png)
 
The Yaml file looks as follows:

 
![Result of the Yaml file](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-4.png)
 
Once we submit the job, it will create 2 pods:

 

 	- Executor

 	- Driver

We can verify these using

 
![get the logs using command](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-5.png)
 
We can get the logs by using below command

 
![submitted job using command](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-6.png)
 
Details of submitted job using below command

![Kubernetes Operator on EKS](/wp-content/uploads/2022/12/Installing-Kubernetes-Operator-on-EKS-img-7.png)
 
Contains details about Web UI, service and events that occurred during creation.

 
Access Web UI:

 
To access the Web UI for a long running job can be done using port forwarding, using the below mentioned command

![Web UI Image](/wp-content/uploads/2022/12/Access-Web-UI-img.png)

## Conclusion

Key considerations for Production Spark code on Kubernetes

 
Cost-Effective

 
No requirement of up and running infrastructure to use Spark on EKS.

 
Build and Deployments

 
As we deploy the Docker image with the Spark submit, so when we have code changes we need to pass the docker image with Spark submit. These images can be tagged to track the changes.

 
Ideal Use Cases

 
When workload is less (e.g. 8-10 hr job executions per day) and as batch processing.

 
Availability/Fault Tolerance

 
Kubernetes has the scheduler which manages the pods created as driver and executor. This enables the usage of pods based on resource availability. Quotas for a namespace can be assigned for better resource management.

 
Resource Tracking

 
We have used node selectors in Spark submit which allows us to run specific workloads on a specific node. This in turn allows us to track the usage of resources.

 
Monitoring

 
We have integrated Spark workloads monitoring with Prometheus and Grafana, by using Kube-state-metrics and creating a dashboard. We do a Spark submit by assigning pod labels which allow us to create custom dashboards for specific labels.

 
Logging

 
We have used the ELK stack for visualizing the logs.

 
The approach we have detailed is suitable for [pipelines](/etl-and-data-pipeline/) which use spark as a containerized service. It also ensures optimal utilization of all the resources as there is no requirement for any component, up and running before doing Spark-submit. Additionally, Spark can utilize features like namespace, quotas along with other features of Kubernetes.

## About the Author

Ajaykumar Baljoshi is a Senior Devops Engineer at Sigmoid, who currently works on Containerization, Kubernetes, DevOps and Infrastructure as Code.

[lc_the_tags]

## Featured blogs

[lc_get_posts post_type="post"
posts_per_page="4" orderby="date" output_view="lc_get_posts_mycustom_view" output_number_of_columns="4"
output_wrapper_class="row" output_article_class="shadow" output_hide_elements="Excerpt"
output_excerpt_length="0" output_excerpt_text="Read More" output_heading_tag="span"
output_featured_image_format="thumbnail" output_featured_image_class="card-img-left" ]

## Share

[addtoany]

## Subscribe to get latest insights

## Talk to our experts

Get the best ROI with Sigmoid’s services in data engineering and AI

## Suggested readings

[View all](/blogs/)

![](/wp-content/uploads/2022/12/apache-spark-for-real-time-analytics-thumbnail.png)

#### [Apache Spark for Real-time Analytics](/blogs/apache-spark-for-real-time-analytics/)

[Read blog](/blogs/apache-spark-for-real-time-analytics/)

![Apache Spark on DataProc](/wp-content/uploads/2022/12/Apache-Spark-on-DataProc-vs-Google-BigQuery-thmubnail.png)

#### [Apache Spark on DataProc vs Google BigQuery](/blogs/apache-spark-on-dataproc-vs-google-bigquery/)

[Read blog](/blogs/apache-spark-on-dataproc-vs-google-bigquery/)

![How to Optimize Nested Queries using Apache Spark](https://www.sigmoid.com/wp-content/uploads/2023/01/How-to-Optimize-Nested-Queries-using-Apache-Spark-thumnail.png)

#### [How to Optimize Nested Queries using Apache Spark](/blogs/optimize-nested-queries-using-apache-spark/)

[Read blog](/blogs/optimize-nested-queries-using-apache-spark/)

---

## Categories

- Data Management

---

## Navigation

- [Company](/about-sigmoid)
- [Newsroom](/newsroom)
- [Life at Sigmoid](/careers)
- [Takshashila](/takshashila)
- [Contact Us](/contact-us)
- [AI Strategy Blueprint your AI advantage](/enterprise-ai-strategy/)
- [Generative AI Drive innovation with Generative AI](/generative-ai/)
- [Responsible AI Build trust with ethical AI practices](/responsible-ai-in-enterprise/)
- [Agentic AI Reshape business with scalable agentic systems](/agentic-ai-solutions/)
- [AI Managed Services Ensure reliable AI performance](/ai-managed-services/)
- [Advanced Analytics Transform your business with data-driven insights](/advanced-data-analytics-solutions/)
- [Start Assessment](/agentic-ai-readiness-index/)
- [Data Strategy Strong data foundations for scalable AI](/data-analytics-strategy/)
- [Data Management Leverage data as a strategic asset](/ai-data-management-services/)
- [Data Ops Automate data for speed and quality](/data-devops/)
- [Data Engineering Deliver insights faster with scalable pipelines](/data-engineering/)
- [Cloud Transformation Modernize data to maximise efficiency](/cloud-migration/)
- [Download Whitepaper](/ebooks-whitepapers/building-data-products-in-a-data-mesh-to-drive-business-value/)
- [Data Modeling Structure data for better decisions](/data-modeling-services/)
- [Data Visualization Transform data into actionable stories](/data-visualization-service/)
- [BI Migration Enhance decision making with modern BI tools](/bi-migration/)
- [Data Observability Build trust with healthy, accurate data](/data-observability/)
- [Automated Insights Make smarter decisions with auto-generated insights](/automated-insights/)
- [Download Whitepaper](/ebooks-whitepapers/power-bi-hacks/)
- [CPG & Retail End-to-end analytics for planning, operations, and commercial excellence](/industries/cpg-analytics/)
- [Life Sciences Trusted intelligence across clinical, commercial, and operational workflows](/industries/life-sciences/)
- [Financial Services AI-powered analytics for risk, compliance and customer experience](/industries/banking-financial-analytics-services/)
- [Read case study](/case-studies/data-clean-room-enables-real-time-insights-to-improve-operational-efficiency/)
- [MediaIQ Advanced platform for in-flight marketing measurement](/accelerators/sigmoid-mediaiq-multi-touch-attribution-tool/)
- [CampaignIQ AI-driven platform for optimized campaign budget allocation](/accelerators/sigmoid-campaigniq/)
- [AssistBot GenAI email assistant that automates human-like responses](/accelerators/sigmoid-assistbot-for-ai-email-assistant/)
- [CreativeBot GenAI tool for personalized and brand-aligned creative design](/accelerators/sigmoid-creativebot/)
- [SocialBot GenAI platform to analyze digital conversations and trends](/accelerators/#marketing|socialbot)
- [DemandIQ Predict trends accurately and optimize inventory management](/accelerators/sigmoid-demandiq/)
- [NetworkIQ Track and optimize logistics operations in real-time to quickly address disruptions](/accelerators/sigmoid-networkiq/)
- [SupplyIQ End-to-end platform to optimize supply chain operations](/accelerators/sigmoid-supplyiq/)
- [ProcurementIQ Automated procurement operations for maximum savings, compliance and efficiency](/accelerators/sigmoid-procurementiq/)
- [RapidML Accelerated deployment for machine learning models](/accelerators/sigmoid-rapidml/)
- [DataGuard Comprehensive platform for proactive data quality management](/accelerators/data-quality-tool-sigmoid-dataguard/)
- [CloudPulse Cloud cost optimization platform with multi-cloud management](/accelerators/sigmoid-cloudpulse/)
- [RAPID GenAI foundation with built-in governance and cost clarity](/accelerators/sigmoid-rapid/)
- [AnalyticsBot GenAI based platform to streamline decision-making in analytics](/accelerators/sigmoid-analyticsbot/)
- [DataConnect Seamlessly ingest, integrate and harmonize data from diverse sources](/accelerators/sigmoid-dataconnect/)
- [Reconica AI-powered data harmonization and reconciliation engine](/accelerators/sigmoid-reconica/)
- [ConverseBot GenAI driven insights generation for automated insights from reports](/accelerators/#sales|conversebot)
- [iNRM Cross-lever revenue growth optimization platform](/accelerators/sigmoid-inrm/)
- [AssortmentIQ Optimize shelf layouts and assortment mix at scale with AI-based insights](/accelerators/sigmoid-assortmentiq/)
- [Read Whitepaper](/ebooks-whitepapers/building-agentic-ai-chatbots-for-business-process-transformation/)
- [Listen Podcast](/events/podcast/how-jack-in-the-box-is-redefining-personalization-and-supply-chain-with-ai/)
- [Blogs](/blogs/)
- [White Papers](/ebooks-whitepapers/)
- [Case Studies](/case-studies/)
- [Podcast](/events/podcast/#Podcasts)
- [Read Blog](/blogs/the-genai-adoption-triad-responsibility-ethics-and-explainability/)
- [ConverseBot](/accelerators/#sales|conversebot/)

## Tags

- AI/ML
- Cloud Transformation

---

## Footer Links

- [Talk to our AI experts](/contact-us/)
- [AI Strategy](/enterprise-ai-strategy/)
- [Agentic AI](/agentic-ai-solutions/)
- [Generative AI](/generative-ai/)
- [AI Managed Services](/ai-managed-services/)
- [Responsible AI](/responsible-ai-in-enterprise/)
- [Advanced Analytics](/advanced-data-analytics-solutions/)
- [Data Strategy](/data-analytics-strategy//)
- [Data Engineering](/data-engineering/)
- [Data Management](/ai-data-management-services/)
- [Cloud Transformation](/cloud-transformation/)
- [Data Ops](/data-devops/)
- [Data Visualization](/data-visualization-service/)
- [Automated Insights](/automated-insights/)
- [BI Migration](/bi-migration/)
- [Data Modeling](/data-modeling-services/)
- [Data Observability](/data-observability/)
- [CPG & Retail](/industries/cpg-analytics/)
- [Financial Services](/industries/banking-financial-analytics-services/)
- [Life Sciences](/industries/life-sciences/)
- [Case Studies](/case-studies/)
- [Thought Leadership](/ebooks-whitepapers/)
- [Blogs](/blogs/)
- [Company](/about-sigmoid/)
- [Newsroom](/newsroom/)
- [Accelerators](/accelerators/)
- [Careers](/careers/)
- [Privacy Policy |](/privacy-policy/)
- [Cookie Policy](/cookie-policy/)