Markdown Content - Getting data into Spark Streaming

---
# Getting data into Spark Streaming

**URL:** https://www.sigmoid.com/blogs/getting-data-into-spark-streaming/
Date: 2015-03-17
Author: Sigmoid
Post Type: post
Summary: In the previous blog post we talked about an overview of Spark Streaming, and now let us take a look on different...Read More...
Categories: Data Management
Tags: Cloud Transformation
Featured Image: https://www.sigmoid.com/wp-content/uploads/2023/01/Getting-Data-into-Spark-Streaming-banner.png
---

In the previous blog post we talked about an overview of [Spark Streaming](/blogs/spark-streaming-internals/), and now let us take a look on different source systems that can be used for creating [Spark Streams](/blogs/spark-streaming-internals/).

[Spark Streaming](/blogs/spark-streaming-production/) provides out of the box connectivity for various source systems. It provides built in support for Kafka, Flume, Twitter, ZeroMQ, Kinesis and Raw TCP.

This blog serves as a guide for using and choosing the appropriate source for a Spark Streaming application. It also shares the steps needed to connect to the required system. The selection of the source system depends on the use case/application.

We need a basic machine set up to connect to any of the source systems, you can follow our Spark installation for the set up.

We will look into various source systems in detail, starting with Kafka as it is the most commonly used for Spark streams.

## Sources for Spark Streams

### Kafka

Kafka is available under Apache Licence. It provides a Distributed, Reliable Topic based Publisher-Subscriber Messaging System. If you are just starting with streaming, the above sentence may seem too complicated, let’s break it down.

**Distributed** – Kafka can run on a set of servers called brokers. The brokers replicate data among themselves. This distributed architecture allows Kafka to be scalable.

**Publisher-Subscriber System** – It provides a topic based streaming system for Apache Spark Streaming to process. [Consumers](/customer-analytics/) can subscribe and get messages from a topic.

**Reliable** – Messages passed to Kafka are persisted on disk and replicated to prevent loss of Data.

Also, Kafka guarantees to preserve the order of messages received from a producer i.e. messages received from a producer are logged and passed downstream in the order the messages are received.

Once you have Kafka running on a cluster you can use the createStream method to access data from Kafka.

Below is the API reference for KafkaUtils[1].

     ![](/wp-content/uploads/2022/12/kafka-createStream-opti.jpg)

Application

At Sigmoid, we have used Kafka for implementing Spark Streaming for an Online Advertising Optimization System. The requirement is to pick the most suitable ad for a user, the one which is most likely to be clicked by him. Many websites which are clients to this Advertisement [Optimization](/events/in-flight-campaign-optimization-using-mta-for-cpg/) system push messages to the Kafka Cluster via a Kafka Producer. The output of the Kafka stream is written to a HBASE database or any other distributed database, to be read by the client. The system is pretty stable and scales well when the load grows. Such a system allows one to find the effectiveness of a campaign in real-time.

     ![](/wp-content/uploads/2022/12/Spark-Streaming-with-Kafka-2-opti.jpg)

The code below depicts how you can use Kafka to create a Stream of Data, and generate the probabilities for various advertisements being clicked.

      ![](/wp-content/uploads/2022/12/flume-examples-6-opti.jpg)

Flume is designed to pull data from various sources and push it in HDFS whereas Kafka is designed to provide data to many systems, where HDFS could be one of the systems.

### Flume

Like Kafka, Flume is also available under Apache Licence and is a distributed, reliable system for collecting, aggregating and moving large amounts of log data from variable sources to a centralized data source. It is similar to Apache Kafka, in being distributed, reliable and a ready to use messaging system. But Kafka is usable in a wider number of use cases. Flume was initially designed for log processing but can be used in any real time event processing systems.

**Distributed** – Flume runs on a cluster, which allows it to be scalable

**Reliable** – Events passed to the stream are deleted only when the events have been stored in the channel of the next agent.

**Recoverability** – Events are staged in a channel, which is backed by a local file system. The events generated during system outage cannot be recovered. But Flume can restore all the events which are already received and can get those events processed in Spark Streaming.

If the flume cluster is already running, you would need to connect to it from your [spark streaming code](/blogs/fault-tolerant-stream-processing/). FlumeUtils can be used to create a stream of data.

Below is the API reference for FlumeUtils[2].

      ![](/wp-content/uploads/2022/12/flume-createStream-4-opti.jpg)

Application

Log analysis is useful to detect system faults, failures, security attacks using which remedial action can be taken before service quality degrades. Spark Streaming allows to analyze logs in real time and allows the user to implement [real-time log analytics](/blogs/apache-spark-for-real-time-analytics/). Flume is often used to process logs for analytics purpose.

  ![](/wp-content/uploads/2022/12/Flume-5-opti.jpg)

The code below depicts how you can create a Stream of data from Flume and create a stream of alerts if anything suspicious is found in the logs.

![](/wp-content/uploads/2022/12/flume-examples-6-opti-1.jpg)

Flume is designed to pull data from various sources and push it in HDFS whereas Kafka is designed to provide data to many systems, where HDFS could be one of the systems.

## MQTT (Message Queue Telemetry Transport)

MQTT is a widely used simple and lightweight messaging protocol. It implements a Publisher subscriber messaging system. It is designed for small devices with limited memory, unreliable networks, low bandwidth, like mobile devices. It is suitable for Internet of Things (IoT) use cases.

Similar to Flume and Kafka, Spark Streaming provides a library for MQTT connectivity also. You can use MQTTUtils to connect to an existing and running MQTT Stream.

Below is the API reference for MQTTUtils[3].

![](/wp-content/uploads/2022/12/MQTT-createStream-7-opti.jpg)

Application 

Vehicle driver monitoring applications analyze various data points and come up with suggestions for you to improve your driving skills, get better mileage from your vehicle. Such systems record and transmit the vital parameters of your vehicle to a streaming system, the streaming system performs the various computations on the data to generate usable data points. A lightweight queue can be used on such a device to transmit events to Spark streams.

![](/wp-content/uploads/2022/12/Spark-Streaming-with-MQTT-8-opti.jpg)

The code below depicts how you can create a Stream using MQTT utils and use that stream to check if the gear used is optimal.

  ![](/wp-content/uploads/2022/12/MQTT-example-9-opti.jpg)

## ZeroMQ

Kinesis is a managed service, therefore, the user does not have to think about how to process data simultaneously but concentrate more on the logic of processing this data. Plug-in your code for processing the data and let Kinesis handle the processing for you. Since it is managed by [Amazon](/blogs/proven-methods-to-reduce-aws-cloud-infrastructure-cost/) it is more stable and scalable than any of the systems discussed, but it does not allow the type of flexibility as you can have with Kafka or any other systems.
For connecting to a Kinesis Store from Spark Streaming use [KinesisUtils](https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/streaming/).

Applications

Applications discussed for Flume or Kafka are applicable for Kinesis.

     - Click streams from websites: [Customers](/customer-analytics/) can be provided with real-time [analytics](/blogs/top-data-analytics-trends-to-watch-out-in-2021/) of their websites

      - Stock market firms: can make use of the real time data coming in, helping them know the trends of the industry

Spark streaming can connect to any other Source system using custom receivers and an API to communicate with the system.

The table below is to summarize the features we have covered above in this blog.

          Feature/Source
              Kafka
             Flume
             MQTT
            ZeroMQ
            Kinesis

    Open Source/Proprietary

    Open source – Apache
    Open source – Apache
    Open source – Eclipse
    Open source – Mozilla Public License
    Proprietary – Amazon

    Light Weight

     No
    No
    Yes
    Yes
    No

    Distributed

    	Yes
    	Yes
    	No
    N/A
    	Yes

    Messaging Component

    Messaging System
    Messaging System
    Messaging System
    Messaging Library
    Messaging System

    Advantages

    Widely accepted architecture, suitable for any source type
    Suitable for log processing
    Fits the IoT use case, can be used to perform many IoT applications
    Provides a messaging library, flexibility to the user to implement any mechanism for the messaging queue
    Provides a scalable queue for event processing

      Application

    General scalable stream processing
    Log processing
    IoT, mobile devices, Lightweight, event processing
    IoT, mobile devices, Lightweight, event processing
    General scalable stream processing

## About the Author

Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks.

    [lc_the_tags]

## Featured blogs

        [lc_get_posts post_type="post"
            posts_per_page="4" orderby="date" output_view="lc_get_posts_mycustom_view" output_number_of_columns="4"
            output_wrapper_class="row" output_article_class="shadow" output_hide_elements="Excerpt"
            output_excerpt_length="0" output_excerpt_text="Read More" output_heading_tag="span"
            output_featured_image_format="thumbnail" output_featured_image_class="card-img-left" ]

## Share

            [addtoany]

            Subscribe to get latest insights

## Talk to our experts

			Get the best ROI with Sigmoid’s services in data engineering and AI

## Suggested readings

					[View all](/blogs/)

						![](https://www.sigmoid.com/wp-content/uploads/2023/01/Overview-of-Spark-Architecture-Spark-Streaming-thaumbnail.png)

#### [Overview of Spark Architecture & Spark Streaming](/blogs/spark-streaming-internals/)

						[Read blog](/blogs/spark-streaming-internals/)

						![](/wp-content/uploads/2022/12/apache-spark-for-real-time-analytics-thumbnail.png)

#### [Apache Spark for Real-time Analytics](/blogs/apache-spark-for-real-time-analytics/)

						[Read blog](/blogs/apache-spark-for-real-time-analytics/)

						![](https://www.sigmoid.com/wp-content/uploads/2023/01/Why-Apache-Arrow-is-the-Future-for-Open-Source-Columnar-thumbnail.png)

#### [Why Apache Arrow is the Future for Open Source Columnar](/blogs/apache-arrow-future-open-source-columnar-memory-analytics/)

						[Read blog](/blogs/apache-arrow-future-open-source-columnar-memory-analytics/)

---

## Categories

- Data Management

---

## Navigation

- [WordPress.org](https://wordpress.org/)
- [Documentation](https://wordpress.org/documentation/)
- [Learn WordPress](https://learn.wordpress.org/)
- [Support](https://wordpress.org/support/forums/)
- [Feedback](https://wordpress.org/support/forum/requests-and-feedback)
- [Sigmoid](https://www.sigmoid.com/)
- [Community](https://community.wpmanageninja.com/portal/space/fluent-forms/home)
- [Docs](https://wpmanageninja.com/docs/fluent-form/)
- [Developer Docs](https://developers.fluentforms.com/)
- [Documentation](https://imagify.io/documentation/)
- [Rate Imagify on WordPress.org](https://wordpress.org/support/view/plugin-reviews/imagify?rate=5#postform)
- [Manage](admin.php?page=litespeed)
- [Settings](admin.php?page=litespeed-cache)
- [Image Optimization](admin.php?page=litespeed-img_optm)
- [Company](/about-sigmoid)
- [Newsroom](/newsroom)
- [Life at Sigmoid](/careers)
- [Takshashila](/takshashila)
- [Contact Us](/contact-us)
- [AI Strategy Blueprint your AI advantage](/enterprise-ai-strategy/)
- [Generative AI Drive innovation with Generative AI](/generative-ai/)
- [Responsible AI Build trust with ethical AI practices](/responsible-ai-in-enterprise/)
- [Agentic AI Reshape business with scalable agentic systems](/agentic-ai-solutions/)
- [AI Managed Services Ensure reliable AI performance](/ai-managed-services/)
- [Advanced Analytics Transform your business with data-driven insights](/advanced-data-analytics-solutions/)
- [Start Assessment](/agentic-ai-readiness-index/)
- [Data Strategy Strong data foundations for scalable AI](/data-analytics-strategy/)
- [Data Management Leverage data as a strategic asset](/ai-data-management-services/)
- [Data Ops Automate data for speed and quality](/data-devops/)
- [Data Engineering Deliver insights faster with scalable pipelines](/data-engineering/)
- [Cloud Transformation Modernize data to maximise efficiency](/cloud-migration/)
- [Download Whitepaper](/ebooks-whitepapers/building-data-products-in-a-data-mesh-to-drive-business-value/)
- [Data Modeling Structure data for better decisions](/data-modeling-services/)
- [Data Visualization Transform data into actionable stories](/data-visualization-service/)
- [BI Migration Enhance decision making with modern BI tools](/bi-migration/)
- [Data Observability Build trust with healthy, accurate data](/data-observability/)
- [Automated Insights Make smarter decisions with auto-generated insights](/automated-insights/)
- [Download Whitepaper](/ebooks-whitepapers/power-bi-hacks/)
- [CPG & Retail End-to-end analytics for planning, operations, and commercial excellence](/industries/cpg-analytics/)
- [Life Sciences Trusted intelligence across clinical, commercial, and operational workflows](/industries/life-sciences/)
- [Financial Services AI-powered analytics for risk, compliance and customer experience](/industries/banking-financial-analytics-services/)
- [Read case study](/case-studies/data-clean-room-enables-real-time-insights-to-improve-operational-efficiency/)
- [MediaIQ Advanced platform for in-flight marketing measurement](/accelerators/sigmoid-mediaiq-multi-touch-attribution-tool/)
- [CampaignIQ AI-driven platform for optimized campaign budget allocation](/accelerators/sigmoid-campaigniq/)
- [AssistBot GenAI email assistant that automates human-like responses](/accelerators/sigmoid-assistbot-for-ai-email-assistant/)
- [CreativeBot GenAI tool for personalized and brand-aligned creative design](/accelerators/sigmoid-creativebot/)
- [SocialBot GenAI platform to analyze digital conversations and trends](/accelerators/#marketing|socialbot)
- [DemandIQ Predict trends accurately and optimize inventory management](/accelerators/sigmoid-demandiq/)
- [NetworkIQ Track and optimize logistics operations in real-time to quickly address disruptions](/accelerators/sigmoid-networkiq/)
- [SupplyIQ End-to-end platform to optimize supply chain operations](/accelerators/sigmoid-supplyiq/)
- [ProcurementIQ Automated procurement operations for maximum savings, compliance and efficiency](/accelerators/sigmoid-procurementiq/)
- [RapidML Accelerated deployment for machine learning models](/accelerators/sigmoid-rapidml/)
- [DataGuard Comprehensive platform for proactive data quality management](/accelerators/data-quality-tool-sigmoid-dataguard/)
- [CloudPulse Cloud cost optimization platform with multi-cloud management](/accelerators/sigmoid-cloudpulse/)
- [RAPID GenAI foundation with built-in governance and cost clarity](/accelerators/sigmoid-rapid/)
- [AnalyticsBot GenAI based platform to streamline decision-making in analytics](/accelerators/sigmoid-analyticsbot/)
- [DataConnect Seamlessly ingest, integrate and harmonize data from diverse sources](/accelerators/sigmoid-dataconnect/)
- [Reconica AI-powered data harmonization and reconciliation engine](/accelerators/sigmoid-reconica/)
- [ConverseBot GenAI driven insights generation for automated insights from reports](/accelerators/#sales|conversebot)
- [iNRM Cross-lever revenue growth optimization platform](/accelerators/sigmoid-inrm/)
- [AssortmentIQ Optimize shelf layouts and assortment mix at scale with AI-based insights](/accelerators/sigmoid-assortmentiq/)
- [Read Whitepaper](/ebooks-whitepapers/building-agentic-ai-chatbots-for-business-process-transformation/)
- [Listen Podcast](/events/podcast/how-jack-in-the-box-is-redefining-personalization-and-supply-chain-with-ai/)
- [Blogs](/blogs/)
- [White Papers](/ebooks-whitepapers/)
- [Case Studies](/case-studies/)
- [Podcast](/events/podcast/#Podcasts)
- [Read Blog](/blogs/the-genai-adoption-triad-responsibility-ethics-and-explainability/)
- [ConverseBot](/accelerators/#sales|conversebot/)

## Tags

- Cloud Transformation

---

## Footer Links

- [Talk to our AI experts](/contact-us/)
- [AI Strategy](/enterprise-ai-strategy/)
- [Agentic AI](/agentic-ai-solutions/)
- [Generative AI](/generative-ai/)
- [AI Managed Services](/ai-managed-services/)
- [Responsible AI](/responsible-ai-in-enterprise/)
- [Advanced Analytics](/advanced-data-analytics-solutions/)
- [Data Strategy](/data-analytics-strategy//)
- [Data Engineering](/data-engineering/)
- [Data Management](/ai-data-management-services/)
- [Cloud Transformation](/cloud-transformation/)
- [Data Ops](/data-devops/)
- [Data Visualization](/data-visualization-service/)
- [Automated Insights](/automated-insights/)
- [BI Migration](/bi-migration/)
- [Data Modeling](/data-modeling-services/)
- [Data Observability](/data-observability/)
- [CPG & Retail](/industries/cpg-analytics/)
- [Financial Services](/industries/banking-financial-analytics-services/)
- [Life Sciences](/industries/life-sciences/)
- [Case Studies](/case-studies/)
- [Thought Leadership](/ebooks-whitepapers/)
- [Blogs](/blogs/)
- [Company](/about-sigmoid/)
- [Newsroom](/newsroom/)
- [Accelerators](/accelerators/)
- [Careers](/careers/)
- [Privacy Policy |](/privacy-policy/)
- [Cookie Policy](/cookie-policy/)