Blog 2018-06-05T15:39:37+00:00
Blog
Blog
An extensive list of technical & business oriented topics discussed in detail by some of the experts in the field. We’ll help you gain a simplistic understanding of analytics.

[wpdreams_ajaxsearchlite]

Business
Technology

[wpdreams_ajaxsearchlite]

changed4

Trending Now

Cloud Computing

Integrating Spark, Kafka & Hbase to Power a Real Time Dashboard

By Arush Kharbanda | June 9th, 2015

Industries are increasingly leveraging the advantage of Big data for analytics and making key decisions to optimize their existing businesses. Traditionally dashboards were updated by batch jobs and there has always been a lag of several minutes

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

By | March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we can expect all the big data platforms adopting Apache Arrow as its columnar in-memory layer. [...]

Implementing a Real-Time Multi- dimensional Dashboard

By | July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This needs to be done in real time and displayed within acceptable display time lag to the user. Any screen must be displayed within industry standard time of 3 sec’s. You would need to [...]

[How-To] Run SparkR with RStudio

By | July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and Analysts who are comfortable with their R ecosystem and still want to utilize the speed and performance of Spark. In this article, I'll walk you through creating an Ubuntu instance from scratch, installing R, RStudio, Spark, [...]

Spark Streaming in Production

By | April 22nd, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Spark Streaming in Production This is our next blog in the series of blogs about Spark Streaming. After talking about Spark Streaming and how it works, now we will look at how to implement this in production. At Sigmoid we have implemented Spark Streaming in production for some customers and have achieved great results by improving the design [...]

Fault Tolerant Stream Processing with Spark Streaming

By | April 19th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Fault Tolerant Stream Processing with Spark Streaming Introduction After a look at how Spark Streaming works, and discussing good production practices for Spark Streaming, this blog is about making your Spark streaming implementation fault tolerant and Highly available. Fault tolerance If you plan to use Spark Streaming in a production environment, it's essential that your system be fault tolerant. [...]

Fault tolerant Streaming Workflows with Apache Mesos

By | April 9th, 2015|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Mesos High Availability Cluster: Apache Mesos is a high availability cluster operating system as it has several masters, with one Leader. The other (standby) masters serve as backup in case the leader master fails. Zookeeper elects the master nodes and handles the failures. Mesos is framework independent and  can intelligently schedule and run Spark, Hadoop, and other frameworks concurrently on the same cluster. [...]

Spark Streaming- Look under the Hood

By | March 31st, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Spark Streaming- Look under the Hood Spark Streaming is designed to provide window based stream processing and stateful stream processing for any real time analytics application. It allows users to do complex processing like running machine learning and graph processing algorithms on streaming data. This is possible because Spark Streaming uses the Spark Processing Engine under the DStream API [...]

Getting Data into Spark Streaming

By | March 17th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Getting Data into Spark Streaming In the previous blog post we talked about overview of Spark Streaming, and now let us take a look on different source systems that can be used for Spark Streaming. Spark Streaming provides out of the box connectivity for various source systems. It provides built in support for Kafka, Flume, Twitter, ZeroMQ, Kinesis and [...]

Apache Spark A Look under the Hood

By | January 24th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Apache Spark A Look under the Hood Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data. Stages: Jobs are divided into stages. Stages are classified as a Map or [...]

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

By | March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we can expect all the big data platforms adopting Apache Arrow as its columnar in-memory layer. [...]

Implementing a Real-Time Multi- dimensional Dashboard

By | July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This needs to be done in real time and displayed within acceptable display time lag to the user. Any screen must be displayed within industry standard time of 3 sec’s. You would need to [...]

[How-To] Run SparkR with RStudio

By | July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and Analysts who are comfortable with their R ecosystem and still want to utilize the speed and performance of Spark. In this article, I'll walk you through creating an Ubuntu instance from scratch, installing R, RStudio, Spark, [...]

Spark Streaming in Production

By | April 22nd, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Spark Streaming in Production This is our next blog in the series of blogs about Spark Streaming. After talking about Spark Streaming and how it works, now we will look at how to implement this in production. At Sigmoid we have implemented Spark Streaming in production for some customers and have achieved great results by improving the design [...]

Fault Tolerant Stream Processing with Spark Streaming

By | April 19th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Fault Tolerant Stream Processing with Spark Streaming Introduction After a look at how Spark Streaming works, and discussing good production practices for Spark Streaming, this blog is about making your Spark streaming implementation fault tolerant and Highly available. Fault tolerance If you plan to use Spark Streaming in a production environment, it's essential that your system be fault tolerant. [...]

Fault tolerant Streaming Workflows with Apache Mesos

By | April 9th, 2015|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Mesos High Availability Cluster: Apache Mesos is a high availability cluster operating system as it has several masters, with one Leader. The other (standby) masters serve as backup in case the leader master fails. Zookeeper elects the master nodes and handles the failures. Mesos is framework independent and  can intelligently schedule and run Spark, Hadoop, and other frameworks concurrently on the same cluster. [...]

Spark Streaming- Look under the Hood

By | March 31st, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Spark Streaming- Look under the Hood Spark Streaming is designed to provide window based stream processing and stateful stream processing for any real time analytics application. It allows users to do complex processing like running machine learning and graph processing algorithms on streaming data. This is possible because Spark Streaming uses the Spark Processing Engine under the DStream API [...]

Getting Data into Spark Streaming

By | March 17th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Getting Data into Spark Streaming In the previous blog post we talked about overview of Spark Streaming, and now let us take a look on different source systems that can be used for Spark Streaming. Spark Streaming provides out of the box connectivity for various source systems. It provides built in support for Kafka, Flume, Twitter, ZeroMQ, Kinesis and [...]

Apache Spark A Look under the Hood

By | January 24th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Apache Spark A Look under the Hood Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data. Stages: Jobs are divided into stages. Stages are classified as a Map or [...]