akhil Akhil Das
Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance.
Akhil Das
He was a Software Developer at Sigmoid.
Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we can expect all the big data platforms adopting Apache Arrow as its columnar in-memory layer.

Data digital flow

What can we expect from an in-memory system like Apache Arrow:

  • Columnar: Over the past few years, big data is all about columnar. It was primarily inspired by the creation and adoption of Apache Parquet and other columnar data storage technologies.
  • In-memory: SAP HANA was the first one to accelerate the analytical workloads with its in-memory component and then Apache Spark came into the picture in the open source world which accelerates the workloads by holding the data in memory.
  • Complex data and dynamic schemas: Solving business problems are much easier when we represent the data through hierarchical and nested data structures. This was the primary reason for the adoption of JSON and document based databases.

At this point most of the systems out there hardly supports two of the above concepts. Many exists which supports one of them. That’s where the Apache Arrow kicks in, which supports all three of them seamlessly.

Arrow is being designed in a way to supports complex data and dynamic schema and in terms of performance, it is totally based on in-memory and is columnar storage.

Without Arrow

The bottleneck with any typical system comes when the data is moved across machines, Serialization is an overhead in many cases, Arrow improves the performance for the data movement within a cluster without any serialization or deserialization. Another important aspect of Arrow is when two systems use arrow as their in-memory storage, for example Kudu could send Arrow data to Impala for analytics purposes since both of them are Arrow-enabled without involving any costly deserialization on the receipt. Inter Process Communication is mostly happening through shared memory, TCP/IP and RDMA with Arrow. It also supports a wide variety of data types which includes both the SQL and JSON types, such as Int, BigInt, Decimal, VarChar, Map, Struct and Array.

Nobody wants to wait longer to get their answers from the data. The faster they gets the answer the faster they can ask other questions or solve their business problems. CPUs these days become faster and more sophisticated in design, the key challenge in any system is making sure the CPU Utilization is at ~100% and is using it efficiently. When the data is in columnar structure, it is much easier to use SIMD instructions over it.

SIMD is short for Single Instruction/Multiple Data, while the term SIMD operations refers to a computing method that enables processing of multiple data with a single instruction. In contrast, the conventional sequential approach using one instruction to process each individual data is called scalar operations. In some cases, when using AVX instructions, these optimizations can increase performance by two orders of magnitude.

increase performance

Arrow is designed to maximize the cache locality, pipelining and SIMD instructions. Cache locality, pipelining and super-word operations frequently provide 10-100x faster execution performance. Since many analytical workloads are CPU bound, these benefits translate into dramatic end-user performance gains. These gains result in faster answers and higher levels of user concurrency.

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we

Implementing a Real-Time Multi- dimensional Dashboard

July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This Rolex Replica needs to be done in real time and displayed within acceptable display time lag to the

[How-To] Run SparkR with RStudio

July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio Your private vip singapore escort has elite call girls ready to provide social services for any of your demands. With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and

By |2019-03-11T06:36:02+00:00March 29th, 2016|Streaming, Technology|