Spark has a great query optimization capability that can significantly improve the execution time of queries and ensure cost reduction. However, when it comes to nested queries, there is a need to further advance optimization techniques in Spark. Nested queries typically include multiple dimensions and metrics such as “OR” and “AND” operations along with millions of values that need to be filtered out.
For instance, when Spark is executed with a query of, say 100 filters, with nested operations between them, the execution process slows down. Since there are multiple dimensions involved, it takes a huge amount of time for the optimization process. Depending upon the cluster size, it may take hours to process a dataFrame with those filters.
In this blog, we will walk you through common challenges faced while working with nested filter queries and how we were able to solve it using logical tree optimization. We found that upon tweaking some of the codes and applying a common algorithm, we were able to reduce the total cost and execution time of the query.
We could further optimize the logical plan and send processed data to apply filters, thus reducing the total load on processing. The logical query is optimized in such a way that there’s always a predicate pushdown for optimal execution of the next part of the query. We used Apache Spark with scala API for this use case.
The user needed to filter out all values from the millions of single columns of records from a file. For this filter file, the supported filters are “include” and “exclude”. The user can select multiple dimensions for applying filters and can combine them with a filter operator which is either “AND” or “OR” or both. So, the user could now create a nested filter expression containing multiple dimensions with both “OR” and “AND” operations between them.
This requirement was to be implemented in our product SigView. For this use case, we developed a feature called Bulk Filter (as it includes “OR” operation between filter dimensions). All the filters inside the Bulk Filter are rearranged on the basis of cost, forming an optimized group of filters for processing.