Let us begin with the exploration of a use case: A Real-time transaction monitoring service for an online financial firm that deals with products such as “Pay Later and Personal Loan”. This firm needs:
- An alert mechanism to flag off fraud transactions – If a customer finds a small loophole in the underwriting rules then he can exploit the system by taking multiple PLs and online purchases through the Pay Later option which is very difficult and sometimes impossible to recover.
- Speeding up of troubleshooting and research in case of system failure or slowdown
- Tracking and evaluation of responses to Marketing campaigns, instantaneously
To achieve the above they want to build a near real-time (NRT) data lake:
- To store ~400TB – last 2 years of historical transaction data
- Handle ~10k transaction records every 5 minutes results of various campaigns.
A typical transaction goes through multiple steps,
- Capturing the transaction details
- Encryption of the transaction information
- Routing to the payment processor
- Return of either an approval or a decline notice.
And the data lake should have a single record for each transaction and it should be the latest state.
Approach 1: Create a Data Pipeline using Apache Spark – Structured Streaming (with data deduped)
A three steps process can be:
- Read the transaction data from Kafka every 5 minutes as micro-batches and store them as small parquet files
- Preparing the consolidated data every 3 hours becomes challenging when the dataset size increases dramatically.
- If we increase the batch execution interval from 3 hours to more, say 6 or 12 hours then this isn’t NRT data lake,
- Any bug in the system if identified by the opportunists, can be exploited and can’t be tracked by IT teams immediately. By the time they see this on the dashboard (after 6 or 12 hours), the business would have already lost a significant amount of money.
- It’s also not very useful for monitoring specific event based campaign, e.g. 5% cashback on food delivery, on the day of “World Cup – Semi-final match”.
Approach 2: Create a Data Pipeline using Apache Spark – Structured Streaming (with duplicate data)
A two steps process can be
- Read the transaction data from Kafka every 5 minutes as micro-batches and store them as small parquet files without any data deduplication,
Adding this additional “where” condition adds extra latency to each of the queries and it would soon become an extra overhead when the data reaches petabytes scale.
Is there any way we can maintain only one copy of the transaction base with the latest transaction state and can provide an easy means to traverse through different snapshots?
Can we add the ACID properties to that single copy of the transaction base parquet table?
Delta Lake by Databricks addresses the above issues when used along with Apache Spark for not just Structured Streaming, but also for use with DataFrame (batch based application).