Enterprises have been spending millions of dollars getting data into data lakes with Apache Spark. The aspiration is to do Machine Learning on all that data – Recommendation Engines, Forecasting, Fraud/Risk Detection, IoT & Predictive Maintenance, Genomics, DNA Sequencing, and more. But the majority of the projects fail to see fruition due to unreliable data and data that is not ready for ML.
~60% of big data projects fail each year according to Gartner.
These include data reliability challenges with data lakes, such as:
- Failed production jobs that leave data in a corrupt state requiring tedious recovery
- Lack of schema enforcement creating inconsistent and low-quality data
- Lack of consistency making it almost impossible to mix appends and reads, batch and streaming
That’s where Delta Lake comes in. Some salient features are:
- Open format based on parquet
- Provides ACID transactions
- Apache Spark’s API
- Time Travel / Data Snapshots
A three steps process for our use case
- Create a delta table by reading historical transaction data,
2. Read the transaction data from Kafka every 5 minutes as micro-batches,
3. Then merge them with the existing delta table