7. Building a Cloud Data Warehouse
Cloud data storage is expected to account for 53% of all enterprise data by the second quarter of 2021. Going forward, cloud storage or cloud data warehouse is clearly going to play a critical role in enterprise cloud transformation strategy. While the move to cloud sure looks like a safe bet, business intelligence teams must keep a few important things in mind before jumping on the bandwagon, such as:
Fig. 2: Consideration Before Moving to the Cloud
i) Closely understand the data modeling process
As a process, data modeling requires the business intelligence team to closely evaluate data process and storage. Data modeling techniques ideally include methods that govern data representation patterns, the preparation of data models, and communication of business requirements. The techniques essentially depend on some formalized workflows that include a sequence of objectives to be performed in an iterative manner, such as:
Fig. 3: Key Steps in Data Modeling Process
In cloud data warehouse architecture optimizing data retrieval speeds for generating analytics is the key. And this is where dimensional data models help create a difference. While Entity-relationship (ER) data models and relational data models emphasized efficient data storage practices, dimensional models focus on enhancing redundancy in order to make it easier for business intelligence teams to easily locate relevant data for retrieval and reporting purposes.
The two most popular dimensional data models include the star schema and the snowflake schema. In a star schema, data is cataloged as dimensions (reference information) and facts (measurable items). Here each fact is mapped with its related dimensions in a pattern that represents a star. The snowflake schema is similar to the star schema but it involves multiple layers of associated dimensions that make the entire pattern quite complex.
What are the benefits of star schema?
Star schema model offers the following benefits:
Simpler queries: In this model, all the data is linked through the fact table and the multiple dimension tables serve as a single table for information. This makes queries comparatively simpler and easier to perform.
Seamless business data reporting: Star Schema makes the process of retrieving business reports simple through period-over-period and as-of-as reports.
High-performing queries: Star Schema removes the barriers in a normalized schema which enhances query speed and performance of read-only commands.
The challenges of star schema include:
Star schema model can fail to enforce robust data integrity simply because of its denormalized architecture. Although there are countermeasures to stop anomalies from taking place, a quick update or insert command is enough to cause small data incongruities.
This data model is developed and optimized to cater to specific analytical needs and as denormalized data sets, star schema is suited for only simple queries.
What are the benefits of snowflake schema?
Snowflake schemas deliver the following benefits:
Compatibility with OLAP database modeling tools: Some OLAP database management tools that data scientists leverage for data modeling and analysis are designed to be compatible with snowflake schema.
Reduces disk space requirements: Data normalization in snowflake schema significantly reduces disk space requirements. In this data model, data scientists essentially convert strings of non-numerical data into keys that are comparatively less taxing from a storage standpoint.
The challenges of star schema include:
Snowflake data schemas create multiple layers of complexity in an attempt to normalize the star schema attributes. This complexity often results in complicated source queries. In this model, complex joins lower the processing speed of cube data which is not the case for star schema.
ii) Ingestion and processing – ELT vs ETL
Data transmission from multiple disparate sources to a cloud data warehouse is done through an extract-transform-load or an extract-load-transform workflow. Selecting the right workflow is extremely critical in cloud data warehouse design. In the ETL workflow, data transformation is done before loading considering that no further transformation is required for analyzing and reporting. ETL served as the preferred workflow until the proliferation of cloud data warehouses with better data processing capabilities. Cloud data warehouses do not need to completely transform data before loading and data can be transformed eventually on the basis of requirements.
This data warehousing model offers the following benefits:
- While developing the data flow structure there’s no need to define the transformation logic.
- Only the relevant data can be transformed, unlike the ETL flow where the entire data needs to be transformed.
- ELT serves as an ideal workflow to manage unstructured data sets as the purpose of the data is not known beforehand.
iii) Separating storage and compute
While cloud data warehousing allows companies to collect and store large amounts of raw, structured and unstructured data, the majority only manage to process a small portion of the data they collate. Given this, cloud deployments that combine compute and storage often end up investing in additional compute capacity which in most cases stay underutilized.
While mass storage has become cheaper with time, compute has remained expensive. By separating storage and compute, data teams will have the opportunity to scale storage capacity economically depending on the data load while also scaling distributed computing depending on the data processing needs.
iv) Following the data management best practices
Business intelligence teams need to implement stringent data governance measures while building cloud data warehouses in order to ensure data availability, integrity, security and usability. A meticulous data governance strategy can help data teams ensure that data is trustworthy and reliable across channels and is not mistreated. Over the years, efficient data governance has evolved as a critical requirement in enterprise cloud migration initiatives, given the need for organizations to comply with stringent data control and privacy regulations.
As a part of the data governance process, creating a strong data lineage can help BI teams better understand, document and visualize data from its origin to the consumption stage. Data lineage can also help in identifying errors, implementing quick changes within the data management process and carrying out systems migration to save time. Other best practices that can help BI teams reinforce their data lineage process are data cataloging and data archival. While the former helps data engineers capture data asset metadata through a data mapping framework, the latter helps in storing data through indexes and metadata, irrespective of how the data may have originally been stored during active use. Data archival also helps companies reduce the cost to a significant extent by shifting cold or infrequently used data in lower-tier storage systems such as AWS Glacier, Azure Archive Storage and so on.
v) Determining where the company stands in terms of organizational maturity
The majority of organizations today are investing in cloud storage solutions building a cloud data warehouse or data lake on Azure, AWS, Databricks or Snowflakes. However, not all of them stand at a similar point in the overall organizational maturity curve. Whether a company is attempting a cloud shift for the first time or whether it is looking to simply modernize the data storage infrastructure in the cloud, the journey can raise several questions.
Similar to forming a team or developing a process, a company may need to closely understand how well its cloud migration efforts are progressing. It needs to religiously adhere to the best practices, follow the established standards and procedures, review performance against those standards and implement corrective measures if there are deviations. A meticulously developed maturity model from cloud experts can help a company understand where it stands in terms of implementation and the measures needed in order to improve.