3. The data engineering best practices
To make best use of Snowflake and optimize data, here are seven important considerations:
- Evaluate virtual warehouse options
- Select the right warehouse sizing
- Optimize cost for resources, compute & storage
- Set up robust network security policy
- Split user authentication based on system defined roles
- Develop watertight security & access controls
- Streamline data loading & enhance query performance
Let us take a detailed look at each of the considerations and how the data engineering best practices can lead to Snowflake success.
3.1- Evaluate virtual warehouse options
Virtual warehouses are a cluster of compute resources in Snowflake providing required resources to execute SQL SELECT statements and perform DML operations. It is important to execute relatively homogeneous queries (size, complexity, data sets, etc.) in the same warehouse. The best practices also include experimenting with diﬀerent warehouse sizes before deciding on the size that suits the requirements and the usage of “auto-suspend” and “auto-resume” features depending on the workloads to save costs. Warehouse suspension and resumption take time, which is noticeable for larger warehouses. It is therefore necessary to keep the warehouse running if the enterprise needs an immediate response.
Some of the other considerations are:
Fig 3 Evaluation criteria for selecting the right type of data warehouse
3.2- Select the right warehouse sizing
For the virtual warehouse to function on optimal level and provide desired outcomes, the size of the warehouse must be intelligently ascertained. For that, the warehouse must be aligned with the workload. Here is a look at the various sizes of warehouses:
Table 1: Snowflake warehouse sizing
For warehouse sizing success, here are the three things that enterprises must do:
- Right sizing: Recognizing workload patterns, linear performance improvements, gradually increasing warehouse size
- Scaling up and out: Improving individual query performance and data load performance concurrency, improving session/query concurrency and identifying the right cluster size to avid queuing
- Automating suspend/resume: Oﬀering on-demand, end-user workloads and suspending idle time settings, programmatically suspending/resuming scheduled jobs where process orchestration is controlled
Here is a snapshot of warehouse alignment with workloads:
Table 2: Snowflake warehouse alignment with workloads
3.3- Optimize cost for resources, compute & storage
In order to optimize costs, here are the key areas of consideration:
Resources: The resource monitors include aligning with team-by-team warehouse separation for granular cost governance, setting it on account level if team-by-team quotas are not needed, leveraging tiered triggers with escalating actions, and enabling proactive notifications.
Compute: The resource monitors and viewing usage
Storage: There are two types of storage:
- Time-travel storage: The focus areas include dimensional tables, persistent staging areas, materialised relationships, derivations, and other business rules. The action points include detection of high churn, verification of retention period on all large or high churn tables, reduction of retention period if data can be regenerated/reloaded and time/eﬀort to do so is within acceptable boundaries/SLAs, and use of periodic zero copy cloning.
- Failsafe storage: The focus areas include staging tables, intermediate result tables, increased responsibilities for developers, analysts and data scientists, and tool materialised results. The action points include following permanent tables with full CDP() lifecycle, not using failsafe for temporary/transient tables, utilizing temporary tables for session-specific intermediate results in complex data processing workflow, dropping temporary tables as soon as the session ends, utilizing transient tables for staging where the truncate/reload operations occur frequently, and considering designation databases/schemas as transient to simplify table creation.
3.4- Set up robust network security policy
Network security policies play a crucial role in enterprise success. For robust security, here are the four areas to consider:
Fig 4: Key components of network security
3.5- Split user authentication based on system defined roles
User authentication relied on role management of administrators, developers and DevOps flow, end users, and service accounts. To make the process seamless and extensible, user authentication is demarcated as per the following system defined roles:
Fig 5: Role based access controls
3.6- Develop watertight security & access controls
For watertight security and access controls, the four key concepts are securable objects, privileges, roles, and users. The aligned best practices include use of schema-level grants, creation of current and future grants, and use of managed access schemas. Here is a snapshot of the role-based access architecture:
Fig 6: Role based access architecture
The securable objects sequence looks like this:
The system-defined roles that aid role-based access controls are as follows:
Account Administrator: This administrator owns the Snowflake account and can operate on all account objects. This role is granted only to a limited number of users.
Security Administrator: This administrator can manage, modify, and monitor any user, role, or session, and can modify or revoke any grant.
System Administrator: This administrator has the privileges to create warehouses, databases, and other objects, and grant those same privileges to custom roles. Snowflake recommends assigning custom roles to the system administrators and their descendants.
User Administrators: This administrator can create users and roles in the account and grant those same privileges to other custom roles.
Public: This pseudo-role that is automatically granted to every user and every role in the account, and it can own securable objects. Objects owned by public are available to every other user and role.
Fig 8: System defined roles (Adapted from Snowflake documentation)
3.7- Streamline data loading & enhance query performance
3.7.1- Data Loading
The multi-stage process of landing the data files is initiated in cloud storage and is then loaded to a landing table before transforming the data. Breaking the overall process into predefined steps makes it easier to orchestrate and test. Here are some of the notable best practices:
Focus on the Ingestion Pattern:
- Use bulk loading to get the data into tables in Snowflake. Leverage the scalable compute layer to do the bulk of the data processing
- Delete from internal stages files that are no longer needed to get an improvement in performance in addition to saving on costs
- Snowpipe in micro-batching scenarios
- Isolate load and transform jobs from queries to prevent resource contention. Dedicate separate warehouses for loading and querying operations to optimize performance for each
Prioritize File Sizing:
- It is important to optimize the number of parallel loads into Snowflake. It is recommended to create compressed data files that are roughly 10 MB to 100 MB in size.
- Aggregate the smaller files to reduce processing overhead.
- Split the large files into a number of smaller files for faster load. This allows you to distribute the load between servers in the active Snowflake warehouse.
Prepare Delimited Text Files:
- Files must have data in ASCII format only. The default character set is UTF-8. However, additional encodings can be mentioned using ENCODING file format option.
- Within the files, records and fields should be delimited by diﬀerent characters. Note, that both should be a single (necessarily not same) character. Pipe (|), caret (^), comma (,), and tilde (~) are common field delimiters. Often the line feed (n) is used as a row delimiter.
- Fields that have delimiter should be enclosed in single or double quotes. If the data being loaded contains quotes, then those must be escaped.
- Fields that have carriage returns (r n) should be enclosed in single or double quotes too. In windows system, carriage returns are commonly introduced along with a line feed character to mark the end of a line.
- Each row should have the same number of columns.
3.7.2- Query Performance
For enhanced query performance, here are the key considerations and best practices:
Fig 9: Best practices for query performance