Snowflake Cloud Data Warehouse Architecture

12:26

Introduction to Snowflake Data Warehouse

It is a cloud-based Analytical data warehouse provided as Software-as-a-service. It is ready to use a solution that the user requires to use directly without worrying about its installation and deployment and then its startup. It doesn’t allow any user to use Snowflake on any private infrastructure, as it uses public cloud infrastructure. It has an instance for managing computations and persists data through storage service.

A central repository of information that can be analyzed to make more informed decisions. Click to explore about, What is a data warehouse?

As it doesn’t require any installation or deployment, users need not worry about its updations and maintenance, and it automatically handles this for all types of users. Its users may include end-users, business analysts, or data scientists.

What is the Architecture of Snowflake Data Warehouse?

Snowflake is a pay-as-you-go service. It charges users on a per-second basis for computation or query processing. Some data warehouses use either shared-disk or shared-nothing architecture, but Snowflake uses a hybrid architecture comprised of both.

Traditional shared-disk architecture has multiple cluster nodes. In contrast, just one storage node is accessible to all cluster nodes, and traditional shared-nothing architecture contains nothing in common, i.e., different storage nodes for each cluster node (CPU). Its hybrid architecture has three layers -

Cloud Services Layer
Query Processing Layer
Database Storage Layer

Database Storage Layer

It has a Scalable cloud blob storage type for storing structured and semi-structured data (JSON, AVRO, and Parquet).
The storage layer contains tables, schemas, databases, and diverse data. Tables can store as much as petabytes of data and effectively manage partitions.
It provides internal optimization and compression of data for easy metadata extraction and query processing.
Data objects are only accessible via SQL queries through the Compute layer and are hidden from users.
Storage is made up of multiple micro partitions that scale automatically when required.
All Micro partitions are encrypted, and metadata is generated to provide more efficiency and performance in data retrieval.

Query Processing Layer

This layer is the compute layer of architecture that contains multiple virtual warehouses, and every query runs on one virtual warehouse.
Virtual warehouses are the compute engines that empower scalable computing capacity.
The compute layer retrieves data from the storage layer and caches to compute resources locally to improve future query responses, i.e., each Virtual warehouse had its cache.
Multiple virtual warehouses can operate simultaneously, maintain ACID, and follow various parallel processing to perform on data.
Read and Write query integrity is maintained on the cloud service layer that accesses the required virtual warehouse and compute nodes.
Virtual Warehouses can be auto-resumed and auto-suspended, are easily scalable, and have an auto-scaling factor inbuilt (when defined).

Cloud Services Layer

The service layer is the “Brain” of the Snowflake. It maintains, optimizes, and transacts data, provides data security, manages metadata, and shares data.
Cloud service is a stateless computing resource that runs in multiple availability zones and utilizes highly available and valuable metadata.
The service layer enables SQL client interface for DDL and DML-like operations on data.
Like other layers, this layer can also be independently scalable.

Why is Snowflake Data Warehouse essential?

The importance of snowflakes is mentioned below:

Warehouse Usages

There are different sizes of Snowflake Data warehouses, from X-Small to 4X-Large. Each warehouse volume contains various cluster nodes and has different billing. The bigger the size of the warehouses, the more the workers. However, one must be aware that using each type of warehouse as a query must be capable of loading by a particular warehouse and have enough clusters or servers provided.

Increasing its size is not only a factor in query performance enhancement. Data must be stored in a fully optimized and normalized form of partitions, so fewer files must be processed for loading results.

Roles to Access Data

Data access is controlled by defining roles on data. Database Admins can create as many characters as needed with granted access to particular data objects such as tables, schemas, and databases. DBAs can also provide read and write access to specific roles. This has the advantage that one role can be assigned to another to maintain a hierarchy of privileges on data control access.

Partition Blocks and Updations

Data stored in it is in the form of micro-partitions. The size of these micro-partitions may vary between 50-500 MB. It automates the process of resizing the partition, and the user also has the privilege to apply such operations to resize these micro-partition blocks. The effectiveness of partitions can be measured while retrieving data. When partitions are appropriately managed and contain processed and informational cluster partitions, the query performance will increase, and the query takes less time to generate results from stored data.

When no cluster information is saved with tables and queries containing the data that needs to be accessed by columns, then the performance is downgraded. It may cost more and also process irrelevant data. Suppose a table contains TB of data, and then we apply to recluster to that table. In this case, data is rearranged to the newly defined format to extract the required cluster information.

Time Travel

It provides a feature to time travel on data in the past for a specified interval of time. Every new table in it has one day of data time travel by default. This provides more value when there is a need to restore the data from a specific time in time and to map data effectively to measure data changes and the volume of data that needs to be operated or processed again.

Another benefit to enterprises is that they can occasionally evaluate how data is being updated. It helps to restore the tables, schema, and database upon providing the retention period to time travel (from 1 day up to 90 days).

Worksheets

It provides a feature to manage data workarounds. Users can create as many worksheets as they require with the commenting feature enabled. This means one can perform statistics on data and create a worksheet of those statistics. It helps to work on the go without managing and saving local reports. We can keep worksheets for as long as we need them and update them at any time.

Snowflake provides the best environment for data with our near-infinite resources and computing power Source: Recognized as a Leader

Analysis of Queries in Snowflake

The performance of queries in Snowflake is available as a Query Profile in its Web Interface. When a user performs any query and selects it, it will display graphical and statistical information about that query. The interface provides an option to view the history of queries that have run in the past. Selecting a query from history and moving to its profile tab will display the query profile. It contains a DAG of the query plan.

Reviewing this query plan makes it easier to eliminate unused parts and generate productive queries that process fewer data and generate only the required data. Queries containing joins, and aggregation are costly as they are not meant to perform these operations on the go. In contrast, in some cases, it keeps track of some basic aggregation operations on the go while saving data internally.

Best practices for Snowflake Data Warehouse

The below mentioned are the best practices of Snowflake:

Best Practices for Schema Design

Most of the time, tables contain a field for the timestamp or date data type for the required granularity. This provides an efficient result for the timestamp and date type rather than the VARCHAR type.
Using the clustering key is a good practice. If your table contains a small amount of data and aggregate data is required all the time, then this clustering may generate many small cluster files.
Always remember the column data length in case of it as its storage may generate erroneous data when outputted.
When the data type is used in VARIANT, it stores data of timestamp and date types in string formats, so it is recommended to flatten the structure and use the appropriate type. Retrieving a string is slower than retrieving a timestamp.

Best Practices for Query Optimisation

Clustering helps to store data in the form of cluster storage files with named partitions. Tables having effective and active clusters may have bounded queries to these clusters, such as using where conditions on these clusters.
Using warehouse suspension and auto-resume also affects the query. By default, the warehouse suspension time is 10 minutes. The segmented workload must be handled by a typical virtual warehouse and multiple users (if required). This will help to re-use the result cache generated by it and provide instant results on the same data.
If concurrent data is needed, the warehouse can auto-scale, and it can be updated from the user interface. To obtain optimization, this feature and an optimized query should be used together.

Best Practices for Efficient Reads

If we access data with aggregations directly from it, its reader is relatively slow and expensive, so if the data has some limits and conditions but no aggregation, use “query” instead of “debatable” in the Spark-Snowflake connector.

Best Practices for Efficient Writes

If you are writing data to its table in Spark, and that table already exists, try to select data in the same sequence/order of columns as they are in it. The workaround to resolve this is to use the “column_mapping” option as “name” while writing. “column_mapping” -> “name” approach is CASE SENSITIVE.

A cloud-based Infrastructure as a Service model designed by Google which is used for storing and processing massive data sets with the help of several SQL queries. Click to explore about, Google BigQuery

Snowflake Pricing and Best Practices for Optimizing Cost

It charged the users on average monthly storage per TB and paid per second billing. This depends on the other three factors as well such as -

In a multi-cluster environment, the number of clusters used
Count of servers used in each clustering
Running time of each server in each cluster

Optimizing and segmenting workloads based on these three factors help reduce costs, such as using a specific warehouse size for a particular workload and by individual users with some caching retention time (or warehouse suspension time) to help reuse the cache and retrieve results faster. Less complex queries help in low billing, and these complexities in queries can be removed by applying certain cluster access conditions and removing massive joins.

Joins cause more shuffle operations than direct loading. It is always recommended to load the data through Spark, looker, etc. and perform activities such as joins and aggregation on data in them. The cost can be reduced by identifying long-running queries from the Query profile and then simplifying the schema. Also, we can define roles for users to prevent irrelevant queries about data and unauthorized access. It allows the creation of transient tables but doesn’t preserve their history. So, the average monthly storage cost can be reduced on the go.

To identify bugs and errors during software development and increase product quality. Interact with Data Warehouse Design Specialists

With Snowflake Cloud Data Warehouse offerings, Enterprises can enable elasticity, reduce I/O for better query performance, and increase faster performance. Learn more about how its solutions are transforming industries. Data Warehouse Managed Services.

Read More About Database Testing Types and Best Tools

Discover More About Cloud Application Development Solutions

Next Steps with Snowflake Cloud Data Warehouse Architecture

Talk to our experts about implementing Snowflake Cloud Data Warehouse Architecture and how industries and different departments leverage scalable, cloud-native data management to drive data-driven decision-making. Discover how Snowflake’s architecture enhances data sharing, performance, and security, optimizing analytics and operations for improved efficiency and responsiveness.

Interested in Solving your Challenges with XenonStack Team

Get Started

Interested in Solving your Challenges with XenonStack

Personalization

What is your Key focus areas? *

In Which Agentic Platform and Accelerator you are Interested? *

Which segment does your company belong to? *

At what stage is your AI use case currently in? *

What are the primary challenges in adopting AI? *

What kind of infrastructure does your organization currently using? *

Are you using any Data platform? *

Preferred Approach for AI Transformation *

In Which Domain your Solution/Organization belongs to in-terms of Data Privacy, Trustworthy AI *

Captcha Verification *

your request has been submitted successfully !

Snowflake Cloud Data Warehouse Architecture

Introduction to Snowflake Data Warehouse

What is the Architecture of Snowflake Data Warehouse?

Database Storage Layer

Query Processing Layer

Cloud Services Layer

Why is Snowflake Data Warehouse essential?

Warehouse Usages

Roles to Access Data

Partition Blocks and Updations

Time Travel

Worksheets

Analysis of Queries in Snowflake

Best practices for Snowflake Data Warehouse

Best Practices for Schema Design

Best Practices for Query Optimisation

Best Practices for Efficient Reads

Best Practices for Efficient Writes

Snowflake Pricing and Best Practices for Optimizing Cost

Next Steps with Snowflake Cloud Data Warehouse Architecture

More Ways to Explore Us

Data Catalog for Snowflake Benefits and Its Functions

Migrating SnowFlake Schema to Star Schema

FinOps for Snowflake

Share Article

Table of Contents

Share Article

Explore Related Topics

Navdeep Singh Gill

Subscribe to our Latest Technology Insights and Resources

Get the latest articles in your inbox

Related Articles

Batch and Real Time Data Ingestion with Apache NiFi for Data Lake

Test Driven Development for Java using JUnit | Quick Guide

Data Ingestion Pipeline Architecture and its Use Cases

From Fragmented PoCs to Production-Ready AI

Building Organizational Readiness

Business Case Discovery - PoC & Pilot

Responsible AI Enablement Program

Dr. Jagreet Kaur

Navdeep Singh Gill