What are micro-partitions and data clustering?
In Snowflake, all data in tables is automatically divided into micro-partitions, which are contiguous units of storage. Snowflake is columnar-based and horizontally partitioned, meaning a row of data is stored in the same micro-partition.
To allow you more control over clustering, Snowflake supports explicitly choosing the columns on which a table is clustered. These columns are called clustering keys and they enable Snowflake to maintain clustering according to the selected columns, as well as enable you to recluster on command. Reclustering a table impacts its physical organization by rearranging the data in a subset of micro-partitions for the table.
Using clustering keys
You can specify the clustering keys when creating a table using CREATE TABLE, or when modifying a table using ALTER TABLE. In either case, include the CLUSTER BY clause, e.g.:
CREATE OR REPLACE TABLE TEST (C1 INT, C2 INT) CLUSTER BY (C1, C2);
ALTER TABLE TEST CLUSTER BY (C1, C2);
The column(s) selected as clustering keys and their order matter. And you can modify the clustering keys at any time using the ALTER TABLE command:
ALTER TABLE TEST CLUSTER BY (C1);
You can see the clustering keys for a table in the cluster_by column in the SHOW TABLES command output:
SHOW TABLES LIKE 'TEST';
created_on | name | database_name | schema_name | kind | comment | cluster_by | rows | bytes | owner |
retention_time
|
2016-12-09 23:35:10.393 +0000 | TEST | MINDB | PUBLIC | TABLE | | (C1, C2) | 0 | SYSADMIN | 3 | 1 |
The SYSTEM$CLUSTERING_RATIO function returns the clustering ratio on a specified table column. If no column is specified, the function uses the clustering keys for the table. You can also specify a subset of the table by providing a predicate:
SELECT SYSTEM$CLUSTERING_RATIO('TEST'),
SYSTEM$CLUSTERING_RATIO('TEST',('C1')),
SYSTEM$CLUSTERING_RATIO('TEST',('C2'));
70 80 50
For more comprehensive examples, see the documentation here.
The clustering ratio is a number between 0 and 100. A clustering ratio of 100 means the table is perfectly clustered and all data is physically ordered. If a clustering ratio for two columns is 100%, there is no overlapping among the micro-partitions for the columns of data, and each partition stores a unique range of data for the columns.
You can also get more clustering information using SYSTEM$CLUSTERING_DEPTH and SYSTEM$CLUSTERING_INFORMATION.
SELECT SYSTEM$CLUSTERING_DEPTH('TEST');
SELECT SYSTEM$CLUSTERING_INFORMATION('TEST');
CLUSTERING_DEPTH computes the average depth of the table according to the clustering keys defined for the table or the clustering keys specified in the function arguments. The CLUSTERING_INFORMATION function returns a JSON object containing name/value pairs including cluster_by_keys,total_partition_count, total_constant_partition_count, average_overlaps, average_depth, and partition_depth_histogram depicting the distribution of overlap depth for each micro-partition in the table.
How does the organization of data into micro-partitions impact query performance?
How data is organized into micro-partitions has a significant impact on pruning performance. The micro-partition metadata collected transparently by Snowflake enables precise pruning of columns into micro-partitions at query run-time, including columns containing semi-structured data.
The following example illustrates how reclustering works on micro-partitions. Let's say we have 12 rows of data, and each micro-partition stores 4 rows:
CREATE OR REPLACE TABLE TEST (C1 INT, C2 INT);
INSERT INTO TEST VALUES (1,1), (3,5), (5,9), (6,12), (4,7),
(1,2), (3,6), (6,11), (2,4), (5,10),
(2,3), (4,8);
The following tables show the arrangement of data with and without reclustering.
In the first table, the data is arranged into micro-partitions by order of population. Each micro-partition is color coded: red, green, and purple respectively.
Before reclustering | C1 | C2 |
MP1
|
1
|
1
|
MP1
|
3
|
5
|
MP1
|
5
|
9
|
MP1
|
6
|
12
|
MP2
|
4
|
7
|
MP2
|
1
|
2
|
MP2
|
3
|
6
|
MP2
|
6
|
11
|
MP3
|
2
|
4
|
MP3
|
5
|
10
|
MP3
|
2
|
3
|
MP3
|
4
|
8
|
Without reclustering, each micro-partition has the following arrangement, by micro-partition:
Micro-partition 1 | | | | |
C1
|
1
|
3
|
5
|
6
|
C2
|
1
|
5
|
9
|
12
|
Micro-partition 2 | | | | |
C1
|
4
|
1
|
3
|
6
|
C2
|
7
|
2
|
6
|
11
|
Micro-partition 3 | | | | |
C1
|
2
|
5
|
2
|
4
|
C2
|
4
|
10
|
3
|
8
|
Next, we will recluster the data as follows:
ALTER TABLE TEST CLUSTER BY (C1, C2);
ALTER TABLE TEST RECLUSTER;
With reclustering, the data is reorganized as follows:
After reclustering | C1 | C2 |
MP1
|
1
|
1
|
MP1
|
1
|
2
|
MP1
|
2
|
3
|
MP1
|
2
|
4
|
MP2
|
3
|
5
|
MP2
|
3
|
6
|
MP2
|
4
|
7
|
MP2
|
4
|
8
|
MP3
|
5
|
9
|
MP3
|
5
|
10
|
MP3
|
6
|
11
|
MP3
|
6
|
12
|
With reclustering, each micro-partition has the following arrangement, by column:
Micro-partition 1 | | | | |
C1
|
1
|
1
|
2
|
2
|
C2
|
1
|
2
|
3
|
4
|
Micro-partition 2 | | | | |
C1
|
3
|
3
|
4
|
4
|
C2
|
5
|
6
|
7
|
8
|
Micro-partition 3 | | | | |
C1
|
5
|
5
|
6
|
6
|
C2
|
9
|
10
|
11
|
12
|
Now, let's determine how the data in each micro-partition overlaps, with and without reclustering, revealing how well a table is clustered on certain columns. The minimum and maximum value for each column in each micro-partition determines the overlapping region.
MP: Micro-Partition
|
Before Reclustering
|
After Reclustering
|
MP 1: C1 [Min:Max]
|
[1:6]
|
[1:2]
|
MP 1: C2 [Min:Max]
|
[1:12]
|
[1:4]
|
MP 2: C1 [Min:Max]
|
[1:6]
|
[3:4]
|
MP 2: C2 [Min:Max]
|
[2:11]
|
[5:8]
|
MP 3: C1 [Min:Max]
|
[2:5]
|
[5:6]
|
MP 3: C2 [Min:Max]
|
[3:10]
|
[9:12]
|
Overlapping Summary
|
Different micro-partitions overlap on both columns.
|
There is no overlap between different micro-partitions on both columns. C2 is unique and in order.
|
How does reclustering impact queries?
Let's see how reclustering impacts querying a table. Consider the following query:
SELECT C1, C2 FROM TEST
WHERE C1 IN [2, 5, 9]
AND C2 BETWEEN 3 AND 6;
The chart below indicates whether the query required the scanning of a particular micro-partition based on the minimum and maximum values for each column and predicates specified in the query:
MP: Micro-Partition
|
Before Reclustering
|
After Reclustering
|
MP 1: C1 [Min:Max]
|
[1:6] Yes
|
[1:2] Yes
|
MP 1: C2 [Min:Max]
|
[1:12] Yes
|
[1:4] Yes
|
MP 2: C1 [Min:Max]
|
[1:6] Yes
|
[3:4] No
|
MP 2: C2 [Min:Max]
|
[2:11] Yes
|
[5:8] Yes
|
MP 3: C1 [Min:Max]
|
[2:5] Yes
|
[5:6] Yes
|
MP 3: C2 [Min:Max]
|
[3:10] Yes
|
[9:12] No
|
MP scanned (need a YES on both C1 and C2)
|
MP1, MP2, MP3
|
MP 1
|
For this query, all three micro-partitions were scanned before the table was reclustered. By contrast, only one micro-partitioned was scanned after reclustering.
How are DML operations impacted?
If you perform continual DML operations on a table, the operations alter the layout of micro-partitions on the table. Changes to the layout ultimately impact query performance. If the layout strategy of micro-partitions is specified (by specifying clustering keys), additional maintenance must be performed to keep up with the regular DMLs on the table.
A recluster operation is performed using the ALTER TABLE command:
ALTER TABLE TEST RECLUSTER;
ALTER TABLE TEST RECLUSTER MAX_SIZE = 100000;
You can periodically use the SYSTEM$CLUSTERING_RATIO function to determine the clustering ratio for a table, and manually run recluster statements when you discover the clustering ratio has fallen below a desired level.
Related documentation
For more detailed information, see:
https://docs.snowflake.net/manuals/user-guide/tables-micro-partitions.html
Other related reference documents:
https://docs.snowflake.net/manuals/sql-reference/sql/create-table.html
https://docs.snowflake.net/manuals/sql-reference/sql/alter-table.html
https://docs.snowflake.net/manuals/sql-reference/sql/show-tables.html
https://docs.snowflake.net/manuals/sql-reference/functions/system_clustering_ratio.html
http://docs-master.int.snowflakecomputing.com/documentation/preview/sql-reference/functions/system_clustering_depth.html
http://docs-master.int.snowflakecomputing.com/documentation/preview/sql-reference/functions/system_clustering_information.html