雪花：数百万行时性能很慢答案

【问题标题】：Snowflake: Perfomance is slow with millions of rows雪花：数百万行时性能很慢
【发布时间】：2021-11-06 11:50:19
【问题描述】：

要求：加快雪花中的性能

问题：即使读取数据也需要很多时间，已在表中为列创建了一个集群

create or replace TABLE table_A cluster by (ID, yyyymm)(
YYYYMM NUMBER(38,0),
ID NUMBER(38,0),
.....(lot of other columns)
......
SURROGATE_KEY VARCHAR(16777216)

);

表有70,825,139,352 rows

如果 ID 是最近 60 分钟内插入到表中的，我们希望删除该 ID 的任何先前版本（如果它是在过去 3 个月内）

下面是查询

select
  surrogate_key,
  SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
  array_agg(distinct yyyymm) as yyyymms,
  max(extraction_ts) as max_extraction_ts
from table A
where (ID, surrogate_key) IN (
  select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp)
)
and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
group by surrogate_key
;

然后我尝试只获取最近 3 个月的行，即使这需要很多时间

    select yyyymm, ID, 
 surrogate_key,create_time,extraction__ts
  from table A
 where yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
   and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint

当我检查查询解释计划时，看起来它扫描整个表而不是只扫描那些过滤的数据

我不确定如何优化查询性能，我在这里遗漏了一些东西

我还发现如下所示扫描整个分区需要更多时间

Pruning 
275,445 Partitions scanned 
945,526 Partitions total –

编辑：更新

我现在尝试使用 with 子句比原始查询快一些，但仍然需要 9 分钟才能获取数据

with tbl as (select ID, surrogate_key from table A where create_time >= dateadd(minute, -60, current_timestamp))
select
      surrogate_key,
      SUBSTR(surrogate_key, 1, CHARINDEX('|', surrogate_key) - 1)::bigint as original_id,
      array_agg(distinct yyyymm) as yyyymms,
      max(extraction_ts) as max_extraction_ts
    from table A
    where (ID, surrogate_key) IN (select ID, surrogate_key from tbl)
    and yyyymm >= to_char(dateadd(month, -3, current_timestamp), 'YYYYMM')::bigint
    and yyyymm <= to_char(dateadd(month, -0, current_timestamp), 'YYYYMM')::bigint
    group by surrogate_key
    ;

我尝试按照 Eric Lin 的建议更改集群密钥，但时间相同

> 编辑：system$clustering_information 的输出

原文：(ID,yyymm)

 {
  "cluster_by_keys" : "LINEAR(ID, yyyymm)",
  "total_partition_count" : 946321,
  "total_constant_partition_count" : 766438,
  "average_overlaps" : 57.6508,
  "average_depth" : 30.1231,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 764362,
    "00002" : 0,
    "00003" : 0,
    "00004" : 0,
    "00005" : 0,
    "00006" : 0,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 1,
    "00015" : 3,
    "00016" : 1,
    "00032" : 17,
    "00064" : 32263,
    "00128" : 43131,
    "00256" : 88619,
    "00512" : 17449,
    "01024" : 475
  }
}

将聚类更改为 (yyymm,ID)

{
  "cluster_by_keys" : "LINEAR(yyyymm,ID)",
  "total_partition_count" : 953033,
  "total_constant_partition_count" : 769276,
  "average_overlaps" : 33.2017,
  "average_depth" : 18.5576,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 768630,
    "00002" : 0,
    "00003" : 15,
    "00004" : 129,
    "00005" : 611,
    "00006" : 1589,
    "00007" : 3128,
    "00008" : 4235,
    "00009" : 5374,
    "00010" : 6404,
    "00011" : 6176,
    "00012" : 5809,
    "00013" : 5397,
    "00014" : 4034,
    "00015" : 3007,
    "00016" : 2287,
    "00032" : 18517,
    "00064" : 18992,
    "00128" : 43803,
    "00256" : 43519,
    "00512" : 11377
  }
}

不同的数据

yyymmd      1076. Distinct
ID          179030 Distinct

【问题讨论】：

根据列的名称，如果有多行具有相同的 ID 值，每行由时间戳、版本或类似的东西。是这样吗？ ID是唯一的吗？如果不是唯一的，与整个表相比，它的基数非常高？
同一ID的每行有多个版本，按时间戳划分，其他列对于同一ID和YYYMM有不同的值
我发现如下Pruning 275,445 Partitions scanned 945,526 Partitions total
您是否尝试过“集群（yyyymm，ID）”而不是“集群（ID，yyyymm）”？ ID 从来都不是聚类键的好候选者，但 date 是。由于 ID 是第一个集群键，它的唯一性可能会导致所有数据均匀分布到所有分区，即使集群键中有日期也是如此。反转键可能会有所帮助。
@EricLin 现在检查。在更改表以更改集群之后..我应该对数据进行任何聚类吗？还是自动的？

标签： snowflake-cloud-data-platform

【解决方案1】：

有时这取决于聚集列的基数，我认为这在早期的 cmets 中已指出。

聚类键的作用类似于分区变量，因此理想情况下，它们应该为值基数较低的列定义。

见：https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html

您可以做的是检查列深度和重叠是什么，如上面的链接所示。越接近 1 深度和 0 重叠越好。

使用此命令检查聚类列，请参阅：https://docs.snowflake.com/en/sql-reference/functions/system_clustering_information.html

总是先看表结构！当 Snowflake 分析查询以最小化表扫描时，有两种类型的过滤（从您的屏幕截图看来，这是您查询中花费大部分时间的地方）

静态修剪 - 过滤器，确保您不对列本身应用函数，但您可以在查询的静态值上应用函数
动态修剪 - 连接，尝试使用等值连接和连接查询。显式列连接提高了性能

接下来是适当大小的虚拟仓库，在查询分析器的右侧，您应该查找溢出和缓存等内容。溢出表明仓库的大小不合适。太小了，内容会溢出到远程存储。

【讨论】：

聚类深度为 30
用聚类信息更新了问题
您可以看到 LINEAR(yyyymm,ID) 的聚类比其他方式更好