在 SQL 查询中优化散列和散列连接答案

【问题标题】：Optimising Hash And Hash Joins in SQL Query在 SQL 查询中优化散列和散列连接
【发布时间】：2015-10-17 13:42:02
【问题描述】：

我在 PostgreSQL 中有一堆表，我运行如下查询：

SELECT DISTINCT ON ...some stuff... 
FROM "rent_flats" 
INNER JOIN "rent_flats_linked_users" 
  ON "rent_flats_linked_users"."rent_flat_id" = "rent_flats"."id" 
INNER JOIN "users" 
  ON "users"."id" = "rent_flats_linked_users"."user_id" 
INNER JOIN "owners" 
  ON "owners"."id" = "users"."profile_id" AND "users"."profile_type" = 'Owner' 
INNER JOIN "phone_numbers" 
  ON "phone_numbers"."person_id" = "owners"."id" AND "phone_numbers"."person_type" = 'Owner' 
INNER JOIN "phone_number_categories" 
  ON "phone_number_categories"."id" = "phone_numbers"."phone_number_category_id" 
INNER JOIN "localities" 
  ON "localities"."id" = "rent_flats"."locality_id" 
INNER JOIN "regions" 
  ON "regions"."id" = "localities"."region_id" 
INNER JOIN "cities" 
  ON "cities"."id" = "regions"."city_id" 
INNER JOIN "property_types" 
  ON "property_types"."id" = "rent_flats"."property_type_id" 
INNER JOIN "apartment_types" 
  ON "apartment_types"."id" = "rent_flats"."apartment_type_id" 
WHERE "rent_flats"."status" = 3 
  AND (((extract(epoch from age(current_date,rent_flats.date_added))/86400)::int) IN (cities.short_period,cities.long_period)) 
  AND (phone_number_categories.name IN ('SMS','SMS & Mobile')) 
ORDER BY rf_id, phone_numbers.priority ASC

注意：rent_flats 表包含约 500 万行，rent_flats_linked_users 包含约 600k 行，users 包含 350k 行。其他表较小。

查询执行大约需要 6.8 秒，解释分析表明，大约 99% 的总时间用于哈希和哈希连接。

将 seq_scan 设置为关闭...查询需要更长的时间到 ~11 秒

Here的解释查询计划分析。我已经在涉及内部连接的字段以及涉及过滤器的字段（如 phone_numbers.priority 和 city.short_period 和 city.long_period）上放置了索引。如何进一步优化这一点并减少哈希和哈希连接时间？

【问题讨论】：

@DrewPierce 在 postgres 中没有...但 Hash 和 Hash 也不是在 mysql 中加入，概念不一样吗？
最佳解决方案取决于缺少的部分。我有根据的猜测是你有DISTINCT ON (rent_flats.id) rent_flats.id AS rf_id, ...Burt 为什么要猜你什么时候可以告诉我们？还请按照[postgresql-performance] 的标签信息中的说明提供其他缺失信息。最重要的是你的 Postgres 版本和三个大表的基本表定义。还要澄清任何连接是否可以在左侧找到 0 个或多个匹配项。
(((extract(epoch from age(current_date,rent_flats.date_added))/86400)::int) IN (cities.short_period,cities.long_period)) 将无法使用任何可用的索引。（并导致rent_flats 上的seqscan）。可能应该重写（通过转换为日期和减法）

标签： sql postgresql join postgresql-performance

【解决方案1】：

您的第二个WHERE 子句不是sargable：

 AND (((extract(epoch from age(current_date,rent_flats.date_added))/86400)::int) IN (cities.short_period,cities.long_period))

如果涉及的列是date 和integer 类型（我们可以在表定义中看到），您可以重写为：

AND rent_flats.date_added IN (current_date - cities.short_period - 1
                            , current_date - cities.long_period - 1)

这是一个 odd 谓词。你确定你不是这个意思？

AND rent_flats.date_added BETWEEN current_date - cities.short_period - 1
                              AND current_date - cities.long_period - 1

您可能可以做更多，等待丢失的信息。很可能是这样的：

【讨论】：

@joop：当我仍在处理重写条件时，您已经暗示了相同的方向。
@ErwinBrandstetter 很好...如果 date_added 是时间戳呢？
@aceBox：可以轻松完成。也取决于其他列。请提供完整的图片在问题中。遍历 cmets 中的位太繁琐了。
@ErwinBrandstetter 没关系...我做到了...但整体查询时间没有改善。仍然有大约 99% 的时间花在 Hash & Hash Joins 上。
@aceBox：正如我所说：You can probably do a lot more, pending missing information.