如何优化多个左连接 SQL SELECT 查询？答案

【问题标题】：How to optimize multiple left-joins SQL SELECT-query?如何优化多个左连接 SQL SELECT 查询？
【发布时间】：2019-09-03 08:48:27
【问题描述】：

情况：

我们有一个数据库“base1”~600万行数据，显示了实际客户购买和购买日期+本次购买的参数。

CREATE TABLE base1 (
User_id NOT NULL PRIMARY KEY ,
PurchaseDate date,
Parameter1 int,
Parameter2 int,
...
ParameterK int );

还有另一个数据库“base2”~ 9000 万行数据，它实际上显示了同样的事情，但不是购买日期，而是使用每周部分（例如：每个客户 4 年的所有周 -如果 N 周没有购买，仍然显示客户）。

CREATE TABLE base2 (
Users_id NOT NULL PRIMARY KEY ,
Week_start date ,
Week_end date,
Parameter1 int,
Parameter2 int,
...
ParameterN int );

执行以下查询的任务：

-- a = base1 , b , wb%% = base2
--create index idx_uid_purch_date on base1(Users_ID,Purchasedate);
SELECT a.Users_id
-- Checking whether the client will make a purchase in next week and the purchase will be bought on condition
,iif(b.Users_id is not null,1,0) as User_will_buy_next_week
,iif(b.Users_id is not null and b.Parameter1 = 1,1,0) as User_will_buy_on_Condition1
--   about 12 similar iif-conditions
,iif(b.Users_id is not null and (b.Parameter1 = 1 and b.Parameter12 = 1),1,0) 
as User_will_buy_on_Condition13

-- checking on the fact of purchase in the past month, 2 months ago, 2.5 months, etc.
,iif(wb1m.Users_id is null,0,1) as was_buy_1_month_ago
,iif(wb2m.Users_id is null,0,1) as was_buy_2_month_ago
,iif(wb25m.Users_id is null,0,1) as was_buy_25_month_ago
,iif(wb3m.Users_id is null,0,1) as was_buy_3_month_ago
,iif(wb6m.Users_id is null,0,1) as was_buy_6_month_ago
,iif(wb1y.Users_id is null,0,1) as was_buy_1_year_ago

 ,a.[Week_start]
 ,a.[Week_end]

 into base3
 FROM base2 a 

 -- Join for User_will_buy
 left join base1 b
 on a.Users_id =b.Users_id and 
 cast(b.[PurchaseDate] as date)>=DATEADD(dd,7,cast(a.[Week_end] as date)) 
 and cast(b.[PurchaseDate] as date)<=DATEADD(dd,14,cast(a.[Week_end] as date))

 -- Joins for was_buy
 left join base1  wb1m
 on a.Users_id =wb1m.Users_id 
 and cast(wb1m.[PurchaseDate] as date)>=DATEADD(dd,-30-4,cast(a.[Week_end] as date)) 
 and cast(wb1m.[PurchaseDate] as date)<=DATEADD(dd,-30+4,cast(a.[Week_end] as date))

/* 4 more similar joins where different values are added in 
DATEADD (dd, %%, cast (a. [Week_end] as date))
to check on the fact of purchase for a certain period */

 left outer join base1 wb1y
 on a.Users_id =wb1y.Users_id and 
 cast(wb1y.[PurchaseDate] as date)>=DATEADD(dd,-365-4,cast(a.[Week_end] as date)) 
 and cast(wb1y.[PurchaseDate] as date)<=DATEADD(dd,-365+5,cast(a.[Week_end] as date))

由于有大量的连接和相当大的数据库 - 这个脚本运行了大约 24 小时，这非常长。

正如执行计划所示，主要时间花在“Merge Join”上，从base1和base2查看表的行，并将数据插入到另一个base3表中。

问题：是否可以优化此查询使其运行得更快？

也许使用一个 Join 代替什么的。

请帮忙，我不够聪明:(

感谢大家的回答！

UPD：也许使用不同类型的连接（合并、循环或散列）可能对我有帮助，但无法真正检查这个理论。也许有人可以告诉我这是对还是错；）

【问题讨论】：

“问题”是ON 中的DATEADD(dd, 7, CAST(a.[Week_end] AS date)) 之类的语法不是 SARGable，这意味着不能使用索引来帮助数据引擎必须对表执行全面扫描.
要扩展@Larnu 所说的内容，您的第一步是重写连接，以便它们不使用函数。原因是需要对表中的每一行运行该函数之前 SQL 可以比较和过滤。而不是只比较符合 JOIN 标准的行——它会每次都做。而且它不能使用索引，这会加快这个过程。这样想，你有一本包含数百万行日期的书。您是 a) 翻译日期然后与书比较还是 b) 先翻译整本书？
非常感谢您的回答，现在我什至知道 SARGable 是什么意思了？？？？但是仍然不明白如何完全不使用任何函数来重写它们（我已经删除了所有“强制转换”，但仍然存在 DATEADD）
您确定查询执行您希望它执行的操作吗？你到底想达到什么目的？您从 base2 中选择 9000 万行而不使用任何过滤器，然后外部连接 base1 的日期范围，因此您最终会得到 9000 万到 44 亿个结果行，或者我已经计算过了。
@ThorstenKettner 此查询的输出是每周的每个客户数据，以及有关客户是上个月还是去年购买的附加信息 + 客户下周是否会根据购买历史（base1）购买东西跨度>

标签： sql sql-server join select query-optimization

【解决方案1】：

我假设base1 表存储有关当前一周购买的信息。

如果是这样，在联接的查询条件中，您可以忽略[PurchaseDate] 参数，而将其替换为当前日期常量。在这种情况下，您的 DATEADD 函数将应用于当前日期，并且在连接条件下将是常量：

left join base1 b
on a.Users_id =b.Users_id and 
DATEADD(day,-7,GETDATE())>=a.[Week_end] 
and DATEADD(day,-14,GETDATE())<=a.[Week_end]

要使上述查询正确运行，您应该将b.[PurchaseDate] 限制为当天。

然后您可以运行另一个查询，查询昨天进行的购买，以及由-1 纠正的连接条件中的所有DATEADD 常量

等等，最多 7 个查询，或 base1 表涵盖的任何时间跨度。

您还可以按天对 [PurchaseDate] 值进行分组，重新计算常量并在一个查询中完成所有这些，但我还没有准备好自己花时间创建它。 :)

【讨论】：

抱歉很久没有回复。不幸的是，您对该 base1 的假设是错误的，因为它存储了过去几年的回顾信息。但它对我最近的曲目仍然非常有用，所以非常感谢！

【解决方案2】：

如果您有诸如DATEADD(dd,-30-4,cast(a.[Week_end] as date)) 之类的重复参数，例如，要使其可搜索，您可以在其上创建索引（SQL Server 不能）。 Postgres 可以这样做：

create index ix_base2__34_days_ago on base2(DATEADD(dd,-30-4, cast([Week_end] as date)))

然后像下面这样的表达式将是 SARGable，因为您的数据库将使用 DATEADD(dd,-30-4, cast([Week_end])) 上的索引，因此如果您有上面示例中的索引，那么像下面这样的条件将很快。

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

请注意，将 PurchaseDate 转换为日期会产生一个 SARGable 表达式，尽管 cast 看起来像一个函数，因为 SQL Server 对日期时间有特殊处理，即使您部分搜索日期时间字段（仅限日期部分）。与部分表达式like、where lastname LIKE 'Mc%' 类似，即使索引针对整个姓氏字段，该表达式也是可搜索的。我跑题了。

为了在 SQL Server 上实现表达式的索引，您可以在该表达式上创建一个计算列..，例如，

CREATE TABLE base2 (
  Users_id NOT NULL PRIMARY KEY ,
  Week_start date ,
  Week_end date,
  Parameter1 int,
  Parameter2 int,
  Thirty4DaysAgo as DATEADD(dd,-30-4, cast([Week_end] as date))
)

..然后在该列上创建索引：

create index ix_base2_34_days_ago on base2(Thirty4DaysAgo)

然后将您的表达式更改为：

and cast(wb1m.[PurchaseDate] as date) >= a.Thirty4DaysAgo

这就是我之前的建议，将旧表达式更改为使用计算列。但是，在进一步搜索后，您似乎可以只保留原始代码，因为 SQL Server 可以智能地将表达式与计算列匹配，并且如果您在该列上有索引，则您的表达式将是 SARGable。因此，您的 DBA 可以在幕后优化事物，并且您的原始代码将运行优化，而无需对您的代码进行任何更改。因此无需更改以下内容，它将是 SARGable（假设您的 DBA 为 dateadd(recurring parameters here) 表达式创建了一个计算列，并在其上应用了索引）：

and cast(wb1m.[PurchaseDate] as date) >= DATEADD(dd,-30-4,cast(a.[Week_end] as date))

唯一的缺点（与 Postgres 相比）是在使用 SQL Server 时，您的表上仍有悬空的计算列 :)

好读：https://littlekendra.com/2016/03/01/sql-servers-year-function-and-index-performance/

【讨论】：

非常感谢！这应该可以解决这个问题！

【解决方案3】：

您希望结果中包含所有 9000 万个 base2 行，每行都包含有关 base1 数据的附加信息。所以，DBMS要做的就是对base2进行全表扫描，快速找到base1中的相关行。

带有EXISTS 子句的查询看起来像这样：

select
  b2.users_id,
  b2.week_start,
  b2.week_end,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_next_week,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition1,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.parameter1 = 1
    and b1.parameter2 = 1
    and b1.purchasedate between dateadd(day, 7, cast(b2.week_end as date))
                            and dateadd(day, 14, cast(b2.week_end as date))´
  ) then 1 else 0 end as user_will_buy_on_condition13,
  case when exists
  (
    select *
    from base1 b1 
    where b1.users_id = b2.users_id
    and b1.purchasedate between dateadd(day, -30-4, cast(b2.week_end as date))
                            and dateadd(day, -30+4, cast(b2.week_end as date))´
  ) then 1 else 0 end as was_buy_1_month_ago,
  ...
from base2 b2;

我们可以很容易地看到这将花费很长时间，因为必须检查每个 base2 行的所有条件。那是 900 万次 7 次查找。我们唯一能做的就是提供一个索引，希望查询能从中受益。

create index idx1 on base1 (users_id, purchasedate, parameter1, parameter2);

我们可以添加更多索引，因此 DBMS 可以根据选择性在它们之间进行选择。稍后我们可以检查它们是否被使用，并在没有使用的情况下丢弃它们。

create index idx2 on base1 (users_id, parameter1, purchasedate);
create index idx3 on base1 (users_id, parameter1, parameter2, purchasedate);
create index idx4 on base1 (users_id, parameter2, parameter1, purchasedate);

【讨论】：