左连接大表的 Teradata 性能问题答案

【问题标题】：Teradata performances issues with left join big table左连接大表的 Teradata 性能问题
【发布时间】：2021-10-13 19:06:03
【问题描述】：

在 Teradata sql 助手上连接两个表时遇到 sql 性能问题，其中一个（表 B）包含超过 30 亿行，因此连接需要 2 个多小时。

表 A 包含此列

name|id_number|id_product|creation_date|cp_date|amount|rang

表 B 包含此列

name|id_number|id_product_cp|creation_date|cp_date|amount|year_|month_

所以我正在尝试获取每个名称/id_number/id_product 的数量

--> 如果表 A 中的数量 = 0 则我们得到表 B 中的数量（如果它不为空）否则我们取一个金额。

我的查询是

select
    a.name,
    a.id_number,
    a.id_product,
    a.creation_date,
    case
        when
            sum(a.amount) = 0 and sum(net.amount) is not null then
                sum(net.amount)
            else
                sum(a.amount)
        end
    as amount
from 
    A a    
        left join (
            select
                a.name,
                a.id_number,
                a.cp_date(date) as cp_date,
                a.year_,
                a.month_,
                cp.id_product,
                sum(a.amount) as amount
            from
                B a 
                    join C cp
                    on cp.id_product_cp = a.id_product_cp
            group by 1,2,3,4,5,6
        ) net
        on
            a.name= net.name
            and a.id_number= net.id_number
            and a.id_product = net.id_product 
            and a.cp_date= net.cp_date
            and (
                        extract(year from a.cp_date) < net.year_
                    or (
                                extract(year from a.cp_date) = net.year_
                            and net.month_ >= extract(month from a.cp_date)
                        )
                ) 
    where a.rang <> 1
    group by 1,2,3,4

下图是表 dbc.QryLogStepsV 的查询结果

我认为左连接中的子查询是导致性能问题的原因。

请有任何方法来执行此查询！

谢谢

【问题讨论】：

我注意到一些语法错误：(a) cp 别名未定义； (b) 我找不到名为“cp_date”的函数。此外，“选择”子句没有明确显示列来自哪个表...
谢谢@MarcusViniciusPompeu 的评论，我更正了查询
我在您的查询中做了另一个微妙的编辑，突出显示 a.cp_date 与 net.year_ 和 net.month_ 的关系。虽然我没看懂数据，但我想我可以给你一个解决方案
很难说没有细节。 dbc.QryLogStepsV 对于这个查询有什么解释甚至更好的数据？
您好@dnoeth，我将结果来自 dbc.QryLogStepsV，很抱歉我花了一些时间回复。

标签： sql query-optimization teradata teradata-sql-assistant

【解决方案1】：

由于您没有提供小提琴或数据样本，我会做出最好的猜测，好吗？

:-)

下面的查询更快而且正确吗？

select
    a.name,
    a.id_number,
    a.id_product,
    a.creation_date,
    case
        when
            sum(a.amount) = 0 and sum(net.amount) is not null then
                sum(net.amount)
            else
                sum(a.amount)
        end
    as amount
from
    A a
        left join (
            select
                b.name,
                b.id_number,
                b.cp_date(date) as cp_date,
                --
                cast(
                    -- https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/7x~o21pczkQDFDjAA_5Olg
                    (b.year_ - 1900) * 10000 + b.month_ * 100 + 1 as date
                ) as cp_date_to_join,
                --
                cp.id_product,
                sum(b.amount) as amount
            from
                B b
                    join C cp
                    on cp.id_product_cp = b.id_product_cp
            group by
                1, 2, 3,
                4,
                5
        ) net
        on
                a.name       = net.name
            and a.id_number  = net.id_number
            and a.id_product = net.id_product
            and a.cp_date    = net.cp_date
            --
            and a.cp_date <= net.cp_date_to_join
    where a.rang <> 1
    group by 1,2,3,4

【讨论】：

你好@Marcus Vinicius Pompeu，（你错过了 + 01 作为日期）由于迄今为止的 year_ 和 month_ 列的转换，我对一些数据有错误，所以它不能正常工作跨度>
对不起，@mado。如果您不提供您偶然发现的错误、哪些数据，我无法帮助您...
顺便说一句，@mado，感谢您的建议。我正在修复+ 01 as date 的代码:-)