【问题标题】:Aggregation by timestamp按时间戳聚合
【发布时间】:2014-05-19 15:44:04
【问题描述】:

搜索引擎优化 > 搜索引擎优化 > 付费 1 付费 > 付费 > 附属 > 付费 1 SEO > Affiliate 1我有一个查询,结果包含客户 ID 号、营销渠道、时间戳和购买日期的数据。所以,结果可能看起来像这样。

id marketingChannel TimeStamp      Transaction_date
1  SEO              5/18 23:11:43  5/18
1  SEO              5/18 24:12:43  5/18
1  Paid             5/18 24:13:43  5/18
2  Paid             5/18 24:12:43  5/18
2  Paid             5/18 24:14:43  5/18
2  Affiliate        5/18 24:20:43  5/18
2  Paid             5/18 24:22:43  5/18
3  SEO              5/18 24:10:43  5/18
3  Affiliate        5/18 24:11:43  5/18

我想知道是否有查询以显示营销路径计数的方式汇总此信息。

例如。

Marketing Path                  Count
SEO > SEO > Paid                  1
Paid > Paid > Affiliate > Paid    1
SEO > Affiliate                   1

我正在考虑编写一个 Python 脚本来获取这些信息,但我想知道 SQL 中是否有一个简单的解决方案 - 因为我对 SQL 不太熟悉。

【问题讨论】:

  • 如何从营销渠道获取路径?
  • 所以,我添加了更多数据。要获得营销路径,基本上您只需按时间戳排序。所以,id #1 的路径是 SEO > SEO >Paid。 Id # 2 将是付费 > 付费 > 附属 > 付费。
  • 我只是不知道如何汇总这些信息,这样会导致表格变小。
  • 实际使用Teradata,不要以为他们有group_concat功能。
  • 是否有已知的最大路径长度?您的 Teradata 版本是什么?

标签: sql aggregate-functions teradata


【解决方案1】:

几年前,我需要一个类似的结果,我测试了不同的方法来在 Teradata 中获取连接字符串。顺便说一句,如果行数过多并且连接的字符串超过 64000 个字符,所有可能都会失败。

最有效的是用户定义函数(用 C 编写):

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT 
      DelimitedBuildSorted(MARKETINGCHANNEL
                          ,CAST(CAST(ts AS FORMAT 'yyyymmddhhmiss') AS VARCHAR(14))
                          ,'>') AS PATH
   FROM t
   GROUP BY id
 ) AS dt
GROUP BY 1;

如果您需要频繁地运行该查询和/或在大型表上运行该查询,如果可以使用 UDF,您可能会与您的 DBA 交谈(大多数 DBA 不喜欢它们,因为它们是用他们不知道的语言编写的, C)。

如果每个 id 的平均行数较低,则递归可能没问题。 Joseph B 的版本可以稍微简化一点,但最重要的是创建一个临时表,而不是使用 View 或 Derived Table 进行 ROW_NUMBER 计算。这会产生更好的计划(在 SQL Server 中也是如此):

CREATE VOLATILE TABLE vt AS 
 (
   SELECT
      id
     ,MarketingChannel
     ,ROW_NUMBER() OVER (PARTITION BY id ORDER BY TS DESC) AS rn
     ,COUNT(*) OVER (PARTITION BY id) AS max_rn
   FROM t
 ) WITH DATA 
PRIMARY INDEX (id) 
ON COMMIT PRESERVE ROWS;

WITH RECURSIVE cte(id, path, rn) AS
 (
   SELECT 
      id, 

      -- modify VARCHAR size to fit your maximum number of rows, that's better than VARCHAR(64000)
      CAST(MarketingChannel AS VARCHAR(10000)) AS PATH, 
      rn
   FROM vt
   WHERE rn = max_rn
   UNION ALL
   SELECT 
      cte.ID, 
      cte.PATH || '>' || vt.MarketingChannel, 
      cte.rn-1
   FROM vt JOIN cte
     ON vt.id = cte.id
    AND vt.rn = cte.rn - 1
 )
SELECT 
   PATH, 
   COUNT(*) 
FROM cte
WHERE rn = 1
GROUP BY path
ORDER BY PATH
;

你也可以试试老派的 MAX(CASE):

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT
      id
     ,MAX(CASE WHEN rnk =  0 THEN MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  1 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  2 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  3 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  4 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  5 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  6 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  7 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  8 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk =  9 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 10 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 11 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 12 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 13 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 14 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 15 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 16 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 17 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 18 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 19 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 20 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 21 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 22 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 23 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 24 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 25 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 26 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 27 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 28 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 29 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 30 THEN '>' || MarketingChannel ELSE '' END) ||
      MAX(CASE WHEN rnk = 31 THEN '>' || MarketingChannel ELSE '' END) AS PATH
   FROM
    (
     SELECT
        id
       ,TRIM(MarketingChannel) AS MarketingChannel
       ,RANK() OVER (PARTITION BY id
                     ORDER BY TS) -1 AS rnk
     FROM t
    ) dt
   GROUP BY 1
 ) AS dt
GROUP BY 1;

我最多可以连接 2048 行,每行 30 个字符 :-)

SELECT
   PATH
  ,COUNT(*)
FROM
 (
   SELECT
      id
     ,MAX(CASE WHEN rnk MOD 16 = 0 THEN path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 1 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 2 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 3 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 4 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 5 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 6 THEN '>' || path ELSE '' END) ||
      MAX(CASE WHEN rnk MOD 16 = 7 THEN '>' || path ELSE '' END) AS PATH
   FROM
    (
     SELECT
        id
       ,rnk / 16 AS rnk
       ,MAX(CASE WHEN rnk MOD 16 =  0 THEN path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  1 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  2 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  3 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  4 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  5 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  6 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  7 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  8 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 =  9 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 10 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 11 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 12 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 13 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 14 THEN '>' || path ELSE '' END) ||
        MAX(CASE WHEN rnk MOD 16 = 15 THEN '>' || path ELSE '' END) AS path
     FROM
      (
       SELECT
          id
         ,rnk / 16 AS rnk
         ,MAX(CASE WHEN rnk MOD 16 =  0 THEN path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  1 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  2 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  3 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  4 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  5 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  6 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  7 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  8 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 =  9 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 10 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 11 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 12 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 13 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 14 THEN '>' || path ELSE '' END) ||
          MAX(CASE WHEN rnk MOD 16 = 15 THEN '>' || path ELSE '' END) AS path
       FROM
        (
         SELECT
            id
           ,TRIM(MarketingChannel) AS PATH
           ,RANK() OVER (PARTITION BY id
                         ORDER BY TS) -1 AS rnk
         FROM t
        ) dt
       GROUP BY 1,2
      ) dt
     GROUP BY 1,2
    ) dt
   GROUP BY 1
 ) dt
GROUP BY 1

【讨论】:

    【解决方案2】:

    这是一个查询,它已经用 SQL Server 进行了测试。相同的语法也适用于 Teradata:

    编辑

    将多个 CTE 转换为单个 CTE:

    WITH RECURSIVE Single_Path (CURRENT_ID, CURRENT_PATH, CURRENT_TS, rn) AS
    (
      SELECT 
        ID CURRENT_ID, 
        CAST(MARKETINGCHANNEL AS VARCHAR(MAX)) CURRENT_PATH, 
        TIMESTAMP CURRENT_TS, 
        1 RN
      FROM 
      (
        SELECT 
          id, 
          marketingChannel, 
          TimeStamp, 
          ROW_NUMBER() OVER (PARTITION BY id ORDER BY TimeStamp DESC) rn
        FROM T
      ) Ordered_Data
      WHERE RN = 1
      UNION ALL
      SELECT 
        ID, 
        CAST(MARKETINGCHANNEL + ' > ' + CURRENT_PATH AS VARCHAR(MAX)), 
        TIMESTAMP, 
        sp.rn+1
      FROM 
      (
        SELECT 
          id, 
          marketingChannel, 
          TimeStamp, 
          ROW_NUMBER() OVER (PARTITION BY id ORDER BY TimeStamp DESC) rn
        FROM T
      ) ORDERED_DATA od, Single_Path sp
      WHERE od.id = sp.Current_id
      AND od.rn = sp.rn + 1
    )
    SELECT 
      sp2.CURRENT_PATH MARKETING_PATH, 
      COUNT(*) COUNT
    FROM Single_Path sp2
    INNER JOIN 
    (
      SELECT 
        ID, 
        MAX(rn) max_rn
      FROM Ordered_Data
      GROUP BY ID
    ) MR
    ON SP2.CURRENT_ID = MR.ID AND SP2.RN = MR.MAX_RN
    GROUP BY SP2.CURRENT_PATH
    ORDER BY sp2.CURRENT_PATH;
    

    SQL Fiddle demo

    参考文献

    Fun with Recursive SQL (Part 1) on Sharpening Stones blog

    【讨论】:

    • 尝试删除 RECURSIVE 关键字。由于没有 TD 环境,我无法测试。
    • 一旦我删除它,我得到 3707 语法错误,预计在 '(' 和 'MAX' 关键字之间有一个整数..也感谢您的帮助。
    • 我只是用一个大数字代替 max.. 然后我得到 [6932] 不支持多个 WITH 定义。
    • 您使用的是早于 TD14 的版本?
    【解决方案3】:

    假设 MySQL:

    select
    path, count(*) from (
       select
       id, group_concat(marketingChannel separator ' > ') as path
       from
       t
       group by id
    ) sq 
    group by path
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-02-05
      • 2023-03-11
      • 2015-06-06
      • 2021-04-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多