【问题标题】:Correlated subqueries in Snowflake doesn't work雪花中的相关子查询不起作用
【发布时间】:2019-08-27 17:22:07
【问题描述】:

我正在尝试在 Snowflake 中运行以下查询,但它以 Unsupported subquery type cannot be evaluated 失败。该查询在 Postgresql 和 Presto 等其他 SQL 引擎中有效,因此 Snowflake 似乎不支持此类查询。

SELECT first_action.date, 
  DATEDIFF('day', first_action.date, returning_action.date) - 1 as diff, 
  APPROXIMATE_SIMILARITY(select MINHASH_COMBINE(value) from (select first_action.user_id_set as value union all select returning_action.user_id_set)) _set
  FROM (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as first_action
  JOIN (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as returning_action 
ON (first_action.date < returning_action.date AND dateadd(day, 14, first_action.date) >= returning_action.date)
group by 1,2

该查询是使用 MinHash 的典型同类群组查询。我们计算每一天的MinHash,加入接下来的14天,合并结果,最后计算出最终结果。

由于 MinHash 没有线性 MINHASH_COMBINE 函数,我们必须使用 UNION all 的子查询才能使其工作,但这也没有用。 :/

我们现在陷入困境,因为我们真的不知道任何解决方法。任何帮助表示赞赏!

【问题讨论】:

    标签: snowflake-cloud-data-platform


    【解决方案1】:

    不确定这是否可行,尝试使用WITH 语句将first_actionreturning_action 分开:

    WITH 
    first_action as (
        SELECT 
            TRY_CAST(_time AS DATE) as date, 
            MINHASH(100, _user) as user_id_set 
        FROM events 
        GROUP BY 1
    ),
    returning_action as (
        SELECT 
            TRY_CAST(_time AS DATE) as date, 
            MINHASH(100, _user) as user_id_set 
        FROM events 
        GROUP BY 1
    ),
    SELECT 
      first_action.date, 
      DATEDIFF('day', fa.date, ra.date) - 1 as diff, 
      APPROXIMATE_SIMILARITY(
          SELECT MINHASH_COMBINE(value) 
          FROM (
              SELECT fa.user_id_set AS value FROM first_action fa
              UNION ALL  
              SELECT ra.user_id_set AS value FROM returning_action ra
          )
      ) _set
    FROM first_action fa
    JOIN returning_action ra
    ON (fa.date < ra.date AND DATEADD(day, 14, fa.date) >= ra.date)
    GROUP BY 1,2
    

    【讨论】:

      【解决方案2】:

      所以主要的技巧是所有这些MINHASH_函数都是窗口函数,所以你需要在数据上建立一个分组键。

      所以用这个作为我的示例数据:

      CREATE TABLE events(_user number, _time timestamp_ntz);
      INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'),
          (1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'),
          (2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'),
          (2,'2019-03-15');
      

      第一组是获取COMBINE的14天数据

      WITH actions AS (
          SELECT _time::date as date
              ,dateadd(day, 14, date) as date14
              ,minhash(100, _user) as user_id_set
          FROM events
          GROUP BY 1
      )
      SELECT fa.date
          ,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date)
          ,MINHASH_COMBINE(ra.user_id_set) AS sets
      FROM actions AS fa
      JOIN actions AS ra 
          ON (fa.date <= ra.date AND fa.date14 > ra.date) 
      GROUP BY 1
      ORDER BY 1; 
      

      这与您的代码类似,但在这里我在 RA 中包含与 FA 同一天。所以我可以按 FA.date 分组,但包含 FA 的数据。在日期范围内,我不确定您是想要 14 天后还是 14 天后的日期。我假设后者,因此改变了结束范围终止。

      现在我们每天都有接下来 14 天的数据组合,我们想要获得成对的数据(在我的代码中,我没有设置比较的最大天数,而只包含所有数据对)。现在再次 APPROXIMATE_SIMILARITY 是一个窗口函数,所以我构建了一个数组,我将再次直接撕开,从而旋转数据,这就是你试图通过 union all 做的事情(这可以在成对中看到并展开CTE)

      WITH actions AS (
          SELECT _time::date AS date
              ,dateadd(day, 14, date) as date14
              ,minhash(100, _user) AS user_id_set
          FROM events
          GROUP BY 1
      ), combined AS (
          SELECT fa.date
              ,MINHASH_COMBINE(ra.user_id_set) AS sets
          FROM actions AS fa
          JOIN actions AS ra 
              ON fa.date <= ra.date AND fa.date14 > ra.date
          GROUP BY 1
      ), pairs AS (
          SELECT fa.date
              ,DATEDIFF('day', fa.date, ra.date) AS diff
              ,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set
          FROM combined AS fa
          JOIN combined AS ra 
              ON fa.date < ra.date
      ), unrolled AS (
          SELECT date
              ,diff
              ,f.value AS sets
          FROM pairs p,
          LATERAL FLATTEN(input => p.comp_set) f
      )
      SELECT date
          ,diff
          ,APPROXIMATE_SIMILARITY(sets)
      FROM unrolled
      GROUP BY 1,2
      ORDER BY 1,2;
      

      因此你得到了所有天的结果

      DATE    DIFF    APPROXIMATE_SIMILARITY(SETS)
      2019-03-01  4   1
      2019-03-01  9   1
      2019-03-01  10  1
      2019-03-01  13  1
      2019-03-01  14  1
      2019-03-01  15  0.51
      2019-03-05  5   1
      2019-03-05  6   1
      2019-03-05  9   1
      2019-03-05  10  1
      2019-03-05  11  0.51
      2019-03-10  1   1
      2019-03-10  4   1
      2019-03-10  5   1
      2019-03-10  6   0.51
      2019-03-11  3   1
      2019-03-11  4   1
      2019-03-11  5   0.51
      2019-03-14  1   1
      2019-03-14  2   0.51
      2019-03-15  1   0.51
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-12-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-02-24
        • 1970-01-01
        • 1970-01-01
        • 2023-02-09
        相关资源
        最近更新 更多