所以主要的技巧是所有这些MINHASH_函数都是窗口函数,所以你需要在数据上建立一个分组键。
所以用这个作为我的示例数据:
CREATE TABLE events(_user number, _time timestamp_ntz);
INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'),
(1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'),
(2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'),
(2,'2019-03-15');
第一组是获取COMBINE的14天数据
WITH actions AS (
SELECT _time::date as date
,dateadd(day, 14, date) as date14
,minhash(100, _user) as user_id_set
FROM events
GROUP BY 1
)
SELECT fa.date
,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date)
,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra
ON (fa.date <= ra.date AND fa.date14 > ra.date)
GROUP BY 1
ORDER BY 1;
这与您的代码类似,但在这里我在 RA 中包含与 FA 同一天。所以我可以按 FA.date 分组,但包含 FA 的数据。在日期范围内,我不确定您是想要 14 天后还是 14 天后的日期。我假设后者,因此改变了结束范围终止。
现在我们每天都有接下来 14 天的数据组合,我们想要获得成对的数据(在我的代码中,我没有设置比较的最大天数,而只包含所有数据对)。现在再次 APPROXIMATE_SIMILARITY 是一个窗口函数,所以我构建了一个数组,我将再次直接撕开,从而旋转数据,这就是你试图通过 union all 做的事情(这可以在成对中看到并展开CTE)
WITH actions AS (
SELECT _time::date AS date
,dateadd(day, 14, date) as date14
,minhash(100, _user) AS user_id_set
FROM events
GROUP BY 1
), combined AS (
SELECT fa.date
,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra
ON fa.date <= ra.date AND fa.date14 > ra.date
GROUP BY 1
), pairs AS (
SELECT fa.date
,DATEDIFF('day', fa.date, ra.date) AS diff
,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set
FROM combined AS fa
JOIN combined AS ra
ON fa.date < ra.date
), unrolled AS (
SELECT date
,diff
,f.value AS sets
FROM pairs p,
LATERAL FLATTEN(input => p.comp_set) f
)
SELECT date
,diff
,APPROXIMATE_SIMILARITY(sets)
FROM unrolled
GROUP BY 1,2
ORDER BY 1,2;
因此你得到了所有天的结果
DATE DIFF APPROXIMATE_SIMILARITY(SETS)
2019-03-01 4 1
2019-03-01 9 1
2019-03-01 10 1
2019-03-01 13 1
2019-03-01 14 1
2019-03-01 15 0.51
2019-03-05 5 1
2019-03-05 6 1
2019-03-05 9 1
2019-03-05 10 1
2019-03-05 11 0.51
2019-03-10 1 1
2019-03-10 4 1
2019-03-10 5 1
2019-03-10 6 0.51
2019-03-11 3 1
2019-03-11 4 1
2019-03-11 5 0.51
2019-03-14 1 1
2019-03-14 2 0.51
2019-03-15 1 0.51