【问题标题】:Find unique entities with multiple UUID identifiers in redshift在 redshift 中查找具有多个 UUID 标识符的唯一实体
【发布时间】:2018-01-23 10:06:22
【问题描述】:

有一个事件表,每个用户都有多种类型的 UUID,我们想想出一种方法将所有这些 UUID 拼接在一起,以获得单个用户的最高定义。

例如:

   UUID1 | UUID2
   1      a
   1      a
   2      a
   2      b
   3      c
   4      c

这里有 2 个用户,第一个用户 uuid1={1,2} 和 uuid2={a,b},第二个用户 uuid1={3,4} 和 uuid2={c}。这些链可能很长。没有交叉点(即 1c 不存在)并且所有行都按时间戳排序。

在 redshift 中有没有一种方法可以生成这些唯一的“访客”标识符,而无需创建包含许多连接的巨大查询?

提前致谢!

【问题讨论】:

  • 链可能相交吗?像第二个用户的“1c”条目......并且是否有时间戳列来对链中的事件进行排序?
  • 感谢您的提问:分别是否和是。编辑了问题以进行澄清。

标签: amazon-redshift uuid uniqueidentifier bigdata


【解决方案1】:

创建测试数据表

-- DROP TABLE uuid_test;
CREATE TEMP TABLE uuid_test AS
          SELECT   1 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT   2 row_id, 1::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT   3 row_id, 2::int uuid1, 'a'::char(1) uuid2
UNION ALL SELECT   4 row_id, 2::int uuid1, 'b'::char(1) uuid2
UNION ALL SELECT   5 row_id, 3::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT   6 row_id, 4::int uuid1, 'c'::char(1) uuid2
UNION ALL SELECT   7 row_id, 4::int uuid1, 'd'::char(1) uuid2
UNION ALL SELECT   8 row_id, 5::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT   9 row_id, 6::int uuid1, 'e'::char(1) uuid2
UNION ALL SELECT  10 row_id, 6::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT  11 row_id, 7::int uuid1, 'f'::char(1) uuid2
UNION ALL SELECT  12 row_id, 8::int uuid1, 'g'::char(1) uuid2
UNION ALL SELECT  13 row_id, 8::int uuid1, 'h'::char(1) uuid2
;

解决实际问题的方法是使用严格的排序来查找唯一用户更改的每个地方,将其捕获为查找表,然后将其应用于原始数据。

-- Create lookup table with a from-to range of IDs for each unique user
WITH unique_user AS (

-- Calculate the end of the id range using LEAD() to look ahead
-- Use an inline MAX() to find the ending ID for the last entry

SELECT row_id AS from_id
     , NVL(LEAD(row_id,1) OVER (ORDER BY row_id)-1, (SELECT MAX(row_id) FROM uuid_test) ) AS to_id
     , unique_uuid

-- Mark unique user change when there is discontinuity in either UUID
FROM (SELECT row_id
            ,CASE WHEN NVL(LAG(uuid1,1) OVER (ORDER BY row_id),  0) <> uuid1
                   AND NVL(LAG(uuid2,1) OVER (ORDER BY row_id), '') <> uuid2
             THEN MD5(uuid1||uuid2)
             ELSE NULL END unique_uuid
      FROM uuid_test) t
WHERE unique_uuid IS NOT NULL
ORDER BY row_id
)

-- Apply the unique user value to each row using a range join to the lookup table
SELECT a.row_id, a.uuid1, a.uuid2, b.unique_uuid
FROM uuid_test   AS a
JOIN unique_user AS b
  ON a.row_id BETWEEN b.from_id AND b.to_id
ORDER BY a.row_id
;

这是输出

 row_id | uuid1 | uuid2 |           unique_uuid
--------+-------+-------+----------------------------------
      1 |     1 | a     | efaa153b0f682ae5170a3184fa0df28c
      2 |     1 | a     | efaa153b0f682ae5170a3184fa0df28c
      3 |     2 | a     | efaa153b0f682ae5170a3184fa0df28c
      4 |     2 | b     | efaa153b0f682ae5170a3184fa0df28c
      5 |     3 | c     | 5fcfcb7df376059d0075cb892b2cc37f
      6 |     4 | c     | 5fcfcb7df376059d0075cb892b2cc37f
      7 |     4 | d     | 5fcfcb7df376059d0075cb892b2cc37f
      8 |     5 | e     | 18a368e1052b5aa0388ef020dd9a1e20
      9 |     6 | e     | 18a368e1052b5aa0388ef020dd9a1e20
     10 |     6 | f     | 18a368e1052b5aa0388ef020dd9a1e20
     11 |     7 | f     | 18a368e1052b5aa0388ef020dd9a1e20
     12 |     8 | g     | 321fcc2447163a81d470b9353e394121
     13 |     8 | h     | 321fcc2447163a81d470b9353e394121

【讨论】:

    猜你喜欢
    • 2021-02-26
    • 2011-06-14
    • 1970-01-01
    • 2018-03-13
    • 2018-03-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-16
    • 2019-10-26
    相关资源
    最近更新 更多