【问题标题】:BigQuery compare all the columns(100+) from two rows in a sinle tableBigQuery 比较单个表中两行的所有列(100+)
【发布时间】:2023-02-08 09:59:08
【问题描述】:

我有如下输入表-

id col1 col2 time
01 abc 001 12:00
01 def 002 12:10

所需的输出表-

id col1 col2 time diff_field
01 abc 001 12:00 null
01 def 002 12:10 col1,col2

我需要比较两行并找到所有值不同的列,并将这些列名保留在新列中差异字段.

我需要一个优化的解决方案,因为我的表有超过 100 列(所有列都需要比较)

【问题讨论】:

  • 使说明更清楚?您是否正在比较行与行并记录哪些列在后一行中具有不同的值?如果是这样,为什么您的输出表 diff_field 列第二行没有“时间”,因为时间列中的值在 row1 和 row2 中不同?

标签: google-cloud-platform google-bigquery bigdata


【解决方案1】:

您可能会考虑以下方法:

WITH sample_table AS (
  SELECT '01' id, 'abc' col1, '001' col2, '12:00' time UNION ALL
  SELECT '01' id, 'def' col1, '002' col2, '12:10' time UNION ALL
  SELECT '01' id, 'def' col1, '002' col2, '12:20' time UNION ALL
  SELECT '01' id, 'ddf' col1, '002' col2, '12:30' time
)
SELECT * EXCEPT(curr, prev),
       (SELECT STRING_AGG('col' || offset) 
          FROM UNNEST(SPLIT(curr)) c WITH offset
          JOIN UNNEST(SPLIT(prev)) p WITH offset USING (offset)
         WHERE c <> p AND offset < ARRAY_LENGTH(SPLIT(curr)) - 1
       ) diff_field
  FROM (
    SELECT *, FORMAT('%t', t) AS curr, LAG(FORMAT('%t', t)) OVER w AS prev
      FROM sample_table t
    WINDOW w AS (PARTITION BY id ORDER BY time)
  );

查询结果

【讨论】:

    【解决方案2】:

    下面的方法不依赖于实际列的名称或任何名称约定,而只是 idtime

    create temp function extract_keys(input string) returns array<string> language js as """
      return Object.keys(JSON.parse(input));
      """;
    create temp function extract_values(input string) returns array<string> language js as """
      return Object.values(JSON.parse(input));
      """;
    select t.*, 
      ( select string_agg(col)
        from unnest(extract_keys(cur)) as col with offset
        join unnest(extract_values(cur)) as cur_val with offset using(offset)
        join unnest(extract_values(prev)) as prev_val with offset using(offset)
        where cur_val != prev_val and col != 'time'
      ) as diff_field
    from (
      select t, to_json_string(t) cur, to_json_string(ifnull(lag(t) over(win), t)) prev
      from your_table t
      window win as (partition by id order by time)
    )     
    

    如果适用于您问题中的示例数据(或者更确切地说是我从 Jaytiger 回答中借用的扩展版本)- 输出是

    【讨论】:

      猜你喜欢
      • 2018-12-21
      • 1970-01-01
      • 1970-01-01
      • 2016-06-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多