【发布时间】:2020-07-30 05:32:56
【问题描述】:
我在 BigQuery 中有一个表。它可以是任何数据库。
我想根据时间条件删除行。发生的情况是,如果用户点击太快,它会创建一个需要删除的重复条目。但在某些情况下,由于 IP 地址或广告商不同,两个有效线索会非常接近并且需要保留。当行之间的间隔在 4 秒内时,我们需要进行重复数据删除。
我还需要确保如果一行被标记为重复,则以下行不使用重复行时间戳来派生 4 秒标志。
ip_address datetime advertiser order_number Comment
34.195.131 2020-07-03 22:45:02.585 UTC homepage 5678 KEEP
34.195.131 2020-07-03 22:45:05.593 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B2
34.195.131 2020-07-03 22:45:08.923 UTC homepage 5678 KEEP - SINCE B3 WAS REMOVED, C4 IS NOW MORE THAN 4 SECONDS FROM B2
34.195.131 2020-07-03 22:45:13.788 UTC homepage 5678 KEEP
34.195.131 2020-07-03 22:45:16.523 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B5
34.195.131 2020-07-03 22:45:20.393 UTC homepage 5678 KEEP - SINCE B6 WAS REMOVED, LESS THAN 4 SECONDS OF B4
34.195.131 2020-07-03 22:45:21.247 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B7
34.195.131 2020-07-03 22:45:24.924 UTC homepage 5678 KEEP - SINCE B8 WAS REMOVED AND MORE THAN 4 SECONDS OF B7
34.195.131 2020-07-03 22:45:27.443 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B9
34.195.131 2020-07-03 22:45:30.561 UTC homepage 5678 KEEP - SINCE B10 WAS REMOVED AND MORE THAN 4 SECONDS OF B9
34.195.131 2020-07-03 22:45:32.561 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B11
34.195.131 2020-07-03 22:45:33.935 UTC homepage 5678 REMOVE - WITHIN 4 SECONDS OF B11
34.195.131 2020-07-03 22:45:36.083 UTC homepage 5678 KEEP - SINCE B12 AND B13 WERE REMOVED AND MORE THAN 4 SECONDS OF B11
34.195.132 2020-07-03 22:45:38.849 UTC homepage 5678 KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B14, THIS IS A DIFFERENT IP_ADDRESS
34.195.132 2020-07-03 22:45:38.949 UTC homepage 1234 KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B15 THIS IS A NEW ORDER_NUMBER
我曾尝试使用 CTE 和自我加入,但目前没有任何成功。谁能告诉我该怎么做或指示如何进一步进行?
【问题讨论】:
标签: google-bigquery