【问题标题】:Bigquery - remove rows based on time differenceBigquery - 根据时差删除行
【发布时间】:2020-07-30 05:32:56
【问题描述】:

我在 BigQuery 中有一个表。它可以是任何数据库。

我想根据时间条件删除行。发生的情况是,如果用户点击太快,它会创建一个需要删除的重复条目。但在某些情况下,由于 IP 地址或广告商不同,两个有效线索会非常接近并且需要保留。当行之间的间隔在 4 秒内时,我们需要进行重复数据删除。

我还需要确保如果一行被标记为重复,则以下行不使用重复行时间戳来派生 4 秒标志。


ip_address  datetime                    advertiser  order_number    Comment
34.195.131  2020-07-03 22:45:02.585 UTC homepage    5678            KEEP
34.195.131  2020-07-03 22:45:05.593 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B2
34.195.131  2020-07-03 22:45:08.923 UTC homepage    5678            KEEP - SINCE B3 WAS REMOVED, C4 IS NOW MORE THAN 4 SECONDS FROM B2
34.195.131  2020-07-03 22:45:13.788 UTC homepage    5678            KEEP
34.195.131  2020-07-03 22:45:16.523 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B5
34.195.131  2020-07-03 22:45:20.393 UTC homepage    5678            KEEP - SINCE B6 WAS REMOVED, LESS THAN 4 SECONDS OF B4
34.195.131  2020-07-03 22:45:21.247 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B7
34.195.131  2020-07-03 22:45:24.924 UTC homepage    5678            KEEP - SINCE B8 WAS REMOVED AND MORE THAN 4 SECONDS OF B7
34.195.131  2020-07-03 22:45:27.443 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B9
34.195.131  2020-07-03 22:45:30.561 UTC homepage    5678            KEEP - SINCE B10 WAS REMOVED AND MORE THAN 4 SECONDS OF B9
34.195.131  2020-07-03 22:45:32.561 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B11
34.195.131  2020-07-03 22:45:33.935 UTC homepage    5678            REMOVE - WITHIN 4 SECONDS OF B11
34.195.131  2020-07-03 22:45:36.083 UTC homepage    5678            KEEP - SINCE B12 AND B13 WERE REMOVED AND MORE THAN 4 SECONDS OF B11
34.195.132  2020-07-03 22:45:38.849 UTC homepage    5678            KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B14, THIS IS A DIFFERENT IP_ADDRESS
34.195.132  2020-07-03 22:45:38.949 UTC homepage    1234            KEEP - EVEN THOUGH WITHIN 4 SECONDS OF B15 THIS IS A NEW ORDER_NUMBER
 

我曾尝试使用 CTE 和自我加入,但目前没有任何成功。谁能告诉我该怎么做或指示如何进一步进行?

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    如果您可以添加有关 B2、B3 等的描述,我不太确定要求,我想 cmets 会更清楚地解码所需的逻辑。 无论如何,根据我的理解,我实现了以下逻辑:

    创建虚拟表:

    WITH 
    data as
    
    (
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:02.585 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:05.593 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:08.923 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:13.788 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:16.523 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:20.393 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:21.247 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:24.924 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:27.443 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:30.561 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:32.561 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:33.935 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.131' ip_address,'2020-07-03 22:45:36.083 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.132' ip_address,'2020-07-03 22:45:38.849 UTC' datetime,'homepage' advertiser,'5678' order_number
    UNION ALL
    SELECT '34.195.132' ip_address,'2020-07-03 22:45:38.949 UTC' datetime,'homepage' advertiser,'1234' order_number
    ),
    
    data_corrected
    as
    (
    SELECT ip_address,CAST(datetime As Timestamp) datetime,advertiser,order_number
    From data
    )
    

    现在的逻辑是,我使用 Lag 和 Lead 窗口函数来获取后面和前面的值,按 d.ip_address,advertiser,order_number 排序记录,然后计算时间增量。

    SELECT d.*, LEAD(datetime)
        OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC) AS followed_by_click,
        CASE WHEN TIMESTAMP_DIFF(LEAD(datetime)
        OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC),d.datetime  , SECOND)<=4 THEN 'Duplicate' ELSE 'Keep' END delta_followed_by_click,
        LAG(datetime)
        OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC) AS preceding_click,
        CASE WHEN TIMESTAMP_DIFF(d.datetime  , LAG(datetime)
        OVER (PARTITION BY d.ip_address,advertiser,order_number ORDER BY datetime ASC), SECOND)<=4 THEN 'Duplicate' ELSE 'Keep' END delta_preceding_click,
        FROM data_corrected d
    ORDER BY d.datetime desc
    

    希望这有助于您取得成果。

    【讨论】:

    • A、B、C 是 excel 列和 1、2、3,就像 excel 行 注释不是数据的一部分,它们只是说明要保留和删除的内容。 B 列是日期时间
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-05-23
    • 1970-01-01
    • 2018-07-25
    • 1970-01-01
    • 2023-03-21
    • 2018-09-29
    相关资源
    最近更新 更多