【问题标题】:How to join by closest timestamp in BigQuery?如何在 BigQuery 中按最近的时间戳加入?
【发布时间】:2019-12-09 05:32:41
【问题描述】:

假设我有一张班级开始时间表和一张学生表以及他们想要的开始时间。我想通过将最接近的 Class.StartTime 与 Student.DesiredStartTime 匹配来加入这两个表(参见下面的示例)。你会怎么做? 我看到经常被问到和回答的问题,但仅适用于其他数据库(不是 BigQuery)。由于 BigQuery 有一些独特的属性,我想知道 BigQuery 是否有任何特殊功能可以帮助实现这一点?谢谢!

Class
+-----------------------------------+------------+
|               StartTime           |    Class   |
+-----------------------------------+------------+
| 07/01/19 08:00                    | English    |
| 07/01/19 09:00                    | Chemistry  |
| 07/01/19 10:30                    | Math       |
+-----------------------------------+------------+

Student
+-----------------------------------+------------+
|               DesiredStartTime    |    Student |
+-----------------------------------+------------+
| 07/01/19 08:45                    | Jimmy      |
| 07/01/19 09:15                    | Bobby      |
| 07/01/19 10:00                    | Buddy      |
+-----------------------------------+------------+

[Query Results]
+-----------------------------------+------------+------------+
|               StartTime           |    Class   |  Student   |
+-----------------------------------+------------+------------+
| 07/01/19 09:00                    | Chemistry  | Jimmy      |
| 07/01/19 09:00                    | Chemistry  | Bobby      |
| 07/01/19 10:30                    | Math       | Buddy      |
+-----------------------------------+------------+------------+

【问题讨论】:

    标签: sql google-bigquery


    【解决方案1】:

    与许多其他数据库不同,这是在 BQ 中进行交叉联接的好时机。以下查询查找学生期望的开始时间和所有课程开始时间之间的绝对差值(以分钟为单位),对它们进行排名,然后选择最接近的那个。

    with joined as (
      select 
        Student, 
        Class,
        StartTime,
        DesiredStartTime, 
        ABS(TIMESTAMP_DIFF(StartTime,DesiredStartTime, MINUTE)) as abs_difference_mins
      from <dataset>.Class
      cross join <dataset>.Student
    ),
    ranked as (
      select
        StartTime,
        Class,
        Student,
        row_number() over(partition by Student order by abs_difference_mins asc) as ranked_by_mins_diff
      from joined
    )
    select * except(ranked_by_mins_diff)
    from ranked
    where ranked_by_mins_diff = 1
    

    【讨论】:

    • 更喜欢没有交叉连接的 Mikhail Berlyant 的答案,因为它的性能和可扩展性更高
    • 很公平,显然取决于您的数据。 BigQuery 可以以非常高效的方式处理很多(包括交叉连接),因此不确定这是否是您最关心的问题。但是,我发现我的代码更具可读性和更容易理解。如果您正在编写将由其他同事共享或继承的代码,我的解决方案很容易理解(没有 cmets 的事件)并使用基本的 SQL 原则。随着计算能力的提高,我发现自己对效率的担忧越来越少,而更多地在尝试提出简单的解决方案。
    【解决方案2】:

    以下是 BigQuery 标准 SQL 和一些非正统的作为第一个(非常好的)使用 CROSS JOIN 的答案(这对于学生级用例来说很可能是可以的,但对于更通用的情况来说可能是一个杀手涉及真正的大数据)。所以下面使用 UNION ALL 来处理N+M vs NxM 中间行

    #standardSQL
    SELECT * FROM (
      SELECT IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(StartTime IGNORE NULLS) OVER(prev_win), FIRST_VALUE(StartTime IGNORE NULLS) OVER(next_win)
        ) StartTime, IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(Class IGNORE NULLS) OVER(prev_win), FIRST_VALUE(Class IGNORE NULLS) OVER(next_win)
        ) Class, Student 
      FROM (
        SELECT StartTime, UNIX_SECONDS(StartTime) ts, Class, '' Student FROM `project.dataset.class` 
        UNION ALL
        SELECT DesiredStartTime, UNIX_SECONDS(DesiredStartTime), NULL, Student FROM `project.dataset.student` 
      )
      WINDOW 
        prev_win AS (ORDER BY StartTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
        next_win AS (ORDER BY StartTime ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
    )
    WHERE Student != ''
    

    您可以使用问题中的虚拟数据进行测试、使用上述操作

    #standardSQL
    WITH `project.dataset.class` AS (
      SELECT TIMESTAMP '2019-07-01 08:00:00' StartTime, 'English' Class UNION ALL
      SELECT '2019-07-01 09:00:00', 'Chemistry' UNION ALL
      SELECT '2019-07-01 10:30:00', 'Math' 
    ), `project.dataset.student` AS (
      SELECT TIMESTAMP '2019-07-01 08:45:00' DesiredStartTime, 'Jimmy' Student UNION ALL
      SELECT '2019-07-01 09:15:00', 'Bobby' UNION ALL
      SELECT '2019-07-01 10:00:00', 'Buddy' 
    )
    SELECT * FROM (
      SELECT IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(StartTime IGNORE NULLS) OVER(prev_win), FIRST_VALUE(StartTime IGNORE NULLS) OVER(next_win)
        ) StartTime, IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(Class IGNORE NULLS) OVER(prev_win), FIRST_VALUE(Class IGNORE NULLS) OVER(next_win)
        ) Class, Student 
      FROM (
        SELECT StartTime, UNIX_SECONDS(StartTime) ts, Class, '' Student FROM `project.dataset.class` 
        UNION ALL
        SELECT DesiredStartTime, UNIX_SECONDS(DesiredStartTime), NULL, Student FROM `project.dataset.student` 
      )
      WINDOW 
        prev_win AS (ORDER BY StartTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
        next_win AS (ORDER BY StartTime ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
    )
    WHERE Student != ''   
    

    结果如下

    Row StartTime                   Class       Student  
    1   2019-07-01 09:00:00 UTC     Chemistry   Jimmy    
    2   2019-07-01 09:00:00 UTC     Chemistry   Bobby    
    3   2019-07-01 10:30:00 UTC     Math        Buddy     
    

    如果StartTimeDesiredStartTime 是字符串,就像从您的问题示例中一样,您显然需要首先将它们解析为TIMESTAMP,如下例所示

    #standardSQL
    WITH `project.dataset.class` AS (
      SELECT '07/01/19 08:00' StartTime, 'English' Class UNION ALL
      SELECT '07/01/19 09:00', 'Chemistry' UNION ALL
      SELECT '07/01/19 10:30', 'Math' 
    ), `project.dataset.student` AS (
      SELECT '07/01/19 08:45' DesiredStartTime, 'Jimmy' Student UNION ALL
      SELECT '07/01/19 09:15', 'Bobby' UNION ALL
      SELECT '07/01/19 10:00', 'Buddy' 
    )
    SELECT * FROM (
      SELECT IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(StartTime IGNORE NULLS) OVER(prev_win), FIRST_VALUE(StartTime IGNORE NULLS) OVER(next_win)
        ) StartTime, IF(
        ts - LAST_VALUE(ts IGNORE NULLS) OVER(prev_win) < FIRST_VALUE(ts IGNORE NULLS) OVER(next_win) - ts, 
        LAST_VALUE(Class IGNORE NULLS) OVER(prev_win), FIRST_VALUE(Class IGNORE NULLS) OVER(next_win)
        ) Class, Student 
      FROM (
        SELECT PARSE_TIMESTAMP('%D %R', StartTime) StartTime, UNIX_SECONDS(PARSE_TIMESTAMP('%D %R', StartTime)) ts, Class, '' Student FROM `project.dataset.class` 
        UNION ALL
        SELECT PARSE_TIMESTAMP('%D %R', DesiredStartTime), UNIX_SECONDS(PARSE_TIMESTAMP('%D %R', DesiredStartTime)), NULL, Student FROM `project.dataset.student` 
      )
      WINDOW 
        prev_win AS (ORDER BY StartTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
        next_win AS (ORDER BY StartTime ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
    )
    WHERE Student != ''
    

    【讨论】:

    • 此解决方案将为任何学生的DesiredStartTime 返回NULL,该DesiredStartTime 晚于班级中最大的StartTime。尝试将 Buddy 的 DesiredStartTime 更改为 11:00。有什么解决办法吗?
    • 我会检查这个
    • 这篇文章/问题的上下文是这样的,如果 desiredstarttime 大于任何课程的开始时间 - 这样的学生运气不好,因此没有任何匹配这样的学生。如果您仍然希望在最终输出中有这样的学生 - 您可以将当前结果与原始用户列表一起加入。非常简单的调整。
    • 感谢您的回复!我认为这个问题的目标是从班级表中找到最接近学生表的时间戳记录。它没有具体说明时间限制。我根据您的答案发布我的答案并修复我发现的两个问题。顺便说一句,这个答案很棒,感谢您的启发!
    【解决方案3】:

    这应该可以为您解决问题,略带“BQ”风格。 :-)

    SELECT Student, item.StartTime, item.Class FROM (
        SELECT s.Student as Student, 
             ARRAY_AGG(
               STRUCT(
                 c.StartTime as StartTime,
                 c.Class AS Class, 
                 ABS(UNIX_SECONDS(s.DesiredStartTime) - UNIX_SECONDS(c.StartTime)) AS Delta 
               ) 
               ORDER BY ABS(UNIX_SECONDS(s.DesiredStartTime) - UNIX_SECONDS(c.StartTime))
          )[SAFE_OFFSET(0)] AS item
        FROM student s
        LEFT JOIN class c ON 1 = 1
        GROUP BY 1
    )
    

    【讨论】:

      【解决方案4】:

      根据 Mikhail Berlyant 的回答修改。我认为这个答案有两个问题:

      1. 最终结果的StartTime 应该只来自类表,所以我在每个LAST_VALUEFIRST_VALUE 表达式中添加了一个过滤器。

      2. 原始解决方案将为所有学生的DesiredStartTime 返回NULL,该StartTime 晚于班级中最大的StartTime。请注意,我将 Bubby 更新到 11:00。

        原因是上一条记录LAST_VALUE(...) 和第一条下一条记录FIRST_VALUE(...) 都可能是NULL。对于那些排在最后的学生记录,他们的FIRST_VALUE(...)NULL。这将导致比较返回NULL。 BigQuery 将NULL 视为False。但对于这些情况,它应该是True,因为我们需要返回LAST_VALUE(...)

        为了解决这个问题,我将那些可能的NULLs 转换为无穷大值,因此它们应该始终返回具有值的那个。

      WITH
        `project.dataset.class` AS (
            SELECT '07/01/19 08:00' StartTime, 'English' Class UNION ALL
            SELECT '07/01/19 09:00', 'Chemistry' UNION ALL
            SELECT '07/01/19 10:30', 'Math' ),
        `project.dataset.student` AS (
            SELECT '07/01/19 6:45' DesiredStartTime, 'Jimmy' Student UNION ALL
            SELECT '07/01/19 09:29', 'Bobby' UNION ALL
            SELECT '07/01/19 11:00', 'Buddy' )
      SELECT * FROM (
        SELECT
          IF
            (
              IFNULL(ts - LAST_VALUE(IF(Student = '', ts , NULL) IGNORE NULLS) OVER(prev_win), CAST('inf' AS float64)) < 
              IFNULL(FIRST_VALUE(IF(Student = '', ts, NULL) IGNORE NULLS) OVER(next_win) - ts, CAST('inf' AS float64)
            ),
            LAST_VALUE(IF(Student = '', StartTime, NULL) IGNORE NULLS) OVER(prev_win),
            FIRST_VALUE(IF(Student = '', StartTime, NULL) IGNORE NULLS) OVER(next_win) 
            ) StartTime,
          IF
            (
              IFNULL(ts - LAST_VALUE(IF(Student = '', ts , NULL) IGNORE NULLS) OVER(prev_win), CAST('inf' AS float64)) < 
              IFNULL(FIRST_VALUE(IF(Student = '', ts, NULL) IGNORE NULLS) OVER(next_win) - ts, CAST('inf' AS float64)
            ),
            LAST_VALUE(IF(Student = '', Class, NULL) IGNORE NULLS) OVER(prev_win),
            FIRST_VALUE(IF(Student = '', Class, NULL) IGNORE NULLS) OVER(next_win) 
            ) Class,
          Student
        FROM (
          SELECT PARSE_TIMESTAMP('%D %R', StartTime) StartTime, UNIX_SECONDS(PARSE_TIMESTAMP('%D %R', StartTime)) ts, Class, '' Student FROM `project.dataset.class` 
          UNION ALL
          SELECT PARSE_TIMESTAMP('%D %R', DesiredStartTime), UNIX_SECONDS(PARSE_TIMESTAMP('%D %R', DesiredStartTime)), NULL, Student FROM `project.dataset.student`
          )
        WINDOW
          prev_win AS (ORDER BY StartTime ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
          next_win AS (ORDER BY StartTime ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING)
      )
      WHERE
        Student != ''
      

      【讨论】:

        猜你喜欢
        • 2020-01-29
        • 2014-12-20
        • 1970-01-01
        • 2016-08-24
        • 2021-04-17
        • 1970-01-01
        • 2021-10-02
        • 2011-06-25
        • 1970-01-01
        相关资源
        最近更新 更多