Pyspark加入功能和时间戳之间的差异答案

【问题标题】：Pyspark join with functions and difference between timestampsPyspark加入功能和时间戳之间的差异
【发布时间】：2020-04-13 16:16:49
【问题描述】：

我正在尝试将 2 个表与用户事件连接起来。我想通过 user_id (id) 将 table_a 与 table_b 连接起来，并且当差异时间戳小于 5s (5000ms) 时。

这是我正在做的事情：

table_a = (
  table_a
  .join(
  table_b,
    table_a.uid == table_b.uid 
     & abs(table_b.b_timestamp - table_a.a_timestamp) < 5000 
     & table_a.a_timestamp.isNotNull()
  ,
  how = 'left'
  )
)

我收到 2 个错误：

错误 1) ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

如果我删除连接上的第二个条件并仅保留第一个和第三个条件时出现错误 2： org.apache.spark.sql.AnalysisException: cannot resolve &#39;(uidAND (a_timestampIS NOT NULL))&#39; due to data type mismatch: differing types in &#39;(uidAND (a_timestampIS NOT NULL))&#39; (string and boolean).;;

非常感谢任何帮助！

【问题讨论】：

您的条件需要括号（例如 ((condtion1) & (condition2) & ..)。这是 pyspark 的常见问题。对于第二个错误，您应该考虑首先用 0 值填充 NA 和在第二个条件下，将您的两个时间戳值解析为双精度、十进制、整数或任何相同的值。然后您可以在最后一个条件中使用不等于零。

标签： apache-spark join pyspark databricks pyspark-dataframes

【解决方案1】：

您只需要在每个过滤条件周围加上括号。例如，以下工作：

df1 = spark.createDataFrame([
    (1, 20),
    (1, 21),
    (1, 25),
    (1, 30),
    (2, 21),
], ['id', 'val'])

df2 = spark.createDataFrame([
    (1, 21),
    (2, 30),
], ['id', 'val'])

df1.join(
    df2, 
    (df1.id == df2.id) 
    & (abs(df1.val - df2.val) < 5)
).show()
# +---+---+---+---+
# | id|val| id|val|
# +---+---+---+---+
# |  1| 20|  1| 21|
# |  1| 21|  1| 21|
# |  1| 25|  1| 21|
# +---+---+---+---+

但没有括号：

df1.join(
    df2, 
    df1.id == df2.id
    & abs(df1.val - df2.val) < 5
).show()
# ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

【讨论】：

成功了！在我找到这个之前肯定会失去几个小时。非常感谢。