:Spark: subtract two DataFrames 如果两个数据集具有完全相同的列,则可能重复
如果您想要自定义连接条件,那么您可以使用“反”连接。这是pysaprk版本
创建两个数据框:
数据框1:
l1 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row3', 30)
df1 = spark.createDataFrame(l1).toDF('col1','col2')
df1.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row3| 30|
+---------+----+
数据框2:
l2 = [('col1_row1', 10), ('col1_row2', 20), ('col1_row4', 40)]
df2 = spark.createDataFrame(l2).toDF('col1','col2')
df2.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row1| 10|
|col1_row2| 20|
|col1_row4| 40|
+---------+----+
使用减法 api:
df_final = df1.subtract(df2)
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+
使用 left_anti :
加盟条件:
join_condition = [df1["col1"] == df2["col1"], df1["col2"] == df2["col2"]]
最后加入
df_final = df1.join(df2, join_condition, 'left_anti')
df_final.show()
+---------+----+
| col1|col2|
+---------+----+
|col1_row3| 30|
+---------+----+