【问题标题】:Comparing two Identically structured Dataframes in Spark在 Spark 中比较两个相同结构的数据帧
【发布时间】:2020-02-26 21:42:46
【问题描述】:
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")

val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")

所以上述两个数据帧具有相同的表结构,我想找出另一个数据帧(changedDF)中值发生变化的id。我尝试使用 spark 中的 except() 函数,但它给了我两行。 Id 是这两个数据框之间的公共列。

changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id|  name|  city|creditscore|credit_limit|
+---+------+------+-----------+------------+
|  4|Joshua|cochin|        612|       85000|
|  2| sunil| noida|        650|       90000|
+---+------+------+-----------+------------+

而我只想要有任何更改的通用 ID。像这样 ->

+---+------+------+-----------+------------+
| id|  name|  city|creditscore|credit_limit|
+---+------+------+-----------+------------+
|  2| sunil| noida|        650|       90000|
+---+------+------+-----------+------------+

有没有办法找出数据已更改的唯一公共 ID。 谁能告诉我我可以遵循的任何方法来实现这一目标。

【问题讨论】:

    标签: scala sqlite apache-spark apache-spark-sql


    【解决方案1】:

    您可以对数据框进行inner 连接,这将为您提供具有通用 ID 的结果。

    originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
      .select("a.*")
      .except(changedDF)
      .show
    

    那么,你的预期结果就出来了:

    +---+-----+-----+------------+------------+
    | id| name| city|credit_score|credit_limit|
    +---+-----+-----+------------+------------+
    |  2|sunil|noida|         600|       80000|
    +---+-----+-----+------------+------------+
    

    【讨论】:

      猜你喜欢
      • 2018-01-15
      • 1970-01-01
      • 1970-01-01
      • 2019-06-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-21
      • 1970-01-01
      相关资源
      最近更新 更多