【发布时间】:2020-02-26 21:42:46
【问题描述】:
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
所以上述两个数据帧具有相同的表结构,我想找出另一个数据帧(changedDF)中值发生变化的id。我尝试使用 spark 中的 except() 函数,但它给了我两行。 Id 是这两个数据框之间的公共列。
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
而我只想要有任何更改的通用 ID。像这样 ->
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
有没有办法找出数据已更改的唯一公共 ID。 谁能告诉我我可以遵循的任何方法来实现这一目标。
【问题讨论】:
标签: scala sqlite apache-spark apache-spark-sql