【问题标题】:How to combine rows in two data frames given a certain condition in Scala?给定Scala中的特定条件,如何组合两个数据框中的行?
【发布时间】:2025-11-25 22:55:01
【问题描述】:

我有这两个数据框叫:

表1

+---------+------------+------+
|  Animal |    Owner   |count1|
+---------+------------+------+
|    Cat  |     Bob    |   3  |
|    Fish |    Jerry   |   2  |
|    Dog  |     Bob    |   2  |
|  Turtle |     Joe    |   5  |
+---------+------------+------+

表2

+---------+------------+------+
|  Animal |    Owner   |count2|
+---------+------------+------+
|    Cat  |     Bob    |   2  |
|    Fish |     Jerry  |   1  |
|    Dog  |     Bob    |   3  |
|   Snake |     Kim    |   6  |
+---------+------------+------+

我正在尝试以某种方式组合这两个数据框,以便下面的新数据框包含行

  • 出现在“table1”或“table2”中
  • 在两个表中找到的不同行包含的计数值在“table2”中大于在“table1”中

这是我想要生成的预期输出数据帧。

+---------+------------+------+------+
|  Animal |   Owner    |count1|count2|
+---------+------------+------+------+
|   Dog   |     Bob    |   2  |   3  |
|  Turtle |     Joe    |   5  | null |
|   Snake |     Kim    | null |   6  |
+---------+------------+------+------+

出现在“table1”中而不出现在“table2”中(或出现在“table2”中而不出现在“table1”中)的行的计数值可以为“null”。

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    在 Spark 中尝试完全加入 filter 条件

    scala> var t1 = Seq(("Cat","Bob",3),  ("Fish" ,"Jerry" ,2),  ("Dog"  ,    "Bob",2),  ("Turtle" ,"Joe",5)).toDF("Animal","Owner","count1")
    
    scala> var t2 = Seq(("Cat", "Bob",2),("Fish","Jerry",1),("Dog" ,"Bob",3),("Snake","Kim",6)).toDF("Animal","Owner","count2")
    

    在数据帧 t1(table1) 和 t2(table2) 中应用 full join 并保留表中两个计数列的空行。

    scala> t2.join(t1,Seq("Animal","Owner"),"full").filter(col("count2")>col("count1") || col("count2").isNull || col("count1").isNull).show
    +------+-----+------+------+
    |Animal|Owner|count2|count1|
    +------+-----+------+------+
    |   Dog|  Bob|     3|     2|
    | Snake|  Kim|     6|  null|
    |Turtle|  Joe|  null|     5|
    +------+-----+------+------+
    

    【讨论】: