在 spark 中删除除 null 之外的重复项答案

【问题标题】：Drop duplicates except null in spark在 spark 中删除除 null 之外的重复项
【发布时间】：2021-01-05 20:16:03
【问题描述】：

我在 pandas 中看到有一种方法可以删除重复项并忽略空值。 Drop duplicates, but ignore nulls 有没有办法在 Spark 中忽略空值（不删除那些行）时删除重复项？

例如：我想删除重复的“动物”

val columns=Array("id", "color", "animal")
val df1=sc.parallelize(Seq(
  (1, "Blue", null ), // dont drop this
  (4, "yellow", null ), // dont drop this
  (2, "Red", "Fish"),
  (5, "green", "panda"), // one panda row needs to drop
  (6, "red", "panda"), // one panda needs to drop
  (7, "Blue", "koala")
)).toDF(columns: _*)


df1.show()

val dropped = df1.dropDuplicates("animal") 

dropped.show()

我看到 dropDuplicates，占用其他列。我尝试了这种方法，但它引入了另一个问题，即不删除不为空的重复动物。

【问题讨论】：

标签： scala dataframe apache-spark

【解决方案1】：

使用窗口方法：

与 distinct/dropDuplicates 方法相比，以下方法提供更好的性能。

 df1.withColumn("rn",row_number().over(Window.partitionBy("animal").orderBy("animal"))).where(('rn===1 &&'animal.isNotNull)|| ('rn>=1 && 'animal.isNull)).show

+---+------+------+---+
| id| color|animal| rn|
+---+------+------+---+
|  5| green| panda|  1|
|  7|  Blue| koala|  1|
|  1|  Blue|  null|  1|
|  4|yellow|  null|  2|
|  2|   Red|  Fish|  1|
+---+------+------+---+

【讨论】：

【解决方案2】：

一种方法如下（我展示了完整的代码）

val schema2 = StructType(List(StructField("id", IntegerType, true), StructField("color",StringType, true), StructField("animal",StringType, true)))
val data = sc.parallelize(Seq(
        (1, "Blue", null ), // dont drop this
        (4, "yellow", null ), // dont drop this
        (2, "Red", "Fish"),
        (5, "green", "panda"), // one panda row needs to drop
        (6, "red", "panda"), // one panda needs to drop
        (7, "Blue", "koala")
      )).map(t => Row(t._1,t._2,t._3))
val df2 = spark.createDataFrame(data, schema2)

df2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
|  1|  Blue|  null|
|  4|yellow|  null|
|  2|   Red|  Fish|
|  5| green| panda|
|  6|   red| panda|
|  7|  Blue| koala|
+---+------+------+
*/
// dropping duplicates except nulls
val dropped2 = df2
    .filter(r => r(2) == null)
    .union(df2.na.drop("any").dropDuplicates("animal"))

dropped2.show()
/*
+---+------+------+
| id| color|animal|
+---+------+------+
|  1|  Blue|  null|
|  4|yellow|  null|
|  2|   Red|  Fish|
|  7|  Blue| koala|
|  5| green| panda|
+---+------+------+
*/

【讨论】：