【问题标题】:Drop duplicates except null in spark在 spark 中删除除 null 之外的重复项
【发布时间】:2021-01-05 20:16:03
【问题描述】:

我在 pandas 中看到有一种方法可以删除重复项并忽略空值。 Drop duplicates, but ignore nulls 有没有办法在 Spark 中忽略空值(不删除那些行)时删除重复项?

例如:我想删除重复的“动物”

val columns=Array("id", "color", "animal")
val df1=sc.parallelize(Seq(
  (1, "Blue", null ), // dont drop this
  (4, "yellow", null ), // dont drop this
  (2, "Red", "Fish"),
  (5, "green", "panda"), // one panda row needs to drop
  (6, "red", "panda"), // one panda needs to drop
  (7, "Blue", "koala")
)).toDF(columns: _*)


df1.show()

val dropped = df1.dropDuplicates("animal") 

dropped.show()

我看到 dropDuplicates,占用其他列。我尝试了这种方法,但它引入了另一个问题,即不删除不为空的重复动物。

【问题讨论】:

    标签: scala dataframe apache-spark


    【解决方案1】:

    使用窗口方法:

    与 distinct/dropDuplicates 方法相比,以下方法提供更好的性能。

     df1.withColumn("rn",row_number().over(Window.partitionBy("animal").orderBy("animal"))).where(('rn===1 &&'animal.isNotNull)|| ('rn>=1 && 'animal.isNull)).show
    
    +---+------+------+---+
    | id| color|animal| rn|
    +---+------+------+---+
    |  5| green| panda|  1|
    |  7|  Blue| koala|  1|
    |  1|  Blue|  null|  1|
    |  4|yellow|  null|  2|
    |  2|   Red|  Fish|  1|
    +---+------+------+---+
    

    【讨论】:

      【解决方案2】:

      一种方法如下(我展示了完整的代码)

      val schema2 = StructType(List(StructField("id", IntegerType, true), StructField("color",StringType, true), StructField("animal",StringType, true)))
      val data = sc.parallelize(Seq(
              (1, "Blue", null ), // dont drop this
              (4, "yellow", null ), // dont drop this
              (2, "Red", "Fish"),
              (5, "green", "panda"), // one panda row needs to drop
              (6, "red", "panda"), // one panda needs to drop
              (7, "Blue", "koala")
            )).map(t => Row(t._1,t._2,t._3))
      val df2 = spark.createDataFrame(data, schema2)
      
      df2.show()
      /*
      +---+------+------+
      | id| color|animal|
      +---+------+------+
      |  1|  Blue|  null|
      |  4|yellow|  null|
      |  2|   Red|  Fish|
      |  5| green| panda|
      |  6|   red| panda|
      |  7|  Blue| koala|
      +---+------+------+
      */
      // dropping duplicates except nulls
      val dropped2 = df2
          .filter(r => r(2) == null)
          .union(df2.na.drop("any").dropDuplicates("animal"))
      
      dropped2.show()
      /*
      +---+------+------+
      | id| color|animal|
      +---+------+------+
      |  1|  Blue|  null|
      |  4|yellow|  null|
      |  2|   Red|  Fish|
      |  7|  Blue| koala|
      |  5| green| panda|
      +---+------+------+
      */
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2013-05-08
        • 2021-01-28
        • 1970-01-01
        • 1970-01-01
        • 2018-04-25
        • 2018-12-04
        • 2016-10-01
        • 1970-01-01
        相关资源
        最近更新 更多