【问题标题】:How to count the number of missing values in each row of a data frame -spark scala?如何计算数据框每行中缺失值的数量-spark scala?
【发布时间】:2019-04-21 04:40:14
【问题描述】:

我想在 spark scala 中计算数据框每一行中缺失值的数量。

代码:

val samplesqlDF = spark.sql("SELECT * FROM sampletable")

samplesqlDF.show()

输入数据框:

    ------------------------------------------------------------------
   | name       |     age             |  degree    | Place            |
   | -----------------------------------------------------------------|
   | Ram        |                     |    MCA     | Bangalore        |
   |            |     25              |            |                  |
   |            |     26              |     BE     |                  |
   | Raju       |     21              |     Btech  |  Chennai         |
   -----------------------------------------------------------------

输出数据帧(Row Level Count)如下:

    -----------------------------------------------------------------
   | name       |     age   |  degree    | Place      |   rowcount   |
   | ----------------------------------------------------------------|
   | Ram        |           |    MCA     | Bangalore  |   1          |
   |            |     25    |            |            |   3          |
   |            |     26    |     BE     |            |   2          |
   | Raju       |     21    |    Btech   |  Chennai   |   0          | 
   -----------------------------------------------------------------

我是 scala 和 spark 的初学者。提前致谢。

【问题讨论】:

  • 嗨,欢迎来到 StackOverflow。您可以查看this link - how to ask 以改进未来的问题。特别是,您应该提供一些研究工作和/或一些代码来探测您已经尝试过自己解决问题。
  • 嗨,看看解决方案怎么样?

标签: scala apache-spark apache-spark-sql spark-streaming


【解决方案1】:

看起来您想以动态方式获取空计数。看看这个

val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","Place")
df.show(false)
val df2 = df.columns.foldLeft(df)( (df,c) => df.withColumn(c+"_null", when(col(c).isNull,1).otherwise(0) ) )
df2.createOrReplaceTempView("student")
val sql_str_null = df.columns.map( x => x+"_null").mkString(" ","+"," as null_count ")
val sql_str_full = df.columns.mkString( "select ", ",", " , " + sql_str_null + " from student")
spark.sql(sql_str_full).show(false)

输出:

+----+----+------+---------+----------+
|name|age |degree|Place    |null_count|
+----+----+------+---------+----------+
|Ram |null|MCA   |Bangalore|1         |
|null|25  |null  |null     |3         |
|null|26  |BE    |null     |2         |
|Raju|21  |Btech |Chennai  |0         |
+----+----+------+---------+----------+

【讨论】:

    【解决方案2】:

    也有可能并检查“”,但不使用 foldLeft 来证明这一点:

    import org.apache.spark.sql.functions._
    
    val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,""),(null,"26","BE",null),("Raju","21","Btech","Chennai")).toDF("name","age","degree","place")
    
    // Count per row the null or "" columns! 
    val null_counter = Seq("name", "age", "degree", "place").map(x => when(col(x) === "" || col(x).isNull , 1).otherwise(0)).reduce(_ + _)  
    
    val df2 = df.withColumn("nulls_cnt", null_counter)
    
    df2.show(false)
    

    返回:

     +----+----+------+---------+---------+
     |name|age |degree|place    |nulls_cnt|
     +----+----+------+---------+---------+
     |Ram |null|MCA   |Bangalore|1        |
     |null|25  |null  |         |3        |
     |null|26  |BE    |null     |2        |
     |Raju|21  |Btech |Chennai  |0        |
     +----+----+------+---------+---------+
    

    【讨论】:

    • 您还可以将坦克翻译成赞成票作为赞赏的标志,并且它向其他人指出这是一个有效的选择...
    【解决方案3】:

    @stack0114106 建议的简化版本是

    val df = Seq(("Ram",null,"MCA","Bangalore"),(null,"25",null,null), 
                 (null,"26","BE",null),("Raju","21","Btech","Chennai"))
            .toDF("name","age","degree","Place")
            .withColumn("null_count", lit(0))
    
    val df2 = df.columns.foldLeft(df)((df,c) => 
                df.withColumn("null_count", 
                    when(col(c).isNull,$"null_count" + 1).otherwise($"null_count")
                )
            )
    df2.show(false)
    

    输出是

    +----+----+------+---------+----------+
    |name|age |degree|Place    |null_count|
    +----+----+------+---------+----------+
    |Ram |null|MCA   |Bangalore|1         |
    |null|25  |null  |null     |3         |
    |null|26  |BE    |null     |2         |
    |Raju|21  |Btech |Chennai  |0         |
    +----+----+------+---------+----------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-06-13
      • 1970-01-01
      • 2017-11-03
      • 1970-01-01
      • 2020-08-03
      • 2015-07-15
      相关资源
      最近更新 更多