【发布时间】:2018-09-19 06:30:00
【问题描述】:
访问数据框Row 值时如何处理空值?空指针异常真的需要手动处理吗?一定有更好的解决方案。
case class FirstThing(id:Int, thing:String, other:Option[Double])
val df = Seq(FirstThing(1, "first", None), FirstThing(1, "second", Some(2)), FirstThing(1, "third", Some(3))).toDS
df.show
val list = df.groupBy("id").agg(collect_list(struct("thing", "other")).alias("mylist"))
list.show(false)
NPE 失败:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getDouble(1)))
list.withColumn("aa", xxxx(col("mylist"))).show(false)
这奇怪地给出了 0:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Double]("other")))
list.withColumn("aa", xxxx(col("mylist"))).show(false)
+---+-----------------------------------------+---------------+
|id |mylist |aa |
+---+-----------------------------------------+---------------+
|1 |[[first,null], [second,2.0], [third,3.0]]|[0.0, 2.0, 3.0]|
+---+-----------------------------------------+---------------+
遗憾的是,这种适用于数据帧/数据集的方法也失败了:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Option[Double]]("other")))
list.withColumn("aa", xxxx(col("mylist"))).show(false)
ClassCastException: java.lang.Double 不能转换为 scala.Option
【问题讨论】:
标签: scala apache-spark null apache-spark-sql user-defined-functions