【发布时间】:2016-07-28 08:17:14
【问题描述】:
假设我有一个如下的 DataFrame:
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
架构是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
| | |-- useless: string (nullable = true)
我正在寻找一种仅选择字段子集的方法:数组列subClasss 的id 和size,但保留嵌套数组结构。
生成的架构将是:
root
|-- subClasss: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- size: integer (nullable = false)
我已经尝试过
df.select("subClasss.id","subClasss.size")
但这会将数组subClasss 拆分为两个数组:
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- size: array (nullable = true)
| |-- element: integer (containsNull = true)
有没有办法保留原始结构并消除useless 字段?看起来像的东西:
df.select("subClasss.[id,size]")
感谢您的宝贵时间。
【问题讨论】:
标签: scala apache-spark dataframe apache-spark-sql