【发布时间】:2024-05-02 12:40:02
【问题描述】:
val rdd = df.rdd.map(line => Row.fromSeq((
scala.xml.XML.loadString("<?xml version='1.0' encoding='utf-8'?>" + line(1)).child
.filter(elem =>
elem.label == "name1"
|| elem.label == "name2"
|| elem.label == "name3"
|| elem.label == "name4"
).map(elem => (elem.label -> elem.text)).toList)
)
我是rdd.take(10).foreach(println),我是RDD[Row] 然后产生如下输出:
[(name1, value1), (name2, value2),(name3, value3)]
[(name1, value11), (name2, value22),(name3, value33)]
[(name1, value111), (name2, value222),(name4, value44)]
我想用 (name1..name4 是 csv 的标题) 将它保存到 csv 中,请任何人帮助我如何用 apache spark 2.4.0 实现它
name1 | name2 | name3 | name4
value1 | value2 |value3 | null
value11 | value22 |value33 | null
value111 | value222 |null | value444
【问题讨论】:
标签: scala apache-spark scala-xml