【发布时间】:2018-04-15 22:54:08
【问题描述】:
我正在使用 Spark(2.0) 开发 Spark SQL,并使用 Java API 读取 CSV。
在 CSV 文件中有一个双引号,/ 分隔的列。例如:"Express Air,Delivery Truck"
读取CSV并返回Dataset的代码:
Dataset<Row> df = spark.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load(filename)
结果:
+-----+-----------------------+--------------------------+
|Year | State | Ship Mode |...
+-----+-----------------------+--------------------------+
|2012 |New York/California |Express Air/Delivery Truck|...
|2013 |Nevada/Texas |Delivery Truck |...
|2014 |North Carolina/Kentucky|Regular Air/Delivery Truck|...
+-----+-----------------------+--------------------------+
但是,我想将 State 和 Shop Mode 拆分为 Mode 列并作为数据集返回并希望它保持顺序。 ex) {New York,Express Air} {California,Delivery Truck}
+-----+--------------------------+
|Year | Mode |
+-----+--------------------------+
|2012 |New York,Express Air |
|2012 |California,Delivery Truck |
|2013 |Nevada,Delivery Truck |
|2013 |Texas,Delivery Truck |
|2014 |North Carolina,Regular Air|
|2014 |Kentucky,Delivery Truck |
+-----+--------------------------+
我有什么方法可以使用 Java Spark 做到这一点?
【问题讨论】:
标签: java sql apache-spark dataset apache-spark-sql