【发布时间】:2018-04-11 18:38:00
【问题描述】:
我有一个简单的 Spark 程序,它读取 JSON 文件并发出 CSV 文件。在 JSON 数据中,值包含前导和尾随空格,当我发出 CSV 时,前导和尾随空格消失了。有没有办法可以保留空间。我尝试了很多选项,例如 ignoreTrailingWhiteSpace 、 ignoreLeadingWhiteSpace 但没有运气
输入.json
{"key" : "k1", "value1": "Good String", "value2": "Good String"}
{"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "}
{"key" : "k1", "value1": "with tab\t", "value2": "with tab\t"}
输出.csv
_corrupt_record,key,value1,value2
,k1,Good String,Good String
,k1,With Spaces,With Spaces
,k1,with tab,with tab
expected.csv
_corrupt_record,key,value1,value2
,k1,Good String,Good String
,k1,With Spaces ,With Spaces
,k1,with tab\t,with tab\t
我的代码:
public static void main(String[] args) {
SparkSession sparkSession = SparkSession
.builder()
.appName(TestSpark.class.getName())
.master("local[1]").getOrCreate();
SparkContext context = sparkSession.sparkContext();
context.setLogLevel("ERROR");
SQLContext sqlCtx = sparkSession.sqlContext();
System.out.println("Spark context established");
List<StructField> kvFields = new ArrayList<>();
kvFields.add(DataTypes.createStructField("_corrupt_record", DataTypes.StringType, true));
kvFields.add(DataTypes.createStructField("key", DataTypes.StringType, true));
kvFields.add(DataTypes.createStructField("value1", DataTypes.StringType, true));
kvFields.add(DataTypes.createStructField("value2", DataTypes.StringType, true));
StructType employeeSchema = DataTypes.createStructType(kvFields);
Dataset<Row> dataset =
sparkSession.read()
.option("inferSchema", false)
.format("json")
.schema(employeeSchema)
.load("D:\\dev\\workspace\\java\\simple-kafka\\key_value.json");
dataset.createOrReplaceTempView("sourceView");
sqlCtx.sql("select * from sourceView")
.write()
.option("header", true)
.format("csv")
.save("D:\\dev\\workspace\\java\\simple-kafka\\output\\" + UUID.randomUUID().toString());
sparkSession.close();
}
更新
添加了 POM 依赖项
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.22</version>
</dependency>
</dependencies>
【问题讨论】:
标签: apache-spark apache-spark-sql spark-dataframe spark-streaming apache-spark-mllib