【发布时间】:2021-04-26 22:57:37
【问题描述】:
如何在 spark 3.0.1 中读取具有多行选项的多字符分隔符的文件?
输入文件
company||street||city
Test1 company||1st street||city1
Test2 company||2nd street||city2
Test3 company||"3rd
street"||city3
spark.read
.option("delimiter", "||")
.option("header", "true")
.option("multiLine", "true")
.option("inferSchema", "false")
.csv(transformedFile)
在打印数据框时,它将总记录显示为 4 而不是 3。
records count :4
+-------------+
|company |
+-------------+
|Test1 company|
|Test2 company|
|Test3 company|
|street" |
+-------------+
+-------------+-----------+-----+
|company |street |city |
+-------------+-----------+-----+
|Test1 company|1st street |city1|
|Test2 company|2nd street |city2|
|Test3 company|3rd
street|city3|
+-------------+-----------+-----+
如果它是单个字符分隔符,则与预期相同。
【问题讨论】:
标签: scala apache-spark