Spark 读取 csv 文件，文件中的一列带有引号答案

【问题标题】：Spark read for csv file with quotes on one column in fileSpark 读取 csv 文件，文件中的一列带有引号
【发布时间】：2022-01-16 19:43:54
【问题描述】：

我的 HDFS 位置有一个 csv 文件，它在一列上有引号。我的文件是 records.csv ，这是它的数据

 100,"surender,CHN",IND
 101,"ajay,HYD",IND



 scala> val schema = StructType(
 | Array(
 | StructField("emp_id", StringType, true),
 | StructField("emp_name", StringType, true),
 | StructField("emp_city", StringType, true),
 | StructField("emp_country", StringType, true)
 | )
 | )
 schema: org.apache.spark.sql.types.StructType = StructType(StructField(emp_id,StringType,true), StructField(emp_name,StringType,true), StructField(emp_city,StringType,true), StructField(emp_country,StringType,true))

 scala>

 scala> val loc = "/user/omega/records.csv"
 loc: String = /user/omega/records.csv

                                                                                          ^

 scala> val loc = "/user/omega/records.csv"
 loc: String = /user/omega/records.csv

 scala> val df = spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").schema(schema).csv(loc)
 df: org.apache.spark.sql.DataFrame = [emp_id: string, emp_name: string ... 2 more fields]

  scala> df.show(10,false)
  +------+------------+--------+-----------+
  |emp_id|emp_name    |emp_city|emp_country|
  +------+------------+--------+-----------+
  |100   |surender,CHN|IND     |null       |
  |101   |ajay,HYD    |IND     |null       |
  +------+------------+--------+-----------+

但我的预期输出是

  +------+------------+--------+-----------+
  |emp_id|emp_name    |emp_city|emp_country|
  +------+------------+--------+-----------+
  |100   |surender    |CHN     |IND       |
  |101   |ajay        |HYD     |IND       |
  +------+------------+--------+-----------+

我如何获得预期的输出？

我尝试了另一个代码，如下所示

  val df1 = spark.read.option("delimiter", ",").option("quote", "").option("escape quote", "").schema(schema).csv(loc)

上面的df1给出了以下结果

 +------+---------+--------+-----------+
 |emp_id| emp_name|emp_city|emp_country|
 +------+---------+--------+-----------+
 |   100|"surender|    CHN"|        IND|
 |   101|    "ajay|    HYD"|        IND|
 +------+---------+--------+-----------+

【问题讨论】：

标签： dataframe scala apache-spark

【解决方案1】：

一个简单的解决方案是在读取 CSV 后清理数据

df
  .withColumn("emp_name", split(col("column_with_quotes"), ",").getItem(0))
  .withColumn("emp_city", split(col("column_with_quotes"), ",").getItem(1))
  .drop("column_with_quotes")

稍后更新

我查看了CSV options。你检查过这个选项吗？

unescapedQuoteHandling BACK_TO_DELIMITER #Defines how the CsvParser will handle values with unescaped quotes

【讨论】：

@gater ，这会起作用，但这对我没有帮助，因为 column_with_quotes 后面有大约 100 多列，我不想写很多 .withColumn