【问题标题】:Spark read for csv file with quotes on one column in fileSpark 读取 csv 文件,文件中的一列带有引号
【发布时间】:2022-01-16 19:43:54
【问题描述】:

我的 HDFS 位置有一个 csv 文件,它在一列上有引号。我的文件是 records.csv ,这是它的数据

 100,"surender,CHN",IND
 101,"ajay,HYD",IND



 scala> val schema = StructType(
 | Array(
 | StructField("emp_id", StringType, true),
 | StructField("emp_name", StringType, true),
 | StructField("emp_city", StringType, true),
 | StructField("emp_country", StringType, true)
 | )
 | )
 schema: org.apache.spark.sql.types.StructType = StructType(StructField(emp_id,StringType,true), StructField(emp_name,StringType,true), StructField(emp_city,StringType,true), StructField(emp_country,StringType,true))

 scala>

 scala> val loc = "/user/omega/records.csv"
 loc: String = /user/omega/records.csv

                                                                                          ^

 scala> val loc = "/user/omega/records.csv"
 loc: String = /user/omega/records.csv

 scala> val df = spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").schema(schema).csv(loc)
 df: org.apache.spark.sql.DataFrame = [emp_id: string, emp_name: string ... 2 more fields]

  scala> df.show(10,false)
  +------+------------+--------+-----------+
  |emp_id|emp_name    |emp_city|emp_country|
  +------+------------+--------+-----------+
  |100   |surender,CHN|IND     |null       |
  |101   |ajay,HYD    |IND     |null       |
  +------+------------+--------+-----------+

但我的预期输出是

  +------+------------+--------+-----------+
  |emp_id|emp_name    |emp_city|emp_country|
  +------+------------+--------+-----------+
  |100   |surender    |CHN     |IND       |
  |101   |ajay        |HYD     |IND       |
  +------+------------+--------+-----------+

我如何获得预期的输出?

我尝试了另一个代码,如下所示

  val df1 = spark.read.option("delimiter", ",").option("quote", "").option("escape quote", "").schema(schema).csv(loc)

上面的df1给出了以下结果

 +------+---------+--------+-----------+
 |emp_id| emp_name|emp_city|emp_country|
 +------+---------+--------+-----------+
 |   100|"surender|    CHN"|        IND|
 |   101|    "ajay|    HYD"|        IND|
 +------+---------+--------+-----------+

【问题讨论】:

    标签: dataframe scala apache-spark


    【解决方案1】:

    一个简单的解决方案是在读取 CSV 后清理数据

    df
      .withColumn("emp_name", split(col("column_with_quotes"), ",").getItem(0))
      .withColumn("emp_city", split(col("column_with_quotes"), ",").getItem(1))
      .drop("column_with_quotes")
    

    稍后更新

    我查看了CSV options。你检查过这个选项吗?

    unescapedQuoteHandling BACK_TO_DELIMITER #Defines how the CsvParser will handle values with unescaped quotes
    

    【讨论】:

    • @gater ,这会起作用,但这对我没有帮助,因为 column_with_quotes 后面有大约 100 多列,我不想写很多 .withColumn
    猜你喜欢
    • 2020-09-07
    • 2018-07-29
    • 2021-04-12
    • 1970-01-01
    • 2014-08-27
    • 1970-01-01
    • 2019-03-28
    • 1970-01-01
    • 2017-06-24
    相关资源
    最近更新 更多