【发布时间】:2022-01-16 19:43:54
【问题描述】:
我的 HDFS 位置有一个 csv 文件,它在一列上有引号。我的文件是 records.csv ,这是它的数据
100,"surender,CHN",IND
101,"ajay,HYD",IND
scala> val schema = StructType(
| Array(
| StructField("emp_id", StringType, true),
| StructField("emp_name", StringType, true),
| StructField("emp_city", StringType, true),
| StructField("emp_country", StringType, true)
| )
| )
schema: org.apache.spark.sql.types.StructType = StructType(StructField(emp_id,StringType,true), StructField(emp_name,StringType,true), StructField(emp_city,StringType,true), StructField(emp_country,StringType,true))
scala>
scala> val loc = "/user/omega/records.csv"
loc: String = /user/omega/records.csv
^
scala> val loc = "/user/omega/records.csv"
loc: String = /user/omega/records.csv
scala> val df = spark.read.option("delimiter", ",").option("quote", "\"").option("escape", "\"").schema(schema).csv(loc)
df: org.apache.spark.sql.DataFrame = [emp_id: string, emp_name: string ... 2 more fields]
scala> df.show(10,false)
+------+------------+--------+-----------+
|emp_id|emp_name |emp_city|emp_country|
+------+------------+--------+-----------+
|100 |surender,CHN|IND |null |
|101 |ajay,HYD |IND |null |
+------+------------+--------+-----------+
但我的预期输出是
+------+------------+--------+-----------+
|emp_id|emp_name |emp_city|emp_country|
+------+------------+--------+-----------+
|100 |surender |CHN |IND |
|101 |ajay |HYD |IND |
+------+------------+--------+-----------+
我如何获得预期的输出?
我尝试了另一个代码,如下所示
val df1 = spark.read.option("delimiter", ",").option("quote", "").option("escape quote", "").schema(schema).csv(loc)
上面的df1给出了以下结果
+------+---------+--------+-----------+
|emp_id| emp_name|emp_city|emp_country|
+------+---------+--------+-----------+
| 100|"surender| CHN"| IND|
| 101| "ajay| HYD"| IND|
+------+---------+--------+-----------+
【问题讨论】:
标签: dataframe scala apache-spark