仅在第一个文件中读取带有标题的多个 CSV 文件 - Spark

【问题标题】：Read multiple CSV files with header in only first file - Spark仅在第一个文件中读取带有标题的多个 CSV 文件 - Spark
【发布时间】：2021-10-22 01:26:25
【问题描述】：

我有 5 个 CSV 文件，并且标题仅在第一个文件中。我想使用 spark 读取和创建数据框。我的下面的代码有效，但是，我使用这种方法丢失了 4 行数据，因为在最终读取中标题设置为 true。如果我将标题设置为 false，我会取回 4 行数据，但我也会从第一个文件中获取实际标题作为数据中的一行。

有没有更有效的方法来做到这一点，这样标题就不会在我的数据集中显示为一行？

header = spark.read \
  .format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load("path/file-1") 

schema = header.schema 

df = spark.read \
  .format("csv") \
  .option("header", "true") \
  .schema(schema) \
  .load("path")

【问题讨论】：

标签： python apache-spark pyspark

【解决方案1】：

不幸的是，我认为没有一种简单的方法可以满足您的需求。有一种解决方法看起来像您所做的那样。您可以读取第一个文件以获取架构，读取所有文件但使用option("header", "false") 读取第一个文件，然后合并第一个文件和其余文件。

在 python 中，它看起来像这样：

first_file = "path/file-1"
header = spark.read.option("header", "true") \
  .option("inferSchema", "true").csv(first_file) 
schema = header.schema 

# I use binaryFiles simply to get the list of the files in the folder
# Not that the files are not read.
# Any other mean to list files in a directory would do the trick as well.
all_files = files = spark.sparkContext.binaryFiles("path")\
  .map(lambda x : x[0]).collect()
all_files_but_first = [f for f in all_files if not f.endswith(first_file)]

df = spark.read.option("header", "false") \
  .schema(schema).csv(all_files_but_first)\
  .union(header)

【讨论】：