读取具有不同列顺序的文件答案

【问题标题】：Read files with different column order读取具有不同列顺序的文件
【发布时间】：2020-08-21 23:38:43
【问题描述】：

我有几个带有标题的 csv 文件，但我发现有些文件有不同的列顺序。有没有办法用 Spark 来处理这个问题，我可以为每个文件定义选择顺序，这样主 DF 就不会出现 col x 可能具有 col y 值的不匹配？

我目前的阅读 -

 val masterDF = spark.read.option("header", "true").csv(allFiles:_*)

【问题讨论】：

标签： scala apache-spark pyspark

【解决方案1】：

提取all file names并存储到列表变量中。

然后定义 schema of 其中的所有列。
iterate通过每个文件使用header true，所以我们分别读取每个文件。
unionAll 新数据框与现有数据框。

Example:

file_lst=['<path1>','<path2>']

from pyspark.sql.functions import *
from pyspark.sql.types import *

#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])

#create an empty dataframe
df=spark.createDataFrame([],schema)

for i in file_lst:
    tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
    df=df.unionAll(tmp_df)

#display results
df.show()

【讨论】：

这不是很贵吗？创建 df 然后合并 100K 文件？有没有更好的方法来做到这一点？
我们要在这里指出的一个问题是 unionAll 将累加分区数。最后，当您将输出推送到下游时，您最终会得到小文件，这会降低下游应用程序的性能……确保重新分区并将数据推送到下游。
@Hello.World，如果列名在最新 Spark 版本中的顺序不同，我想不出任何其他选择。