您可以尝试的一种方法是先创建 spark 数据框,然后用它制作一个表格。下面使用 pyspark 给出一个假设案例的例子,其中分隔符是 |和 -
注意:我们正在使用 split ,这意味着它将拆分所有内容,例如2000-12-31 是一个值,它将被拆分。因此,我们应该非常确定在数据中不会发生这种情况。作为一般建议,永远不要接受这些类型的文件,因为会有意外发生。
示例数据的外观:在这种情况下,我们的目录中有 2 个文件,其中 |和 - 作为分隔符随机出现
# Create RDD. Basically read as simple text file.
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/")
rddRead.collect() # For debugging
import re # Import for usual python regex
# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality.
# But this a price we have to pay for not having good data).
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)
rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging
# This block is applicable only if you have headers
lsHeader = rddSplit.first() # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in
# all files are same. If not, then will have to filter by manually specifying
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader)
rddData.collect() # For debugging
# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging