Databricks Spark 的同一文件中不同行上的不同分隔符答案

【问题标题】：Different delimiters on different lines in the same file for Databricks SparkDatabricks Spark 的同一文件中不同行上的不同分隔符
【发布时间】：2021-12-09 15:51:20
【问题描述】：

我有一个混合了逗号分隔线和竖线分隔线的文件，我需要导入 Databricks。

在 Databricks/Spark 中创建 sql 表时，是否可以指示使用两个或多个不同的分隔符？

我看到很多关于多个字符分隔符的帖子，但没有关于不同分隔符的帖子。

等等

我目前正在使用类似的东西。

create table myschema.mytable (
  foo string,
  bar string
)
using csv
options (
  header = "true",
  delimiter = ","
);

【问题讨论】：

什么是using cvs？我认为您有两个选择 1) 将带有每个分隔符的所有行过滤到它们自己的数据框中，然后加入它们。 2）使用正则表达式拆分成行并创建一个统一的数据框
样本数据。预期成绩。请。
应该读取 csv，而不是 cvs（在原帖中更正）
我自己可以想出几种方法（其中大多数可能在 99.999% 的时间里都有效），但我希望这是 Databricks/Spark 内置的东西，我没有自己做一些定制的事情。换句话说，我想将我提供的文件传递给我的 Databricks 解决方案，而不是自己进行预处理。

标签： apache-spark databricks

【解决方案1】：

您可以尝试的一种方法是先创建 spark 数据框，然后用它制作一个表格。下面使用 pyspark 给出一个假设案例的例子，其中分隔符是 |和 -

注意：我们正在使用 split ，这意味着它将拆分所有内容，例如2000-12-31 是一个值，它将被拆分。因此，我们应该非常确定在数据中不会发生这种情况。作为一般建议，永远不要接受这些类型的文件，因为会有意外发生。

示例数据的外观：在这种情况下，我们的目录中有 2 个文件，其中 |和 - 作为分隔符随机出现

# Create RDD. Basically read as simple text file. 
# sc is spark context
rddRead = sc.textFile("/mnt/adls/RI_Validation/ReadMulktipleDelimerFile/Sample1/") 
rddRead.collect() # For debugging

import re # Import for usual python regex 

# Create another rdd using simple string opertaions. This will be similar to list of lists.
# Give regex expression to split your string based on anticipated delimiters (this could be dangerous 
# if those delimiter occur as part of value. e.g.: 2021-12-31 is a single value in reality. 
# But this a price we have to pay for not having good data). 
# For each iteration, k represents 1 element which would eventually become 1 row (e.g. A|33-Mech)

rddSplit = rddRead.map(lambda k: re.split("[|-]+", k)) # Anticipated delimiters are | OR - in this case.
rddSplit.collect() # For debugging

# This block is applicable only if you have headers
lsHeader = rddSplit.first()  # Get First element from rdd as header.
print(lsHeader) # For debugging
print()
# Remove rows representing header. (Note: Have assumed name of all columns in 
# all files are same. If not, then will have to filter by manually specifying 
#all of them which would be a nightmare from pov of good code as well as maintenance)
rddData = rddSplit.filter(lambda x: x != lsHeader) 
rddData.collect() # For debugging

# Convert rdd to spark dataframe
# Utilise the header we got in earlier step. Else can give our own headers.
dfSpark = rddData.toDF(lsHeader)
dfSpark.display() # For debugging

【讨论】：

更新正则表达式模式以包含 + 以防分隔符连续重复