将 XML 数据从 Google 地球 KML 文件上传到 DataBricks答案

【问题标题】：Uploading XML data from Google Earth KML file to DataBricks将 XML 数据从 Google 地球 KML 文件上传到 DataBricks
【发布时间】：2019-04-17 20:34:36
【问题描述】：

我正在设置 DataBricks 来比较和对比来自多个来源的数据。有些数据是 CSV 文件，有些是 JSON 格式，还有一些是 Google Earth KML 文件。最后一个确实是一个挑战。我正在尝试使用数据上传功能上传 XML 数据，但 DataBricks 无法从 XML 字符串创建表。将 XML 插入 DataBricks 表的过程是什么？

【问题讨论】：

看到这个线程：- stackoverflow.com/questions/52758704/…

标签： python xml kml google-earth databricks

【解决方案1】：

在您的工作区中使用 spark-xml 库的最佳方式。

在 maven/spark 包部分搜索 spark-xml 并按照此步骤将其添加到库https://docs.databricks.com/user-guide/libraries.html#create-a-library

将库附加到集群

https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster

最后使用以下代码读取databricks中的xml数据

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

这里也是执行相同操作的 python 代码：

import xml.etree.ElementTree as ET
xmlfiles = dbutils.fs.ls(storage_mount_name)

##Get attribute names (for now I took all leafs of the xml structure)
firstfile = xmlfiles[0].path.replace('dbfs:','/dbfs')
root = ET.parse(firstfile).getroot()
attributes = [node.tag for node in root.iter() if len(node)==0]
clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]

#Create Dataframe and save it as csv
df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
for xf in xmlfiles:
    afile = xf.path.replace('dbfs:','/dbfs')
    root = ET.parse(afile).getroot()
    df.loc[afile] = [node.text for node in root.iter() if node.tag in attributes]

【讨论】：