带有添加数据的 Pandas DataFrame 到 XML答案

【问题标题】：Pandas DataFrame to XML with added data带有添加数据的 Pandas DataFrame 到 XML
【发布时间】：2021-01-18 23:13:30
【问题描述】：

我很难从 Pandas DataFrame 生成 .xml 文件。我正在使用这个解决方案 (How do convert a pandas/dataframe to XML?)（抱歉，由于某种原因，堆栈不会让我将单词链接到该站点），但我正在尝试添加一个额外的字段。如果我不包含 shape 参数，则原始解决方案有效，但我确实需要将值添加到 .xml 文件中。我不知道为什么我不能用参数调用函数。除了调用该函数之外，我还很难将其编写为 xml。我搜索了其他一些堆栈问题，发现这个代码块有效，但是当我打开 .xml 文件时，我只得到四个数字（30、1、67、44）。虽然如果我在 pycharm 中打开它，我会得到“想要的”视图。

file_handle = open("output.xml", "w")
Q.writexml(file_handle)
file_handle.close()

代码：

print(image_x.shape)
output: (185, 186, 3)

width = image_x.shape[0]
height = image_x.shape[1]
depth = image_x.shape[2]

def func(row, width, height, depth):
    xml = ['<item>']
    shape = [f'<width>{width}</width>\n<height>{height}</height>\n<depth>{depth}</depth>']
    for field in row.index:
        xml.append('  <{0}>{1}</{0}>'.format(field, row[field]))
    xml.append('</item>')
    xml.append(shape)
    return '\n'.join(xml)

xml_file = func(df, width, height, depth)

df:

   xmin  ymin  xmax  ymax
0    30     1    67    44
1    39   136    73   176

错误：

Traceback (most recent call last):
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:/PycharmProjects/Augmentation/random_shit.py", line 100, in <module>
    Q = func(df, width, height, depth)
  File "D:/PycharmProjects/Augmentation/random_shit.py", line 95, in func
    xml.append('  <{0}>{1}</{0}>'.format(field, row[field]))
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "D:\PyCharmEnvironments\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 0

期望的输出：

<annotations>
  <size>
    <width>185</width>
    <height>186</height>
    <depth>3</depth>
  </size>
  <item>
    <xmin>30</xmin>
    <ymin>1</ymin>
    <xmax>67</xmax>
    <ymax>44</ymax>
  </item>
  <item>
    <xmin>39</xmin>
    <ymin>136</ymin>
    <xmax>73</xmax>
    <ymax>176</ymax>
  </item>
</annotations>

【问题讨论】：

看起来您正在为您的函数提供完整的 df。但从您的链接来看，它应该应用于 df 中的每一行。
@KJDII 好的，我想我明白了。这就是为什么有'\n'.join(df.apply(func, axis=1)？我需要在函数中创建函数吗？
你可能完全可以做到'\n'.join(df.apply(func, axis=1)也许：xml_file = '\n'.join(df.apply(func, axis=1)
我可以使用'\n'.join(df.apply(func, axis=1) 并将其设置为变量，例如您所拥有的。但这只是在我不包括shape 的情况下。此外，输出 .xml 未正确保存（如我的问题中所述）。

标签： python xml pandas dataframe

【解决方案1】：

单行函数：

def func(df, width, height, depth):
    return '<annotations>\n'+f'<width>{width}</width>\n<height>{height}</height>\n<depth>{depth}</depth>\n'+df.apply(lambda row:f'<item>\n<xmin>{row.xmin}</xmin>\n<ymin>{row.ymin}</ymin>\n<xmax>{row.xmax}</xmax>\n<ymax>{row.ymax}</ymax>\n</item>\n',axis=1).str.cat()+'\n</annotations>'

使用+ 连接字符串，并使用apply 和cat 对数据框使用map-reduce 方法。 Apply 将构建每个数据帧行并将其转换为等效于<item> 标记的字符串，str.cat() 将连接每一行（也将输入参数行重命名为 df）

【讨论】：

非常感谢。你能解释一下lambda 在做什么吗？我以前遇到过这个，但不了解文档。
一般来说，lambda 是您可以使用函数的方式，而不是引用您在其他地方使用 def 声明的命名函数

【解决方案2】：

由于 XML 不完全是一个文本文件，请避免 building XML with string concatenation 的常见问题。因此，请避免链接帖子中可能无法正确处理数据编码的解决方案。回想一下 XML 代表 Extensible Markup Language，它定义了一组编码文档的规则。

因此，请考虑使用兼容的 DOM 库，例如 Python 的内置 etree 或功能丰富的第三方 lxml：

import xml.etree.ElementTree as et 
# import lxml.etree as et

root = et.Element("annotations")

size = et.SubElement(root, "size")
et.SubElement(size, "width").text = str(image.shape[0])
et.SubElement(size, "length").text = str(image.shape[1])
et.SubElement(size, "depth").text = str(image.shape[2])

data = image.to_dict(orient='records')

for d in data:
   item = et.SubElement(root, "item")

   for k, v in d.items():
      et.SubElement(item, k).text = str(v)

with open("output.xml", "wb") as f:
   f.write(et.tostring(root, encoding="utf8"))

要使用换行和缩进漂亮的打印输出，请在内置 minidom 中使用 toprettyxml。注意：lxml.etree 在其tostring 调用中有一个pretty_print 参数。

from xml.dom.minidom import parseString

# ...same code as above except write output

dom = parseString(et.tostring().decode("utf-8"))

with open("output.xml", "wb") as f:
   f.write(dom.toprettyxml(encoding="utf8"))

【讨论】：