如何将 XML 转换为 Python 对象？答案

【问题标题】：How can I convert XML into a Python object?如何将 XML 转换为 Python 对象？
【发布时间】：2010-09-29 22:19:03
【问题描述】：

我需要加载一个 XML 文件并将其内容转换为面向对象的 Python 结构。我想拿这个：

<main>
    <object1 attr="name">content</object>
</main>

然后把它变成这样的：

main
main.object1 = "content"
main.object1.attr = "name"

XML 数据将具有比这更复杂的结构，我无法对元素名称进行硬编码。解析时需要收集属性名作为对象属性。

如何将 XML 数据转换为 Python 对象？

【问题讨论】：

标签： python xml

【解决方案1】：

值得关注lxml.objectify。

xml = """<main>
<object1 attr="name">content</object1>
<object1 attr="foo">contenbar</object1>
<test>me</test>
</main>"""

from lxml import objectify

main = objectify.fromstring(xml)
main.object1[0]             # content
main.object1[1]             # contenbar
main.object1[0].get("attr") # name
main.test                   # me

或者反过来构建 xml 结构：

item = objectify.Element("item")
item.title = "Best of python"
item.price = 17.98
item.price.set("currency", "EUR")

order = objectify.Element("order")
order.append(item)
order.item.quantity = 3
order.price = sum(item.price * item.quantity for item in order.item)

import lxml.etree
print(lxml.etree.tostring(order, pretty_print=True))

输出：

<order>
  <item>
    <title>Best of python</title>
    <price currency="EUR">17.98</price>
    <quantity>3</quantity>
  </item>
  <price>53.94</price>
</order>

【讨论】：

当我使用 lxml 版本 2.2 beta1 运行您的生成示例时，我的 XML 充满了类型注释（“...”）。有办法抑制吗？
你可以使用 lxml.etree.cleanup_namespaces(order)
您实际上想同时使用lxml.objectify.deannotate(order) 和lxml.etree.cleanup_namespaces(order)。

【解决方案2】：

我今天已经不止一次推荐这个了，但是试试Beautiful Soup (easy_install BeautifulSoup)。

from BeautifulSoup import BeautifulSoup

xml = """
<main>
    <object attr="name">content</object>
</main>
"""

soup = BeautifulSoup(xml)
# look in the main node for object's with attr=name, optionally look up attrs with regex
my_objects = soup.main.findAll("object", attrs={'attr':'name'})
for my_object in my_objects:
    # this will print a list of the contents of the tag
    print my_object.contents
    # if only text is inside the tag you can use this
    # print tag.string

【讨论】：

main.findAll 需要是 soup.findAll，但这有点帮助。仍然不是我想要的——但我想我可能知道如何让它工作。它将用于将由应用程序解释的外部 py 文件，因此我可能可以在执行之前重新映射它们。
我修复了代码中的错误并更新了xml。我只是复制了问题中给出的原始代码。
BeautifulSoup (BeutifulStoneSoup) 以空标签 <element /> 中断，例如<icon data="/ig/images/weather/partly_cloudy.gif"/> - 这些在 xml 中很多:(
这应该更新为使用BeautifulSoup4。旧版本不再维护，与 Python 3 不兼容。

【解决方案3】：

David Mertz 的gnosis.xml.objectify 似乎可以为您做到这一点。文档有点难找，但有几篇 IBM 文章，包括 this one (text only version)。

from gnosis.xml import objectify

xml = "<root><nodes><node>node 1</node><node>node 2</node></nodes></root>"
root = objectify.make_instance(xml)

print root.nodes.node[0].PCDATA # node 1
print root.nodes.node[1].PCDATA # node 2

不过，以这种方式从对象创建 xml 是另一回事。

【讨论】：

【解决方案4】：

这个怎么样

http://evanjones.ca/software/simplexmlparse.html

【讨论】：

【解决方案5】：

#@Stephen: 
#"can't hardcode the element names, so I need to collect them 
#at parse and use them somehow as the object names."

#I don't think thats possible. Instead you can do this. 
#this will help you getting any object with a required name.

import BeautifulSoup


class Coll(object):
    """A class which can hold your Foo clas objects 
    and retrieve them easily when you want
    abstracting the storage and retrieval logic
    """
    def __init__(self):
        self.foos={}        

    def add(self, fooobj):
        self.foos[fooobj.name]=fooobj

    def get(self, name):
        return self.foos[name]

class Foo(object):
    """The required class
    """
    def __init__(self, name, attr1=None, attr2=None):
        self.name=name
        self.attr1=attr1
        self.attr2=attr2

s="""<main>
         <object name="somename">
             <attr name="attr1">value1</attr>
             <attr name="attr2">value2</attr>
         </object>
         <object name="someothername">
             <attr name="attr1">value3</attr>
             <attr name="attr2">value4</attr>
         </object>
     </main>
"""

#

soup=BeautifulSoup.BeautifulSoup(s)


bars=Coll()
for each in soup.findAll('object'):
    bar=Foo(each['name'])
    attrs=each.findAll('attr')
    for attr in attrs:
        setattr(bar, attr['name'], attr.renderContents())
    bars.add(bar)


#retrieve objects by name
print bars.get('somename').__dict__

print '\n\n', bars.get('someothername').__dict__

输出

{'attr2': 'value2', 'name': u'somename', 'attr1': 'value1'}


{'attr2': 'value4', 'name': u'someothername', 'attr1': 'value3'}

【讨论】：

【解决方案6】：

python 有三种常见的 XML 解析器：xml.dom.minidom、elementree 和 BeautifulSoup。

IMO，BeautifulSoup 是迄今为止最好的。

http://www.crummy.com/software/BeautifulSoup/

【讨论】：

BeautifulSoup 不能很好地与 XML 配合使用 - 它有空标签 <element/> 的问题 - 这对于 HTML 来说没问题，因为那些在那里不流行

【解决方案7】：

如果在谷歌上搜索代码生成器不起作用，您可以编写自己的代码生成器，使用 XML 作为输入并以您选择的语言输出对象。

这并不难，但是解析 XML、生成代码、编译/执行脚本这三个步骤的过程确实让调试变得有点困难。

【讨论】：

你能举个例子吗？