lxml 解析没有架构 URL 的 xsd 文件答案

【问题标题】：lxml parse xsd file without Schema URLlxml 解析没有架构 URL 的 xsd 文件
【发布时间】：2011-10-06 22:59:57
【问题描述】：

我正在使用 lxml 解析 xsd 文件，并正在寻找一种简单的方法来删除附加到每个元素名称的 URL 命名空间。这是 xsd 文件：

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="rootelement">
    <xs:complexType>
      <xs:choice maxOccurs="unbounded">
        <xs:element minOccurs="1" maxOccurs="1" name="element1">
          <xs:complexType>
            <xs:all>
              <xs:element name="subelement1" type="xs:string" />
              <xs:element name="subelement2" type="xs:integer" />
              <xs:element name="subelement3" type="xs:dateTime" />
            </xs:all>
            <xs:attribute name="id" type="xs:integer" use="required" />
          </xs:complexType>
        </xs:element>
       </xs:choice>
      <xs:attribute fixed="2.0" name="version" type="xs:decimal" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

并使用此代码：

from lxml import etree

parser = etree.XMLParser()
data = etree.parse(open("testschema.xsd"),parser)
root = data.getroot()
rootelement = root.getchildren()[0]
rootelementattribute = rootelement.getchildren()[0].getchildren()[1]
print "root element tags"
print rootelement[0].tag
print rootelementattribute.tag
elements = rootelement.getchildren()[0].getchildren()[0].getchildren()
elements_attribute = elements[0].getchildren()[0].getchildren()[1]
print "element tags"
print elements[0].tag
print elements_attribute.tag
subelements = elements[0].getchildren()[0].getchildren()[0].getchildren()
print "subelements"
print subelements

我得到以下输出

root element tags
{http://www.w3.org/2001/XMLSchema}complexType
{http://www.w3.org/2001/XMLSchema}attribute
element tags
{http://www.w3.org/2001/XMLSchema}element
{http://www.w3.org/2001/XMLSchema}attribute
subelements
[<Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb16e0>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb1780>, <Element {http://www.w3.org/2001/XMLSchema}element at 0x7f2998fb17d0>]

我不希望在提取标签数据时出现“{http://www.w3.org/2001/XMLSchema}”（不能更改 xsd 文件）。我需要 xsd 标记信息的原因是我使用它来验证一系列平面文件中的列名。在“元素”级别上，我正在提取多个元素以及子元素，我使用字典来验证列。此外，任何关于改进上述代码的建议都会非常有用，例如使用更少的“getchildren”调用，或者只是让它更有条理。

【问题讨论】：

这不是“URL 数据”，而是命名空间。

标签： python xml xsd lxml

【解决方案1】：

我会使用：

print elem.tag.split('}')[-1]

但你也可以使用 xpath 函数local-name():

print elem.xpath('local-name()')

至于更少的getchildren() 电话：把它们排除在外。 getchildren() 是一种不推荐使用的制作直接子代列表的方法（如果你真的想要这个，你应该使用 list(elem) 代替）。

您可以迭代，或直接在元素上使用索引。例如：rootelement[0] 将为您提供rootelement 的第一个子元素（但比使用rootelement.getchildren()[0] 更有效，因为这会像list(rootelement) 一样首先创建一个新列表）

【讨论】：

elem.tag.xpath('local-name()') 不起作用。应该是elem.xpath('local-name()')
谢谢，这正是我正在寻找的答案

【解决方案2】：

我想知道为什么etree.XMLParser(ns_clean=True) 不起作用。它对我没有用，所以它从括号之间的 root.nsmap 获取命名空间并用空字符串替换它

print rootelement[0].tag.replace('{%s}' %root.nsmap['xs'], '')

【讨论】：

【解决方案3】：

最简单的做法就是使用字符串切片来移除命名空间前缀：

>>> print rootelement[0].tag[34:]
complexType

【讨论】：

是的，我考虑过这一点，但正在寻找更优雅的东西，可以考虑未来对命名空间的更改，但不需要正则表达式或子字符串等

【解决方案4】：

如果 URI 将来可能会发生变化（出于某种未知原因或者您真的很偏执），请考虑以下事项：

print "root element tags"
tag, nsmap, prefix = rootelement[0].tag, rootelement[0].nsmap, rootelement[0].prefix
tag = tag[len(nsmap[prefix]) + 2:]
print tag

这是一个非常不可能的情况，但谁知道呢？

【讨论】：