【发布时间】:2020-11-24 19:27:08
【问题描述】:
我有一个如下的 xml。很少有以ce 为前缀的标签,例如<ce:title>。当我使用 xpath 运行如下代码时,在输出中,<ce:title> 被替换为<title>。我确实在 SO 上看到了其他链接,例如 How to preserve namespace information when parsing HTML with lxml?,但不确定在何处以及如何添加命名空间详细信息。
有人可以建议吗?如何为以下 xml 保留 <ce:title>?
from lxml import html
from lxml.etree import tostring
with open('102277033304.xml', encoding='utf-8') as file_object:
xml = file_object.read().strip()
root = html.fromstring(xml)
for element in root.xpath('//item/book/pages/*'):
html = tostring(element, encoding='utf-8')
print(html)
XML:
<item>
<book>
<pages>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 1</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 2</page-fulltext>
<page-info>
<page>
<ce:title>Chapter 1</ce:title>
<content>Welcome to Chapter 1</content>
</page>
<page>
<ce:title>Chapter 2</ce:title>
<content>Welcome to Chapter 2</content>
</page>
</page-info>
<page-fulltext>Published in page 3</page-fulltext>
</pages>
</book>
</item>
【问题讨论】:
-
问题中的“XML”并不是真正的 XML,因为没有
ce前缀的命名空间声明(例如xmlns:ce="http://example.com")。
标签: python-3.x lxml