获取标签内的全部内容，包括 html 标签答案

【问题标题】：Getting the whole content within the tag including html tags获取标签内的全部内容，包括 html 标签
【发布时间】：2014-02-18 11:21:28
【问题描述】：

import lxml.html as PARSER
from lxml.html import fromstring

data = """<TextFormat>06</TextFormat>
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>"""
root = PARSER.fromstring(data)

for ele in root.getiterator():
    if ele.tag == 'text':
        print ele.text_content()

这就是我现在得到的 -> Ducdame 是 John Cowper Powysother 的文本。

但我需要“文本”标签中的全部内容。这是我期待的结果。

<![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]>

我尝试了 lxml，BeautifulSoup，但没有得到我期望的结果。我真的需要这个帮助。

谢谢

【问题讨论】：

它不起作用，因为您的数据未正确编码。您不能将带有 XML 语法元素的字符串用作 XML 内部的字符串。将编码为 <和&gr;等等，它会工作的。
实际上先生这个输入来自 .onx 文件格式，但我不知道我应该如何解析它。所以我尝试使用 lxml 库。但这正是我从输入文件中得到的输入。

标签： python

【解决方案1】：

这里是LXML 的示例。为了找到正确的标签使用xpath，这里.//text：

from lxml import html
from lxml import etree

text = """<TextFormat>06</TextFormat>
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body>  </html>]]></Text>"""

tree = html.fromstring(text)
tags = tree.xpath('.//text')

text_tag = tags[-1]
print etree.tostring(text_tag)

输出

'<text><p>Ducdame was John Cowper Powys</p><p>other text</p></text>'

如果您也需要CDATA，您可以找到以下有用的帖子：How to output CDATA using ElementTree

【讨论】：

先生，如果可能的话，您能告诉我如何获取 CDATA 只是这个例子。

【解决方案2】：

下面的这个例子使用minidom 模块。

import xml.dom.minidom

data = """<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>"""

p = xml.dom.minidom.parseString(data)
p = p.childNodes[0]
p = p.childNodes[0]
print p.toxml()

【讨论】：

谢谢先生，这正是我所期待的。但是我应该如何遍历“文本”标签。假设我的文件有两个标签。 02
Ducdame was John Cowper Powys

其他文字