【问题标题】:Remove all XML tags, only keep the text between the tags删除所有 XML 标签,只保留标签之间的文本
【发布时间】:2012-12-23 23:54:32
【问题描述】:

我有一个 XML(实际上是一个 XML 样式表)。 使用 Python,我想从中删除所有标签,只保留标签之间的文本。

对此最简单的解决方案是什么? 我在这里看到了一个类似的问题: How to remove all html tags from downloaded page

但由于某种原因,这在这种情况下似乎不起作用。 请注意,我不希望在标签中保留由引号分隔的文本 - 我真的想删除以“”结尾的所有内容。

【问题讨论】:

  • 有你想从中删除内容的文件/文档的样本吗?

标签: python xml


【解决方案1】:

你可以使用xml.parsers.expat:

from xml.parsers.expat import ParserCreate

def char_data(data):
    if data.strip(): # skip empty text if you want
        print data

parser = ParserCreate()
parser.CharacterDataHandler = char_data
parser.Parse(doc,True)

xml.sax:

from xml.sax import make_parser, handler

class extract_text(handler.ContentHandler):
    def characters(self,data):
        if data.strip():
            print data

parser = make_parser()
parser.setContentHandler(extract_text())
parser.feed(doc)

如果不是格式良好的 XML,您也可以尝试HTMLParser

from HTMLParser import HTMLParser

class extract_text(HTMLParser):
    def handle_data(self,data):
        if data.strip():
            print data

parser = extract_text()
parser.feed(doc)

【讨论】:

  • 谢谢 - 似乎是一个不错的解决方案。但是,这会在文本中第一次出现字符“&”(不在标签内)时阻塞,并出现以下错误:xml.parsers.expat.ExpatError: not well-formed (invalid token): line 701, column第778章
  • @calvintiger HTMLParser 不那么严格,可以处理&。试一试,也许它适用于您的(格式错误的)XML 文档。或者,您可以在将 XML 传递给严格的解析器之前尝试修复它。
【解决方案2】:

使用ElementTree API(或更快的API等效lxml),然后使用etree.totext(tree, method='text')函数将树序列化回文本内容:

>>> from xml.etree import ElementTree as ET
>>> doc='''\
... <?xml-stylesheet href="common.css"?>
... <?xml-stylesheet href="modern.css"
...   title="Modern" media="screen"
...   type="text/css"?>
... <?xml-stylesheet href="classic.css"
...   alternate="yes" title="Classic"
...   media="screen, print" type="text/css"?>
... <ARTICLE>
...   <HEADLINE>Fredrick the Great meets
...     Bach</HEADLINE>
...   <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
...   <PARA>
...     One evening, just as he was
...     getting his
...     <INSTRUMENT>flute</INSTRUMENT>
...     ready and his musicians were
...     assembled, an officer brought him a
...     list of the strangers who had arrived.
...   </PARA>
... </ARTICLE>
... '''
>>> tree = ET.fromstring(doc)
>>> ET.tostring(tree, method='text')
'\n  Fredrick the Great meets\n    Bach\n  Johann Nikolaus Forkel\n  \n    One evening, just as he was\n    getting his\n    flute\n    ready and his musicians were\n    assembled, an officer brought him a\n    list of the strangers who had arrived.\n  \n'

【讨论】:

    【解决方案3】:

    Lxml 可能有问题,您可以按照 Martijn Pieters 所说的对 ElementTree 或标准库中的 C 版本 cElementTree 进行操作。

    >>> from xml.etree import ElementTree
    >>> doc='''
    ...  <?xml-stylesheet href="common.css"?>
    ...  <?xml-stylesheet href="modern.css"
    ...    title="Modern" media="screen"
    ...    type="text/css"?>
    ...  <?xml-stylesheet href="classic.css"
    ...    alternate="yes" title="Classic"
    ...    media="screen, print" type="text/css"?>
    ...  <ARTICLE>
    ...    <HEADLINE>Fredrick the Great meets
    ...      Bach</HEADLINE>
    ...    <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
    ...    <PARA>
    ...      One evening, just as he was
    ...      getting his
    ...      <INSTRUMENT>flute</INSTRUMENT>
    ...      ready and his musicians were
    ...      assembled, an officer brought him a
    ...      list of the strangers who had arrived.
    ...    </PARA>
    ...  </ARTICLE>
    ...  '''
    
    >>> xml = ElementTree.fromstring(doc)
    >>> xml
    <Element 'ARTICLE' at 0x9295e6c>
    >>> ElementTree.tostring(xml,method='text')
    '\n   Fredrick the Great meets\n     Bach\n   Johann Nikolaus Forkel\n   \n     One evening, just as he was\n     getting his\n     flute\n     ready and his musicians were\n     assembled, an officer brought him a\n     li
    st of the strangers who had arrived.\n   \n '
    

    请注意,cElementTree 更快,它在标准库中,但我认为它与 UTF8 存在一些问题,因此如果您需要 utf8,请使用“ElementTree”

    【讨论】:

      猜你喜欢
      • 2014-08-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-10-09
      • 2011-08-28
      相关资源
      最近更新 更多