获取lxml中标签内的所有文本答案

【问题标题】：Get all text inside a tag in lxml获取lxml中标签内的所有文本
【发布时间】：2011-01-07 09:24:36
【问题描述】：

我想编写一个代码 sn-p，它会在以下所有三个实例中，在 lxml 中获取 <content> 标记内的所有文本，包括代码标记。我试过tostring(getchildren()) 但这会错过标签之间的文本。我在 API 中搜索相关功能时运气不佳。你能帮帮我吗？

<!--1-->
<content>
<div>Text inside tag</div>
</content>
#should return "<div>Text inside tag</div>

<!--2-->
<content>
Text with no tag
</content>
#should return "Text with no tag"


<!--3-->
<content>
Text outside tag <div>Text inside tag</div>
</content>
#should return "Text outside tag <div>Text inside tag</div>"

【问题讨论】：

谢谢 - 我正在尝试编写一个 RSS 提要解析器并显示标记内的所有内容，其中包括来自提要提供者的 HTML 标记。

标签： python parsing lxml

【解决方案1】：

只需使用node.itertext() 方法，如：

 ''.join(node.itertext())

【讨论】：

这很好用，但会去掉你可能想要的任何标签。
字符串中不应该有空格吗？还是我错过了什么？
@Private 这取决于您的具体需求。例如，我可以使用<word><pre>con</pre>gregate</word> 之类的标记来指示单词中的前缀。假设我想提取没有标记的单词。如果我使用带有空格的.join，那么我会得到"con gregate"，而没有空格我会得到"congregate"。
虽然上面的答案被接受了，但这是我真正想要的。

【解决方案2】：

text_content() 能满足你的需要吗？

【讨论】：

text_content() 删除所有标记，并且 OP 希望保留标记内的标记。
@benselme 为什么我使用text_content，上面写着AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
@roger text_content() 仅在您的树是 HTML 时可用（即，如果它是使用 lxml.html 中的方法解析的）。
@EdSummers 非常感谢！这在解析 <p> 标记时很有用。我在 XPath 中使用 text() 时缺少文本（如嵌套链接），但您的方法对我有用！。
正如 Louis 所指出的，这仅适用于使用 lxml.html 解析的树。 Arthur Debert 的itertext() 解决方案是通用的。

【解决方案3】：

试试：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    parts = ([node.text] +
            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
            [node.tail])
    # filter removes possible Nones in texts and tails
    return ''.join(filter(None, parts))

例子：

from lxml import etree
node = etree.fromstring("""<content>
Text outside tag <div>Text <em>inside</em> tag</div>
</content>""")
stringify_children(node)

生产者：'\nText outside tag <div>Text <em>inside</em> tag</div>\n'

【讨论】：

@delnan。不需要，tostring 已经处理了递归情况。你让我怀疑，所以我在真实代码上进行了尝试，并用一个例子更新了答案。感谢您指出。
代码被破坏并产生重复的内容：>>> stringify_children(lxmlhtml.fromstring('A
B
C')) 'A
A
B
B
CC'
要修复@hoju 报告的错误，请将with_tail=False 作为参数添加到tostring()。所以tostring(c, with_tail=False)。这将解决尾部文本 (C) 的问题。为了解决前缀文本（A）的问题，这似乎是tostring() 中的一个错误，它添加了<p> 标签，因此它不是OP 代码中的错误。
可以通过从parts 列表中删除c.text 来修复第二个错误。我提交了一个修复了这些错误的新答案。
应添加 tostring(c, encoding=str) 以在 Python 3 上运行。

【解决方案4】：

最简单的代码 sn-ps 之一，实际上对我有用，并且根据 http://lxml.de/tutorial.html#using-xpath-to-find-text 的文档是

etree.tostring(html, method="text")

其中 etree 是一个节点/标签，您正在尝试阅读其完整文本。请注意，它并没有摆脱脚本和样式标签。

【讨论】：

去除html标签

【解决方案5】：

解决 hoju 报告的 bugs 的 albertov 的 stringify-content 版本：

def stringify_children(node):
    from lxml.etree import tostring
    from itertools import chain
    return ''.join(
        chunk for chunk in chain(
            (node.text,),
            chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())),
            (node.tail,)) if chunk)

【讨论】：

【解决方案6】：

答案已经给出，只是一个快速的改进。如果你想清理里面的文字：

clean_string = ' '.join([n.strip() for n in node.itertext()]).strip()

【讨论】：

【解决方案7】：

以下使用python生成器的sn-p完美运行，非常高效。

''.join(node.itertext()).strip()

【讨论】：

如果节点是从缩进文本中获取的，根据解析器的不同，它通常会有缩进文本，itertext() 将交织在普通文本 sn-ps 中。根据实际设置，以下可能有用：' '.join(node.itertext('span', 'b')) - 仅使用来自<span> 和<b> 标记的文本，从缩进中丢弃带有“\n”的标记。

【解决方案8】：

import urllib2
from lxml import etree
url = 'some_url'

获取网址

test = urllib2.urlopen(url)
page = test.read()

获取包含 table 标记的所有 html 代码

tree = etree.HTML(page)

xpath 选择器

table = tree.xpath("xpath_here")
res = etree.tostring(table)

res是table的html代码这对我有用。

因此您可以使用 xpath_text() 提取标签内容，并使用 tostring() 提取包含其内容的标签

div = tree.xpath("//div")
div_res = etree.tostring(div)

text = tree.xpath_text("//content")

或 text = tree.xpath("//content/text()")

div_3 = tree.xpath("//content")
div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

使用 strip 方法的最后一行不太好，但它确实有效

【讨论】：

对我来说，这已经足够好了，而且无疑要简单得多。我知道我有一个
标签——每次——我都可以把它去掉
xpath_text 已经从 lxml 中删除了吗？它说AttributeError: 'lxml.etree._Element' object has no attribute 'xpath_text'

【解决方案9】：

用这种方式定义stringify_children可能不那么复杂：

from lxml import etree

def stringify_children(node):
    s = node.text
    if s is None:
        s = ''
    for child in node:
        s += etree.tostring(child, encoding='unicode')
    return s

或一行

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

原理和this answer一样：把子节点的序列化留给lxml。在这种情况下，node 的 tail 部分并不有趣，因为它位于结束标记的“后面”。请注意，encoding 参数可以根据自己的需要进行更改。

另一种可能的解决方案是序列化节点本身，然后去掉开始和结束标记：

def stringify_children(node):
    s = etree.tostring(node, encoding='unicode', with_tail=False)
    return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

这有点可怕。这段代码只有在node 没有属性时才是正确的，而且我认为即使在那时也没有人愿意使用它。

【讨论】：

node.text if node.text is not None else '' 可以只是node.txt or ''
在这里玩拉撒路（复活玩笑......不是双关语），但我看过这篇文章很多次，当时我记不清我做了什么。鉴于 node.text 只返回不被视为迭代器一部分的文本（当直接迭代到节点时，我相信与 node.getChildren() 相同），似乎解决方案可以很容易地从这里简化为：''.join([node.text or ''] + [etree.tostring(e) for e in node])
这个实际上适用于 python 3，而最受好评的答案却没有。

【解决方案10】：

import re
from lxml import etree

node = etree.fromstring("""
<content>Text before inner tag
    <div>Text
        <em>inside</em>
        tag
    </div>
    Text after inner tag
</content>""")

print re.search("\A<[^<>]*>(.*)</[^<>]*>\Z", etree.tostring(node), re.DOTALL).group(1)

【讨论】：

【解决方案11】：

我知道这是一个老问题，但这是一个常见问题，我有一个似乎比目前建议的更简单的解决方案：

def stringify_children(node):
    """Given a LXML tag, return contents as a string

       >>> html = "<p><strong>Sample sentence</strong> with tags.</p>"
       >>> node = lxml.html.fragment_fromstring(html)
       >>> extract_html_content(node)
       "<strong>Sample sentence</strong> with tags."
    """
    if node is None or (len(node) == 0 and not getattr(node, 'text', None)):
        return ""
    node.attrib.clear()
    opening_tag = len(node.tag) + 2
    closing_tag = -(len(node.tag) + 3)
    return lxml.html.tostring(node)[opening_tag:closing_tag]

与此问题的其他一些答案不同，此解决方案保留了其中包含的所有标签，并从与其他可行解决方案不同的角度解决问题。

【讨论】：

【解决方案12】：

lxml 有一个方法：

node.text_content()

【讨论】：

这个答案没有添加任何新内容。与stackoverflow.com/a/11963661/407651相同。
lxml 文档似乎也有误：AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'

【解决方案13】：

针对@Richard 上面的评论，如果您修补 stringify_children 以阅读：

 parts = ([node.text] +
--            list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) +
++            list(chain(*([tostring(c)] for c in node.getchildren()))) +
           [node.tail])

似乎避免了他所指的重复。

【讨论】：

【解决方案14】：

这是一个可行的解决方案。我们可以获取带有父标签的内容，然后从输出中剪切父标签。

import re
from lxml import etree

def _tostr_with_tags(parent_element, html_entities=False):
    RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' 
    content_with_parent = etree.tostring(parent_element)    

    def _replace_html_entities(s):
        RE_ENTITY = r'&#(\d+);'

        def repl(m):
            return unichr(int(m.group(1)))

        replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE)

        return replaced

    if not html_entities:
        content_with_parent = _replace_html_entities(content_with_parent)

    content_with_parent = content_with_parent.strip() # remove 'white' characters on margins

    start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0]

    if start_tag != end_tag:
        raise Exception('Start tag does not match to end tag while getting content with tags.')

    return content_without_parent

parent_element 必须有 Element 类型。

请注意，如果您需要文本内容（不是文本中的 html 实体），请将 html_entities 参数保留为 False。

【讨论】：

【解决方案15】：

如果这是一个a标签，你可以试试：

node.values()

【讨论】：

这不是获取标签内的文本，而是获取标签内的属性。