如何从没有尾巴的lxml中的节点删除标签？答案

【问题标题】：How delete tag from node in lxml without tail?如何从没有尾巴的lxml中的节点删除标签？
【发布时间】：2017-08-13 10:57:11
【问题描述】：

例子：

html = <a><b>Text</b>Text2</a>

美人汤代码

[x.extract() for x in html.findAll(.//b)]

在出口我们有：

html = <a>Text2</a>

Lxml代码：

[bad.getparent().remove(bad) for bad in html.xpath(".//b")]

在出口我们有：

html = <a></a>

因为 lxml 认为“Text2”是<b></b> 的尾巴

如果我们只需要标签连接中的文本行，我们可以使用：

for bad in raw.xpath(xpath_search):
    bad.text = ''

但是，如何在不更改文本的情况下删除没有尾部的标签？

【问题讨论】：

不确定我是否正确理解了您的问题，但也许drop_tag 可能会有所帮助？
@phoibos，感谢您的回答，但不，drop_tag，只需删除 self 标记并将文本保存在其中，但还需要一些其他内容。我们有 foobar 如果我们在 结果上使用 drop_tag 我们得到 foo bar 但在结果中需要 bar .

标签： python beautifulsoup html-parsing lxml

【解决方案1】：

虽然从 phlou 接受的答案会起作用，但有更简单的方法可以删除标签而不删除它们的尾巴。

如果你想删除一个特定的元素，那么你正在寻找的 LXML 方法是drop_tree。

来自文档：

删除元素及其所有子元素。与 el.getparent().remove(el) 不同，它不会删除尾部文本；使用 drop_tree，尾部文本与前一个元素合并。

如果要删除特定标记的所有实例，可以将lxml.etree.strip_elements 或lxml.html.etree.strip_elements 与with_tail=False 结合使用。

从树中删除所有具有提供的标签名称的元素或子树。这将删除元素及其整个子树，包括它们的所有属性、文本内容和后代。它还将删除元素的尾部文本，除非您将 with_tail 关键字参数选项显式设置为 False。

因此，对于原始帖子中的示例：

>>> from lxml.html import fragment_fromstring, tostring
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> for bad in html.xpath('.//b'):
...    bad.drop_tree()
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'

或

>>> from lxml.html import fragment_fromstring, tostring, etree
>>>
>>> html = fragment_fromstring('<a><b>Text</b>Text2</a>')
>>> etree.strip_elements(html, 'b', with_tail=False)
>>> tostring(html, encoding="unicode")
'<a>Text2</a>'

【讨论】：

根据 drop_tag 的帮助，它还保留了文本。结果应该是：TextText2
@TMikonos 谢谢，我的意思是输入drop_tree 而不是drop_tag。我已经更新了那个例子。

【解决方案2】：

编辑：

请看@Joshmakers 答案https://stackoverflow.com/a/47946748/8055036，这显然是更好的答案。

我执行以下操作以将尾部文本保护到前一个兄弟或父级。

def remove_keeping_tail(self, element):
    """Safe the tail text and then delete the element"""
    self._preserve_tail_before_delete(element)
    element.getparent().remove(element)

def _preserve_tail_before_delete(self, node):
    if node.tail: # preserve the tail
        previous = node.getprevious()
        if previous is not None: # if there is a previous sibling it will get the tail
            if previous.tail is None:
                previous.tail = node.tail
            else:
                previous.tail = previous.tail + node.tail
        else: # The parent get the tail as text
            parent = node.getparent()
            if parent.text is None:
                parent.text = node.tail
            else:
                parent.text = parent.text + node.tail

HTH

【讨论】：