【问题标题】:python Xpath: how to remove empty tags but keep sibling trailing text?python Xpath:如何删除空标签但保留同级尾随文本?
【发布时间】:2020-12-15 21:39:40
【问题描述】:
<div>
   1
   <br/>
   5
   <p> </p>
   2
</div>

假设我必须删除空标签。在这个例子中,空标签是&lt;p&gt; &lt;/p&gt;。我创建了这个函数来为我完成这项工作。但它也会在&lt;p&gt;tag 之后删除2。那我该怎么办呢?


def reformat_article(text):
    tree = etree.fromstring(text, parser=etree.HTMLParser(encoding='utf-8'))
    # etree.strip_attributes(tree, 'style')
    etree.strip_tags(tree, 'span', 'font')

    for script in tree.xpath('//script'):
        script.getparent().remove(script)

    for empty in tree.xpath('//*[text() and not(*)]'):
        if re.match(r'^\s+$', ''.join(empty.xpath('./text()'))):
            empty.getparent().remove(empty)

    for empty in tree.xpath('//*[not(self::br) and not(*) and not(normalize-space()) and not(self::text())]'):
        empty.getparent().remove(empty)

    for align in tree.xpath('//*[text()]'):
        s_s = re.compile(r'\s{20,}')
        for line in align.xpath('./text()'):
            if s_s.search(line):
                align.attrib['align'] = 'right'

    text = etree.tostring(tree, encoding='utf-8').decode()
    return text

【问题讨论】:

标签: python xpath lxml


【解决方案1】:

要删除一个不带 tail 字符串的元素,请使用以下函数:

def remove_element(el):
    parent = el.getparent()
    tail = el.tail
    if tail is not None and len(tail.strip()) > 0:
        prev = el.getprevious()
        if prev is not None:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

我通过以下方式对其进行了测试:

from lxml import etree as et

parser = et.XMLParser(remove_blank_text=True)
txt = '<div>1<br/>5<p> </p>2</div>'
tree = et.XML(txt, parser)
for emp in tree.xpath('//*[text() and not(*)]'):
    remove_element(emp)
print(et.tostring(tree, method='xml', encoding='unicode',
    pretty_print=True).strip())

我得到的结果是:

<div>1<br/>52</div>

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-08-17
    • 1970-01-01
    • 2016-10-09
    • 1970-01-01
    • 1970-01-01
    • 2015-12-13
    • 1970-01-01
    相关资源
    最近更新 更多