【发布时间】:2020-12-15 21:39:40
【问题描述】:
<div>
1
<br/>
5
<p> </p>
2
</div>
假设我必须删除空标签。在这个例子中,空标签是<p> </p>。我创建了这个函数来为我完成这项工作。但它也会在<p>tag 之后删除2。那我该怎么办呢?
def reformat_article(text):
tree = etree.fromstring(text, parser=etree.HTMLParser(encoding='utf-8'))
# etree.strip_attributes(tree, 'style')
etree.strip_tags(tree, 'span', 'font')
for script in tree.xpath('//script'):
script.getparent().remove(script)
for empty in tree.xpath('//*[text() and not(*)]'):
if re.match(r'^\s+$', ''.join(empty.xpath('./text()'))):
empty.getparent().remove(empty)
for empty in tree.xpath('//*[not(self::br) and not(*) and not(normalize-space()) and not(self::text())]'):
empty.getparent().remove(empty)
for align in tree.xpath('//*[text()]'):
s_s = re.compile(r'\s{20,}')
for line in align.xpath('./text()'):
if s_s.search(line):
align.attrib['align'] = 'right'
text = etree.tostring(tree, encoding='utf-8').decode()
return text
【问题讨论】:
-
<p>之后的尾随文本是元素的tail。见stackoverflow.com/a/47946748/407651