【问题标题】:How do I remove tags in between other html tags using beautiful soup如何使用漂亮的汤删除其他 html 标签之间的标签
【发布时间】:2015-04-29 18:32:03
【问题描述】:
【问题讨论】:
标签:
python
python-3.x
beautifulsoup
【解决方案1】:
from BeautifulSoup import BeautifulSoup
VALID_TAGS = ['td']
def sanitize_html(value):
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.hidden = True
return soup.renderContents()
这会保留无效标签的内容。
Python HTML sanitizer / scrubber / filter.
【解决方案2】:
您可以使用简单的getText() 来获取没有子标签的标签内容:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<td><script class="blah">a</script>baba<script id="blahhhh">b</script></td>')
td = soup.td
#update content of <td> to concatenation of all inner text nodes
td.string = td.getText()
print(soup)
输出:
<html><body><td>ababab</td></body></html>