如何在 BeautifulSoup 中的已删除标签周围添加空格答案

【问题标题】：How to add space around removed tags in BeautifulSoup如何在 BeautifulSoup 中的已删除标签周围添加空格
【发布时间】：2015-09-17 08:38:57
【问题描述】：

from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

我有这个示例代码，但我找不到如何在已删除的标签周围添加空格，因此当<a href...> 中的文本被格式化时，它可以被读取并且不会像这样显示：

诗乌鸦曾经在一个沉闷的午夜，当我沉思时，虚弱而疲倦......

在我们最绿色的山谷中由善良的天使租用......，鬼宫的一部分

【问题讨论】：

呃，您的原始 HTML 包含链接文本，这些链接文本与相邻的单词组合在一起。

标签： python html beautifulsoup html-parsing

【解决方案1】：

beautifoulsoup4 中的get_text() 有一个名为separator 的可选输入。您可以按如下方式使用它：

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

【讨论】：

【解决方案2】：

一种选择是查找所有文本节点并用空格连接它们：

" ".join(item.strip() for item in poems.find_all(text=True))

此外，您正在使用beautifulsoup3 包，该包已过时且未维护。升级到beautifulsoup4：

pip install beautifulsoup4

并替换：

from BeautifulSoup import BeautifulSoup

与：

from bs4 import BeautifulSoup

【讨论】：

升级到BS4，更好——更简单，更快。
bs4 单独并没有为我解决这个问题，我不得不像下面推荐的答案一样添加separator 参数。

【解决方案3】：

这里有lxml 及其xpath 函数的替代方法来搜索所有文本节点：

from lxml import etree

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''

root = etree.fromstring(html, etree.HTMLParser())
print(' '.join(root.xpath("//text()")))

它产生：

Poem  The Raven Once upon a midnight dreary, while I pondered, weak and weary...  


In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace

【讨论】：