【问题标题】:How to add space around removed tags in BeautifulSoup如何在 BeautifulSoup 中的已删除标签周围添加空格
【发布时间】:2015-09-17 08:38:57
【问题描述】:
from BeautifulSoup import BeautifulSoup

html = '''<div class="thisText">
Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>

<div class="thisText">
In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
</div>'''


soup = BeautifulSoup(html)
all_poems = soup.findAll("div", {"class": "thisText"})
for poems in all_poems:
print(poems.text)

我有这个示例代码,但我找不到如何在已删除的标签周围添加空格,因此当&lt;a href...&gt; 中的文本被格式化时,它可以被读取并且不会像这样显示:

诗乌鸦曾经在一个沉闷的午夜,当我沉思时,虚弱而疲倦......

在我们最绿色的山谷中由善良的天使租用......,鬼宫的一部分

【问题讨论】:

  • 呃,您的原始 HTML 包含链接文本,这些链接文本与相邻的单词组合在一起。

标签: python html beautifulsoup html-parsing


【解决方案1】:

beautifoulsoup4 中的get_text() 有一个名为separator 的可选输入。您可以按如下方式使用它:

soup = BeautifulSoup(html)
text = soup.get_text(separator=' ')

【讨论】:

    【解决方案2】:

    一种选择是查找所有文本节点并用空格连接它们:

    " ".join(item.strip() for item in poems.find_all(text=True))
    

    此外,您正在使用beautifulsoup3 包,该包已过时且未维护。升级到beautifulsoup4

    pip install beautifulsoup4
    

    并替换:

    from BeautifulSoup import BeautifulSoup
    

    与:

    from bs4 import BeautifulSoup
    

    【讨论】:

    • 升级到BS4,更好——更简单,更快。
    • bs4 单独并没有为我解决这个问题,我不得不像下面推荐的答案一样添加separator 参数。
    【解决方案3】:

    这里有 及其xpath 函数的替代方法来搜索所有文本节点:

    from lxml import etree
    
    html = '''<div class="thisText">
    Poem <a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Raven</a>Once upon a midnight dreary, while I pondered, weak and weary... </div>
    
    <div class="thisText">
    In the greenest of our valleys By good angels tenanted..., part of<a href="http://famouspoetsandpoems.com/poets/edgar_allan_poe/poems/18848">The Haunted Palace</a>
    </div>'''
    
    root = etree.fromstring(html, etree.HTMLParser())
    print(' '.join(root.xpath("//text()")))
    

    它产生:

    Poem  The Raven Once upon a midnight dreary, while I pondered, weak and weary...  
    
    
    In the greenest of our valleys By good angels tenanted..., part of The Haunted Palace
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-04-28
      • 2013-01-17
      • 1970-01-01
      • 2011-05-15
      • 2011-08-09
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多