用 beautifulsoup 解析 <br> 标签答案

【问题标题】：Parsing <br> tags with beautifulsoup用 beautifulsoup 解析 <br> 标签
【发布时间】：2017-09-24 11:46:06
【问题描述】：

我正在抓取一个网站，
标签的结构是：

<div class="content"
    <p> 
        "C Space"
        <br>
        "802 white avenue"
        <br>
        "xyz 123"
        <br>
        "Lima"
    </p>

当我使用 beautifulsoup 获取文本时，使用以下命令：

html=urlopen("something")
bsObj = BeautifulSoup(html,"html5lib")
templist = bsObj.find("div",{"class":"content"})
print(templist.get_text())

我得到以下输出： C Space802 白色 avenuexyz 123Lima

而我希望输出为：C Space 802 white avenue xyz 123 Lima。

如何在从后续 br 标签获取数据时添加额外的空格？

谢谢

【问题讨论】：

标签： html web-scraping beautifulsoup tags web-crawler

【解决方案1】：

您可以在这里使用split 和join：

>>> ' '.join(templist.get_text().split())
'"C Space" "802 white avenue" "xyz 123" "Lima"'

【讨论】：

【解决方案2】：

您可以使用.get_text() 参数：

In [4]: elm = soup.select_one(".content")

In [5]: print(elm.get_text(strip=True, separator=" "))
"C Space" "802 white avenue" "xyz 123" "Lima"

【讨论】：