div标签中的BeautifulSoup提取字符串[重复]答案

【问题标题】：BeatifulSoup Extract String in div tag [duplicate]div标签中的BeautifulSoup提取字符串[重复]
【发布时间】：2021-01-29 14:07:49
【问题描述】：

我有以下 HTML：

<div class="interesting"><span>a</span>&nbsp;&nbsp;<span>b</span>&nbsp;&nbsp;c</div><div>d</div>

我正在尝试使用beautifulsoup 来提取字符串c。

但是，soup.div.string 是 None。我可以调用get_text() 来获取a b c，然后我再次解析文本。但我觉得它违背了使用beautifulsoup的目的。

有什么建议吗？

======================

更新：

我在上面的示例字符串中添加了&nbsp;&nbsp;，因为我注意到它实际上导致soup.div.find(text=True, recursive=False) 无法在div 中返回文本。所以这个问题不再重复了。

soup = BeautifulSoup('<div class="interesting"><span>a</span>&nbsp;&nbsp;<span>b</span>&nbsp;&nbsp;c</div><div>d</div>', 'html.parser')
div = soup.find('div', class_='interesting')
print(div.find_all_next(text=True)[-1])

上面的代码打印d

【问题讨论】：

标签： html python-3.x beautifulsoup

【解决方案1】：

这应该对你有帮助：

div = soup.find('div',class_ = "interesting")

print(div.find_all(text=True)[-1].strip()) #Prints the last text present within the div tag

输出：

这里是完整的代码：

from bs4 import BeautifulSoup

html = '<div class="interesting"><span>a</span>&nbsp;&nbsp;<span>b</span>&nbsp;&nbsp;c</div><div>d</div>'

soup = BeautifulSoup(html,'html5lib')

div = soup.find('div',class_ = "interesting")

print(div.find_all(text=True)[-1].strip())

【讨论】：

我的示例过于简单，因为 html 字符串实际上包含多个并行的 div 标签。我通过soup.fnd('div', class_='interesting') 找到了我感兴趣的那个。但是，如果我用find_all_next(text=True)[-1] 链接它，它会跳转到html 字符串中的最后一个div 并挖掘出文本。
哦……那就提供实际的html代码吧。
整个html是61KB。有点太大了。我删除并更新了我的问题。
好的...查看我的最新编辑。
有趣。使用html5lib 的解析器，它可以工作。 html.parser 的行为不同。谢谢！