使用 BeautifulSoup 按 id 获取 div 的内容答案

【问题标题】：Get contents of div by id with BeautifulSoup使用 BeautifulSoup 按 id 获取 div 的内容
【发布时间】：2014-10-26 04:56:28
【问题描述】：

我正在使用 python2.7.6、urllib2 和 BeautifulSoup

从网站中提取 html 并存储在变量中。

如何使用 beautifulsoup 仅显示带有 id 的 div 的 html 内容？

<div id='theDiv'>
<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

会

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

【问题讨论】：

标签： python html python-2.7 beautifulsoup html-parsing

【解决方案1】：

加入div标签.contents的元素：

from bs4 import BeautifulSoup

data = """
<div id='theDiv'>
    <p>div content</p>
    <p>div stuff</p>
    <p>div thing</p>
</div>
"""

soup = BeautifulSoup(data)
div = soup.find('div', id='theDiv')
print ''.join(map(str, div.contents))

打印：

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

【讨论】：

这似乎有效！你能解释一下print ''.join(map(str, div.contents))是怎么回事吗？
@user8028 当然，contents 实际上包含所有可以表示为字符串或Tag 类实例的标记子项。应用map(str, ...) 有助于将每个孩子都转换成字符串。希望对您有所帮助。
我在 div 的内容中有一个特殊字符 (€)。如何将其编码为 ascii，以便它可以打印到终端或写入文件？我总是收到错误UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 31: ordinal not in range(128)

【解决方案2】：

从 4.0.1 版本开始有一个函数decode_contents():

>>> soup = BeautifulSoup("""
<div id='theDiv'>
<p>div content</p>
<p>div stuff</p>
<p>div thing</p>
""")

>>> print(soup.div.decode_contents())

<p>div content</p>
<p>div stuff</p>
<p>div thing</p>

此问题的解决方案中的更多详细信息：https://stackoverflow.com/a/18602241/237105

【讨论】：