【问题标题】：BeautifulSoup innerhtml?BeautifulSoup 内部html？
【发布时间】：2011-12-28 02:45:10
【问题描述】：

假设我有一个带有div 的页面。我可以使用soup.find() 轻松获取该 div。

现在我有了结果，我想打印整个 innerhtml 和 div：我的意思是，我需要一个包含所有 html 标签和文本的字符串，就像字符串一样我会用obj.innerHTML 输入javascript。这可能吗？

【问题讨论】：

【解决方案1】：

其中一个选项可以使用类似的东西：

 innerhtml = "".join([str(x) for x in div_element.contents])

【讨论】：

还有其他一些问题。首先，它不会在字符串元素中转义 html 实体（例如大于和小于）。其次，它会写入 cmets 的内容，而不是评论标签本身。
向@ChrisD cmets 添加另一个不使用它的理由：这将在包含非 ASCII 字符的内容上引发 UnicodeDecodeError。

【解决方案2】：

TL;DR

对于 BeautifulSoup 4，如果您想要一个 UTF-8 编码的字节字符串，请使用 element.encode_contents()；如果您想要 Python Unicode 字符串，请使用 element.decode_contents()。例如，DOM's innerHTML method 可能看起来像这样：

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

这些函数目前不在在线文档中，因此我将引用当前函数定义和代码中的文档字符串。

`encode_contents` - 从 4.0.4 开始

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

另见documentation on formatters；除非您想以某种方式手动处理文本，否则您很可能会使用 formatter="minimal"（默认）或 formatter="html"（用于 html entities）。

encode_contents 返回一个编码的字节串。如果您想要 Python Unicode 字符串，请改用 decode_contents。

`decode_contents` - 从 4.0.1 开始

decode_contents 与encode_contents 做同样的事情，但返回的是 Python Unicode 字符串而不是编码的字节串。

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

美汤3

BeautifulSoup 3 没有上述功能，而是有renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

此功能已添加回 BeautifulSoup 4 (in 4.0.4) 以与 BS3 兼容。

【讨论】：

这是正确答案。由于 ChrisD 概述的原因，@peewhy 的回答不起作用。
有人知道为什么这是无证的吗？似乎这将是一个常见的用例。

【解决方案3】：

unicode(x) 怎么样？似乎对我有用。

编辑：这将为您提供外部 HTML，而不是内部。

【讨论】：

这将返回包含外部元素的 div，而不仅仅是内容。
你是对的。暂时将其留在这里，以防对其他人有帮助。

【解决方案4】：

如果你只需要文本（没有HTML标签），那么你可以使用.text：

soup.select("div").text

【讨论】：

这会删除内部标签。
也许你错过了问题“我需要一个包含所有 html 标签的字符串”的部分

【解决方案5】：

对于纯文字，美丽汤 4 `get_text()`

如果您只想要文档或标签中的人类可读文本，您可以使用get_text() 方法。它以单个 Unicode 字符串的形式返回文档中或标签下的所有文本：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com'

您可以指定一个字符串用于将文本位连接在一起：

soup.get_text("|")
'\nI linked to |example.com|\n'

您可以告诉 Beautiful Soup 从每一位文本的开头和结尾去除空格：

soup.get_text("|", strip=True)
'I linked to|example.com'

但此时您可能想改用.stripped_strings 生成器，并自己处理文本：

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

从 Beautiful Soup 版本 4.9.0 开始，当使用 lxml 或 html.parser 时，<script>、<style> 和 <template> 标签的内容不被视为 ‘text’，因为这些标签不是页面的人类可见内容的一部分。

参考这里：https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

【讨论】：

【解决方案6】：

str(element) 帮助您获取 outerHTML，然后从外部 html 字符串中删除外部标记。

【讨论】：

【解决方案7】：

最简单的方法是使用 children 属性。

inner_html = soup.find('body').children

它将返回一个列表。因此，您可以使用简单的 for 循环获取完整代码。

for html in inner_html:
    print(html)

【讨论】：

TL;DR

encode_contents - 从 4.0.4 开始

decode_contents - 从 4.0.1 开始

美汤3

对于纯文字，美丽汤 4 get_text()

`encode_contents` - 从 4.0.4 开始

`decode_contents` - 从 4.0.1 开始

对于纯文字，美丽汤 4 `get_text()`