使用 BeautifulSoup 解析文档而不解析 <code> 标签的内容答案

【问题标题】：Parsing a document with BeautifulSoup while not-parsing the contents of <code> tags使用 BeautifulSoup 解析文档而不解析 <code> 标签的内容
【发布时间】：2011-04-29 18:38:41
【问题描述】：

我正在用 Django 编写一个博客应用程序。我想让评论作者使用一些标签（如<strong>、a 等），但禁用所有其他标签。

另外，我想让他们把代码放在标签里，让pygments去解析。

例如，有人可能会写这样的评论：

I like this article, but the third code example <em>could have been simpler</em>:

<code lang="c">
#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code>

问题是，当我使用 BeautifulSoup 解析注释以去除不允许的 HTML 标签时，它还会解析 块的内部，并将 <stdbool.h> 和 <stdio.h> 视为 HTML 标签.</stdio.h></stdbool.h>

如何告诉 BeautifulSoup 不要解析 块？也许还有其他 HTML 解析器更适合这项工作？

【问题讨论】：

请参阅下面的参考资料。这可以解决您面临的相同问题。

标签： python html django beautifulsoup pygments

【解决方案1】：

编辑：

使用python-markdown2 处理输入，并让用户缩进代码区域。

>>> print html
I like this article, but the third code example <em>could have been simpler</em>:

    #include <stdbool.h>
    #include <stdio.h>

    int main()
    {
        printf("Hello World\n");
    }

>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include &lt;stdbool.h&gt;\n#include &lt;stdio.h&gt;\n\nint main()\n{\n    printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>

<pre><code>#include &lt;stdbool.h&gt;
#include &lt;stdio.h&gt;

int main()
{
    printf("Hello World\n");
}
</code></pre>

如果您仍需要使用 BeautifulSoup 进行导航和编辑，请执行以下操作。如果您需要重新插入“”（而不是“”），请包括实体转换。

soup = BeautifulSoup(marked, 
                     convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>

int main()
{
    printf("Hello World\n");
}
</code></pre>


def thickened(soup):
    """
    <code>
    blah blah <entity> blah
        blah
    </code>
    """
    codez = soup.findAll('code') # get the code tags
    for code in codez:
        # take all the contents inside of the code tags and convert
        # them into a single string
        escape_me = ''.join([k.__str__() for k in code.contents])
        escaped = cgi.escape(escape_me) # escape them with cgi
        code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
    return soup

【讨论】：

它会创建诸如 </stdbool.h> 和 </stdio.h> 之类的工件。
@J.F.Sebastian：你说得对，它对我有用，我刚刚意识到不同之处——我已经通过 markdown 传递了它。重写我的答案。

【解决方案2】：

如果<code> 元素在代码中包含未转义的<、&、> 字符，则它不是有效的html。 BeautifulSoup 将尝试将其转换为有效的 html。这可能不是你想要的。

要将文本转换为有效的 html，您可以调整 a regex that strips tags from an html 以从 <code> 块中提取文本并将其替换为 cgi.escape() 版本。如果没有嵌套的 <code> 标签，它应该可以正常工作。之后，您可以将经过净化的 html 提供给 BeautifulSoup。

【讨论】：

【解决方案3】：

问题是<code>按照HTML标记的正常规则处理，<code>标签内的内容仍然是HTML（标签的存在主要是为了驱动CSS格式，而不是改变解析规则）。

您要做的是创建一种与 HTML 非常相似但不完全相同的不同标记语言。简单的解决方案是假设某些规则，例如“<code> 和 </code> 必须单独出现在一行中”，然后自己进行一些预处理。

一个非常简单的——虽然不是 100% 可靠的——技术是用<code><![CDATA[ 和^</code>$ 替换^<code>$ 和]]></code>。它并不完全可靠，因为如果代码块包含]]>，事情就会大错特错。
更安全的选择是将代码块中的危险字符（&lt;、&gt; 和 &amp; 可能就足够了）用它们的等效字符实体引用（&lt;、&gt; 和 &amp;）替换。您可以通过将您识别的每个代码块传递给 cgi.escape(code_block) 来做到这一点。

完成预处理后，像往常一样将结果提交给 BeautifulSoup。

【讨论】：

选项 #2 似乎是赢家。我该怎么做呢？正则表达式，还是一些复杂的字符串处理算法？
@Dor：我已经修改了我的答案以涵盖这一点。
我已经尝试过了，但显然 cgi.escape 需要一个字符串，而不是 BeautifulSoup 标签对象 :) 如何在解析之前转义标签的内容？
您应该根据我的原始答案提取<code> 和</code> 行之间的文本，将其传递给cgi.escape 并将它们连接在一起。然后（并且只有那时）将整个事情传递给 BeautifulSoup。
马塞洛·坎托斯：That's the main part of the question - *how?* – @Dor Oct 24 '10 at 15:47

【解决方案4】：

很遗憾，BeautifulSoup 无法解析代码块。

你想要达到的目标的一个解决方案也是

1) 移除代码块

soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
    block.replaceWith(u'<code class="removed"></code>')

2) 执行通常的解析以去除不允许的标签。

3) 重新插入代码块，重新生成html。

stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code

我会用一些代码来回答，但我最近阅读了一个优雅地做到这一点的博客。

http://iboris.com/page/add-source-code-syntax-highlighting-your-django-content-pygments.html

【讨论】：

当我第一次解析字符串时，BeautifulSoup 会插入结束的和标记。因此，即使我使用了这种技术，我仍然会在我的代码块中获得这些结束标记。

【解决方案5】：

来自Python wiki

>>>import cgi
>>>cgi.escape("<string.h>")
>>>'&lt;string.h&gt;'

>>>BeautifulSoup('&lt;string.h&gt;', 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)

【讨论】：

这样我就得写出所有可能的标签，不是吗？
@Dor：为什么？只需将<code> 中的所有内容传递给cgi.escape
这是问题的主要部分 - 如何？