标签被转换为 HTML 实体？答案

【问题标题】：Tags are converted to HTML entities?标签被转换为 HTML 实体？
【发布时间】：2015-06-23 21:30:20
【问题描述】：

我正在尝试使用 BeautifulSoup 来解析一些脏 HTML。一种这样的 HTML 是 http://f10.5post.com/forums/showthread.php?t=1142017

发生的情况是，首先，树错过了页面的一大块。其次，tostring(tree) 会将页面一半上的<div> 等标签转换为&lt;/div&gt; 等HTML 实体。比如

原文：

<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`

toString(tree) 给了

&lt;div class="smallfont" align="center"&gt;All times are GMT -4. The time now is &lt;span class="time"&gt;02:12 PM&lt;/span&gt;.&lt;/div&gt;

这是我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)

print soup

谢谢

【问题讨论】：

标签： python html parsing beautifulsoup html-parsing

【解决方案1】：

使用beautifulsoup4 和一个极其宽松的 html5lib parser：

import urllib2
from bs4 import BeautifulSoup  # NOTE: importing beautifulsoup4 here

page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")

print soup

【讨论】：

你知道使用 html5lib 的缺点是什么吗？它比标准的慢吗？
@Kar 不同的解析器以不同的方式解析 HTML - 理论上，您可能会遇到html.parser 或lxml 会产生比html5lib 更准确的结果的情况。此外，就性能而言，BeautifulSoup(page, "lxml") 通常是最佳选择。