BeautifulSoup 解析的问题答案

【问题标题】：Issues with BeautifulSoup parsingBeautifulSoup 解析的问题
【发布时间】：2010-10-10 17:05:54
【问题描述】：

我正在尝试使用 BeautifulSoup 解析 html 页面，但 BeautifulSoup 似乎根本不喜欢 html 或该页面。当我运行下面的代码时，prettify() 方法只返回页面的脚本块（见下文）。有人知道为什么会发生吗？

import urllib2
from BeautifulSoup import BeautifulSoup

url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()

是 BeautifulSoup 产生的输出。

-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
 <!--
     function highlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
     }

     function unhighlight(img) {
       document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
     }
//-->
</script>

谢谢！

更新：我使用的是以下版本，似乎是最新的。

__author__ = "Leonard Richardson (leonardr@segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

Samj：如果我得到类似的东西 HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>" 在将其提供给 BeautifulSoup 之前，我只是从标记中删除了罪魁祸首，一切都很花哨：

html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

【讨论】：

【解决方案2】：

试试lxml。尽管它的名字，它也用于解析和抓取 HTML。它比 BeautifulSoup 快得多，它甚至比 BeautifulSoup 更好地处理“损坏”的 HTML，因此它可能对您更有效。如果你不想学习 lxml API，它也有一个 BeautifulSoup 的兼容性 API。

Ian Blicking agrees.

没有理由再使用 BeautifulSoup，除非您使用的是 Google App Engine 或不允许使用任何非纯 Python 的东西。

【讨论】：

【解决方案3】：

我在解析以下代码时也遇到了问题：

<script>
        function show_ads() {
          document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
        }
</script>

HTMLParseError: bad end tag: u'', at 26 line, column 127

山姆

【讨论】：

【解决方案4】：

import urllib
from BeautifulSoup import BeautifulSoup

>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()

在我的例子中，通过执行上述语句，它会返回整个 HTML 页面。

【讨论】：

在否决任何人之前给出适当的理由。这会有点道德。哦！如果你不明白我的回答，那么愿上帝帮助你

【解决方案5】：

按照Łukasz 的建议尝试使用版本 3.0.7a。 BeautifulSoup 3.1 旨在与 Python 3.0 兼容，因此他们不得不将解析器从 SGMLParser 更改为 HTMLParser，这似乎更容易受到不良 HTML 的攻击。

来自changelog for BeautifulSoup 3.1：

“Beautiful Soup 现在基于 HTMLParser 而不是 SGMLParser，后者在 Python 3 中已消失。SGMLParser 处理了一些糟糕的 HTML，但 HTMLParser 没有”

【讨论】：

更多关于这里的信息：crummy.com/software/BeautifulSoup/3.1-problems.html

【解决方案6】：

我在 BeautifulSoup 版本 '3.0.7a' 上测试了这个脚本，它返回了看似正确的输出。我不知道 '3.0.7a' 和 '3.1.0.1' 之间有什么变化，但请尝试一下。

【讨论】：

【解决方案7】：

BeautifulSoup 并不神奇：如果传入的 HTML 太糟糕，那么它就无法工作。

在这种情况下，传入的 HTML 就是这样：BeautifulSoup 无法弄清楚该做什么。例如，它包含如下标记：

SCRIPT type=""javascript""

（注意双引号。）

BeautifulSoup 文档包含一个部分，如果 BeautifulSoup 无法解析您的标记，您可以做什么。您需要调查这些替代方案。

【讨论】：