BeautifulSoup 找不到正确解析的元素答案

【问题标题】：BeautifulSoup doesn't find correctly parsed elementsBeautifulSoup 找不到正确解析的元素
【发布时间】：2015-01-09 20:54:14
【问题描述】：

我正在使用BeautifulSoup 来解析一堆可能非常脏的HTML 文档。我偶然发现了一件非常奇怪的事情。

HTML 来自此页面：http://www.wvdnr.gov/

它包含多个错误，例如<head>之外的多个<html></html>、<title>等...

但是，即使在这些情况下，html5lib 通常也能正常工作。事实上，当我这样做时：

soup = BeautifulSoup(document, "html5lib")

我 pretti-print soup，我看到以下输出：http://pastebin.com/8BKapx88

其中包含很多<a>标签。

但是，当我执行soup.find_all("a") 时，我得到一个空列表。使用lxml 我得到了同样的结果。

那么：以前有人偶然发现过这个问题吗？到底是怎么回事？如何获取 html5lib 找到但没有返回 find_all 的链接？

【问题讨论】：

标签： python html beautifulsoup html-parsing html5lib

【解决方案1】：

即使正确答案是“使用另一个解析器”（感谢@alecxe），我还有另一个解决方法。出于某种原因，这也有效：

soup = BeautifulSoup(document, "html5lib")
soup = BeautifulSoup(soup.prettify(), "html5lib")
print soup.find_all('a')

返回相同的链接列表：

soup = BeautifulSoup(document, "html.parser")

【讨论】：

【解决方案2】：

在解析格式不正确且棘手的 HTML 时，the parser choice 非常重要：

HTML 解析器之间也存在差异。如果你给美丽汤一个格式完美的 HTML 文档，这些差异无关紧要。一个解析器会比另一个更快，但它们都会给你一个看起来与原始 HTML 文档一模一样的数据结构。

但是如果文档的格式不完美，不同的解析器会给出不同的结果。

html.parser 为我工作：

from bs4 import BeautifulSoup
import requests

document = requests.get('http://www.wvdnr.gov/').content
soup = BeautifulSoup(document, "html.parser")
print soup.find_all('a')

演示：

>>> from bs4 import BeautifulSoup
>>> import requests
>>> document = requests.get('http://www.wvdnr.gov/').content
>>>
>>> soup = BeautifulSoup(document, "html5lib")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "lxml")
>>> len(soup.find_all('a'))
0
>>> soup = BeautifulSoup(document, "html.parser")
>>> len(soup.find_all('a'))
147

另见：

Differences between parsers.

【讨论】：

非常有趣。我假设 html5lib 和 lxml 比 html.parser 更好。哦，好吧，现在我知道得更好了。谢谢！
html5lib 应该与 HTML 规范的功能和 Web 浏览器的功能相匹配。如果没有，那就有问题了。
嗯 — html5lib 似乎自己找到了 147 个 a 元素，但在 BeautifulSoup 中却找不到。这似乎是 BeautifulSoup 方面的问题，而不是 html5lib。
我已经在 BS4 here, FWIW 中报告了这个错误。