BeatutifulSoup 的 findAll 函数无法获取所有需要的部分答案

【问题标题】：BeatutifulSoup's findAll function fails to get all the desired partsBeatutifulSoup 的 findAll 函数无法获取所有需要的部分
【发布时间】：2013-09-13 15:52:50
【问题描述】：

我目前正在使用 BeautifulSoup findAll 函数来提取网页所需的属性。但是，它无法获得所有所需的部分，并为某些部分返回 None。我的python代码是这样的：

from bs4 import BeautifulSoup
import urllib

url = 'http://code.google.com/p/android/issues/detail?id=1060&colspec=ID Type Status Owner Summary Stars Opened Closed Modified Reporter Cc Project Reportedby Priority Version Target Milestone Component MergedInto BlockedOn Blocking Blocked Subcomponent Attachments'
issue_page = urllib.urlopen(url).read()

soup = BeautifulSoup(issue_page)
comment_parts =  soup.findAll(name = 'div',attrs={'class':'cursor_off vt issuecomment'})
for comment_part in comment_parts:
    print str(comment_part)+'\n'

只获取前 48 个，不返回第 49 个和后续的。我查看了对应的html页面的源码，第49个和第48个和之前的一样。我真的想不通为什么会这样！有没有人可以帮帮我？非常感谢！

【问题讨论】：

标签： python beautifulsoup

【解决方案1】：

当我执行你的代码时，我得到了 58 个结果。

... Your code ...
print len(comment_parts)

...还有，

print comment_parts[-1]

打印页面上的最后一项。你有什么不一样的吗？

【讨论】：

非常感谢您的快速回复。我在问题中犯了一个错误，我刚刚编辑了它。实际上，我只得到了 48 个结果，应该还会返回大约 10 个结果。 comment_parts[-1] 的输出是“
”。另外，我用Ubuntu 13.04和Python2.7做实验。
看来这个问题和BeautifulSoap的版本有关。使用版本 4.3.1 时会出现此问题。当我更改为“3.2.1”版本时，它工作正常！
我刚刚使用 bs4 4.3.1 运行了您的代码，并得到了与以前相同的结果......即 58 个结果，最后一个是最终评论。您可以尝试虚拟环境吗？你用的是哪个版本的python？
我在Ubuntu13.04上用python2.7做实验，用bs4不能得到所有结果，但用BeautifulSoup 3.2.1可以正常工作。刚才，我用bs4在Windows7上尝试了python2.7的代码，它工作正常。真的很奇怪。
非常感谢您的帮助。我在这里找到答案BeatutifulSoup findAll dose not find them all。问题与使用的 HTML 解析器有关。我安装了lxml 和BeautifulSoup 作为默认使用它不能很好地处理损坏的HTML。我将解析器设置为html.parser，就像这样soup = BeautifulSoup(issue_page,'html.parser')，现在它可以工作了！