BeautifulSoup / lxml：大元素有问题吗？答案

【问题标题】：BeautifulSoup / lxml: Are there problems with large elements?BeautifulSoup / lxml：大元素有问题吗？
【发布时间】：2013-04-08 09:49:22
【问题描述】：

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "lxml")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

输出：

ActivePython 2.7.2.5 (ActiveState Software Inc.) based on
Python 2.7.2 (default, Jun 24 2011, 12:21:10) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, re, sys, urllib2
>>> from bs4 import BeautifulSoup
>>> import lxml
>>>
>>> html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
>>> soup = BeautifulSoup(html, "lxml")
>>> divs = soup.find_all("div", {"class":"block"})
>>> print len(divs)
2

我也试过了：

divs = soup.find_all(class_="block")

结果相同...

但是有 11 个元素符合这个条件。那么是否有任何限制，例如最大元素大小。我怎样才能得到所有的元素？

【问题讨论】：

我猜你想睡觉了...
你的代码有 11 个 div。
与@JosefAssad 相同，我的第一个代码得到 11
所以这可能是python版本问题还是什么？对于使用 ActivePython 2.7.2.5 的我来说，我每次只得到 2 个，我真的不知道如何解决这个问题:(
现在也用 Python 3.2.2.2 尝试过 --> 相同的结果...也在不同的电脑上尝试过 --> 相同的结果...你能告诉我你的技巧吗？跨度>

标签： python python-2.7 beautifulsoup lxml activepython

【解决方案1】：

最简单的方法可能是使用“html.parser”而不是“lxml”：

import os, re, sys, urllib2
from bs4 import BeautifulSoup
import lxml

html = urllib2.urlopen("http://www.hoerzu.de/tv-programm/jetzt/")
soup = BeautifulSoup(html, "html.parser")
divs = soup.find_all("div", {"class":"block"})
print len(divs)

使用您的原始代码（使用lxml）它为我打印了1，但这打印了11。对于此页面，lxml 是宽松的，但不如 html.parser 宽松。

请注意，如果您通过tidy 运行该页面，则该页面有超过一千条警告。包括无效字符代码、未封闭的<div>s、< 和/ 等字母在它们不允许的位置。

【讨论】：

似乎更多的是beautifulsoup的问题，因为plain lxml也获取所有div，所以我最近切换到plain lxml。