【发布时间】:2012-05-18 23:03:23
【问题描述】:
我有这个脚本:
import urllib2
from BeautifulSoup import BeautifulSoup
import html5lib
import lxml
soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read())
但这给了我以下错误:
Traceback (most recent call last):
File "akaConnection.py", line 59, in <module>
soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read())
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 226, in parse_starttag
endpos = self.check_for_whole_start_tag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 301, in check_for_whole_start_tag
self.error("malformed start tag")
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 56, column 872
然后我尝试了这段代码:
soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"lxml")
或
soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"html5lib")
这给了我这个错误:
Traceback (most recent call last):
File "akaConnection.py", line 59, in <module>
soup = BeautifulSoup(urllib2.urlopen("http://www.hitmeister.de").read(),"lxml")
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1499, in __init__
BeautifulStoneSoup.__init__(self, *args, **kwargs)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1230, in __init__
self._feed(isHTML=isHTML)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1263, in _feed
self.builder.feed(markup)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 156, in goahead
k = self.parse_declaration(i)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1112, in parse_declaration
j = HTMLParser.parse_declaration(self, i)
File "/usr/lib/python2.6/markupbase.py", line 109, in parse_declaration
self.handle_decl(data)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1097, in handle_decl
self._toStringSubclass(data, Declaration)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1030, in _toStringSubclass
self.soup.endData(subclass)
File "/usr/lib/pymodules/python2.6/BeautifulSoup.py", line 1318, in endData
(not self.parseOnlyThese.text or \
AttributeError: 'str' object has no attribute 'text'
我正在运行 Linux Ubuntu 10.04,Python 2.6.5,BeautifulSoup 版本是:'3.1.0.1' 我该如何修复我的代码,或者我错过了什么?
【问题讨论】:
-
你的初始脚本似乎对我有用.... 你有什么版本的 BeautifulSoup?我的是 3.0.8.1。
-
对于真正损坏的 HTML,另一种选择是先通过 Tidy 运行它。类似countergram.com/open-source/pytidylib
-
第二个错误是复制 BeautifulSoup 4 的示例并尝试将其与 BeautifulSoup 3 一起使用。BS3 不使用 lxml 或 html5lib。
-
我的 BeautifulSoup 版本是:'3.1.0.1'
-
"BS3 不使用 lxml 或 html5lib" 那么我该如何修复它,我无法安装另一个 python 版本,因为它在服务器上,而且我知道 BS4 不支持 python
标签: python web-crawler beautifulsoup lxml html5lib