【发布时间】:2021-01-19 20:41:46
【问题描述】:
我正在使用 Beautiful Soup 4 和 Python 3.8。我只想解析 HTML 页面中的某些元素,所以我决定使用像这样的过滤器......
req = urllib2.Request(full_url, headers=settings.HDR)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
,,,
@staticmethod
def idiom_match_strainer(elem, attrs):
if elem == 'ul' and 'class' in attrs and attrs['class'] == 'idiKw':
return True
return False
不幸的是,当我尝试解析任何 URL(https://idioms.thefreedictionary.com/testing 是一个示例)时,我收到以下错误
Internal Server Error: /ajax/get_hints
Traceback (most recent call last):
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 126, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/views.py", line 194, in get_hints
objects = s.get_hints(article)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/article_service.py", line 398, in get_hints
idioms = DictionaryService.get_idioms(word)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/dictionary_service.py", line 75, in get_idioms
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 281, in __init__
self._feed()
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 342, in _feed
self.builder.feed(self.markup)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 287, in feed
self.parser.feed(markup)
File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
File "src/lxml/parsertarget.pxi", line 148, in lxml.etree._TargetParserContext._handleParseResult
File "src/lxml/parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult
File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/saxparser.pxi", line 389, in lxml.etree._handleSaxTargetStartNoNs
File "src/lxml/saxparser.pxi", line 404, in lxml.etree._callTargetSaxStart
File "src/lxml/parsertarget.pxi", line 80, in lxml.etree._PythonSaxParserTarget._handleSaxStart
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 220, in start
self.soup.handle_starttag(name, namespace, nsprefix, attrs)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 582, in handle_starttag
and (self.parse_only.text
AttributeError: 'function' object has no attribute 'text'
我应该以不同的方式使用过滤器吗?
【问题讨论】:
-
传递给
idiom_match_strainer的参数是什么? -
他们的文档给我的印象是——crummy.com/software/BeautifulSoup/bs4/doc 参数必须始终是元素和属性,这就是我将这两个包含在过滤器中的原因。
-
@QHarr 是重复的吗?
-
@αԋɱҽԃαмєяιcαη 我认为可能是
标签: python-3.x beautifulsoup html-parsing lxml