可以对一个 BeautifulSoup 文档使用多个过滤器吗？答案

【问题标题】：Possible to use multiple strainers with one BeautifulSoup document?可以对一个 BeautifulSoup 文档使用多个过滤器吗？
【发布时间】：2019-02-22 20:08:17
【问题描述】：

我正在使用 Django 和 Python 3.7 。我想加快我的 HTML 解析速度。目前，我正在我的文档中寻找三种类型的元素，就像这样

req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req).read()
comments_soup = BeautifulSoup(html, features="html.parser")

score_elts = comments_soup.findAll("div", {"class": "score"})

comments_elts = comments_soup.findAll("a", attrs={'class': 'comments'})

bad_elts = comments_soup.findAll("span", text=re.compile("low score"))

我读到 SoupStrainer 是提高性能的一种方法 - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document。但是，所有示例都只讨论了使用单个过滤器解析 HTML 文档。就我而言，我有三个。如何将三个过滤器传递到我的解析中，或者这实际上会产生比我现在这样做的方式更差的性能？

【问题讨论】：

标签： django python-3.x performance parsing beautifulsoup

【解决方案1】：

我认为您不能将多个过滤器传递给 BeautifulSoup 构造函数。您可以做的是将所有条件包装到一个过滤器中并将其传递给 BeautifulSoup 构造函数。

对于简单的情况，例如标签名称，您可以将列表传递给 SoupStrainer

html="""
<a>yes</a>
<p>yes</p>
<span>no</span>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
custom_strainer = SoupStrainer(["a","p"])
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

输出

<a>yes</a><p>yes</p>

为了指定更多的逻辑，你也可以传入一个自定义函数（你可能必须这样做）。

html="""
<html class="test">
<a class="wanted">yes</a>
<a class="not-wanted">no</a>
<p>yes</p>
<span>no</span>
</html>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def my_function(elem,attrs):
    if elem=='a' and attrs['class']=="wanted":
        return True
    elif elem=='p':
        return True
custom_strainer= SoupStrainer(my_function)
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

输出

<a class="wanted">yes</a><p>yes</p>

如文档中所述

仅解析文档的一部分不会为您节省很多时间来解析文档，但它可以节省大量内存，并且会进行搜索文档速度更快。

我认为您应该查看文档的Improving performance 部分。

【讨论】：

在您上面的代码示例中，您的意思是“parse_only=custom_strainer”还是“parse_only=my_strainer_function”，因为这是您为函数命名的？
@Dave 您将函数传递给过滤器，然后将过滤器传递给 BeautifulSoup 构造函数。为了更清楚起见，我已重命名该函数。