BeautifulSoup的Python高内存使用率：无法删除对象答案

【问题标题】：Python high memory usage with BeautifulSoup: can't delete objectBeautifulSoup的Python高内存使用率：无法删除对象
【发布时间】：2015-04-24 15:33:49
【问题描述】：

我基本上和这里的人有同样的问题：Python high memory usage with BeautifulSoup

我的 BeautifulSoup 对象没有被垃圾回收，导致大量的 RAM 消耗。这是我使用的代码（“entry”是我从 RSS 网页获取的对象。它基本上是一篇 RSS 文章）。

title = entry.title
date = arrow.get(entry.updated).format('YYYY-MM-DD')

try:
    url = entry.feedburner_origlink
except AttributeError:
    url = entry.link

abstract = None
graphical_abstract = None
author = None

soup = BeautifulSoup(entry.summary)

r = soup("img", align="center")
print(r)
if r:
    graphical_abstract = r[0]['src']

if response.status_code is requests.codes.ok:
    soup = BeautifulSoup(response.text)

    # Get the title (w/ html)
    title = soup("h2", attrs={"class": "alpH1"})
    if title:
        title = title[0].renderContents().decode().lstrip().rstrip()

    # Get the abstrat (w/ html)
    r = soup("p", xmlns="http://www.rsc.org/schema/rscart38")
    if r:
        abstract = r[0].renderContents().decode()
        if abstract == "":
            abstract = None

    r = soup("meta", attrs={"name": "citation_author"})
    if r:
        author = [tag['content'] for tag in r]
        author = ", ".join(author)

所以在文档 (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Improving%20Memory%20Usage%20with%20extract) 中，他们说问题可能来自这样一个事实：只要您使用汤对象中包含的标签，汤对象就会保留在内存中。所以我尝试了类似的方法（每次我在前面的例子中使用一个汤对象）：

    r = soup("img", align="center")[0].extract()
    graphical_abstract = r['src']

但是，当程序退出作用域时，内存并没有被释放。

所以，我正在寻找一种从内存中删除汤对象的有效方法。你有什么想法吗？

【问题讨论】：

你试过lxml吗？是iterparse对于大文档的解析非常高效，看看here
我知道 lxml，但我更喜欢 BeautifulSoup。我有一个用 BS 编码的完整模块。它可以工作，除了内存泄漏部分。

标签： python memory-leaks beautifulsoup

【解决方案1】：

为了避免 BeautifulSoup 对象的大量内存泄漏尝试使用 SoupStrainer 类。

它非常适合我。

from bs4 import SoupStrainer

only_span = SoupStrainer('span')
only_div = SoupStrainer('div')
only_h1 = SoupStrainer('h1')

soup_h1 = BeautifulSoup(response.text, 'lxml', parse_only=only_h1)
soup_span = BeautifulSoup(response.text, 'lxml', parse_only=only_span)
soup_div = BeautifulSoup(response.text, 'lxml', parse_only=only_div)


try:
    name = soup_h1.find('h1', id='itemTitle').find(text=True, recursive=False)
except:
    name = 'Noname'

try:
    price = soup_span.find('span', id='prcIsum').text.strip()

等等……

即使我们使用 SoupStrainer 创建三个 BeautifulSoup 对象，它消耗的 RAM 也会比不使用 SoupStrainer 并且只使用一个 BeautifulSoup 对象少得多。

【讨论】：

【解决方案2】：

我遇到了类似的问题，发现尽管我很注意，但我仍然存储了一些 BS NavigableString 和/或 ResultSet，这会导致汤留在内存中，如您所知。不确定两者是否有用（我让你试试），但我记得以这种方式提取文本解决了问题

ls_result = [unicode(x) for x in soup_bloc.findAll(text = True)]
str_result = unicode(soup_bloc.text)

【讨论】：

所以基本上，每次我需要一个来自soup对象的字符串时，我只需要调用它的unicode函数，对吧？浏览/搜索树时我不需要做任何特别的事情吗？
就我而言，这就足够了。我还按照您提到的 SO 问题中的建议使用了 gc 和 decompose() ，但没有帮助。最后，我通过有条不紊地检查我存储的每件东西的类型（包括我认为是列表但结果是 BS ResultSets 的类型以及我认为是字符串并转过头的列表中的项目）发现了问题出是 BS NavigableStrings)。我想你的问题可能会有所不同。我不介意检查您是否发布了我可以运行的 sn-p。