在目录中打开多个文件时出现 BeautifulSoup MemoryError答案

【问题标题】：BeautifulSoup MemoryError When Opening Several Files in Directory在目录中打开多个文件时出现 BeautifulSoup MemoryError
【发布时间】：2015-07-06 09:40:29
【问题描述】：

背景：每周，我都会收到一份 html 文件形式的实验室结果列表。每周，大约有 3,000 个结果，每组结果有 2 到 4 个与之关联的表格。对于每个结果/试验，我只关心存储在这些表之一中的一些标准信息。该表可以唯一标识，因为第一个单元格、第一列始终包含文本“实验室结果”。

问题：当我一次处理每个文件时，以下代码效果很好。也就是说，我没有在目录上执行 for 循环，而是将 get_data = open() 指向特定文件。但是，我想获取过去几年的数据，而不是单独处理每个文件。因此，我使用 glob 模块和 for 循环来循环浏览目录中的所有文件。我遇到的问题是，当我到达目录中的第三个文件时，我得到了 MemoryError。

问题：有没有办法清除/重置每个文件之间的内存？这样，我可以循环浏览目录中的所有文件，而不是单独粘贴每个文件名。正如您在下面的代码中看到的，我尝试使用 del 清除变量，但没有奏效。

谢谢。

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

【问题讨论】：

您是否尝试过使用get_data.close() 而不是del？
@Anzel 我已经尝试过，但在运行之前出现以下错误：AttributeError: 'str' object has no attribute 'close'。我相信因为它是 get_data = open(FileName,'r').read()，所以 .read() 会打开它，然后在读取后关闭它。
对不起@JohnR4785，应该是f = open(...)，get_data = f.read()，然后是f.close()...
@Anzel 知道了。不幸的是没有运气。它在同一个地方弹出 - 当它读取目录中的第三个大文件时（第三个循环）。我一直在寻找清除记忆的方法。这就是我添加 gs.collect() 的原因——我不完全理解它，但听起来应该确保 Python 从新变量中清除信息。我会假设有一种方法可以做到这一点，因为我一次可以做一个文件。谢谢你的想法！
您有没有想过如何让 BSoup 工作，或者您是否选择了其他解决方案？

标签： python memory web-scraping beautifulsoup scraper

【解决方案1】：

我是一个非常初学者的程序员，我也遇到过同样的问题。我做了三件事似乎解决了这个问题：

在迭代开始时也调用垃圾回收('gc.collect()')
在一次迭代中转换解析，所以所有的全局变量都会变成局部变量，并在函数结束时被删除。
使用 soupe.decompose()

我认为第二个更改可能解决了它，但我没有时间检查它，我不想更改工作代码。

对于这段代码，解决方案是这样的：

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")

【讨论】：