大型数据集的 Python 内存错误答案

【问题标题】：Python memory error for a large data set大型数据集的 Python 内存错误
【发布时间】：2013-05-21 19:36:55
【问题描述】：

我想生成一个“词袋”矩阵，其中包含文档以及文档中单词的相应计数。为了做到这一点，我运行下面的代码来初始化词袋矩阵。不幸的是，在我阅读文档的行中有 x 数量的文档后，我收到了内存错误。有没有更好的方法可以避免内存错误？请注意，我想处理大量文档 ~ 2.000.000，只有 8 Gb 的 RAM。

def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
    '''
    Open all documents from the given path.
    Initialize the variables needed in order
    to construct the word matrix.

    Parameters
    ----------
    paths: paths to the documents.
    words_count: number of words in the bag of words.
    trainingset_size: the proportion of the data that should be set to the training set.
    validation_set_words_list: the attributes for validation.
    '''

    print '################ Data Processing Started ################'

    self.max_words_matrix = words_count

    print '________________ Reading Docs From File System ________________'
    timer = time()
    for folder in paths:
        self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
        print '____ dataprocessing for category '+folder
        if trainingset_size == None:
            docs = os.listdir(folder)
        elif not trainingset_size == None and validation_set_words_list == None:
            docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
        else:
            docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
        count = 1
        length = len(docs)
        for doc in docs:
            if doc.endswith('.txt'):
                d = open(folder+'/'+doc).read()
                # Append a filtered version of the document to the document list.
                self.docs_list.append(self.__filter__(d))
                # Append the name of the document to the list containing document names.
                self.docs_names.append(doc)
                # Increase the class indices counter.
                self.class_indices.append(len(self.class_names)-1)

            print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
            count += 1

【问题讨论】：

这可能有用：en.wikipedia.org/wiki/…
对于每个单词，您可以增加字典（您可以创建默认字典）值，例如 words_count[word] = words_count[word]+1 并在文件末尾保存字典。跨度>
self.docs_list.append(self.__filter__(d)) - __filter__ 做什么？您不是要在内存中保留 2M 文档吗？
值得注意的是，BoW模型的主要空间优化一般不适用于Python。毕竟，对2 数字的另一个引用与对"likes" 字符串的另一个引用一样大。获得优化的唯一方法是对每个文档向量而不是列表使用array.array、numpy.ndarray 或类似名称。

标签： python memory memory-management python-2.7 machine-learning

【解决方案1】：

你所要求的是不可能的。此外，Python 不会自动获得您期望从 BoW 获得的空间优势。另外，我认为你首先做错了关键部分。让我们以相反的顺序来处理。

无论你在这一行做什么：

self.docs_list.append(self.__filter__(d))

……可能是错误的。

您要为每个文档存储的只是一个计数向量。为了获得该计数向量，您需要附加到所有看到的单词的单个字典。除非__filter__ 正在就地修改隐藏的字典并返回一个向量，否则它不会做正确的事情。

BoW 模型中节省的主要空间来自不必为每个文档存储字符串键的副本，并且能够存储简单的整数数组而不是花哨的哈希表。但是整数对象几乎与（短）字符串对象一样大，并且无法预测或保证何时获得新整数或字符串与对现有整数或字符串的额外引用。所以，真的，你得到的唯一好处是1/hash_fullness；如果您想要其他任何优势，您需要array.array 或numpy.ndarray 之类的东西。

例如：

a = np.zeros(len(self.word_dict), dtype='i2')
for word in split_into_words(d):
    try:
        idx = self.word_dict[word]
    except KeyError:
        idx = len(self.word_dict)
        self.word_dict[word] = idx
        np.resize(a, idx+1)
        a[idx] = 1
    else:
        a[idx] += 1
self.doc_vectors.append(a)

但这仍然还不够。除非您有大约 1K 个唯一单词，否则您无法将所有这些计数都放在内存中。

例如，如果您有 5000 个唯一词，则您有 2M 个数组，每个数组有 5000 个 2 字节计数，因此最紧凑的可能表示将占用 20GB。

由于大多数文档不会包含最多的单词，因此使用稀疏数组（或单个 2D 稀疏数组）可以获得一些好处，但您可以获得的好处有限。而且，即使事情的排序方式恰好使您获得绝对完美的 RLE 压缩，如果每个文档的唯一单词的平均数量约为 1K，您仍然会耗尽内存。

因此，您只是无法将所有文档向量存储在内存中。

如果您可以迭代地处理它们而不是一次全部处理，那么答案显而易见。

如果没有，您必须将它们分页进出磁盘（无论是显式地，还是使用 PyTables 或数据库或其他方式）。

【讨论】：

非常感谢您的详尽回答。我知道将数据存储在内存中的问题。只是想要一些关于如何防止它的指导方针，就像你给出的那样。我重新实现了这些方法，现在可以通过一些磁盘读/写来运行整个过程。