【发布时间】:2013-05-21 19:36:55
【问题描述】:
我想生成一个“词袋”矩阵,其中包含文档以及文档中单词的相应计数。为了做到这一点,我运行下面的代码来初始化词袋矩阵。不幸的是,在我阅读文档的行中有 x 数量的文档后,我收到了内存错误。有没有更好的方法可以避免内存错误?请注意,我想处理大量文档 ~ 2.000.000,只有 8 Gb 的 RAM。
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
'''
Open all documents from the given path.
Initialize the variables needed in order
to construct the word matrix.
Parameters
----------
paths: paths to the documents.
words_count: number of words in the bag of words.
trainingset_size: the proportion of the data that should be set to the training set.
validation_set_words_list: the attributes for validation.
'''
print '################ Data Processing Started ################'
self.max_words_matrix = words_count
print '________________ Reading Docs From File System ________________'
timer = time()
for folder in paths:
self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
print '____ dataprocessing for category '+folder
if trainingset_size == None:
docs = os.listdir(folder)
elif not trainingset_size == None and validation_set_words_list == None:
docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
else:
docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
count = 1
length = len(docs)
for doc in docs:
if doc.endswith('.txt'):
d = open(folder+'/'+doc).read()
# Append a filtered version of the document to the document list.
self.docs_list.append(self.__filter__(d))
# Append the name of the document to the list containing document names.
self.docs_names.append(doc)
# Increase the class indices counter.
self.class_indices.append(len(self.class_names)-1)
print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
count += 1
【问题讨论】:
-
这可能有用:en.wikipedia.org/wiki/…
-
对于每个单词,您可以增加字典(您可以创建默认字典)值,例如 words_count[word] = words_count[word]+1 并在文件末尾保存字典。跨度>
-
self.docs_list.append(self.__filter__(d))-__filter__做什么?您不是要在内存中保留 2M 文档吗? -
值得注意的是,BoW模型的主要空间优化一般不适用于Python。毕竟,对
2数字的另一个引用与对"likes"字符串的另一个引用一样大。获得优化的唯一方法是对每个文档向量而不是列表使用array.array、numpy.ndarray或类似名称。
标签: python memory memory-management python-2.7 machine-learning