【发布时间】:2019-01-16 08:37:46
【问题描述】:
我有一个大型语言语料库,我使用 sklearn tfidf 矢量化器和 gensim Doc2Vec 来计算语言模型。我的总语料库有大约 100,000 个文档,我意识到一旦我越过某个阈值,我的 Jupyter 笔记本就会停止计算。我猜在应用网格搜索和交叉验证步骤后内存已满。
即使下面的示例脚本在某些时候已经停止用于 Doc2Vec:
%%time
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=1)
model_names = [
'TfidfVectorizer',
'Doc2Vec_PVDM',
]
models = [
TfidfVectorizer(preprocessor=' '.join, tokenizer=None, min_df = 5),
D2VTransformer(dm=0, hs=0, min_count=5, iter=5, seed=1, workers=1),
]
parameters = [
{
'model__smooth_idf': (True, False),
'model__norm': ('l1', 'l2', None)
},
{
'model__size': [200],
'model__window': [4]
}
]
for params, model, name in zip(parameters, models, model_names):
pipeline = Pipeline([
('model', model),
('clf', LogisticRegression())
])
grid = GridSearchCV(pipeline, params, verbose=1, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
cval = cross_val_score(grid.best_estimator_, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1)
print("Cross-Validation (Train):", np.mean(cval))
print("Finished.")
有没有办法“流式传输”文档中的每一行,而不是将完整数据加载到内存中?或者另一种提高内存效率的方法?我阅读了几篇关于该主题的文章,但找不到任何包含管道示例的文章。
【问题讨论】:
标签: scikit-learn streaming gensim corpus