【发布时间】:2017-12-06 18:27:35
【问题描述】:
我正在尝试让这段代码并行运行。
目前每个模型的运行时间约为 6 分钟,这太慢了。
d_mtry={}
# ------------------------------------------------------------------------
pass; count = 0; itr = COUNT
while count < itr
kf = KFold(n_splits = 10)
# --------------------------------------------------------------------
for j in range (2, 25):
avg_acc = 0
for train_index, test_index in kf.split(X_train):
X_train_K, X_test_K = X_train[train_index], X_train[test_index]
y_train_K, y_test_K = y_train[train_index], y_train[test_index]
rotf = RRForestClassifier( n_estimators = 30,
criterion = 'entropy',
max_features = j,
n_jobs = -1,
random_state = 1
)
rotf.fit(X_train_K, y_train_K)
y_predict_K = rotf.predict(X_test_K)
y_prob = rotf.predict_proba(X_test_K)
acc_score = accuracy_score(y_test_K, y_predict_K)
avg_acc += acc_score
d_mtry[str(j)] = avg_acc/10
# --------------------------------------------------------------------
best_mtry = max(d_mtry.iteritems(), key=operator.itemgetter(1))[0]
f.write("\n" + "Iteration: " + str(count+1) + " Best M_Try: " + str(best_mtry)+ "\n")
f.write(str(d_mtry))
rotf = RRForestClassifier( n_estimators = 30,
criterion = 'entropy',
max_features = int( best_mtry ),
n_jobs = -1,
random_state = 1
)
# there is more code after this
# I don't think it is relevant,
# it has to do with calculations on the model rotf
# --------------------------------------------------------------------
count += 1
# ------------------------------------------------------------------------
我遇到的主要问题是我所做的尝试没有同时正确更新字典。我在另一篇文章中发现,使用 multiprocessing.Pool() 实例执行方法时,它似乎也没有运行得更快。
这里的目标是根据折叠的平均准确度为 j(max_features 类实例属性)找到最佳值,并在我创建模型时使用它并在测试集上运行它。
最初,我尝试使用 GridSearchCV(),但在安装时遇到了问题,并且从未完成运行,即使在 AWS 托管的设置中,也有 36 个内核。
感谢任何帮助。
【问题讨论】:
标签: python dictionary parallel-processing scikit-learn cross-validation