【发布时间】:2021-02-22 06:53:52
【问题描述】:
我目前正在处理训练集中大约 2000 个数据点的二元分类问题,我想知道是否应该将整个训练集用于网格搜索,或者是否应该先进行拆分以生成验证数据。我有以下 2 个变体可供选择。第一个是带有train/val的拆分,第二个是没有拆分(整个训练集上的GridSearchCV)
1.变体
训练/验证拆分
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=rs)
SVC 上的 GridSearchCV (仅使用 x_train 和 y_train)
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=rs)
params = {"C": [0.12, 0.13, 0.14, 0.15]}
clf = GridSearchCV(SVC(random_state=rs), params, cv=skf, n_jobs=-1, scoring=monetary_score)
clf.fit(X_train, y_train)
print(grid.best_params_)
print(grid.best_score_, "\n")
print(grid.best_estimator_)
使用验证集
final_clf = clf.best_estimator_
y_pred = final_clf.predict(X_val)
cm = confusion_matrix(y_val, y_pred)
print(cm)
2。变体
SVC 上的 GridSearchCV (使用整个 x 和 y)
skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=rs)
params = {"C": [0.12, 0.13, 0.14, 0.15]}
clf = GridSearchCV(SVC(random_state=rs), params, cv=skf, n_jobs=-1, scoring=monetary_score)
clf.fit(X, y)
print(grid.best_params_)
print(grid.best_score_, "\n")
print(grid.best_estimator_)
【问题讨论】:
标签: python training-data grid-search gridsearchcv train-test-split