如何在交叉验证和 GridSearchCV 中实现 SMOTE答案

【问题标题】：How to implement SMOTE in cross validation and GridSearchCV如何在交叉验证和 GridSearchCV 中实现 SMOTE
【发布时间】：2018-06-30 09:49:33
【问题描述】：

我对 Python 比较陌生。您能帮我将 SMOTE 的实施改进为适当的管道吗？我想要的是在每个 k 折迭代的训练集上应用过采样和欠采样，以便模型在平衡的数据集上进行训练并在不平衡的遗漏部分上进行评估。问题是当我这样做时，我无法使用熟悉的sklearn 界面进行评估和网格搜索。

是否可以制作类似于model_selection.RandomizedSearchCV 的东西。我对此的看法：

df = pd.read_csv("Imbalanced_data.csv") #Load the data set
X = df.iloc[:,0:64]
X = X.values
y = df.iloc[:,64]
y = y.values
n_splits = 2
n_measures = 2 #Recall and AUC
kf = StratifiedKFold(n_splits=n_splits) #Stratified because we need balanced samples
kf.get_n_splits(X)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
s =(n_splits,n_measures)
scores = np.zeros(s)
for train_index, test_index in kf.split(X,y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   sm = SMOTE(ratio = 'auto',k_neighbors = 5, n_jobs = -1)
   smote_enn = SMOTEENN(smote = sm)
   x_train_res, y_train_res = smote_enn.fit_sample(X_train, y_train)
   clf_rf.fit(x_train_res, y_train_res)
   y_pred = clf_rf.predict(X_test,y_test)
   scores[test_index,1] = recall_score(y_test, y_pred)
   scores[test_index,2] = auc(y_test, y_pred)

【问题讨论】：

您找到解决问题的方法了吗？
是的，实际上您的评论对我帮助很大。非常感谢！
嗨@VivekKumar 这种方法是否确保在运行 K-Fold CV 时验证集不会包含过采样的观察？我试图找到一种方法，在我进行训练/测试拆分然后对我的训练集进行过采样之后，我对训练集中每个 CV 折叠的验证集不包含来自过采样的偏差。谢谢！
@thePurplePython 是的。你是对的。 imblearn 管道只会在训练数据上而不是在测试数据上调用 sample() 方法。测试数据将不作任何更改地通过。

标签： python scikit-learn pipeline cross-validation grid-search

【解决方案1】：

这看起来很适合http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

您需要创建自己的transformer (http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html)，它在调用fit 时会返回一个平衡的数据集（大概是从StratifiedKFold 得到的那个），但在调用predict 时，它是这将发生在测试数据上，调用 SMOTE。

【讨论】：

【解决方案2】：

您需要查看管道对象。不平衡学习有一个Pipeline，它扩展了 scikit-learn 管道，以适应 scikit-learn 的 fit_predict()、fit_transform() 和 predict() 方法之外的 fit_sample() 和 sample() 方法。

在此处查看此示例：

https://imbalanced-learn.org/stable/auto_examples/pipeline/plot_pipeline_classification.html

对于您的代码，您需要这样做：

from imblearn.pipeline import make_pipeline, Pipeline

smote_enn = SMOTEENN(smote = sm)
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)

pipeline = make_pipeline(smote_enn, clf_rf)
    OR
pipeline = Pipeline([('smote_enn', smote_enn),
                     ('clf_rf', clf_rf)])

然后你可以把这个pipeline对象作为一个常规对象传递给scikit-learn中的GridSearchCV、RandomizedSearchCV或其他交叉验证工具。

kf = StratifiedKFold(n_splits=n_splits)
random_search = RandomizedSearchCV(pipeline, param_distributions=param_dist,
                                   n_iter=1000, 
                                   cv = kf)

【讨论】：

我试图从这个答案访问链接并得到一个 404 错误
@MarianeReis 感谢您通知我。我现在更新了链接。
两个链接仍将我带到 404 页。
@agent18 链接已再次更新。请立即检查。