【问题标题】:How to apply oversampling when doing Leave-One-Group-Out cross validation?在进行 Leave-One-Group-Out 交叉验证时如何应用过采样?
【发布时间】:2019-11-19 17:12:06
【问题描述】:

我正在处理用于分类的不平衡数据,并且我之前尝试使用合成少数过采样技术 (SMOTE) 对训练数据进行过采样。但是,这次我想我还需要使用 Leave One Group Out (LOGO) 交叉验证,因为我想在每份简历上留下一个主题。

我不确定我是否能很好地解释它,但据我了解,要使用 SMOTE 进行 k-fold CV,我们可以在每个折叠上循环 SMOTE,正如我在这段代码 on another post 中看到的那样。下面是在 k-fold CV 上实现 SMOTE 的示例。

from sklearn.model_selection import KFold
from imblearn.over_sampling import SMOTE
from sklearn.metrics import f1_score

kf = KFold(n_splits=5)

for fold, (train_index, test_index) in enumerate(kf.split(X), 1):
    X_train = X[train_index]
    y_train = y[train_index]  
    X_test = X[test_index]
    y_test = y[test_index]  
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model = ...  # classification model example
    model.fit(X_train, y_train)  
    y_pred = model.predict(X_test)
    print(f'For fold {fold}:')
    print(f'Accuracy: {model.score(X_test, y_test)}')
    print(f'f-score: {f1_score(y_test, y_pred)}')

没有SMOTE,我尝试这样做来做LOGO CV。但是这样做,我将使用一个超级不平衡的数据集。

X = X
y = np.array(df.loc[:, df.columns == 'label'])
groups = df["cow_id"].values #because I want to leave cow data with same ID on each run
logo = LeaveOneGroupOut()

logo.get_n_splits(X_std, y, groups)

cv=logo.split(X_std, y, groups)

scores=[]
for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    model.fit(X_train, y_train.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

我应该如何在 leave-one-group-out CV 循环中实现 SMOTE?我对如何为合成训练数据定义组列表感到困惑。

【问题讨论】:

  • 这个问题我不清楚。您能否举一个玩具数据集的示例,并准确告诉我们您希望如何拆分它?

标签: python machine-learning scikit-learn cross-validation imblearn


【解决方案1】:

LOOCV 此处建议的方法对于留出交叉验证更有意义。留下一组用作测试集,并对剩余的另一组进行过采样。在所有过采样数据上训练您的分类器,并在测试集上测试您的分类器。

在您的情况下,以下代码将是在 LOGO CV 循环中实现 SMOTE 的正确方法。

for train_index, test_index in cv:
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    sm = SMOTE()
    X_train_oversampled, y_train_oversampled = sm.fit_sample(X_train, y_train)
    model.fit(X_train_oversampled, y_train_oversampled.ravel())
    scores.append(model.score(X_test, y_test.ravel()))

【讨论】:

    猜你喜欢
    • 2021-06-18
    • 2021-04-03
    • 2016-01-19
    • 2023-03-05
    • 2020-04-09
    • 2023-03-17
    • 2015-10-29
    • 2018-09-05
    • 2019-10-02
    相关资源
    最近更新 更多