如何使用自定义类分层 sklearn kFold？答案

【问题标题】：how to sklearn kFold with custom class strata?如何使用自定义类分层 sklearn kFold？
【发布时间】：2020-07-28 18:40:23
【问题描述】：

我正在阅读这篇article，了解如何为非常不平衡的数据集进行正确的 KFold。在最后一个示例中，它展示了如何将数据集拆分为 2 折，50/50 训练/测试。一切都非常酷和有趣。然而，我想知道如何进行拆分，我还可以控制每个折叠中的类分布，例如 50/50 class0/class1（又名欠采样/过采样）。因此，鉴于以下数据，假设我想要 4 折，我正在寻找以下结果：

>Train: 0=8, 1=8, 
>Train: 0=8, 1=8, 
>Train: 0=8, 1=8, 
>Train: 0=8, 1=8,

有没有什么方法可以通过sklearn.model_selection 方法实现这一点？我到处寻找这个没有运气。这可能是因为这种方法不应该与 KFold 一起使用吗？

# example of stratified train/test split with an imbalanced dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1)

# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

# summarize
train_0, train_1 = len(trainy[trainy==0]), len(trainy[trainy==1])
test_0, test_1 = len(testy[testy==0]), len(testy[testy==1])
print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

>Train: 0=495, 1=5, Test: 0=495, 1=5

【问题讨论】：

标签： python scikit-learn cross-validation k-fold

【解决方案1】：

如果类之间的 50/50 分布是您的目标，sklearn 有 StratifiedKFold 来实现这一目标。它使用欠采样来确保所有类在每个折叠中都有相同的样本数。

但是如果你想要更大的控制权，比如说你想要分布是 30/70，sklearn 是不够的，你需要imbalance-learn 库来实现。例如，RandomUnderSampler 将允许您通过sampling_strategy 参数精确控制分布。事实上，如果您在 python 中使用非常不平衡的数据集，您可能应该在某种程度上熟悉该库及其算法，而不仅仅是RandomUnderSampler。

【讨论】：

好的，但是如何调整班级划分？
RandomUnderSampler 上的文档页面，特别是描述sampling_strategy 参数的部分描述得很好，所以我决定在这个答案中不再重复