如何在sklearn上分割平衡训练集和测试集的数据答案

【问题标题】：How to split data on balanced training set and test set on sklearn如何在sklearn上分割平衡训练集和测试集的数据
【发布时间】：2016-05-30 02:27:19
【问题描述】：

我正在使用 sklearn 进行多分类任务。我需要将所有数据拆分为 train_set 和 test_set。我想从每个班级随机抽取相同的样本数。其实我是在逗这个功能

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)

但它给出了不平衡的数据集！任何建议。

【问题讨论】：

如果你仍然想使用cross_validation.train_test_split并且你在sklearn 0.17你可以平衡训练和测试，看看我的回答
附带说明，例如，对于具有sklearn.ensemble.RandomForestClassifier 的不平衡训练集，可以使用class_weight="balanced"。
@Shadi：请不要说平衡你的火车组是不同的； class_weight 将对您的成本最小化产生影响。

标签： machine-learning scikit-learn svm cross-validation

【解决方案1】：

虽然 Christian 的建议是正确的，但从技术上讲，train_test_split 应该使用stratify 参数为您提供分层结果。

所以你可以这样做：

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

这里的窍门是从版本开始 0.17 in sklearn。

来自关于参数stratify的文档：

stratify : 类数组或无（默认为无）如果不是 None，则以分层方式拆分数据，将其用作标签数组。 0.17 版新功能：分层拆分

【讨论】：

但如果类在数据中不平衡（class1=200 个样本，class2=250 个样本，..），我需要使用 (100, 100) 进行训练，使用 (50,50)测试。我该怎么做
train_test_split中还有两个参数：train_size、test_size（除了float代表比例，还可以是int）。从未尝试过，但我认为train_size=100、test_size=50 与stratify 参数结合使用应该可以。
我没试过，但如果你这样做，你应该有 100 个遵循原始分布的训练样本和 50 个遵循原始分布的训练样本。（我会稍微改变一下这个例子来澄清一下，假设 class1=200 个样本，class2=400 个样本），那么你的训练集将有 33 个来自 class1 的例子和 67 个来自 class2 的例子，你的测试集将有 18 个来自 class1 和 32 的例子从类2。据我了解，最初的问题是尝试获得一个包含 50 个来自 class1 的示例和 50 个来自 class2 的训练集，但是一个包含来自 class1 的 18 个示例和来自 class2 的 32 个示例的测试集。
为了澄清，使用分层的拆分会创建与原始数据相同比例的数据样本。例如如果您的数据中的类被拆分为 70/30，则分层拆分将创建具有 70/30 拆分的样本。

【解决方案2】：

您可以使用StratifiedShuffleSplit 创建与原始数据集具有相同百分比的类的数据集：

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
    X_train=X[train_idx]
    y_train=y[train_idx]

print(X_train)
# [[3 7]
#  [2 4]]
print(y_train)
# [1 0]

【讨论】：

文档注释：自 0.18 版起已弃用 StratifiedShuffleSplit：此模块将在 0.20 中删除。请改用sklearn.model_selection.StratifiedShuffleSplit。
"创建具有与原始数据相同百分比的类的数据集："根据github.com/scikit-learn/scikit-learn/issues/8913，情况并非总是如此。
我想代码未经测试，因为我得到了 strSplit 不可迭代的错误。

【解决方案3】：

如果班级不平衡，但您希望分配平衡，那么分层将无济于事。在 sklearn 中似乎没有进行平衡采样的方法，但使用基本的 numpy 很容易，例如这样的函数可能会对您有所帮助：

def split_balanced(data, target, test_size=0.2):

    classes = np.unique(target)
    # can give test_size as fraction of input data size of number of samples
    if test_size<1:
        n_test = np.round(len(target)*test_size)
    else:
        n_test = test_size
    n_train = max(0,len(target)-n_test)
    n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
    n_test_per_class = max(1,int(np.floor(n_test/len(classes))))

    ixs = []
    for cl in classes:
        if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
            # if data has too few samples for this class, do upsampling
            # split the data to training and testing before sampling so data points won't be
            #  shared among training and test data
            splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
            ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
                np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
        else:
            ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
                replace=False))

    # take same num of samples from all classes
    ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
    ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])

    X_train = data[ix_train,:]
    X_test = data[ix_test,:]
    y_train = target[ix_train]
    y_test = target[ix_test]

    return X_train, X_test, y_train, y_test

请注意，如果您使用此方法并在每个类中采样的点数多于输入数据中的点，那么这些点将被上采样（带替换的样本）。结果，某些数据点会出现多次，这可能会影响准确性度量等。如果某些类只有一个数据点，则会出现错误。您可以轻松检查每个班级的分数，例如np.unique(target, return_counts=True)

【讨论】：

我喜欢这个原则，但是我认为当前的实现存在一个问题，即随机抽样可能会将相同的样本分配给训练集和测试集。抽样可能应该从不同的池中收集训练和测试指数。
你说得对，我试图通过说“你可能在训练和测试数据中复制了点，这可能导致你的模型性能看起来过于乐观”来提及这一点，但我现在理解了措辞可能并不完美，对此感到抱歉。我将编辑代码，以便不再有共享数据点。
我不确定您的帖子是否准确。当你提到“平衡”时，你的意思是每个班级的比例大致相等吗？或者你的意思是测试集的类分布与训练集的分布大致相同。分层抽样可以达到后者。

【解决方案4】：

这是我用来获取训练/测试数据索引的实现

def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
    classes, counts = np.unique(target, return_counts=True)
    nPerClass = float(len(target))*float(trainSize)/float(len(classes))
    if nPerClass > np.min(counts):
        print("Insufficient data to produce a balanced training data split.")
        print("Classes found %s"%classes)
        print("Classes count %s"%counts)
        ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
        print("trainSize is reset from %s to %s"%(trainSize, ts))
        trainSize = ts
        nPerClass = float(len(target))*float(trainSize)/float(len(classes))
    # get number of classes
    nPerClass = int(nPerClass)
    print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
    # get indexes
    trainIndexes = []
    for c in classes:
        if seed is not None:
            np.random.seed(seed)
        cIdxs = np.where(target==c)[0]
        cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
        trainIndexes.extend(cIdxs)
    # get test indexes
    testIndexes = None
    if getTestIndexes:
        testIndexes = list(set(range(len(target))) - set(trainIndexes))
    # shuffle
    if shuffle:
        trainIndexes = random.shuffle(trainIndexes)
        if testIndexes is not None:
            testIndexes = random.shuffle(testIndexes)
    # return indexes
    return trainIndexes, testIndexes

【讨论】：

【解决方案5】：

另一种方法是从分层测试/训练拆分中过度或过度抽样。 imbalanced-learn 库对此非常方便，如果您正在进行在线学习并希望保证管道中的训练数据平衡，则特别有用。

from imblearn.pipeline import Pipeline as ImbalancePipeline

model = ImbalancePipeline(steps=[
  ('data_balancer', RandomOverSampler()),
  ('classifier', SVC()),
])

【讨论】：