如何将数据集拆分/分区为训练和测试数据集，例如交叉验证？答案

【问题标题】：How to split/partition a dataset into training and test datasets for, e.g., cross validation?如何将数据集拆分/分区为训练和测试数据集，例如交叉验证？
【发布时间】：2011-04-10 02:51:34
【问题描述】：

将 NumPy 数组随机拆分为训练和测试/验证数据集的好方法是什么？类似于 Matlab 中的 cvpartition 或 crossvalind 函数。

【问题讨论】：

标签： python arrays optimization numpy

【解决方案1】：

如果你想将数据集分成两部分，你可以使用numpy.random.shuffle，如果你需要跟踪索引，你可以使用numpy.random.permutation（记得修复随机种子以使所有内容都可重现）：

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

或

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways 重复划分相同的数据集以进行交叉验证。其中许多是available in the sklearn library (k-fold, leave-n-out, ...)。 sklearn 还包括更高级的 "stratified sampling" 方法，这些方法创建在某些特征方面平衡的数据分区，例如，确保训练和测试集中正例和负例的比例相同。

【讨论】：

感谢这些解决方案。但是，使用 randint 的最后一种方法不是很有可能为测试集和训练集提供相同的索引吗？
第二个解决方案是一个有效的答案，而第一个和第三个不是。对于第一种解决方案，打乱数据集并不总是一种选择，在很多情况下您必须保持数据输入的顺序。第三个可以很好地产生相同的测试和训练指数（正如@ggauravr 所指出的那样）。
您应该不为您的交叉验证集重新采样。整个想法是您的算法以前从未见过 CV 集。训练集和测试集用于拟合数据，因此如果将它们包含在 CV 集中，您当然会得到很好的结果。我想赞成这个答案，因为我需要第二个解决方案，但是这个答案有问题。

【解决方案2】：

我知道我的解决方案不是最好的，但是当您想以简单的方式拆分数据时，它会派上用场，尤其是在向新手教授数据科学时！

def simple_split(descriptors, targets):
    testX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 0]
    validX_indices = [i for i in range(descriptors.shape[0]) if i % 4 == 1]
    trainX_indices = [i for i in range(descriptors.shape[0]) if i % 4 >= 2]

    TrainX = descriptors[trainX_indices, :]
    ValidX = descriptors[validX_indices, :]
    TestX = descriptors[testX_indices, :]

    TrainY = targets[trainX_indices]
    ValidY = targets[validX_indices]
    TestY = targets[testX_indices]

    return TrainX, ValidX, TestX, TrainY, ValidY, TestY

根据此代码，数据将分为三部分 - 1/4 用于测试部分，另外 1/4 用于验证部分，2/4 用于训练集。

【讨论】：

【解决方案3】：

拆分成train test和valid

x =np.expand_dims(np.arange(100), -1)


print(x)

indices = np.random.permutation(x.shape[0])

training_idx, test_idx, val_idx = indices[:int(x.shape[0]*.9)], indices[int(x.shape[0]*.9):int(x.shape[0]*.95)],  indices[int(x.shape[0]*.9):int(x.shape[0]*.95)]


training, test, val = x[training_idx,:], x[test_idx,:], x[val_idx,:]

print(training, test, val)

【讨论】：

【解决方案4】：

您可能不仅需要拆分为训练和测试，还需要交叉验证以确保您的模型具有泛化性。在这里，我假设 70% 的训练数据、20% 的验证数据和 10% 的保留/测试数据。

查看np.split：

如果 indices_or_sections 是有序整数的一维数组，则条目指示数组沿轴拆分的位置。例如，[2, 3] 对于axis=0，会导致

ary[:2] ary[2:3] ary[3:]

t, v, h = np.split(df.sample(frac=1, random_state=1), [int(0.7*len(df)), int(0.9*len(df))])

【讨论】：

【解决方案5】：

在做了一些阅读并考虑到（许多..）不同的分割数据以进行训练和测试的方式之后，我决定计时！

我使用了 4 种不同的方法（其中没有一个使用库 sklearn，我相信它会给出最好的结果，因为它是经过精心设计和测试的代码）：

将整个矩阵 arr 打乱，然后拆分数据进行训练和测试
打乱索引，然后分配 x 和 y 来拆分数据
与方法 2 相同，但以更有效的方式进行
使用 pandas 数据框进行拆分

方法 3 以最短的时间获胜，之后方法 1，发现方法 2 和 4 确实效率低下。

我计时的 4 种不同方法的代码：

import numpy as np
arr = np.random.rand(100, 3)
X = arr[:,:2]
Y = arr[:,2]
spl = 0.7
N = len(arr)
sample = int(spl*N)

#%% Method 1:  shuffle the whole matrix arr and then split
np.random.shuffle(arr)
x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,]

#%% Method 2: shuffle the indecies and then shuffle and apply to X and Y
train_idx = np.random.choice(N, sample)
Xtrain = X[train_idx]
Ytrain = Y[train_idx]

test_idx = [idx for idx in range(N) if idx not in train_idx]
Xtest = X[test_idx]
Ytest = Y[test_idx]

#%% Method 3: shuffle indicies without a for loop
idx = np.random.permutation(arr.shape[0])  # can also use random.shuffle
train_idx, test_idx = idx[:sample], idx[sample:]
x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,]

#%% Method 4: using pandas dataframe to split
import pandas as pd
df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns)

train = df.sample(frac=0.7, random_state=200)
test = df.drop(train.index)

对于时间，在 1000 个循环的 3 次重复中执行的最短时间是：

方法一：0.35883826200006297秒
方法二：1.7157016959999964秒
方法三：1.7876616719995582秒
方法4：0.07562861499991413秒

希望对你有帮助！

【讨论】：

很好的分享，想知道为什么没有点赞:)

【解决方案6】：

还有另一个选项只需要使用 scikit-learn。作为scikit's wiki describes，您可以使用以下说明：

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

这样，您可以使您尝试拆分为训练和测试的数据的标签保持同步。

【讨论】：

这是一个非常实用的答案，因为对训练集和标签的处理都是真实的。
它返回一个列表，而不是一个数组。

【解决方案7】：

由于sklearn.cross_validation 模块已被弃用，您可以使用：

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

【讨论】：

【解决方案8】：

感谢 pberkes 的回答。我只是对其进行了修改以避免（1）在采样时替换（2）在训练和测试中都发生重复实例：

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False)
training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)]
    test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

【讨论】：

【解决方案9】：

只是一个注释。如果您需要训练、测试和验证集，您可以这样做：

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

这些参数将 70% 用于训练，15% 用于测试和验证集。希望这可以帮助。

【讨论】：

应该将它添加到您的代码中：from sklearn.cross_validation import train_test_split 以明确您正在使用什么模块
这必须是随机的吗？
即可以按照X和y给定的顺序进行拆分吗？
@liang 不，它不一定是随机的。你可以说训练集、测试集和验证集的大小将是总数据集大小的 a、b 和 c 百分比。比如a=0.7、b=0.15、c=0.15和d = dataset、N=len(dataset)，然后是x_train = dataset[0:int(a*N)]、x_test = dataset[int(a*N):int((a+b)*N)]和x_val = dataset[int((a+b)*N):]。
弃用：stackoverflow.com/a/34844352/4237080，使用from sklearn.model_selection import train_test_split

【解决方案10】：

这是一个将数据分层分成 n=5 折的代码

% X = data array
% y = Class_label
from sklearn.cross_validation import StratifiedKFold
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

【讨论】：

【解决方案11】：

您也可以考虑将训练集和测试集分层划分。 Startified Division 也随机生成训练和测试集，但以保留原始类比例的方式。这使得训练和测试集更好地反映了原始数据集的属性。

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

此代码输出：

[1 2 3]
[1 2 3]

【讨论】：

谢谢！命名有些误导，value_inds 是真正的索引，但输出不是索引，只是掩码。

【解决方案12】：

我为自己的项目编写了一个函数来执行此操作（但它不使用 numpy）：

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

如果您希望区块随机化，只需在传入之前打乱列表即可。

【讨论】：