【问题标题】：Select a random subset of data选择一个随机的数据子集
【发布时间】：2020-06-12 13:00:39
【问题描述】：

我收到了一个数据集，该数据集之前在训练和验证（测试）数据中进行了拆分。我需要将训练数据进一步拆分为单独的训练数据和校准集，我不想触及我当前的验证（测试）集。我无权访问原始数据集。

我想随机执行此操作，这样每次我可以运行我的脚本时，我都会得到不同的训练和校准测试。我知道 .sample() 函数，但我的训练数据集有 44000 行。

原始数据集

training = dataset.loc[dataset['split']== 'train']
print("Training Created")
#print(training.head())

validation = dataset.loc[dataset['split']== 'valid']
print("Validation Created")
#print(validation.head())

我需要这样的东西：

# proper training set
x_train = breast_cancer.values[:-100, :-1]
y_train = breast_cancer.values[:-100, -1]
# calibration set
x_cal = breast_cancer.values[-100:-1, :-1]
y_cal = breast_cancer.values[-100:-1, -1]
# (x_k+1, y_k+1)
x_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]

不确定如何处理第二次拆分

数据集示例

Object  | Variable | Split
Cancer1     55     Train
Cancer5     45     Train
Cancer2     56     Valid
Cancer3     68     Valid
Cancer4     75     Valid

【问题讨论】：

标签： python pandas machine-learning scikit-learn training-data

【解决方案1】：

您似乎已经有一个分配了train 和validation 集的列。通常的方法是使用sklearn.model_selection.train_test_split。因此，要将您的训练数据进一步拆分为训练和“校准”，只需在训练集上使用它（注意您必须拆分为X 和y）：

# initial split into train/test
train = df.loc[df['Split']== 'train']
test = df.loc[df['Split']== 'valid']

# split the test set into features and target
x_test = test.loc[:,:-1]
y_test = test.loc[:,-1]

# same with the train set
X_train = train.loc[:,:-1]
y_train = train.loc[:,-1]

# split into train and validation sets
X_train, X_calib, y_train, y_calib = train_test_split(X_train, y_train)

【讨论】：

谢谢@yatu，这似乎是一种合乎逻辑的方法，是否也可以对我当前的测试数据执行此操作？我知道这很令人困惑，因为我的数据已经被拆分和标记了。因此，由于我无法访问模型训练的原始数据，因此有点困难。
所以解决这个问题的方法是将训练数据拆分为训练和验证。因此，您进行训练，然后使用验证集获得准确度指标。然后你将看不见的test 保留到最后，通常用于比较目的@bio
有道理@Biohacker :) ?
抱歉我的无知，我用一个示例数据框更新了这个问题，说明我的数据或多或少看起来如何。在您发布的代码中，您拆分的验证集是什么，我是否正确？
好的，再次更新，看看这对你是否更有意义@bio

【解决方案2】：

1.将测试集与整个数据集分开

2。然后使用剩余的数据集，将其拆分为训练和校准。

from sklearn.model_selection import train_test_split

# define the test set
X_test = breast_cancer.values[-1, :-1]
y_test = breast_cancer.values[-1, -1]

# Get the remaining dataset 
X = breast_cancer.values[:-1, :-1]
y = breast_cancer.values[:-1, -1]

# Split the remaining dataset into train and calibration sets.
X_train, X_calib, y_train, y_calib = train_test_split(X, y)

【讨论】：