如果我们检查the relevant part of the source code,我们会发现X(和y)中的最后一个validation_split * num_samples样本将用于验证,而其他样本将用于训练:
split_idx = int(len(x) * image_data_generator._validation_split)
# ...
if subset == 'validation':
x = x[:split_idx]
x_misc = [np.asarray(xx[:split_idx]) for xx in x_misc]
if y is not None:
y = y[:split_idx]
else:
x = x[split_idx:]
x_misc = [np.asarray(xx[split_idx:]) for xx in x_misc]
if y is not None:
y = y[split_idx:]
因此,如果您想确保训练和验证子集中的类比例相同(即 Keras 在使用此功能时不保证这一点),则由您负责。 Keras verifies 唯一的一点是,每个类中至少有一个样本包含在训练和验证子集中:
if not np.array_equal(
np.unique(y[:split_idx]),
np.unique(y[split_idx:])):
raise ValueError('Training and validation subsets '
'have different number of classes after '
'the split. If your numpy arrays are '
'sorted by the label, you might want '
'to shuffle them.')
因此,分层拆分的解决方案(即在训练和验证拆分中保留每个类的样本比例)是使用 sklearn.model_selection.train_test_split 和 stratify 参数集:
from sklearn.model_selection import train_test_split
val_split = 0.25
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_split, stratify=y)
X = np.concatenate((X_train, X_val))
y = np.concatenate((y_train, y_val))
现在您可以将validation_split=val_split 传递给ImageDataGenerator,并保证训练和验证子集中的类比例相同。