使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证答案

【问题标题】：Cross validation for MNIST dataset with pytorch and sklearn使用 pytorch 和 sklearn 对 MNIST 数据集进行交叉验证
【发布时间】：2020-03-18 15:27:22
【问题描述】：

我是 pytorch 的新手，正在尝试实现前馈神经网络来对 mnist 数据集进行分类。我在尝试使用交叉验证时遇到了一些问题。我的数据具有以下形状： x_train: torch.Size([45000, 784]) 和 y_train:torch.Size([45000])

我尝试使用 sklearn 中的 KFold。

kfold =KFold(n_splits=10)

这是我将数据分成折叠的训练方法的第一部分：

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

y_train_fold 变量的索引是正确的，它很简单： [ 0 1 2 ... 4497 4498 4499]，但不适用于x_train_fold，即[ 4500 4501 4502 ... 44997 44998 44999]。测试折叠也是如此。

对于第一次迭代，我希望变量 x_train_fold 成为前 4500 张图片，换句话说，形状为 torch.Size([4500, 784])，但它的形状为 torch.Size([40500, 784])

关于如何做到这一点的任何提示？

【问题讨论】：

标签： scikit-learn pytorch cross-validation mnist k-fold

【解决方案1】：

你弄乱了索引。

x_train = x[train_index]
x_test = x[test_index]
y_train = y[train_index]
y_test = y[test_index]

    x_fold = x_train[train_index]
    y_fold = y_train[test_index]

应该是：

x_fold = x_train[train_index]
y_fold = y_train[train_index]

【讨论】：

你说得对！现在更新了代码和问题，但我的x_train_fold 还是有问题

【解决方案2】：

我觉得你很困惑！

暂时忽略第二个维度，当你有 45000 个点时，你使用 10 折交叉验证，每个折的大小是多少？ 45000/10 即 4500。

这意味着您的每个折叠将包含 4500 个数据点，其中一个折叠将用于测试，其余用于训练，即

用于测试： 1 倍 => 4500 个数据点 => 大小：4500
用于训练： 剩余折叠 => 45000-4500 个数据点 => 大小: 45000-4500=40500

因此，对于第一次迭代，前 4500 个数据点（对应于索引）将用于测试，其余用于训练。 （查看下图）

鉴于您的数据是 x_train: torch.Size([45000, 784]) 和 y_train: torch.Size([45000])，您的代码应该是这样的：

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

所以，当你说

我希望变量 x_train_fold 成为前 4500 张图片...形状 torch.Size([4500, 784])。

你错了。这个大小对应于x_test_fold。在第一次迭代中，基于 10 个折叠，x_train_fold 将有 40500 个点，因此它的大小应该是 torch.Size([40500, 784])。

【讨论】：

如果你能看穿我下面的代码会很高兴！
@kHarshit 是 1 次迭代和这里的 1 个 epoch 相同？
@helperFunction 这里的迭代是指KFold迭代，不是训练循环中的epoch/iteration。

【解决方案3】：

我想我现在有了，但我觉得代码有点乱，有 3 个嵌套循环。有没有更简单的方法或者这种方法可以吗？

这是我的交叉验证训练代码：

def train(network, epochs, save_Model = False):
    total_acc = 0
    for fold, (train_index, test_index) in enumerate(kfold.split(x_train, y_train)):
        ### Dividing data into folds
        x_train_fold = x_train[train_index]
        x_test_fold = x_train[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_train[test_index]

        train = torch.utils.data.TensorDataset(x_train_fold, y_train_fold)
        test = torch.utils.data.TensorDataset(x_test_fold, y_test_fold)
        train_loader = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = False)
        test_loader = torch.utils.data.DataLoader(test, batch_size = batch_size, shuffle = False)

        for epoch in range(epochs):
            print('\nEpoch {} / {} \nFold number {} / {}'.format(epoch + 1, epochs, fold + 1 , kfold.get_n_splits()))
            correct = 0
            network.train()
            for batch_index, (x_batch, y_batch) in enumerate(train_loader):
                optimizer.zero_grad()
                out = network(x_batch)
                loss = loss_f(out, y_batch)
                loss.backward()
                optimizer.step()
                pred = torch.max(out.data, dim=1)[1]
                correct += (pred == y_batch).sum()
                if (batch_index + 1) % 32 == 0:
                    print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy:{:.3f}%'.format(
                        (batch_index + 1)*len(x_batch), len(train_loader.dataset),
                        100.*batch_index / len(train_loader), loss.data, float(correct*100) / float(batch_size*(batch_index+1))))
        total_acc += float(correct*100) / float(batch_size*(batch_index+1))
    total_acc = (total_acc / kfold.get_n_splits())
    print('\n\nTotal accuracy cross validation: {:.3f}%'.format(total_acc))

【讨论】：

我觉得还可以。训练总是有 2 个循环，一个用于 KFold 就可以了。您可能想查看skorch - pytorch 的 sklearn 包装器，尽管我没有使用它。
不错的一个！但只是为了引起注意。这不是我们期望的交叉验证的准确性。我们需要的是 test_loader 的平均精度，而不是 train_loader 的精度。（或者我错过了什么？）
@kHarshit 每次折叠后模型的权重不应该重新初始化吗？另外，由于优化器使用模型的参数，是否需要为每个折叠创建一个新的优化器实例？
@Kimmen 每次折叠后不应该重置模型吗？

【解决方案4】：

虽然以上所有答案都提供了如何拆分数据集的一个很好的例子，但我很好奇实现 K 折交叉验证的方式。 K-fold 旨在估计机器学习模型对看不见的数据的技能。使用有限的样本来估计模型在用于对模型训练期间未使用的数据进行预测时的总体预期表现。（参见维基百科https://en.wikipedia.org/wiki/Cross-validation_(statistics)中的概念和解释）因此，有必要在每个折叠开始时初始化您的待训练模型的参数。否则，您的模型将在 K 折后看到数据集中的每个样本，并且没有验证之类的东西（都是训练样本）。

【讨论】：