如何在训练期间调整 gpu 批量大小？答案

【问题标题】：How to adapt the gpu batch size during training?如何在训练期间调整 gpu 批量大小？
【发布时间】：2020-03-17 03:52:15
【问题描述】：

令我惊讶的是，我在网上找不到任何关于如何在不停止训练的情况下动态调整 GPU 批量大小的资源。

思路如下：

1) 有一个（几乎）与正在使用的 GPU 无关的训练脚本。批量大小将动态调整，不受用户干扰或无需调整。

2) 仍然能够指定所需的训练批量大小，即使太大而无法容纳最大的已知 GPU。

例如，假设我想使用 4096 张图片的批量大小来训练模型，每张图片大小为 1024x1024。还假设我可以访问具有不同 NVidea GPU 的服务器，但我不知道会提前分配给我哪一个。（或者说每个人都想使用最大的 GPU，而我要等很长时间才能成为我的任期）。

我希望我的训练脚本找到最大批量大小（假设它是每个 GPU 批次 32 个图像），并且仅在处理完所有 4096 个图像后才更新优化器（一个训练批次 = 128 个 GPU 批次）。

【问题讨论】：

标签： python tensorflow neural-network gpu pytorch

【解决方案1】：

有不同的方法可以解决这个问题。但是，如果指定可以完成这项工作的 GPU，或者无法使用多个 GPU，那么动态调整 GPU 批量大小会很方便。

I prepared this repo with an illustrative training example in pytorch（它应该在 TensorFlow 中类似地工作）

在下面的代码中，try/except 用于在不停止训练的情况下尝试不同的 GPU 批量大小。当批量变得太大时，它会被缩小并关闭适配。请查看 repo 以了解实现细节和可能的错误修复。

它还实现了一种称为 Batch Spoofing 的技术，该技术在进行反向传播之前执行许多前向传递。在 PyTorch 中，它只需要替换 optimizer.zero_grad()。

import torch
import torchvision
import torch.optim as optim
import torch.nn as nn

# Example of how to use it with Pytorch
if __name__ == "__main__":

    # #############################################################
    # 1) Initialize the dataset, model, optimizer and loss as usual.
    # Initialize a fake dataset

    trainset = torchvision.datasets.FakeData(size=1_000_000,
                                             image_size=(3, 224, 224),
                                             num_classes=1000)

    # initialize the model, loss and SGD-based optimizer
    resnet = torchvision.models.resnet152(pretrained=True,
                                          progress=True)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(resnet.parameters(), lr=0.01)

    continue_training = True  # criteria to stop the training

    # #############################################################
    # 2) Set parameters for the adaptive batch size
    adapt = True  # while this is true, the algorithm will perform batch adaptation
    gpu_batch_size = 2  # initial gpu batch_size, it can be super small
    train_batch_size = 2048  # the train batch size of desire

    # Modified training loop to allow for adaptive batch size
    while continue_training:

        # #############################################################
        # 3) Initialize dataloader and batch spoofing parameter
        # Dataloader has to be reinicialized for each new batch size.
        trainloader = torch.utils.data.DataLoader(trainset,
                                                  batch_size=int(gpu_batch_size),
                                                  shuffle=True)

        # Number of repetitions for batch spoofing
        repeat = max(1, int(train_batch_size / gpu_batch_size))

        try:  # This will make sure that training is not halted when the batch size is too large

            # #############################################################
            # 4) Epoch loop with batch spoofing
            optimizer.zero_grad()  # done before training because of batch spoofing.

            for i, (x, y) in enumerate(trainloader):

                y_pred = resnet(x)
                loss = criterion(y_pred, y)
                loss.backward()

                # batch spoofing
                if not i % repeat:
                    optimizer.step()
                    optimizer.zero_grad()

                # #############################################################
                # 5) Adapt batch size while no RuntimeError is rased.
                # Increase batch size and get out of the loop
                if adapt:
                    gpu_batch_size *= 2
                    break

                # Stopping criteria for training
                if i > 100:
                    continue_training = False

        # #############################################################
        # 6) After the largest batch size is found, the training progresses with the fixed batch size.
        # CUDA out of memory is a RuntimeError, the moment we will get to it when our batch size is too large.
        except RuntimeError as run_error:
            gpu_batch_size /= 2  # resize the batch size for the biggest that works in memory
            adapt = False  # turn off the batch adaptation

            # Number of repetitions for batch spoofing
            repeat = max(1, int(train_batch_size / gpu_batch_size))

            # Manual check if the RuntimeError was caused by the CUDA or something else.
            print(f"---\nRuntimeError: \n{run_error}\n---\n Is it a cuda error?")

如果你有代码可以在 Tensorflow、Caffe 或其他中做类似的事情，请分享！

【讨论】：

嗨@Victor。我试过你的方法。但是，当使用新的批量大小恢复训练时，模型会抱怨它内存不足，即使我已经从导致它内存不足的值中减少了该值。您对如何修改代码有什么建议，以便在恢复时不会耗尽内存？我有点困惑，因为它设法用大批量进行训练，但是在重新启动训练循环后，它失败了。谢谢！
@jonathanking 我不确定问题可能是什么。也许您的代码在某处引发了异常并触发了错误？
感谢您的回复。不幸的是，引发的唯一异常是 RuntimeError（CUDA 内存不足），即使我已经减小了批量大小。即使调用了torch.cuda.empty_cache()，Pytorch 似乎也没有有效地清除 GPU 内存。我将尝试在与我的训练脚本完全不同的脚本中运行此循环以避免这种情况。
torch.cuda.empty_cache() 不会清理 GPU 内存，它只是让当前未使用的 GPU 内存可用于其他应用程序。也许您可以在堆栈溢出中创建一个包含更多信息（例如最小可复制代码）的问题并将其链接到此处？

【解决方案2】：

如何在不停止训练的情况下动态调整 GPU 批量大小

有一个very similar question 使用随机采样器来完成这项工作。

我只需要添加另一个选项：DataLoader 有 collate_fn 你可以用来改变 bs。

collate_fn（可调用，可选）——合并样本列表以形成小批量张量。在使用地图样式数据集的批量加载时使用。

【讨论】：