有不同的方法可以解决这个问题。但是,如果指定可以完成这项工作的 GPU,或者无法使用多个 GPU,那么动态调整 GPU 批量大小会很方便。
I prepared this repo with an illustrative training example in pytorch(它应该在 TensorFlow 中类似地工作)
在下面的代码中,try/except 用于在不停止训练的情况下尝试不同的 GPU 批量大小。当批量变得太大时,它会被缩小并关闭适配。请查看 repo 以了解实现细节和可能的错误修复。
它还实现了一种称为 Batch Spoofing 的技术,该技术在进行反向传播之前执行许多前向传递。在 PyTorch 中,它只需要替换 optimizer.zero_grad()。
import torch
import torchvision
import torch.optim as optim
import torch.nn as nn
# Example of how to use it with Pytorch
if __name__ == "__main__":
# #############################################################
# 1) Initialize the dataset, model, optimizer and loss as usual.
# Initialize a fake dataset
trainset = torchvision.datasets.FakeData(size=1_000_000,
image_size=(3, 224, 224),
num_classes=1000)
# initialize the model, loss and SGD-based optimizer
resnet = torchvision.models.resnet152(pretrained=True,
progress=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(resnet.parameters(), lr=0.01)
continue_training = True # criteria to stop the training
# #############################################################
# 2) Set parameters for the adaptive batch size
adapt = True # while this is true, the algorithm will perform batch adaptation
gpu_batch_size = 2 # initial gpu batch_size, it can be super small
train_batch_size = 2048 # the train batch size of desire
# Modified training loop to allow for adaptive batch size
while continue_training:
# #############################################################
# 3) Initialize dataloader and batch spoofing parameter
# Dataloader has to be reinicialized for each new batch size.
trainloader = torch.utils.data.DataLoader(trainset,
batch_size=int(gpu_batch_size),
shuffle=True)
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
try: # This will make sure that training is not halted when the batch size is too large
# #############################################################
# 4) Epoch loop with batch spoofing
optimizer.zero_grad() # done before training because of batch spoofing.
for i, (x, y) in enumerate(trainloader):
y_pred = resnet(x)
loss = criterion(y_pred, y)
loss.backward()
# batch spoofing
if not i % repeat:
optimizer.step()
optimizer.zero_grad()
# #############################################################
# 5) Adapt batch size while no RuntimeError is rased.
# Increase batch size and get out of the loop
if adapt:
gpu_batch_size *= 2
break
# Stopping criteria for training
if i > 100:
continue_training = False
# #############################################################
# 6) After the largest batch size is found, the training progresses with the fixed batch size.
# CUDA out of memory is a RuntimeError, the moment we will get to it when our batch size is too large.
except RuntimeError as run_error:
gpu_batch_size /= 2 # resize the batch size for the biggest that works in memory
adapt = False # turn off the batch adaptation
# Number of repetitions for batch spoofing
repeat = max(1, int(train_batch_size / gpu_batch_size))
# Manual check if the RuntimeError was caused by the CUDA or something else.
print(f"---\nRuntimeError: \n{run_error}\n---\n Is it a cuda error?")
如果你有代码可以在 Tensorflow、Caffe 或其他中做类似的事情,请分享!