无法使用带有 pytorch 的 gpu 训练 ResNet答案

【问题标题】：Can't train ResNet using gpu with pytorch无法使用带有 pytorch 的 gpu 训练 ResNet
【发布时间】：2020-04-17 20:20:23
【问题描述】：

我正在尝试使用 gpu 在 CIFAR10 数据集上训练 ResNet 架构。这是我的 ResNet 代码：

import torch
import torch.nn as nn
import torch.nn.functional as F

class ResNetBlock(nn.Module):

    def __init__(self, in_planes, planes, stride=1):
        super(ResNetBlock, self).__init__()
        self.stride = stride
        self.in_planes=in_planes
        self.planes = planes
        if stride!=1:

          self.fx = nn.Sequential(nn.Conv2d(in_planes, planes, 3, stride=2, 
                                            padding=1),
                                  nn.ReLU(), 
                                  nn.Conv2d(planes, planes,3, padding=1))


        else:
          self.fx = nn.Sequential(nn.Conv2d(planes, planes, 3, padding = 1),
                                  nn.ReLU(), 
                                  nn.Conv2d(planes, planes,3, padding=1))



    def forward(self, x):

      if self.stride ==1:
        fx = self.fx(x)
        id = nn.Sequential()
        out = fx + id(x)
        relu = nn.ReLU()
        return relu(out)

      else:

        fx = self.fx(x)
        id = nn.Conv2d(self.in_planes, self.planes, 2, stride = 2)
        out = fx + id(x)
        relu = nn.ReLU()
        return relu(out)

class ResNet(nn.Module):

  def __init__(self, block, num_blocks, num_classes=10, num_filters=16, input_dim=3):
      super(ResNet, self).__init__()
      self.in_planes = num_filters

      self.conv1 = nn.Conv2d(input_dim, num_filters, kernel_size=3, stride=1, padding=1, bias=False)
      self.bn1 = nn.BatchNorm2d(num_filters)

      layers = []
      plane = num_filters 

      for nb in num_blocks:

        layer = self._make_layer(block,plane ,nb,2)
        layers.append(layer)
        plane*=2

      self.layers = nn.Sequential(*layers)




      self.linear = nn.Linear(2304, num_classes)

  def _make_layer(self, block, planes, num_blocks, stride):
      layers = []
      block1 = ResNetBlock(planes, 2*planes, stride = 2)
      planes *=2
      layers.append(block1)

      for i in range(1,num_blocks):
        block = ResNetBlock(planes, planes, stride =1)
        layers.append(block)

      return nn.Sequential(*layers)

  def forward(self, x):
      out = F.relu(self.bn1(self.conv1(x)))
      out = self.layers(out)

      out = F.avg_pool2d(out, 4)
      out = out.view(out.size(0), -1)
      out = self.linear(out)
      return out



# (1 + 2*(1 + 1) + 2*(1 + 1) + 2*(1 + 1) + 2*(1 + 1)) + 1 = 18
def ResNet18():
    return ResNet(ResNetBlock, [2,2,2,2])

然后我使用 gpu 训练网络：


net = ResNet18()
net = net.to('cuda')
train2(net, torch.optim.Adam(net.parameters(), lr=0.001), trainloader, criterion, n_ep=3)

我得到了错误：

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

这很烦人，因为我的权重也应该是 cuda，因为 resnet.cuda()。

在另一个网络中，train 函数运行良好，所以它必须来自上面提到的类。

另外，next(resnet.parameters()).is_cuda 返回 True。

更新：这是我的训练功能。


def train(net, optimizer, trainload, criterion, n_ep=10, cuda = True):
  if cuda:
    net = net.to('cuda')


  for epoch in range(n_ep):
    for data in trainload:

      inputs, labels = data
      if cuda:
        inputs = inputs.type(torch.cuda.FloatTensor)
        labels = labels.type(torch.cuda.LongTensor)


      optimizer.zero_grad()

      print(next(net.parameters()).is_cuda)
      ## this actually prints "True" ! 



      outputs = net.forward(inputs)
      loss = criterion(outputs, labels)
      loss.backward()
      optimizer.step()

  return net

问题是，这种训练功能可以很好地与另一种类型的网络配合使用。比如用这个（AlexNet）：

class AlexNet(nn.Module):

    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(nn.Conv2d(3,64,11), nn.ReLU(),nn.MaxPool2d(2, stride = 2), nn.Conv2d(64,192,5),
                                     nn.ReLU(), nn.MaxPool2d(2, stride = 2), nn.Conv2d(192,384,3),
                                     nn.ReLU(),nn.Conv2d(384,256,3), nn.ReLU(), nn.Conv2d(256,256,3), nn.ReLU())
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),)

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

有了这个，gpu 训练效果很好。

还有一些我不明白的地方。我尝试使用我没有（故意）移动到 GPU 的训练数据来训练我移动到 GPU 的网络（使用 .cuda() ）。而这一次我得到了权重类型是 torch.cuda 而数据类型不是的错误。

编辑：我认为这与使用 nn.ModuleList 而不是常规的 python 列表有关。但是我试过了，它并没有解决这个问题。

【问题讨论】：

标签： python pytorch gpu resnet

【解决方案1】：

我们需要您的训练循环的 sn-p 以更好地确定您的错误。

我假设在那个循环的某个地方你有一些代码行，它们执行以下操作：

for data, label in CifarDataLoader:
     data, label = data.to('cuda'), label.to('cuda')

我的第一个猜测是在 for 循环之前添加一行 ->

resnet = resnet.to('cuda')

让我知道这是否有效，否则我需要更多代码来查找错误。

【讨论】：

【解决方案2】：

好的，我终于明白了。

我在 ResNetBlock 类的 forward 函数中定义了一些 nn.Module 对象。我猜那些不能移动到 gpu 因为 pytorch 只在 init 函数中寻找这样的对象。我对实现进行了一些更改，以在 init 函数中定义对象，并且它起作用了。

感谢您的帮助:)

【讨论】：

感谢您发布您的答案，我自己不知道，所以将来可能会派上用场。