【问题标题】:Frequently getting CUDA error out of memory?经常让 CUDA 错误超出内存?
【发布时间】:2019-03-27 15:41:39
【问题描述】:

我编写了一个用于训练深度学习模型的代码,我在每批之后删除 cuda 张量,然后执行 torch.cuda.empty_cache()。我很确定批量大小不足以导致此错误。这可能是什么原因?

    for epoch in range(1+last_epoch, self.num_epochs+1):
        for phase in ['train', 'val']:
            loss_arr = []
            if phase == 'train':
                model.train()                    
                scheduler.step()
                was_training = True
            else:
                model.eval()
                was_training = False
            for i_batch, sample_batched in enumerate(dataloaders[phase]):
                X = sample_batched[0]
                y = sample_batched[1].type(torch.LongTensor)
                w = sample_batched[2]

                if model.is_cuda:
                    X, y, w = X.cuda(non_blocking=True), y.cuda(non_blocking=True),  w.cuda(non_blocking=True)

                output = model(X)
                loss = self.loss_func(output, y, w)
                if phase == 'train':
                    curr_iteration+=1
                    optim.zero_grad()                        
                    loss.backward()
                    optim.step()
                    if (curr_iteration % log_nth == 0):
                        self.logWriter.loss_per_iter(loss.item(), curr_iteration)

                loss_arr.append(loss.item())

                with torch.no_grad():
                    self.logWriter.update_cm_per_iter(output, y, self.labels, phase)

                del X, y, w, output, loss
                torch.cuda.empty_cache()

            self.logWriter.loss_per_epoch(loss_arr, phase, epoch)

            epoch_output, epoch_labels = model.predict(dataloaders[phase].dataset.X), dataloaders[phase].dataset.y
            self.logWriter.dice_score_per_epoch(epoch_output, epoch_labels, phase, epoch)
            index = np.random.choice(len(dataloaders[phase].dataset), 3, replace=False)
            self.logWriter.image_per_epoch(epoch_output[index], epoch_labels[index], phase, epoch)
            self.logWriter.cm_per_epoch(self.labels, phase, epoch, i_batch)
            del epoch_output, epoch_labels

        print("==== Epoch ["+str(epoch)+" / "+str(self.num_epochs)+"] done ====")        
        model.save('models/' + self.exp_dir_name + '/quicknat_epoch' + str(epoch) + '.model')

而模型内部的perdict函数如下

def predict(self, X, enable_dropout = False):
    """
    Predicts the outout after the model is trained.
    Inputs:
    - X: Volume to be predicted
    """        
    self.eval()

    if type(X) is np.ndarray:
        X = torch.tensor(X, requires_grad = False).cuda(non_blocking=True)
    elif type(X) is torch.Tensor and not X.is_cuda:
        X = X.cuda(non_blocking=True)

    if enable_dropout:
        self.enable_test_dropout()

    with torch.no_grad():         
        out = self.forward(X)

    max_val, idx = torch.max(out,1)
    idx = idx.data.cpu().numpy()
    prediction = np.squeeze(idx)
    del X, out, idx, max_val
    return prediction

【问题讨论】:

  • 请添加相关代码
  • @TimH 添加了代码

标签: python-3.x deep-learning pytorch


【解决方案1】:

我意识到我在每个 epoch 之后都在为预测提供整个数据集。分批解决问题。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-08-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-07-06
    • 2023-03-28
    • 1970-01-01
    相关资源
    最近更新 更多