PyTorch：“梯度计算所需的变量之一已被就地操作修改”答案

【问题标题】：PyTorch: "one of the variables needed for gradient computation has been modified by an inplace operation"PyTorch：“梯度计算所需的变量之一已被就地操作修改”
【发布时间】：2021-11-04 14:08:44
【问题描述】：

我正在对歌词文本文件训练 PyTorch RNN，以预测给定字符的下一个字符。

这是我的 RNN 的定义方式：


import torch.nn as nn
import torch.optim

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        self.hidden_size = hidden_size
        
        # from input, previous hidden state to new hidden state
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        
        # from input, previous hidden state to output
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        
        # softmax on output
        self.softmax = nn.LogSoftmax(dim = 1)
    
    def forward(self, input, hidden):
        
        combined = torch.cat((input, hidden), 1)
        
        #get new hidden state
        hidden = self.i2h(combined)
        
        #get output
        output = self.i2o(combined)
        
        #apply softmax
        output = self.softmax(output)
        return output, hidden
    
    def initHidden(self): 
        return torch.zeros(1, self.hidden_size)

rnn = RNN(input_size = num_chars, hidden_size = 200, output_size = num_chars)
criterion = nn.NLLLoss()

lr = 0.01
optimizer = torch.optim.AdamW(rnn.parameters(), lr = lr)

这是我的训练函数：

def train(train, target):
    
    hidden = rnn.initHidden()
    
    loss = 0
    
    for i in range(len(train)):
        
        optimizer.zero_grad()

        # get output, hidden state from rnn given input char, hidden state
        output, hidden = rnn(train[i].unsqueeze(0), hidden)

        #returns the index with '1' - indentifying the index of the right character
        target_class = (target[i] == 1).nonzero(as_tuple=True)[0]
        
        loss += criterion(output, target_class)
        
    
        loss.backward(retain_graph = True)
        optimizer.step()
        
        print("done " + str(i) + " loop")
    
    return output, loss.item() / train.size(0)

当我运行我的训练函数时，我得到了这个错误：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [274, 74]], which is output 0 of TBackward, is at version 5; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

有趣的是，它在给我这个错误之前通过了两个完整的训练函数循环。

现在，当我从 loss.backward() 中删除 retain_graph = True 时，我收到此错误：

RuntimeError: Trying to backward through the graph a second time (or directly access saved variables after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved variables after calling backward.

这里不应该多次尝试向后浏览图表。也许训练循环之间的图表没有被清除？

【问题讨论】：

标签： python pytorch recurrent-neural-network

【解决方案1】：

问题是您在变量loss 上累积您的损失值（同时，与它们相关的计算图），这里：

    loss += criterion(output, target_class)

反过来，这意味着在每次迭代中，您都试图通过在先前推理中计算的当前和先前损失值进行反向传播。在这个循环遍历数据集的特殊情况下，这不是正确的做法。

一个简单的解决方法是使用item 累积loss 的基础值，即标量值，而不是张量本身。并且，在当前损失张量上反向传播：

total_loss = 0
    
for i in range(len(train)):
    optimizer.zero_grad()
    output, hidden = rnn(train[i].unsqueeze(0), hidden)
    target_class = (target[i] == 1).nonzero(as_tuple=True)[0]
        
    loss = criterion(output, target_class)
    loss.backward()

    total_loss += loss.item()

由于您在完成反向传播后直接更新模型的参数，因此您不需要将图形保留在内存中。

【讨论】：

谢谢！那行得通！解释是有道理的。