找不到导致“RuntimeError：梯度计算所需的变量之一已被就地操作修改：”的就地操作答案

【问题标题】：Cannot find in-place operation causing "RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:"找不到导致“RuntimeError：梯度计算所需的变量之一已被就地操作修改：”的就地操作
【发布时间】：2020-08-02 02:46:57
【问题描述】：

我对 PyTorch 比较陌生，我正在尝试从一篇学术论文中重现一种算法，该算法使用 Hessian 矩阵来近似一个术语。我已经设置了一个玩具问题，以便我可以将完整 Hessian 的结果与近似值进行比较。我找到了 this gist 并一直在使用它来计算算法的完整 Hessian 部分。

我收到错误消息：“RuntimeError：梯度计算所需的变量之一已被就地操作修改。”

我浏览了简单的示例代码、文档和许多关于此问题的论坛帖子，但找不到任何就地操作。任何帮助将不胜感激！

这是我的代码：

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

torch.set_printoptions(precision=20, linewidth=180)

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)     
    grad_y = torch.zeros_like(flat_y)     

    for i in range(len(flat_y)):         
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
        grad_y[i] = 0.
    return torch.stack(jac).reshape(y.shape + x.shape)           

def hessian(y, x):
    return jacobian(jacobian(y, x, create_graph=True), x)                                             

def f(x):                                                                                             
    return x * x

np.random.seed(435537698)

num_dims = 2
num_samples = 3

X = [np.random.uniform(size=num_dims) for i in range(num_samples)]
print('X: \n{}\n\n'.format(X))

mean = torch.Tensor(np.mean(X, axis=0))
mean.requires_grad = True
print('mean: \n{}\n\n'.format(mean))

cov = torch.Tensor(np.cov(X, rowvar=False))
print('cov: \n{}\n\n'.format(cov))

with autograd.detect_anomaly():
    hessian_matrices = hessian(f(mean), mean)
    print('hessian: \n{}\n\n'.format(hessian_matrices))

这是带有堆栈跟踪的输出：

X: 
[array([0.81700949, 0.17141617]), array([0.53579366, 0.31141496]), array([0.49756485, 0.97495776])]


mean: 
tensor([0.61678934097290039062, 0.48592963814735412598], requires_grad=True)


cov: 
tensor([[ 0.03043144382536411285, -0.05357056483626365662],
        [-0.05357056483626365662,  0.18426130712032318115]])


---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-5a1c492d2873> in <module>()
     42 
     43 with autograd.detect_anomaly():
---> 44     hessian_matrices = hessian(f(mean), mean)
     45     print('hessian: \n{}\n\n'.format(hessian_matrices))

2 frames
<ipython-input-3-5a1c492d2873> in hessian(y, x)
     21 
     22 def hessian(y, x):
---> 23     return jacobian(jacobian(y, x, create_graph=True), x)
     24 
     25 def f(x):

<ipython-input-3-5a1c492d2873> in jacobian(y, x, create_graph)
     15     for i in range(len(flat_y)):
     16         grad_y[i] = 1.
---> 17         grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
     18         jac.append(grad_x.reshape(x.shape))
     19         grad_y[i] = 0.

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
    155     return Variable._execution_engine.run_backward(
    156         outputs, grad_outputs, retain_graph, create_graph,
--> 157         inputs, allow_unused)
    158 
    159 

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [2]] is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

【问题讨论】：

似乎在torch.autograd.grad 的C 代码中发生了一些神奇的事情...将f(x) 的定义从x*x 更改为x*x*torch.ones_like(x) 解决了这个问题。我不知道为什么...对我来说似乎是 PyTorch 中的一个错误...
这似乎使它神奇地工作了。最好有人添加解释原因。

标签： pytorch autograd hessian-matrix

【解决方案1】：

我真诚地认为这是 PyTorch 中的一个错误，但是在发布了一个错误后，我得到了 albanD 的一个很好的回答。 https://github.com/pytorch/pytorch/issues/36903#issuecomment-616671247他还指出https://discuss.pytorch.org/可以提问。

问题出现是因为我们一次又一次地遍历计算图。不过，这里发生的事情确实超出了我的范围......

您的错误消息所指的就地编辑是显而易见的：grad_y[i] = 1. 和 grad_y[i] = 0.。在计算中一遍又一遍地调用grad_y 会导致麻烦。如下重新定义jacobian(...) 对我有用。

def jacobian(y, x, create_graph=False):
    jac = []
    flat_y = y.reshape(-1)
    for i in range(len(flat_y)):
        grad_y = torch.zeros_like(flat_y)
        grad_y[i] = 1.
        grad_x, = torch.autograd.grad(flat_y, x, grad_y, retain_graph=True, create_graph=create_graph)
        jac.append(grad_x.reshape(x.shape))
    return torch.stack(jac).reshape(y.shape + x.shape)

另一种可行的方法，但对我来说更像是黑魔法，将jacobian(...)保持原样，而是将f(x)重新定义为

def f(x):
    return x * x * 1

这也有效。

【讨论】：

【解决方案2】：

对于未来的读者来说，标题中提到的 RuntimeError 可能出现在比原作者更普遍的环境中，例如在移动张量切片和/或从列表推导中操作张量时，因为这是将我带到这里的上下文（我的搜索引擎返回的第一个链接为 RuntimeError）。

为了防止出现这种 RuntimeError 并确保渐变可以流畅地流动，上面的链接中提到了对我最有帮助的理由（但在解决方案消息中缺少），它包括使用 .clone() 方法torch.Tensors 移动它们时（或它们的一些切片）。

例如：

some_container[slice_indices] = original_tensor[slice_indices].clone()

其中只有original_tensor 有requires_grad=True，后续（可能是批处理的）操作将在张量some_container 上执行。

或者：

some_container = [
    tensor.clone() 
    for tensor in some_tensor_list if some_condition_fn(tensor)
]
new_composed_tensor = torch.cat(some_container, dim=0)

【讨论】：

感谢您的指导@Yunnosch，我试图做出更彻底的解释并强调我打算以什么方式添加到以前的解决方案消息中，我希望它是合适的？请告诉我我是否仍然遗漏一点......
现在看起来更像是一个答案。我对“扩大”实际问题的范围并不完全满意。另一方面，来这里获取标题的人可能会在这篇文章中找到帮助。不是我对 Q/A 对应该如何的想法，但我现在接受这个作为答案。（另外，严格来说，这不是我的技术领域，所以我不去评判，因为我可能离得太远了......）玩得开心。