【问题标题】:Understanding the backward mechanism of LSTMCell in Pytorch理解Pytorch中LSTMCell的后向机制
【发布时间】:2019-05-07 03:47:10
【问题描述】:

我想在 pytorch 中挂钩 LSTMCell 函数的反向传递,因此在初始化传递中我执行以下操作(num_layers=4,hidden_​​size=1000,input_size=1000):

self.layers = nn.ModuleList([
        LSTMCell(
            input_size=input_size,
            hidden_size=hidden_size,
        )
        for layer in range(num_layers)
    ])

for l in self.layers:
    l.register_backward_hook(backward_hook)

在前向传递中,我只是在序列长度和 num_layers 上迭代 LSTMCell,如下所示:

for j in range(seqlen):            
    input = #some tensor of size (batch_size, input_size)
    for i, rnn in enumerate(self.layers):
        # recurrent cell
        hidden, cell = rnn(input, (prev_hiddens[i], prev_cells[i]))

输入的大小为(batch_size, input_size)prev_hiddens[i] 的大小为(batch_size, hidden_size)prev_cells[i] 的大小为(batch_size, hidden_size)

backward_hook 中,我打印输入到此函数的张量的大小:

def backward_hook(module, grad_input, grad_output):
    for grad in grad_output:
        print ("grad_output {}".format(grad))

    for grad in grad_input:
         print ("grad_input.size () {}".format(grad.size()))

作为结果,第一次调用backward_hook,例如:

[A] 对于grad_output,我得到2个张量,其中第二个张量是None。这是可以理解的,因为在后向阶段,我们有内部状态梯度(c)和输出梯度(h)。时间维度的最后一次迭代没有隐藏未来,所以它的梯度是无。

[B] 对于 grad_input,我得到 5 个张量 (batch_size=9):

grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 4000])
grad_input.size () torch.Size([9, 1000])
grad_input.size () torch.Size([4000])
grad_input.size () torch.Size([4000])

我的问题是:

(1) 我对 [A] 的理解正确吗?

(2) 如何解释 grad_input 元组中的 5 个张量?我认为应该只有 3 个,因为 LSTMCell forward() 只有 3 个输入?

谢谢

【问题讨论】:

    标签: neural-network lstm pytorch recurrent-neural-network


    【解决方案1】:

    你对grad_inputgrad_output的理解是错误的。我试图用一个更简单的例子来解释它。

    def backward_hook(module, grad_input, grad_output):
        for grad in grad_output:
            print ("grad_output.size {}".format(grad.size()))
    
        for grad in grad_input:
            if grad is None:
                print('None')
            else:
                print ("grad_input.size: {}".format(grad.size()))
        print()
    
    model = nn.Linear(10, 20)
    model.register_backward_hook(backward_hook)
    
    input = torch.randn(8, 3, 10)
    Y = torch.randn(8, 3, 20)
    
    Y_pred = []
    for i in range(input.size(1)):
        out = model(input[:, i])
        Y_pred.append(out)
    
    loss = torch.norm(Y - torch.stack(Y_pred, dim=1), 2)
    loss.backward()
    

    输出是:

    grad_output.size torch.Size([8, 20])
    grad_input.size: torch.Size([8, 20])
    None
    grad_input.size: torch.Size([10, 20])
    
    grad_output.size torch.Size([8, 20])
    grad_input.size: torch.Size([8, 20])
    None
    grad_input.size: torch.Size([10, 20])
    
    grad_output.size torch.Size([8, 20])
    grad_input.size: torch.Size([8, 20])
    None
    grad_input.size: torch.Size([10, 20])
    

    解释

    • grad_output:损失梯度 w.r.t.图层输出,Y_pred

    • grad_input:层输入的损失梯度。对于Linear 层,输入是input 张量和weightbias

    所以,在你看到的输出中:

    grad_input.size: torch.Size([8, 20])  # for the `bias`
    None                                  # for the `input`
    grad_input.size: torch.Size([10, 20]) # for the `weight`
    

    PyTorch 中的Linear 层使用LinearFunction,如下所示。

    class LinearFunction(Function):
    
        # Note that both forward and backward are @staticmethods
        @staticmethod
        # bias is an optional argument
        def forward(ctx, input, weight, bias=None):
            ctx.save_for_backward(input, weight, bias)
            output = input.mm(weight.t())
            if bias is not None:
                output += bias.unsqueeze(0).expand_as(output)
            return output
    
        # This function has only a single output, so it gets only one gradient
        @staticmethod
        def backward(ctx, grad_output):
            # This is a pattern that is very convenient - at the top of backward
            # unpack saved_tensors and initialize all gradients w.r.t. inputs to
            # None. Thanks to the fact that additional trailing Nones are
            # ignored, the return statement is simple even when the function has
            # optional inputs.
            input, weight, bias = ctx.saved_tensors
            grad_input = grad_weight = grad_bias = None
    
            # These needs_input_grad checks are optional and there only to
            # improve efficiency. If you want to make your code simpler, you can
            # skip them. Returning gradients for inputs that don't require it is
            # not an error.
            if ctx.needs_input_grad[0]:
                grad_input = grad_output.mm(weight)
            if ctx.needs_input_grad[1]:
                grad_weight = grad_output.t().mm(input)
            if bias is not None and ctx.needs_input_grad[2]:
                grad_bias = grad_output.sum(0).squeeze(0)
    
            return grad_input, grad_weight, grad_bias
    

    对于 LSTM,有四组权重参数。

    weight_ih_l0
    weight_hh_l0
    bias_ih_l0
    bias_hh_l0
    

    因此,在您的情况下,grad_input 将是 5 个张量的元组。正如您所提到的,grad_output 是两个张量。

    【讨论】:

    • 为什么在线性情况下(你的例子)输入的梯度是 None 并且有两个梯度的偏差? (有错字吗?)
    • 我仍然不明白为什么 grad_input 中的张量在我的输出中有大小。如果包括权重,它应该有类似 (1000x1000) 或 (1000x4000) 的东西。我得到的张量没有这样的大小。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-08-24
    • 1970-01-01
    • 2021-11-18
    • 2018-09-01
    • 2019-08-27
    • 2019-12-06
    相关资源
    最近更新 更多