神经机器翻译中的 Bahdanaus 注意答案

【问题标题】：Bahdanaus attention in Neural machine translation with attention神经机器翻译中的 Bahdanaus 注意
【发布时间】：2020-08-05 15:34:10
【问题描述】：

我正在尝试使用以下教程了解 Bahdanaus 注意力： https://www.tensorflow.org/tutorials/text/nmt_with_attention

计算如下：

self.attention_units = attention_units
self.W1 = Dense(self.attention_units)
self.W2 = Dense(self.attention_units)
self.V = Dense(1)

score = self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)))

我有两个问题：

我不明白为什么tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 的形状是 (batch_size,max_len,attention_units) ？

使用矩阵乘法的规则我得到了以下结果：

a) self.W1(last_inp_dec) -> (1,hidden_units_dec) * (hidden_units_dec,attention_units) = (1,attention_units) 的形状

b) self.W2(last_inp_enc) 的形状 -> (max_len,hidden_units_dec) * (hidden_units_dec,attention_units) = (max_len,attention_units)

然后我们将 a) 和 b) 数量相加。我们如何最终得到维度（max_len, attention_units）或（batch_size, max_len, attention_units）？我们如何对不同大小的第二维（1 vs max_len）进行加法？
为什么要将tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 乘以self.V？因为我们希望 alpha 为标量？

【问题讨论】：

标签： tensorflow deep-learning attention-model

【解决方案1】：

) 我不明白为什么 tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 的形状是 (batch_size,max_len,attention_units) ?

来自class BahdanauAttention中代码的cmets部分

query_with_time_axis shape = (batch_size, 1, hidden size)

请注意，维度 1 是使用 tf.expand_dims 添加的，以使形状与添加的 values 兼容。 1 的添加维度在添加操作期间被广播。否则，传入的形状是 (batch_size, hidden size)，这将不兼容

values shape = (batch_size, max_len, hidden size)

将query_with_time_axis 形状和values 形状相加得到(batch_size, max_len, hidden size) 形状

) 为什么我们要用 self.V 乘以 tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc))？因为我们希望 alpha 为标量？

self.V 是最后一层，它的输出给了我们分数。 self.V层的随机权重初始化由keras在self.V = tf.keras.layers.Dense(1)行的幕后处理。

我们没有将tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 乘以self.V。

构造self.V(tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 意味着 --> 由tf.nn.tanh(self.W1(last_inp_dec) + self.W2(input_enc)) 操作产生的tanh 激活形成输入矩阵到由self.V 表示的单个输出 输出层。

【讨论】：

【解决方案2】：

形状与您给出的形状略有不同。直接举个例子就更好理解了吧？

假设对齐层中有 10 个单元，解码器上有 128 个嵌入维度，编码器有 256 个维度和 19 个时间步，那么：

last_inp_dec 和 input_enc 形状将是 (?,128) 和 (?,19,256)。我们现在需要在时间轴上扩展 last_inp_dec 以使其成为 (?,1,128)，以便可以进行添加。

w1,w2,v 的层权重将分别为 (?,128,10)、(?,256,10) 和 (?,10,1)。请注意 self.w1(last_inp_dec) 如何计算为 (?,1,10)。这被添加到每个 self.w2(input_enc) 以给出 (?,19,10) 的形状。结果被馈送到 self.v ，输出是 (?,19,1) ，这是我们想要的形状 - 一组 19 个权重。 Softmaxing 这给出了注意力权重。

将此注意力权重与每个编码器隐藏状态相乘并求和即可返回上下文。

关于为什么需要“v”的问题，之所以需要它是因为 Bahdanau 提供了在对齐层中使用“n”个单位的选项（以确定 w1、w2），并且我们需要在顶部再添加一层来按摩张量恢复到我们想要的形状 - 一组注意力权重..每个时间步都有一个。

我刚刚在Understanding Bahdanau's Attention Linear Algebra 上发布了答案包含所涉及的张量和权重的所有形状。

【讨论】：