Keras AdditiveAttention 层的输出形状答案

【问题标题】：Output shapes of Keras AdditiveAttention LayerKeras AdditiveAttention 层的输出形状
【发布时间】：2021-05-02 06:42:31
【问题描述】：

尝试使用Keras 中的AdditiveAttention 层。来自tensorflow教程https://www.tensorflow.org/tutorials/text/nmt_with_attention的手动实现层

import tensorflow as tf 

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    query_with_time_axis = tf.expand_dims(query, 1)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    return context_vector, attention_weights

context_vector 的形状是(batch_size, units)

而使用与keras built-in 相同的AdditiveAttention 层

from tensorflow.keras.layers import AdditiveAttention

context_vector 的 shape = [batch_size, Tq, dim]

任何关于导致这种OP shape 差异的建议都会很有用。

【问题讨论】：

标签： tensorflow keras deep-learning neural-network attention-model

【解决方案1】：

除了一些变化外，这两种实现都相互相似。该教程中BahdanauAttention 的实现是一种简化和改编的版本，并使用了一些线性变换。您想知道的context_vector 的返回形状只不过是输入数据形状的问题。下面是一些演示，我们看教程实现：

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V  = tf.keras.layers.Dense(1)

  def call(self, query, values):
    query_with_time_axis = tf.expand_dims(query, 1)
    score = self.V(tf.nn.tanh(self.W1(query_with_time_axis) + self.W2(values)))
    attention_weights = tf.nn.softmax(score, axis=1)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)
    return context_vector, attention_weights

现在，我们将一些输入传递给它，3D 和 2D。

attention_layer = BahdanauAttention(10)

y = tf.random.uniform((2, 60, 512))  
out, attn = attention_layer(y, y)
out.shape , attn.shape
(TensorShape([2, 60, 512]), TensorShape([2, 2, 60, 1]))

y = tf.random.uniform((2, 512))  
out, attn = attention_layer(y, y)
out.shape , attn.shape
(TensorShape([2, 512]), TensorShape([2, 2, 1]))

现在，将相同的输入传递给内置的AdditiveAttention，看看我们会得到什么

buit_attn = tf.keras.layers.AdditiveAttention()

y = tf.random.uniform((2, 60, 512))  
out, attn = buit_attn([y, y], return_attention_scores=True)
out.shape , attn.shape
(TensorShape([2, 60, 512]), TensorShape([2, 60, 60]))

y = tf.random.uniform((2, 512))  
out, attn = buit_attn([y, y], return_attention_scores=True)
out.shape , attn.shape
(TensorShape([2, 512]), TensorShape([2, 2]))

所以，context_vector 的形状在这里是可比较的，但不是attention_weights 的形状。原因是，正如我们所提到的，我相信该教程的实施有点修改和采用。如果我们看BahdanauAttention或AdditiveAttention的计算，我们会得到：

分别将query 和value 重塑为[batch_size, Tq, 1, dim] 和[batch_size, 1, Tv, dim] 形状。
将形状[batch_size, Tq, Tv] 的分数计算为非线性和：scores = tf.reduce_sum(tf.tanh(query + value), axis=-1)
使用分数计算形状为[batch_size, Tq, Tv]: distribution = tf.nn.softmax(scores) 的分布。
使用分布创建形状为batch_size, Tq, dim]: return tf.matmul(distribution, value) 的值的线性组合。

而且我认为那些教程中的实现对于计算注意力权重特征有点不同。如果我们遵循上述方法（1 到 4），我们也会为 attention_weights 获得相同的输出形状。这里是如何，（但不是这里只是一个演示目的，不是通用的。）

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    query_with_time_axis = tf.expand_dims(query, 2)  # [batch_size, Tq, 1, dim]
    value_with_time_axis = tf.expand_dims(values, 1) # [batch_size, 1, Tv, dim]
    scores = tf.reduce_sum(tf.tanh(query_with_time_axis + 
                                   value_with_time_axis), axis=-1)
    distribution = tf.nn.softmax(scores)
    return tf.matmul(distribution, values), distribution

现在，如果我们传递相同的输入，我们将从两个实现中获得相同的输出形状。但是，一般情况下，应该选择内置实现。

attention_layer = BahdanauAttention(10)

y = tf.random.uniform((2, 60, 512))  
out, attn = attention_layer(y, y)
out.shape , attn.shape
(TensorShape([2, 60, 512]), TensorShape([2, 60, 60]))

buit_attn = tf.keras.layers.AdditiveAttention()
y = tf.random.uniform((2, 60, 512))  
out, attn = buit_attn([y, y], return_attention_scores=True)
out.shape , attn.shape
(TensorShape([2, 60, 512]), TensorShape([2, 60, 60]))

【讨论】：

感谢您的回答。但是，built-in 注意力层如何用于需要形状为(batch_size,units) 的输入的问题，比如Text Classification。因为我尝试在传递给Dense 层之前传递built-in 注意力层和Flatten 的OP，所以它会引发错误。
你的意思是y = tf.random.uniform((2, 512)) ; out, attn = buit_attn([y, y], return_attention_scores=True)这样的东西？我不太确定你在做什么，但这应该可以。
对不起，如果我不清楚。我对将built-in Attention Layer 的OP 用于Text Classification 感到困惑，因为它是3d OP。当我尝试通过Flatten 传递OP of Att layer 以进一步通过Dense 它会引发错误。
This 展示了如何使用这个内置层。否则，如果您只是遵循该教程，我认为您应该采用该实现。我希望我能提供更多帮助。