Tensorflow：_variable_with_weight_decay(...) 解释答案

【问题标题】：Tensorflow: _variable_with_weight_decay(...) explanationTensorflow：_variable_with_weight_decay(...) 解释
【发布时间】：2023-04-01 02:15:01
【问题描述】：

目前我正在查看cifar10 example，我注意到文件cifar10.py 中的函数_variable_with_weight_decay(...)。代码如下：

def _variable_with_weight_decay(name, shape, stddev, wd):
  """Helper to create an initialized Variable with weight decay.
  Note that the Variable is initialized with a truncated normal distribution.
  A weight decay is added only if one is specified.
  Args:
    name: name of the variable
    shape: list of ints
    stddev: standard deviation of a truncated Gaussian
    wd: add L2Loss weight decay multiplied by this float. If None, weight
        decay is not added for this Variable.
  Returns:
    Variable Tensor
  """
  dtype = tf.float16 if FLAGS.use_fp16 else tf.float32
  var = _variable_on_cpu(
      name,
      shape,
      tf.truncated_normal_initializer(stddev=stddev, dtype=dtype))
  if wd is not None:
    weight_decay = tf.mul(tf.nn.l2_loss(var), wd, name='weight_loss')
    tf.add_to_collection('losses', weight_decay)
  return var

我想知道这个函数是否按照它所说的那样做。很明显，当给出权重衰减因子（wd 不是无）时，会计算衰减值（weight_decay）。但它是每一个应用吗？最后未修改的变量（var）是返回，还是我遗漏了什么？

第二个问题是如何解决这个问题？据我了解，必须从权重矩阵中的每个元素中减去标量 weight_decay 的值，但我无法找到可以做到这一点的张量流操作（从张量的每个元素中添加/减去单个值）。有没有这样的操作？作为一种解决方法，我认为可以创建一个使用 weight_decay 值初始化的新张量并使用 tf.subtract(...) 来实现相同的结果。或者这是正确的方法吗？

提前致谢。

【问题讨论】：

标签： python tensorflow neural-network

【解决方案1】：

代码按照它所说的去做。您应该将 'losses' 集合中的所有内容（权重衰减项添加到倒数第二行）中的所有内容相加，以获得传递给优化器的损失。在该示例中的 loss() 函数中：

tf.add_to_collection('losses', cross_entropy_mean)
[...]
return tf.add_n(tf.get_collection('losses'), name='total_loss')

所以loss() 函数返回的是分类损失加上之前'losses' 集合中的所有内容。

作为旁注，权重衰减并不意味着您在更新步骤中从张量中的每个值中减去 wd 的值，而是将值乘以 (1-learning_rate*wd)（在纯 SGD 中）。要了解为什么会这样，请回忆一下 l2_loss 计算

output = sum(t_i ** 2) / 2

t_i 是张量的元素。这意味着l2_loss 对每个张量元素的导数是该张量元素本身的值，并且由于您使用wd 缩放了l2_loss，因此导数也被缩放了。

由于更新步骤（同样，在纯 SGD 中）是（请原谅我省略了时间步长索引）

w := w - learning_rate * dL/dw

如果你只有权重衰减项，你就会明白

w := w - learning_rate * wd * w

或

w := w * (1 - learning_rate * wd)

【讨论】：

感谢您的快速回答。你说的对。我对代码的复杂结构感到困惑，忘记了权重衰减不会影响图形的结构，只是在权重更新期间使用。