为什么在 tensorflow 的 cifar10 示例中卷积层没有权重衰减？答案

【问题标题】：Why no weight decay on the convolutional layers in the cifar10 example of tensorflow?为什么在 tensorflow 的 cifar10 示例中卷积层没有权重衰减？
【发布时间】：2016-03-05 22:54:40
【问题描述】：

在 tensorflow 上的 cifar10 示例中，卷积层似乎没有权重衰减。实际上，除了两个完全连接的层之外，任何层都没有权重衰减。这是一种常见的做法吗？我认为权重衰减适用于所有权重（偏差除外）。

供参考，下面是相关代码（wd是权重衰减因子）：

  # conv1
  with tf.variable_scope('conv1') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64],
                                         stddev=1e-4, wd=0.0)
    conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
    bias = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv1)

  # pool1
  pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1],
                         padding='SAME', name='pool1')
  # norm1
  norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm1')

  # conv2
  with tf.variable_scope('conv2') as scope:
    kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64],
                                         stddev=1e-4, wd=0.0)
    conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
    bias = tf.nn.bias_add(conv, biases)
    conv2 = tf.nn.relu(bias, name=scope.name)
    _activation_summary(conv2)

  # norm2
  norm2 = tf.nn.lrn(conv2, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75,
                    name='norm2')
  # pool2
  pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1],
                         strides=[1, 2, 2, 1], padding='SAME', name='pool2')

  # local3
  with tf.variable_scope('local3') as scope:
    # Move everything into depth so we can perform a single matrix multiply.
    dim = 1
    for d in pool2.get_shape()[1:].as_list():
      dim *= d
    reshape = tf.reshape(pool2, [FLAGS.batch_size, dim])

    weights = _variable_with_weight_decay('weights', shape=[dim, 384],
                                          stddev=0.04, wd=0.004)
    biases = _variable_on_cpu('biases', [384], tf.constant_initializer(0.1))
    local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name)
    _activation_summary(local3)

  # local4
  with tf.variable_scope('local4') as scope:
    weights = _variable_with_weight_decay('weights', shape=[384, 192],
                                          stddev=0.04, wd=0.004)
    biases = _variable_on_cpu('biases', [192], tf.constant_initializer(0.1))
    local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name=scope.name)
    _activation_summary(local4)

  # softmax, i.e. softmax(WX + b)
  with tf.variable_scope('softmax_linear') as scope:
    weights = _variable_with_weight_decay('weights', [192, NUM_CLASSES],
                                          stddev=1/192.0, wd=0.0)
    biases = _variable_on_cpu('biases', [NUM_CLASSES],
                              tf.constant_initializer(0.0))
    softmax_linear = tf.add(tf.matmul(local4, weights), biases, name=scope.name)
    _activation_summary(softmax_linear)

  return softmax_linear

【问题讨论】：

确实很奇怪。如果需要，您可以使用 wd 值更改它，但即使 _variable_with_weight_decay 函数使其可配置，它似乎也无法配置。

标签： tensorflow conv-neural-network

【解决方案1】：

重量衰减并不一定会提高性能。根据我自己的经验，我经常合理地发现，我的模型表现更差（根据保留集上的某些指标来衡量），并且有任何显着的权重衰减。需要注意的是，这是一种有用的正则化形式，但您无需将其添加到每个模型中而不考虑是否需要它或比较有无和没有的性能。

至于仅部分模型的权重衰减与整个模型的权重衰减相比是否更好，以这种方式仅对部分权重进行正则化似乎不太常见。但是，我不知道这有理论上的原因。一般来说，神经网络已经有太多的超参数需要配置。是否使用权重衰减已经是一个问题，如果你这样做，规范化权重的强度如何。如果您还想知道，我应该以这种方式对哪些层进行正则化，那么您很快就会没有时间来测试您可以为每一层打开和关闭它的所有不同方式的性能。

我想有些模型会受益于仅部分模型的权重衰减；我不认为它经常这样做，因为很难测试所有可能性并找出最有效的一种。

【讨论】：