TensorFlow CNN：为什么验证损失从一开始就明显不同并且一直在增加？答案

【问题标题】：TensorFlow CNN: Why validation loss are significantly different from the start and has been increasing?TensorFlow CNN：为什么验证损失从一开始就明显不同并且一直在增加？
【发布时间】：2017-08-30 06:54:42
【问题描述】：

这是十类图片的分类模型。我的代码有三个文件，一个是CNN模型convNet.py，一个是read_TFRecord.py读取数据，一个是train.py训练和评估模型。训练集8万样本，验证集2万样本。

问题：

在第一个时代：

训练损失 = 2.11，训练准确率 = 25.61%

验证损失 = 3.05，验证准确率 = 8.29%

为什么验证损失从一开始就显着不同？为什么验证准确率总是低于 10%？

在第 10 个 epoch 的训练中：

训练过程始终处于正常学习中。验证损失在缓慢增加，验证准确率已经震荡在 10% 左右。是不是过拟合了？但是我已经采取了一些措施，比如增加正则化损失，dropout。我不知道问题出在哪里。我希望你能帮助我。

convNet.py：

def convNet(features, mode):
    input_layer = tf.reshape(features, [-1, 100, 100, 3])
    tf.summary.image('input', input_layer)

    # conv1
    with tf.name_scope('conv1'):
         conv1 = tf.layers.conv2d(
             inputs=input_layer,
             filters=32,
             kernel_size=5,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv1'
         )
         conv1_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv1')
         tf.summary.histogram('kernel', conv1_vars[0])
         tf.summary.histogram('bias', conv1_vars[1])
         tf.summary.histogram('act', conv1)

    # pool1  100->50
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2, name='pool1')

    # dropout
    pool1_dropout = tf.layers.dropout(
    inputs=pool1, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool1_dropout')

    # conv2
    with tf.name_scope('conv2'):
         conv2 = tf.layers.conv2d(
             inputs=pool1_dropout,
             filters=64,
             kernel_size=5,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv2'
         )
         conv2_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv2')
         tf.summary.histogram('kernel', conv2_vars[0])
         tf.summary.histogram('bias', conv2_vars[1])
         tf.summary.histogram('act', conv2)

    # pool2  50->25
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2, name='pool2')

    # dropout
    pool2_dropout = tf.layers.dropout(
    inputs=pool2, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool2_dropout')

    # conv3
    with tf.name_scope('conv3'):
         conv3 = tf.layers.conv2d(
             inputs=pool2_dropout,
             filters=128,
             kernel_size=3,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv3'
         )
         conv3_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv3')
         tf.summary.histogram('kernel', conv3_vars[0])
         tf.summary.histogram('bias', conv3_vars[1])
         tf.summary.histogram('act', conv3)

    # pool3  25->12
    pool3 = tf.layers.max_pooling2d(inputs=conv3, pool_size=[2, 2], strides=2, name='pool3')

    # dropout
    pool3_dropout = tf.layers.dropout(
    inputs=pool3, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool3_dropout')

    # conv4
    with tf.name_scope('conv4'):
         conv4 = tf.layers.conv2d(
             inputs=pool3_dropout,
             filters=128,
             kernel_size=3,
             padding="same",
             activation=tf.nn.relu,
             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
             name='conv4'
         )
         conv4_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'conv4')
         tf.summary.histogram('kernel', conv4_vars[0])
         tf.summary.histogram('bias', conv4_vars[1])
         tf.summary.histogram('act', conv4)

    # pool4  12->6
    pool4 = tf.layers.max_pooling2d(inputs=conv4, pool_size=[2, 2], strides=2, name='pool4')

    # dropout
    pool4_dropout = tf.layers.dropout(
    inputs=pool4, rate=0.5, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='pool4_dropout')

    pool4_flat = tf.reshape(pool4_dropout, [-1, 6 * 6 * 128])

    # fc1
    with tf.name_scope('fc1'):
         fc1 = tf.layers.dense(inputs=pool4_flat, units=1024, activation=tf.nn.relu,
                          kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                          kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                          name='fc1')
         fc1_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'fc1')
         tf.summary.histogram('kernel', fc1_vars[0])
         tf.summary.histogram('bias', fc1_vars[1])
         tf.summary.histogram('act', fc1)

    # dropout
    fc1_dropout = tf.layers.dropout(
    inputs=fc1, rate=0.3, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='fc1_dropout')

    # fc2
    with tf.name_scope('fc2'):
         fc2 = tf.layers.dense(inputs=fc1_dropout, units=512, activation=tf.nn.relu,
                          kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                          kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                          name='fc2')
         fc2_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'fc2')
         tf.summary.histogram('kernel', fc2_vars[0])
         tf.summary.histogram('bias', fc2_vars[1])
         tf.summary.histogram('act', fc2)

    # dropout
    fc2_dropout = tf.layers.dropout(
    inputs=fc2, rate=0.3, training=tf.equal(mode, learn.ModeKeys.TRAIN), name='fc2_dropout')

    # logits
    with tf.name_scope('out'):
         logits = tf.layers.dense(inputs=fc2_dropout, units=10, activation=None,
                             kernel_initializer=tf.truncated_normal_initializer(stddev=0.01),
                             kernel_regularizer=tf.contrib.layers.l2_regularizer(0.01),
                             name='out')
         out_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 'out')
         tf.summary.histogram('kernel', out_vars[0])
         tf.summary.histogram('bias', out_vars[1])
         tf.summary.histogram('act', logits)

     return logits

read_TFRecord.py：

def read_and_decode(filename, width, height, channel):
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(serialized_example,
                                   features={
                                       'label': tf.FixedLenFeature([], tf.int64),
                                       'img_raw': tf.FixedLenFeature([], tf.string),
                                   })
img = tf.decode_raw(features['img_raw'], tf.uint8)
img = tf.reshape(img, [width, height, channel])
img = tf.cast(img, tf.float16) * (1. / 255) - 0.5
label = tf.cast(features['label'], tf.int16)
return img, label

train.py：

# step 1
TRAIN_TFRECORD = 'F:/10-image-set2/train.tfrecords'  # train data set
VAL_TFRECORD = 'F:/10-image-set2/val.tfrecords'  # validation data set
WIDTH = 100  # image width
HEIGHT = 100  # image height
CHANNEL = 3  # image channel
TRAIN_BATCH_SIZE = 64
VAL_BATCH_SIZE = 16
train_img, train_label = read_and_decode(TRAIN_TFRECORD, WIDTH, HEIGHT, 
                         CHANNEL)
val_img, val_label = read_and_decode(VAL_TFRECORD, WIDTH, HEIGHT, CHANNEL)
x_train_batch, y_train_batch = tf.train.shuffle_batch([train_img, 
                               train_label], batch_size=TRAIN_BATCH_SIZE, 
                               capacity=80000,min_after_dequeue=79999, 
                               num_threads=64,name='train_shuffle_batch')
x_val_batch, y_val_batch = tf.train.shuffle_batch([val_img, val_label],
                           batch_size=VAL_BATCH_SIZE, 
                           capacity=20000,min_after_dequeue=19999, 
                           num_threads=64, name='val_shuffle_batch')

# step 2
x = tf.placeholder(tf.float32, shape=[None, WIDTH, HEIGHT, CHANNEL], 
                   name='x')
y_ = tf.placeholder(tf.int32, shape=[None, ], name='y_')
mode = tf.placeholder(tf.string, name='mode')
step = tf.get_variable(shape=(), dtype=tf.int32,     initializer=tf.zeros_initializer(), name='step')
tf.add_to_collection(tf.GraphKeys.GLOBAL_STEP, step)
logits = convNet(x, mode) 
with tf.name_scope('Reg_losses'):
     reg_losses = tf.cond(tf.equal(mode, learn.ModeKeys.TRAIN),
                     lambda: tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)),
                     lambda: tf.constant(0, dtype=tf.float32))
with tf.name_scope('Loss'):
     loss = tf.losses.sparse_softmax_cross_entropy(labels=y_, logits=logits) + reg_losses
train_op = tf.train.AdamOptimizer().minimize(loss, step)
correct_prediction = tf.equal(tf.cast(tf.argmax(logits, 1), tf.int32), y_)
with tf.name_scope('Accuracy'):
     acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# step 3
tf.summary.scalar("reg_losses", reg_losses)
tf.summary.scalar("loss", loss)
tf.summary.scalar("accuracy", acc)
merged = tf.summary.merge_all()

# step 4
with tf.Session() as sess:
     summary_dir = './logs/summary/'

     sess.run(tf.global_variables_initializer())
     saver = tf.train.Saver()   
     saver = tf.train.Saver(max_to_keep=1)

     train_writer = tf.summary.FileWriter(summary_dir + 'train',
                                     sess.graph)
     valid_writer = tf.summary.FileWriter(summary_dir + 'valid')

     coord = tf.train.Coordinator()
     threads = tf.train.start_queue_runners(sess=sess, coord=coord) 
     max_acc = 0
     MAX_EPOCH = 10
     for epoch in range(MAX_EPOCH):
         # training
         train_step = int(80000 / TRAIN_BATCH_SIZE)
         train_loss, train_acc = 0, 0
         for step in range(epoch * train_step, (epoch + 1) * train_step):
             x_train, y_train = sess.run([x_train_batch, y_train_batch])
             train_summary, _, err, ac = sess.run([merged, train_op, loss, acc],
                                             feed_dict={x: x_train, y_: y_train,
                                                        mode: learn.ModeKeys.TRAIN,
                                                        global_step: step})
            train_loss += err
            train_acc += ac
            if (step + 1) % 50 == 0:
                train_writer.add_summary(train_summary, step)
         print("Epoch %d,train loss= %.2f,train accuracy=%.2f%%" % (
          epoch, (train_loss / train_step), (train_acc / train_step * 100.0)))

         # validation
         val_step = int(20000 / VAL_BATCH_SIZE)
         val_loss, val_acc = 0, 0
         for step in range(epoch * val_step, (epoch + 1) * val_step):
             x_val, y_val = sess.run([x_val_batch, y_val_batch])
             val_summary, err, ac = sess.run([merged, loss, acc],
                                        feed_dict={x: x_val, y_: y_val, mode: learn.ModeKeys.EVAL,
                                                   global_step: step})
             val_loss += err
             val_acc += ac
             if (step + 1) % 50 == 0:
                 valid_writer.add_summary(val_summary, step)
         print(
           "Epoch %d,validation loss= %.2f,validation accuracy=%.2f%%" % (
            epoch, (val_loss / val_step), (val_acc / val_step * 100.0)))

         # save model
         if val_acc > max_acc:
             max_acc = val_acc
             saver.save(sess, summary_dir + '/10-image.ckpt', epoch)
             print("model saved")
coord.request_stop()
coord.join(threads)

张量板结果：

（橙色是训练。蓝色是验证。）

accuracy-loss-reg_losses-conv1-conv2-conv3-conv4-fc1-fc2-output

我的数据：

train-val

【问题讨论】：

我用图片重新整理了代码，所有的代码和我用截图的数据集都显示出来了。我在网上找了很久。但是没有用。请帮助或尝试提供一些想法如何实现这一目标。提前致谢。

标签： tensorflow deep-learning conv-neural-network

【解决方案1】：

我怀疑这是一个过度拟合的问题 - 损失从一开始就明显不同，并且在您完成第一个 epoch（约 500 个批次）之前进一步分歧。如果没有看到您的数据集，很难说更多，但作为第一步，我鼓励您可视化训练和评估输入数据，以确保问题不存在。您在 10 类分类问题上的得分明显低于 10% 这一事实最初表明您几乎可以肯定这里没有问题。

话虽如此，使用此模型您可能会遇到过拟合问题，因为尽管您可能会这么想，但您并没有使用 dropout 或正则化。

Dropout：如果 mode 是张量，mode == learn.ModeKeys 为 false，因此您永远不会使用 dropout。你可以使用tf.equals(mode, learn.ModeKeys)，但我认为你最好将training bool 张量传递给你的convNet并输入适当的值。

正则化：您正在创建正则化损失项并将它们添加到 tf.GraphKeys.REGULARIZATION_LOSSES 集合中，但是您要最小化的损失不使用它们。添加以下内容：

loss += tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))

在优化之前。

关于优化step 的注释：您不应该像现在这样为会话运行提供值。每次运行优化操作时，它都会更新创建它时传递给step 的值，因此只需使用 int 变量创建它并不管它。请看以下示例代码：

import tensorflow as tf

x = tf.get_variable(shape=(4, 3), dtype=tf.float32,
                    initializer=tf.random_normal_initializer(), name='x')
loss = tf.nn.l2_loss(x)
step = tf.get_variable(shape=(), dtype=tf.int32,
                       initializer=tf.zeros_initializer(), name='step')
tf.add_to_collection(tf.GraphKeys.GLOBAL_STEP, step)  # good practice

opt = tf.train.AdamOptimizer().minimize(loss, step)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    step_val = sess.run(step)
    print(step_val)  # 0
    sess.run(opt)
    step_val = sess.run(step)
    print(step_val)  # 1

【讨论】：

真诚感谢您的回答。关于正则化的辍学你是对的，但是在我修改了辍学和正则化之后，我得到了类似的结果。这里我是为了让train_step和val_step相等，所以将TRAIN_BATCH_SIZE设置为160，VAL_BATCH_SIZE设置为40，是这个原因吗？如果train_step和val_step不相等，tensorboard显示两条曲线因为横坐标无法比较。
准确率是平均值，因此不应该受到影响。你的损失会有所不同。您遇到问题是因为您从 for 循环中输入了一个人为的步骤 - 您应该让优化操作处理步骤更新。查看更新的答案。
真的很抱歉。我的电脑 gpu 上次烧毁修复它。非常感谢您的耐心解答，您教会了我很多。如您所说，我再次更改了代码，但仍然得到相同的结果，您能再次帮助我吗？问题是否出现在read_TFRecord.py底部倒数第三行：img = tf.cast(img, tf.float16) * (1. / 255) - 0.5这里我在读取数据的时候做了一个类似归一化处理，但是这个方法我也用在读取验证数据集中。
你说你怀疑我的数据有问题，我怎么用tensorboard给你看我的数据，你是说tensorboard IMAGES菜单栏？
这是一种方式。您还必须将摘要操作添加到图表中，并将它们写入您的训练循环中的文件。作为一个更简单的步骤，我只需在会话中运行输入张量操作并使用 matplotlib.pyplot.imshow 可视化生成的 numpy 数组。