【问题标题】：Tensorflow Queues - Switching between train and validation dataTensorflow 队列 - 在训练数据和验证数据之间切换
【发布时间】：2017-04-30 23:37:58
【问题描述】：

我正在尝试使用队列从 Tensorflow 中的文件加载数据。

我想在每个 epoch 结束时运行带有验证数据的图表，以便更好地了解训练的进展情况。

这就是我遇到问题的地方。我似乎无法弄清楚如何使用队列时在训练数据和验证数据之间进行切换。

我已将我的代码精简为一个最基本的玩具示例，以便更容易得到帮助。我没有包含加载图像文件、执行推理和训练的所有代码，而是在文件名加载到队列中的位置。

import tensorflow as tf

#  DATA
train_items = ["train_file_{}".format(i) for i in range(6)]
valid_items = ["valid_file_{}".format(i) for i in range(3)]

# SETTINGS
batch_size = 3
batches_per_epoch = 2
epochs = 2

# CREATE GRAPH
graph = tf.Graph()
with graph.as_default():
    file_list = tf.placeholder(dtype=tf.string, shape=None)
    
    # Create a queue consisting of the strings in `file_list`
    q = tf.train.string_input_producer(train_items, shuffle=False, num_epochs=None)
    
    # Create batch of items.
    x = q.dequeue_many(batch_size)
    
    # Inference, train op, and accuracy calculation after this point
    # ...


# RUN SESSION
with tf.Session(graph=graph) as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    
    # Start populating the queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    
    try:
        for epoch in range(epochs):
            print("-"*60)
            for step in range(batches_per_epoch):
                if coord.should_stop():
                    break
                train_batch = sess.run(x, feed_dict={file_list: train_items})
                print("TRAIN_BATCH: {}".format(train_batch))
    
            valid_batch = sess.run(x, feed_dict={file_list: valid_items})
            print("\nVALID_BATCH : {} \n".format(valid_batch))
    
    except Exception, e:
        coord.request_stop(e)
    finally:
        coord.request_stop()
        coord.join(threads)

变化和实验

为`num_epochs` 尝试不同的值

num_epochs=无

如果我在tf.train.string_input_producer()to 中设置num_epochs 参数 None 它给出了以下输出，这表明它正在按预期运行两个时期，但它正在使用数据运行评估时从训练集中获取。

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

VALID_BATCH : ['train_file_0' 'train_file_1' 'train_file_2']

------------------------------------------------------------
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']

VALID_BATCH : ['train_file_3' 'train_file_4' 'train_file_5']

num_epochs=2

如果我将tf.train.string_input_producer() 中的num_epochs 参数设置为2 它给出了以下输出，这表明它甚至根本没有运行完整的两个批次（并且评估仍在使用训练数据）

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

VALID_BATCH : ['train_file_0' 'train_file_1' 'train_file_2']

------------------------------------------------------------
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

num_epochs=1

如果我将tf.train.string_input_producer() 中的num_epochs 参数设置为1 希望它会被冲走队列中的任何其他训练数据，以便它可以利用验证数据，我得到以下输出，这表明它正在终止它通过了一个时期的训练数据，并且没有通过加载评估数据。

------------------------------------------------------------
TRAIN_BATCH: ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN_BATCH: ['train_file_3' 'train_file_4' 'train_file_5']

将`capacity` 参数设置为各种值

我也尝试将capacity 参数设置为 tf.train.string_input_producer() 到较小的值，例如 3 和 1。但是这些对结果没有影响。

我应该采取什么其他方法？

我还可以采取哪些其他方法在训练数据和验证数据之间切换？我必须创建单独的队列吗？我不知道如何做到这一点工作。我是否还必须创建额外的协调器和队列运行器？

【问题讨论】：

你的队列不是总是用 train_list 创建的吗？ “q = tf.train.string_input_producer(train_items, shuffle=False, num_epochs=None)”

标签： python queue tensorflow

【解决方案1】：

我正在编制一份可能解决此问题的潜在方法列表。其中大部分只是模糊的建议，没有实际的代码示例来展示如何使用它们。

默认占位符

建议here

使用 tf.cond()

推荐here

sygi 在这个 stackoverflow 线程上也提出了建议。 link

使用 tf.group() 和 tf.cond()

建议here

make_template() 方法

建议here 和here

共享权重法

由 sygi 在这个 stackoverflow 线程 (link) 中建议。这可能与 make_template() 方法相同。

QueueBase() 方法。

建议here 带有示例代码here 在这个线程上适应我的问题的代码。 link

训练桶法

建议here

【讨论】：

op，您找到最佳解决方案了吗？在过去的几天里，我一直被困在这个问题上。
这是在 tf.cond 语句中使用 dequeue 的另一种方法：groups.google.com/a/tensorflow.org/d/msg/discuss/mLrt5qc9_uU/… 不确定它是否真的有效。

【解决方案2】：

首先，您可以手动读取代码中的示例（到 numpy 数组）并以您想要的任何方式传递它：

data = tf.placeholder(tf.float32, [None, DATA_SHAPE])
for _ in xrange(num_epochs):
  some_training = read_some_data()
  sess.run(train_op, feed_dict={data: some_training})
  some_testing = read_some_test_data()
  sess.run(eval_op, feed_dict={data: some_testing})

如果需要使用Queues，可以尝试有条件地将队列从“training”改为“testing”：

train_filenames = tf.string_input_producer(["training_file"])
train_q = some_reader(train_filenames)
test_filenames = tf.string_input_producer(["testing_file"])
test_q = some_reader(test_filenames)

am_testing = tf.placeholder(dtype=bool,shape=())
data = tf.cond(am_testing, lambda:test_q, lambda:train_q)
train_op, accuracy = model(data)

for _ in xrange(num_epochs):
  sess.run(train_op, feed_dict={am_testing: False})
  sess.run(accuracy, feed_dict={am_testing: True})

不过，第二种方法是 considered unsafe——在这篇文章中，鼓励构建两个单独的图用于训练和测试（共享权重），这是实现您想要的另一种方法。

【讨论】：

感谢 sygi，是的，我更喜欢从当前项目的占位符中移开。我正在处理各种形状和大小的图像文件，因此我无法轻松地将它们导入到 numpy 数组中。我必须对它们进行数据预处理和调整大小。使用 Queues 预取图像并使用 tensorflows 图像预处理功能使这更易于管理。
由于某种原因，我无法让tf.cond() 方法为我工作。虽然我确信这是我的代码中的一个愚蠢的错误。我肯定会研究使用共享权重的方法，尽管更改我的其余代码以正确使用共享权重可能会打开一个全新的蠕虫罐，我目前还没有准备好处理。现在我有一个使用QueueBase.from_list() 的解决方案，尽管我怀疑你使用共享权重的建议可能是一个更好的解决方案。

【解决方案3】：

好的，所以我有一个适合我的解决方案。它基于来自 tensorflow github 问题部分的this post 的代码。它利用了QueueBase.from_list() 函数。感觉很hacky，我对它并不完全满意，但至少我让它工作了。

import tensorflow as tf

# DATA
train_items = ["train_file_{}".format(i) for i in range(6)]
valid_items = ["valid_file_{}".format(i) for i in range(3)]

# SETTINGS
batch_size = 3
batches_per_epoch = 2
epochs = 2

# ------------------------------------------------
#                                            GRAPH
# ------------------------------------------------
graph = tf.Graph()
with graph.as_default():
    # TRAIN QUEUE
    train_q = tf.train.string_input_producer(train_items, shuffle=False)

    # VALID/TEST QUEUE
    test_q = tf.train.string_input_producer(valid_items, shuffle=False)

    # SELECT QUEUE
    is_training = tf.placeholder(tf.bool, shape=None, name="is_training")
    q_selector = tf.cond(is_training,
                         lambda: tf.constant(0),
                         lambda: tf.constant(1))

    # select_q = tf.placeholder(tf.int32, [])
    q = tf.QueueBase.from_list(q_selector, [train_q, test_q])

    # # Create batch of items.
    data = q.dequeue_many(batch_size)


# ------------------------------------------------
#                                          SESSION
# ------------------------------------------------
with tf.Session(graph=graph) as sess:
    # Initialize variables
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())

    # Start populating the queue.
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)


    try:
        for epoch in range(epochs):
            print("-" * 60)
            # TRAIN
            for step in range(batches_per_epoch):
                if coord.should_stop():
                    break
                print("TRAIN.dequeue = " + str(sess.run(data, {is_training: True})))

            # VALIDATION
            print "\nVALID.dequeue = " + str(sess.run(data, {is_training: False}))

    except Exception, e:
        coord.request_stop(e)

    finally:
        coord.request_stop()
        coord.join(threads)

给出以下输出，这是我所期望的。

------------------------------------------------------------
TRAIN.dequeue = ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN.dequeue = ['train_file_3' 'train_file_4' 'train_file_5']

VALID.dequeue = ['valid_file_0' 'valid_file_1' 'valid_file_2']
------------------------------------------------------------
TRAIN.dequeue = ['train_file_0' 'train_file_1' 'train_file_2']
TRAIN.dequeue = ['train_file_3' 'train_file_4' 'train_file_5']

VALID.dequeue = ['valid_file_0' 'valid_file_1' 'valid_file_2']

我将保留此线程以希望有更好的解决方案出现。

【讨论】：

您知道更好的处理方法吗？自从你发布这个已经一年了，我仍然找不到一个像样的方法来做到这一点......

【解决方案4】：

创建两个不同的队列是discouraged。

如果您有两台不同的机器，我建议您使用不同的机器进行训练和验证（如果没有，您可以使用两个不同的过程）。对于 2 个机器案例：

第一台机器只有训练数据。它使用队列将数据批量传递给图形模型，并具有用于训练的 GPU。在每一步之后，它都会将新模型 (model_iteration) 保存在第二台机器可以访问的地方。
第二台机器（只有验证数据）定期轮询模型所在的位置并检查新模型是否可用。在这种情况下，它会运行新模型的推理并检查性能。由于大多数情况下验证数据远小于训练数据，因此您甚至可以将其全部存储在内存中。

这种方法的优点很少。训练/验证数据是分开的，你不能乱用它们。您可以使用较弱的机器进行验证，因为即使验证落后于训练（不太可能的情况），这也不是问题，因为它们是独立的

【讨论】：

有没有更好的方法来运行 CV？我不喜欢在单独的机器/会话上运行它......为什么这么简单的事情会这么复杂？！！

变化和实验

为num_epochs 尝试不同的值

num_epochs=无

num_epochs=2

num_epochs=1

将capacity 参数设置为各种值

我应该采取什么其他方法？

默认占位符

使用 tf.cond()

使用 tf.group() 和 tf.cond()

make_template() 方法

共享权重法

QueueBase() 方法。

训练桶法

为`num_epochs` 尝试不同的值

将`capacity` 参数设置为各种值