Tensorflow——依次迭代训练和验证答案

【问题标题】：Tensorflow -- Iterating over training and validation sequenciallyTensorflow——依次迭代训练和验证
【发布时间】：2019-06-05 18:41:59
【问题描述】：

我一直在使用 tensorflow 的 Dataset API 来轻松地将不同的数据集提供给 RNN 模型。

在没有那么多博客以及 tensorflow 网站上的文档之后，我得到了一切正常工作。我的工作示例执行以下操作：

--- 在训练数据集中训练 X 个 epoch -> 在验证数据集中所有训练结束后进行验证。

但是，我无法开发以下示例：

--- 在训练数据集中训练 X 个 epoch -> 在每个 epoch 中使用验证数据集验证训练模型（有点像 Keras 所做的）

问题出在下面这段代码：

train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()

val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()

itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)

当我从结构创建迭代器时，我需要指定一个输出形状。显然，训练数据集和验证数据集的输出形状并不相同，因为它们具有不同的 batch_size。但是，validation_init_op 抛出以下错误，这似乎违反直觉，因为验证集总是具有不同的 batch_size：

TypeError: Expected output shapes compatible with (TensorShape([Dimension(256), Dimension(11), Dimension(74)]), TensorShape([Dimension(256), Dimension(3)])) but got dataset with output shapes (TensorShape([Dimension(28), Dimension(11), Dimension(74)]), TensorShape([Dimension(28), Dimension(3)])).

我想使用第二种方法来评估我的模型，并查看同时开发的常见训练和验证图，看看我该如何改进它（尽早停止学习等）。然而，第一种简单的方法我没有得到所有这些。

所以，问题是：¿我做错了吗？ ¿ 我的第二种方法是否必须以不同的方式处理？ 我可以考虑创建两个迭代器，但我不知道这是否是正确的方法。此外，@MatthewScarpino 的这个答案指出了一个 feedable 迭代器，因为在可重新初始化的迭代器之间切换会使它们重新开始；但是，上述错误与代码的那部分无关——¿也许可重新初始化的迭代器并不打算为验证集设置不同的批量大小，并且只在训练后迭代一次，无论它是什么大小并且没有在.batch() 方法中设置它？

非常感谢任何帮助。

完整代码供参考：

N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell 
N_EPOCHS = 350
LEARNING_RATE = 0.001

### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])

train_data = (xt, yt)
validation_data = (xval, yval)

N_BATCHES = train_data[0].shape[0] // BATCH_SIZE
print('The number of batches is: {}'.format(N_BATCHES))
BATCH_SIZE_VAL = validation_data[0].shape[0] // N_BATCHES
print('The validation batch size is: {}'.format(BATCH_SIZE_VAL))

## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')

# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE_VAL, drop_remainder=True).repeat()

# Creating the Iterator type that permits to switch between datasets.
itr = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_init_op = itr.make_initializer(train_dataset)
validation_init_op = itr.make_initializer(val_dataset)

next_features, next_labels = itr.get_next()

【问题讨论】：

你能解释一下为什么 feedable 迭代器不起作用，因为它看起来像解决方案吗？因为现在如果不查看更多代码，特别是在整个过程中使用这些数据集的训练和验证代码，很难说出问题所在。
我认为在这种情况下使用 2 个迭代器可能是最好的，因为它可以让您灵活地以适合您的内存要求的形状定义数据
如果定义了 2 个迭代器，则可以定义 2 个可初始化的迭代器，在使用它们之前运行它们的迭代器初始化器。它仍然可以很好地工作。会出现复杂性的情况是，如果您在训练过程中的某个地方设置了检查点并且必须从那里恢复，在这种情况下，您需要保存迭代器状态以从您停止的地方恢复。否则，您可以在使用之前对迭代器进行一次又一次的初始化，对吗？
关于批量大小的要求，对于训练和验证来说，你会很舒服。这是因为您只是在评估损失和准确性，并且不会因为批量大小而有所不同。这使您可以同时使用可重新初始化的迭代器或可馈送的迭代器，而不必担心如何处理图形
关于形状，数据集中的形状包括第一个维度，即批量大小。因此，在这种情况下，形状本身会有所不同，因为批次中的元素数量不同。正如你所说，其他维度具有相同的形状。

标签： python-3.x tensorflow

【解决方案1】：

在研究了执行此操作的最佳方法后，我发现了这个最终实现，它对我来说效果很好。肯定不是最好的。为了保持状态，我使用了 feedable 迭代器。

目标：当您想要同时训练和验证时使用此代码，保留每个迭代器的状态（即使用最新的模型参数进行验证）。除此之外，代码还保存了模型和其他内容，例如有关超参数和摘要的一些信息，以便在 Tensorboard 中可视化训练和验证。

另外，不要混淆：您不需要为训练集和验证集设置不同的批量大小。这是我的一个误解。批次大小必须相同，并且您必须处理不同数量的批次，当没有更多批次时才通过。这是一个要求，以便您可以创建迭代器，让两个数据集具有相同的数据类型和形状。

希望它可以帮助其他人。只需忽略与您的目标无关的代码。 非常感谢 @kvish 提供的所有帮助和时间。

代码：

def RNNmodelTF(xt, yt, xval, yval, xtest, ytest):

N_TIMESTEPS_X = xt.shape[0] ## The stack number
BATCH_SIZE = 256
#N_OBSERVATIONS = xt.shape[1]
N_FEATURES = xt.shape[2]
N_OUTPUTS = yt.shape[1]
N_NEURONS_LSTM = 128 ## Number of units in the LSTMCell 
N_EPOCHS = 350
LEARNING_RATE = 0.001

### Define the placeholders anda gather the data.
xt = xt.transpose([1,0,2])
xval = xval.transpose([1,0,2])

train_data = (xt, yt)
validation_data = (xval, yval)

N_BATCHES = train_data[0].shape[0] // BATCH_SIZE

## We define the placeholders as a trick so that we do not break into memory problems, associated with feeding the data directly.
'''As an alternative, you can define the Dataset in terms of tf.placeholder() tensors, and feed the NumPy arrays when you initialize an Iterator over the dataset.'''
batch_size = tf.placeholder(tf.int64)
x = tf.placeholder(tf.float32, shape=[None, N_TIMESTEPS_X, N_FEATURES], name='XPlaceholder')
y = tf.placeholder(tf.float32, shape=[None, N_OUTPUTS], name='YPlaceholder')

# Creating the two different dataset objects.
train_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()
val_dataset = tf.data.Dataset.from_tensor_slices((x,y)).batch(BATCH_SIZE, drop_remainder=True).repeat()

#################### Creating the Iterator type that permits to switch between datasets.

handle = tf.placeholder(tf.string, shape = [])
iterator = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
next_features, next_labels = iterator.get_next()

train_val_iterator = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
train_iterator = train_val_iterator.make_initializer(train_dataset)
val_iterator = train_val_iterator.make_initializer(val_dataset)

###########################

### Create the graph 
cellType = tf.nn.rnn_cell.LSTMCell(num_units=N_NEURONS_LSTM, name='LSTMCell')
inputs = tf.unstack(next_features, axis=1)
'''inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size]'''
RNNOutputs, _ = tf.nn.static_rnn(cell=cellType, inputs=inputs, dtype=tf.float32)
out_weights = tf.get_variable("out_weights", shape=[N_NEURONS_LSTM, N_OUTPUTS], dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer())
out_bias = tf.get_variable("out_bias", shape=[N_OUTPUTS], dtype=tf.float32, initializer=tf.zeros_initializer())
predictionsLayer = tf.matmul(RNNOutputs[-1], out_weights) + out_bias

### Define the cost function, that will be optimized by the optimizer. 
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=predictionsLayer, labels=next_labels, name='Softmax_plus_Cross_Entropy'))
optimizer_type = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, name='AdamOptimizer')
optimizer = optimizer_type.minimize(cost)

### Model evaluation 
correctPrediction = tf.equal(tf.argmax(predictionsLayer,1), tf.argmax(next_labels,1))
accuracy = tf.reduce_mean(tf.cast(correctPrediction,tf.float32))

confusionMatrix1 = tf.confusion_matrix(tf.argmax(next_labels,1), tf.argmax(predictionsLayer,1), num_classes=3, name='ConfMatrix')

## Saving variables so that we can restore them afterwards.
saver = tf.train.Saver()
save_dir = '/media/SecondDiskHDD/8classModels/DLmodels/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
#save_dir = '/home/Desktop/tfModels/{}_{}'.format(cellType.__class__.__name__, datetime.now().strftime("%Y%m%d%H%M%S"))
os.mkdir(save_dir)
varDict = {'nTimeSteps': N_TIMESTEPS_X, 'BatchSize': BATCH_SIZE, 'nFeatures': N_FEATURES,
           'nNeuronsLSTM': N_NEURONS_LSTM, 'nEpochs': N_EPOCHS,
           'learningRate': LEARNING_RATE, 'optimizerType': optimizer_type.__class__.__name__}
varDicSavingTxt = save_dir + '/varDict.txt'
modelFilesDir = save_dir + '/modelFiles'
os.mkdir(modelFilesDir)

logDir = save_dir + '/TBoardLogs'
os.mkdir(logDir)

acc_summary = tf.summary.scalar('Accuracy', accuracy)
loss_summary = tf.summary.scalar('Cost_CrossEntropy', cost)
summary_merged = tf.summary.merge_all()

with open(varDicSavingTxt, 'w') as outfile:
    outfile.write(repr(varDict))

with tf.Session() as sess:

    tf.set_random_seed(2)
    sess.run(tf.global_variables_initializer())
    train_writer = tf.summary.FileWriter(logDir + '/train', sess.graph)
    validation_writer = tf.summary.FileWriter(logDir + '/validation')

    # initialise iterator with data
    train_val_string = sess.run(train_val_iterator.string_handle())

    cm1Total = None
    cm2Total = None

    print('¡Training starts!')
    for epoch in range(N_EPOCHS):

        batchAccList = []
        batchAccListVal = []
        tot_loss_train = 0
        tot_loss_validation = 0

        for batch in range(N_BATCHES):

            sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
            optimizer_output, loss_value, summary, accBatch, cm1 = sess.run([optimizer, cost, summary_merged, accuracy, confusionMatrix1], feed_dict = {handle: train_val_string})

            npArrayPred = predictionsLayer.eval(feed_dict= {handle: train_val_string})
            predLabEnc = np.apply_along_axis(thresholdSet, 1, npArrayPred, value=0.5)

            npArrayLab = next_labels.eval(feed_dict= {handle: train_val_string})
            labLabEnc = np.argmax(npArrayLab, 1)

            cm2 = confusion_matrix(labLabEnc, predLabEnc)
            tot_loss_train += loss_value
            batchAccList.append(accBatch)

            try:
                sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
                valLoss, valAcc, summary_val = sess.run([cost, accuracy, summary_merged], feed_dict = {handle: train_val_string})
                tot_loss_validation += valLoss
                batchAccListVal.append(valAcc)

            except tf.errors.OutOfRangeError:
                pass

            if cm1Total is None and cm2Total is None:

                cm1Total = cm1
                cm2Total = cm2
            else:

                cm1Total += cm1
                cm2Total += cm2

            if batch % 10 == 0:

                train_writer.add_summary(summary, batch)
                validation_writer.add_summary(summary_val, batch)

        epochAcc = tf.reduce_mean(batchAccList)
        sess.run(train_iterator, feed_dict = {x : train_data[0], y: train_data[1], batch_size: BATCH_SIZE})
        epochAcc_num = sess.run(epochAcc, feed_dict = {handle: train_val_string})

        epochAccVal = tf.reduce_mean(batchAccListVal)
        sess.run(val_iterator, feed_dict = {x: validation_data[0], y: validation_data[1], batch_size: BATCH_SIZE})
        epochAcc_num_Val = sess.run(epochAccVal, feed_dict = {handle: train_val_string})

        if epoch%10 == 0:

            print("Epoch: {}, Loss: {:.4f}, Accuracy: {:.3f}".format(epoch, tot_loss_train / N_BATCHES, epochAcc_num))
            print('Validation Loss: {:.4f}, Validation Accuracy: {:.3f}'.format(tot_loss_validation / N_BATCHES, epochAcc_num_Val))

    cmLogFile1 = save_dir + '/cm1File.txt'
    with open(cmLogFile1, 'w') as outfile:
        outfile.write(repr(cm1Total))

    cmLogFile2 = save_dir + '/cm2File.txt'
    with open(cmLogFile2, 'w') as outfile:
        outfile.write(repr(cm2Total))

    saver.save(sess, modelFilesDir + '/model.ckpt')

【讨论】：