如何平均多个批次的摘要？答案

【问题标题】：How to average summaries over multiple batches?如何平均多个批次的摘要？
【发布时间】：2016-11-24 14:22:10
【问题描述】：

假设我有一堆定义如下的摘要：

loss = ...
tf.scalar_summary("loss", loss)
# ...
summaries = tf.merge_all_summaries()

我可以在训练数据上每隔几步评估summaries张量，并将结果传递给SummaryWriter。结果将是嘈杂的摘要，因为它们仅在一批上计算。

但是，我想计算整个验证数据集的摘要。当然，我不能将验证数据集作为单个批次传递，因为它太大了。因此，我将获得每个验证批次的摘要输出。

有没有办法对这些摘要进行平均，以使摘要看起来好像是在整个验证集上计算的？

【问题讨论】：

标签： tensorflow

【解决方案1】：

在 Python 中对您的度量进行平均，并为每个平均值创建一个新的 Summary 对象。这是我的工作：

accuracies = []

# Calculate your measure over as many batches as you need
for batch in validation_set:
  accuracies.append(sess.run([training_op]))

# Take the mean of you measure
accuracy = np.mean(accuracies)

# Create a new Summary object with your measure
summary = tf.Summary()
summary.value.add(tag="%sAccuracy" % prefix, simple_value=accuracy)

# Add it to the Tensorboard summary writer
# Make sure to specify a step parameter to get nice graphs over time
summary_writer.add_summary(summary, global_step)

【讨论】：

太好了，不知道有一个API可以直接在python代码中构建摘要对象。不过是有道理的，因为它只是一个协议缓冲区。
这里的前缀是什么意思？

【解决方案2】：

我会避免计算在图表之外的平均值。

你可以使用tf.train.ExponentialMovingAverage:

ema = tf.train.ExponentialMovingAverage(decay=my_decay_value, zero_debias=True)
maintain_ema_op = ema.apply(your_losses_list)

# Create an op that will update the moving averages after each training step.
with tf.control_dependencies([your_original_train_op]):
    train_op = tf.group(maintain_ema_op)

然后，使用：

sess.run(train_op)

这将调用maintain_ema_op，因为它被定义为控制依赖项。

为了获得指数移动平均线，请使用：

moving_average = ema.average(an_item_from_your_losses_list_above)

并使用以下方法检索其值：

value = sess.run(moving_average)

这会计算移动平均线在您的计算图中。

【讨论】：

@MZHn 为什么让 TF 在内部进行计算更好？是不是快很多？（似乎python可以很好地处理平均操作）
@DankMasterDan 当一切都由 Tensorflow 处理时，每一个操作都将在 GPU 端执行（假设您在这种情况下使用的是 Tensorflow GPU）。因此，将执行点从 GPU 更改为 CPU（Python 代码运行的地方）需要一些时钟才能完成。这个时间可能会成为 GPU 等待处理程序继续处理大型数据阵列的瓶颈。然而，当使用 Tensorflow 处理这个简单的任务时，它可以像计算图中的其他节点一样并行化，而无需在 CPU 和 GPU 之间来回切换。
@MZHm an_item_from_your_losses_list_above 应该是什么？你能展示一下your_losses_list的创建吗？我尝试创建一个空的 np.array() 并将张量附加到它作为控制依赖项的一部分。当我打电话给ema.average(list_of_loss_tensors) 时，我得到：TypeError: unhashable type: 'numpy.ndarray'。应该使用什么来代替列表或 numpy.ndarray，作为损失的容器？一个大小为 num_losses 的 tf.placeholder？

【解决方案3】：

我认为让 tensorflow 进行计算总是更好。

查看流媒体指标。他们有一个更新功能来提供您当前批次的信息，以及一个获取平均摘要的功能。它看起来有点像这样：

accuracy = ... 
streaming_accuracy, streaming_accuracy_update = tf.contrib.metrics.streaming_mean(accuracy)
streaming_accuracy_scalar = tf.summary.scalar('streaming_accuracy', streaming_accuracy)

# set up your session etc. 

for i in iterations:
      for b in batches:
               sess.run([streaming_accuracy_update], feed_dict={...})

     streaming_summ = sess.run(streaming_accuracy_scalar)
     writer.add_summary(streaming_summary, i)

另请参阅 tensorflow 文档：https://www.tensorflow.org/versions/master/api_guides/python/contrib.metrics

还有这个问题： How to accumulate summary statistics in tensorflow

【讨论】：

累加前还需要`sess.run(tf.local_variables_initializer())`。
如果只有 tf.metrics 可以使用急切执行，但它似乎没有。对于急切的 exec，是否有类似的方法？否则，如果我们想在编译和急切之间保持灵活，在某些情况下似乎我们需要自己进行计算......

【解决方案4】：

您可以平均存储当前总和并在每批之后重新计算平均值，例如：

loss_sum = tf.Variable(0.)
inc_op = tf.assign_add(loss_sum, loss)
clear_op = tf.assign(loss_sum, 0.)
average = loss_sum / batches
tf.scalar_summary("average_loss", average)

sess.run(clear_op)
for i in range(batches):
    sess.run([loss, inc_op])

sess.run(average)

【讨论】：

这不会在每次执行代码时向图中添加一个新的标量摘要吗？
是的，我以为你只会执行一次（尽管现在我发现它没有意义）。正在编辑...
这似乎是合理的。但是，我希望有一个更优雅的解决方案......但是无法想出一个（请参阅我的 hacky 替代方案）。

【解决方案5】：

为了将来参考，TensorFlow 指标 API 现在默认支持此功能。比如看看tf.mean_squared_error：

为了估计数据流上的指标，该函数创建一个update_op 操作来更新这些变量并返回mean_squared_error。在内部，squared_error 操作计算predictions 和labels 之间差异的元素平方。然后update_op 将total 与weights 和squared_error 的乘积之和相减，count 与weights 之和相加。

这些 total 和 count 变量被添加到度量变量集中，所以在实践中你会做的事情是这样的：

x_batch = tf.placeholder(...)
y_batch = tf.placeholder(...)
model_output = ...
mse, mse_update = tf.metrics.mean_squared_error(y_batch, model_output)
# This operation resets the metric internal variables to zero
metrics_init = tf.variables_initializer(
    tf.get_default_graph().get_collection(tf.GraphKeys.METRIC_VARIABLES))
with tf.Session() as sess:
    # Train...
    # On evaluation step
    sess.run(metrics_init)
    for x_eval_batch, y_eval_batch in ...:
        mse = sess.run(mse_update, feed_dict={x_batch: x_eval_batch, y_batch: y_eval_batch})
    print('Evaluation MSE:', mse)

【讨论】：

【解决方案6】：

我自己找到了一种解决方案。我认为这有点 hacky，我希望有一个更优雅的解决方案。

设置期间：

valid_loss_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
valid_loss_summary = tf.scalar_summary("valid loss", valid_loss_placeholder)

或者对于 0.12 之后的 tensorflow 版本（更改 tf.scalar_summary 的名称）：

valid_loss_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
valid_loss_summary = tf.summary.scalar("valid loss", valid_loss_placeholder)

在训练循环内：

# Compute valid loss in python by doing sess.run() for each batch
# and averaging
valid_loss = ...

summary = sess.run(valid_loss_summary, {valid_loss_placeholder: valid_loss})
summary_writer.add_summary(summary, step)

【讨论】：

【解决方案7】：

截至 2018 年 8 月，流媒体指标已贬值。然而，不直观的是，all metrics are streaming。所以，请使用tf.metrics.accuracy。

但是，如果您只希望批次子集的准确性（或其他指标），那么您可以使用指数移动平均线，如@MZHm 的回答中所示，或者按照此操作重置任何tf.metric very informative blog post

【讨论】：

【解决方案8】：

在相当长的一段时间内，我每个时期只保存一次摘要。我从来不知道 TensorFlows summary 只会保存最后一次运行批次的摘要。

震惊我调查了这个问题。这是我提出的解决方案（使用数据集 API）：

loss = ...
train_op = ...

loss_metric, loss_metric_update = tf.metrics.mean(ae_loss)
tf.summary.scalar('loss', loss_metric)

merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(os.path.join(res_dir, 'train'))
test_writer = tf.summary.FileWriter(os.path.join(res_dir, 'test'))

init_local = tf.initializers.local_variables()
init_global = tf.initializers.global_variables()

sess.run(init_global)

def train_run(epoch):
    sess.run([dataset.train_init_op, init_local]) # test_init_op is the operation that switches to test data
    for i in range(dataset.num_train_batches): # num_test_batches is the number of batches that should be run for the test set
        sess.run([train_op, loss_metric_update])

    summary, cur_loss = sess.run([merged, loss_metric])
    train_writer.add_summary(summary, epoch)

    return cur_loss

def test_run(epoch):
    sess.run([dataset.test_init_op, init_local]) # test_init_op is the operation that switches to test data
    for i in range(dataset.num_test_batches): # num_test_batches is the number of batches that should be run for the test set
        sess.run(loss_metric_update)

    summary, cur_loss = sess.run([merged, loss_metric])
    test_writer.add_summary(summary, epoch)

    return cur_loss

for epoch in range(epochs):
    train_loss = train_run(epoch+1)
    test_loss = test_run(epoch+1)
    print("Epoch: {0:3}, loss: (train: {1:10.10f}, test: {2:10.10f})".format(epoch+1, train_loss, test_loss))

为了总结，我只是将我感兴趣的张量包装到tf.metrics.mean() 中。对于每个批处理运行，我调用指标更新操作。在每个 epoch 结束时，metrics 张量将返回所有批次结果的正确平均值。

每次在训练和测试数据之间切换时，不要忘记初始化局部变量。否则你的训练和测试指标将几乎相同。

【讨论】：

【解决方案9】：

当我意识到当内存空间紧张并且 OOM 错误泛滥时，我必须迭代我的验证数据时，我遇到了同样的问题。

正如这些答案中的多个所说，tf.metrics 内置了这个，但我没有在我的项目中使用tf.metrics。受此启发，我做了这个：

import tensorflow as tf
import numpy as np


def batch_persistent_mean(tensor):
    # Make a variable that keeps track of the sum
    accumulator = tf.Variable(initial_value=tf.zeros_like(tensor), dtype=tf.float32)
    # Keep count of batches in accumulator (needed to estimate mean)
    batch_nums = tf.Variable(initial_value=tf.zeros_like(tensor), dtype=tf.float32)
    # Make an operation for accumulating, increasing batch count
    accumulate_op = tf.assign_add(accumulator, tensor)
    step_batch = tf.assign_add(batch_nums, 1)
    update_op = tf.group([step_batch, accumulate_op])
    eps = 1e-5
    output_tensor = accumulator / (tf.nn.relu(batch_nums - eps) + eps)
    # In regards to the tf.nn.relu, it's a hacky zero_guard:
    # if batch_nums are zero then return eps, else it'll be batch_nums
    # Make an operation to reset
    flush_op = tf.group([tf.assign(accumulator, 0), tf.assign(batch_nums, 0)])
    return output_tensor, update_op, flush_op

# Make a variable that we want to accumulate
X = tf.Variable(0., dtype=tf.float32)
# Make our persistant mean operations
Xbar, upd, flush = batch_persistent_mean(X)

现在您将Xbar 发送到您的摘要，例如tf.scalar_summary("mean_of_x", Xbar)，之前你会做sess.run(X)，你会做sess.run(upd)。在不同时期之间，你会做sess.run(flush)。

测试行为：

### INSERT ABOVE CODE CHUNK IN S.O. ANSWER HERE ###
sess = tf.InteractiveSession()
with tf.Session() as sess:
    sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
    # Calculate the mean of 1+2+...+20
    for i in range(20):
        sess.run(upd, {X: i})
    print(sess.run(Xbar), "=", np.mean(np.arange(20)))
    for i in range(40):
        sess.run(upd, {X: i})
    # Now Xbar is the mean of (1+2+...+20+1+2+...+40):
    print(sess.run(Xbar), "=", np.mean(np.concatenate([np.arange(20), np.arange(40)])))
    # Now flush it
    sess.run(flush)
    print("flushed. Xbar=", sess.run(Xbar))
    for i in range(40):
        sess.run(upd, {X: i})
    print(sess.run(Xbar), "=", np.mean(np.arange(40)))

【讨论】：