为什么分布式 tensorflow 时间线跟踪将 QueueDequeue 操作标记为 PS 操作？答案

【问题标题】：Why does distributed tensorflow timeline trace mark the QueueDequeue operation as a PS operation?为什么分布式 tensorflow 时间线跟踪将 QueueDequeue 操作标记为 PS 操作？
【发布时间】：2017-02-11 23:49:10
【问题描述】：

我在 AWS ubuntu 机器集群上运行 tensorflow 分布式初始模型，并通过输出时间线跟踪

# Track statistics of the run using Timeline
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()

# Run
loss_value, step = sess.run([train_op, global_step], options=run_options, run_metadata=run_metadata)

# Create timeline and write it to a json file
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline%d.json' % FLAGS.task_id, 'w') as f:
f.write(ctf)

当我查看工作机器生成的时间线时，我看到： Timeline Trace for Worker Machine

注意右边的 QueueDequeue 操作，时间线说它是参数服务器 /job:ps/replica:0/task:0/cpu:0 的一部分。

由于 ScatterUpdate 就在 QueueDequeue 之后，如图所示，我相信此操作对应于同步副本优化器操作，其中工作人员尝试将令牌出列并执行分散更新：https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/sync_replicas_optimizer.py#L412

但如果是这样，那么应该是一个工作人员执行这个操作，而不是一个参数服务器。为什么时间线说参数服务器正在执行这个？

我使用的是 tensorflow 0.11，仅限 CPU。

【问题讨论】：

标签： tensorflow tensorflow-serving

【解决方案1】：

似乎这是正确的，并且出队操作是在 PS 上执行的。只是worker对这个操作有依赖，也就是说worker本质上是在等待一个成功的出队。

【讨论】：