推荐的分析分布式张量流的方法答案

【问题标题】：recommended way of profiling distributed tensorflow推荐的分析分布式张量流的方法
【发布时间】：2018-06-26 20:59:59
【问题描述】：

目前，我正在使用 tensorflow estimator API 来训练我的 tf 模型。我正在使用基于训练数据大小的几乎 20-50 个工作人员和 5-30 个参数服务器的分布式训练。由于我无权访问会话，因此无法使用 run metadata a=with full trace 查看 chrome 跟踪。我看到还有其他两种方法：

1) tf.profiler.profile
2)tf.train.profilerhook

我专门使用 tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)

我的估算器是预构建的估算器。

有人可以给我一些指导（具体的代码示例和代码指针会非常有帮助，因为我对 tensorflow 很陌生）推荐的分析估算器的方法是什么？这两种方法是否获得了一些不同的信息或服务于相同的目的？还有一个比另一个推荐吗？

【问题讨论】：

标签： tensorflow tensorflow-serving tensorflow-datasets tensorflow-estimator

【解决方案1】：

您可以尝试两件事：

ProfilerContext

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py 示例用法：

with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
  train_loop()

ProfilerService

https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras

您可以通过tf.python.eager.profiler.start_profiler_server(port) 在所有工作人员和参数服务器上启动 ProfilerServer。并使用 TensorBoard 捕获配置文件。

请注意，这是一个非常新的功能，您可能需要使用tf-nightly。

【讨论】：

【解决方案2】：

Tensorflow 最近为sample multiple workers 添加了一种方式。

请查看 API： https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly

上述API在此上下文中重要的参数是：

service_addr：以逗号分隔的 gRPC 地址字符串工人概况。例如service_addr='grpc://localhost:6009' service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466' service_addr='grpc://localhost:12345,grpc://localhost:23456'

另外，请查看 API， https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly

上述API在此上下文中重要的参数是：

delay_ms：请求所有主机在某个时间戳开始profiling 即delay_ms 与当前时间的距离。 delay_ms 在毫秒。如果为零，则每个主机将立即开始分析接收请求。默认值为None，允许profiler 猜出最佳值。

【讨论】：