如何在多个没有 GPU 的设备上训练 TensorFlow？答案

【问题标题】：How to train TensorFlow on multiple devices which don't have GPUs?如何在多个没有 GPU 的设备上训练 TensorFlow？
【发布时间】：2020-11-28 08:09:12
【问题描述】：

假设我们有一个带有几个卷积层的简单 TensorFlow 模型。我们喜欢在没有配备 GPU 的计算机集群上训练这个模型。该集群的每个计算节点可能有 1 个或多个核心。是否可以开箱即用？如果没有，哪些软件包能够做到这一点？这些包是否能够执行数据和模型并行处理？

【问题讨论】：

标签： tensorflow deep-learning distributed-computing distributed

【解决方案1】：

根据Tensorflow documentation

tf.distribute.Strategy 是一个 TensorFlow API，用于跨多个 GPU、多台机器或 TPU 分配训练。

如上所述，考虑到所有设备都应该在同一个网络中，它支持 CPU 进行分布式训练。

是的，您可以使用多个设备来训练模型，并且需要在几个设备上进行集群和工作器配置，如下所示。

tf_config = {
    'cluster': {
        'worker': ['localhost:1234', 'localhost:6789']
    },
    'task': {'type': 'worker', 'index': 0}
}

了解配置和训练模型，请参考Multi-worker training with Keras。

根据this SO answer

tf.distribute.Strategy 被集成到tf.keras，所以当model.fit 是与tf.distribute.Strategy 实例一起使用，然后使用 strategy.scope() 为您的模型允许创建分布式变量。这允许它在您的设备。

注意：在处理大量数据和复杂模型（即 w.r.t 性能）时，使用分布式训练可以受益。

【讨论】：