【问题标题】:TensorFlow on multiple GPU多个 GPU 上的 TensorFlow
【发布时间】:2025-11-30 19:05:02
【问题描述】:

最近,我尝试通过阅读官方教程来学习如何在多个 GPU 上使用 Tensorflow。但是,有些事情让我感到困惑。以下代码是官方教程的一部分,计算单GPU上的损失。

def tower_loss(scope, images, labels):

  # Build inference Graph.
  logits = cifar10.inference(images)

  # Build the portion of the Graph calculating the losses. Note that we will
  # assemble the total_loss using a custom function below.
  _ = cifar10.loss(logits, labels)

  # Assemble all of the losses for the current tower only.
  losses = tf.get_collection('losses', scope)

  # Calculate the total loss for the current tower.
  total_loss = tf.add_n(losses, name='total_loss')

  # Attach a scalar summary to all individual losses and the total loss; do the
  # same for the averaged version of the losses.
  for l in losses + [total_loss]:
    # Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
    # session. This helps the clarity of presentation on tensorboard.
    loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
    tf.summary.scalar(loss_name, l)

  return total_loss

训练过程如下。

def train():
with tf.device('/cpu:0'):
    # Create a variable to count the number of train() calls. This equals the
    # number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
    'global_step', [],
    initializer=tf.constant_initializer(0), trainable=False)

# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
                         FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)

# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
                                global_step,
                                decay_steps,
                                cifar10.LEARNING_RATE_DECAY_FACTOR,
                                staircase=True)

# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)

# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
      [images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
  for i in xrange(FLAGS.num_gpus):
    with tf.device('/gpu:%d' % i):
      with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
        # Dequeues one batch for the GPU
        image_batch, label_batch = batch_queue.dequeue()
        # Calculate the loss for one tower of the CIFAR model. This function
        # constructs the entire CIFAR model but shares the variables across
        # all towers.
        loss = tower_loss(scope, image_batch, label_batch)

        # Reuse variables for the next tower.
        tf.get_variable_scope().reuse_variables()

        # Retain the summaries from the final tower.
        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)

但是,我对关于“for i in xrange(FLAGS.num_gpus)”的 for 循环感到困惑。看来我必须从 batch_queue 获取一个新的批处理图像并计算每个梯度。我认为这个过程是序列化的而不是并行的。如果我自己的理解有什么问题?顺便说一句,我也可以使用迭代器将图像提供给我的模型而不是出队,对吗?

谢谢大家!

【问题讨论】:

    标签: python tensorflow distributed-computing multiple-gpu


    【解决方案1】:

    这是对 Tensorflow 编码模型的常见误解。 您在这里展示的是计算图的构造,而不是实际的执行。

    方块:

    for i in xrange(FLAGS.num_gpus):
        with tf.device('/gpu:%d' % i):
          with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
            # Dequeues one batch for the GPU
            image_batch, label_batch = batch_queue.dequeue()
            # Calculate the loss for one tower of the CIFAR model. This function
            # constructs the entire CIFAR model but shares the variables across
            # all towers.
            loss = tower_loss(scope, image_batch, label_batch)
    

    翻译为:

    For each GPU device (`for i in range..` & `with device...`):
        - build operations needed to dequeue a batch
        - build operations needed to run the batch through the network and compute the loss
    

    请注意,您如何通过tf.get_variable_scope().reuse_variables() 告诉图形用于图形 GPU 的变量必须在所有人之间共享(即,多个设备上的所有图形“重用”相同的变量)。

    这些都没有真正运行一次网络(注意没有sess.run()):你只是在说明数据必须如何流动。

    然后,当您开始实际训练时(我猜您在此处复制代码时错过了那段代码)每个 GPU 将拉出自己的批次并产生每塔损失。我猜这些损失在后续代码中的某处被平均了,平均值就是传递给优化器的损失。

    直到塔损失被平均化为止,一切都独立于其他设备,因此获取批次和计算损失可以并行完成。然后梯度和参数更新只进行一次,变量更新,循环重复。

    所以,要回答您的问题,,每批损失计算不会序列化,但由于这是同步分布式计算,您需要收集所有 GPU 的所有损失,然后才能继续处理梯度计算和参数更新,所以你仍然有 一些 部分不能独立的图。

    【讨论】:

    • 好吧,谢谢您耐心的解释,我明白您上面提到的观点。实际上,在构建图之后,CPU 会收集每个 GPU 的梯度并对其进行平均以更新变量。顺便说一句,在这种情况下,作者使用 'batch_queue_dequeue()' 为每个 GPU 提供一个新批次。但是,我想我可以直接在我的图表中创建一个迭代器,对吧?