【问题标题】:Tensorflow Adam Multigpu GradientTensorflow Adam Multigpu 梯度
【发布时间】:2016-04-01 12:24:07
【问题描述】:

我正在尝试使用 ADAM 优化在 tensorflow 上实现网络多 GPU。

我正在处理来自 Cifar10_multigpu 的代码,但看起来当梯度调用第二个塔时,它调用了第一个的梯度,并在两个塔的平均值上产生了误差。 两塔的代码是这样的

  for d in devs:
        with tf.device(d):
            with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope:
                loss = tower_loss(scope)
                tf.get_variable_scope().reuse_variables()
                summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
                grads = opt.compute_gradients(loss)
                print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads)))
                tower_grads.append(grads)
        i +=1

这会生成每个塔:

stream, target= placeholder_inputs(FLAGS.batch_size*tf_model.ANGLES/FLAGS.num_gpus)
    logits = tf_model.inference_noisy_simulate(stream)
    _ = tf_model.loss(logits, target)
    losses = tf.get_collection('losses', scope)
    total_loss = tf.add_n(losses, name='total_loss')

查看第一个塔的渐变效果:

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>)
1: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>)
2: (<tf.Tensor 'tower_0/gradients/tower_0/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>)
3: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>)
4: (<tf.Tensor 'tower_0/gradients/tower_0/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>)
5: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>)
6: (<tf.Tensor 'tower_0/gradients/tower_0/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>)
7: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>)
8: (<tf.Tensor 'tower_0/gradients/tower_0/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>)
9: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>)
10: (<tf.Tensor 'tower_0/gradients/tower_0/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>)
11: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>)
12: (<tf.Tensor 'tower_0/gradients/tower_0/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>)
13: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>)
14: (<tf.Tensor 'tower_0/gradients/tower_0/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>)
15: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>)
16: (<tf.Tensor 'tower_0/gradients/tower_0/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>)
17: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>)
18: (<tf.Tensor 'tower_0/gradients/tower_0/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>)

第二个生成这个;

0: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca11d0ae10>)
1: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351b10>)
2: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c380dd0>)
3: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351a10>)
4: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6dd0>)
5: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3a6490>)
6: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351990>)
7: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c351890>)
8: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c3b7790>)
9: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2d9110>)
10: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2849d0>)
11: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2e6f10>)
12: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c2afed0>)
13: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1f9550>)
14: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c214a10>)
15: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c23dfd0>)
16: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c269bd0>)
17: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1d1a50>)
18: (None, <tensorflow.python.ops.variables.Variable object at 0x7fca0c1def50>)
19: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/Conv2D_grad/tuple/control_dependency_1:0' shape=(1, 1, 8, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0c178c50>)
20: (<tf.Tensor 'tower_1/gradients/tower_1/conv1/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfbb490>)
21: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 16) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfda950>)
22: (<tf.Tensor 'tower_1/gradients/tower_1/conv2/BiasAdd_grad/tuple/control_dependency_1:0' shape=(16,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf91bd0>)
23: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 16, 32) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bfcb590>)
24: (<tf.Tensor 'tower_1/gradients/tower_1/conv3/BiasAdd_grad/tuple/control_dependency_1:0' shape=(32,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39e90>)
25: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 32, 64) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf499d0>)
26: (<tf.Tensor 'tower_1/gradients/tower_1/conv4/BiasAdd_grad/tuple/control_dependency_1:0' shape=(64,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf14fd0>)
27: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 64, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf39150>)
28: (<tf.Tensor 'tower_1/gradients/tower_1/conv5/BiasAdd_grad/tuple/control_dependency_1:0' shape=(128,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd8d0>)
29: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/Conv2D_grad/tuple/control_dependency_1:0' shape=(45, 4, 128, 256) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf23110>)
30: (<tf.Tensor 'tower_1/gradients/tower_1/conv6/BiasAdd_grad/tuple/control_dependency_1:0' shape=(256,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf04610>)
31: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/MatMul_grad/tuple/control_dependency_1:0' shape=(18944, 4096) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebdc50>)
32: (<tf.Tensor 'tower_1/gradients/tower_1/fc1/add_grad/tuple/control_dependency_1:0' shape=(4096,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bebd310>)
33: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/MatMul_grad/tuple/control_dependency_1:0' shape=(4096, 1024) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96e10>)
34: (<tf.Tensor 'tower_1/gradients/tower_1/fc1_1/add_grad/tuple/control_dependency_1:0' shape=(1024,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be96990>)
35: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/MatMul_grad/tuple/control_dependency_1:0' shape=(1024, 360) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0be52c90>)
36: (<tf.Tensor 'tower_1/gradients/tower_1/softmax_linear/softmax_linear_grad/tuple/control_dependency_1:0' shape=(360,) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7fca0bf56f50>)

我想知道如何从第二个中删除第一个 None,但没有定位索引,以便我可以制作更多的塔。

【问题讨论】:

    标签: python tensorflow adam


    【解决方案1】:

    我已经找到了错误。我使用了一个可训练变量作为学习率(我想跟踪 lr,但它看起来不可能),并且还添加了要由 adam 处的操作计算的变量列表。我不确定这是否正确,但看起来可行。

    with tf.Graph().as_default(), tf.device('/cpu:0'):
            devs = ['/job:prs/task:0/gpu:0','/job:worker/task:0/gpu:0'] #
            global_step = tf.get_variable('global_step', [], initializer=tf.constant_initializer(0), trainable=False)
            num_batches_per_epoch = dt_fdr.FLS_PER_ANGLE/ FLAGS.batch_size
            #lr = tf.Variable(tf.constant(FLAGS.learning_rate, dtype=tf.float32))
            opt = tf.train.AdamOptimizer(FLAGS.learning_rate)
            tower_grads = []
            for i in xrange(FLAGS.num_gpus):
                with tf.device(devs[i]):
                    with tf.name_scope('%s_%d' % (tf_model.TOWER_NAME, i)) as scope:
                        loss = tower_loss(scope)
                        tf.get_variable_scope().reuse_variables()
                        summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
                        #"print('\n'.join('{}: {}'.format(*k) for k in enumerate(summaries)))
                        grads = opt.compute_gradients(loss, tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope))
                        #print('\n'.join('{}: {}'.format(*k) for k in enumerate(grads)))
                        tower_grads.append(grads)
            grads = average_gradients(tower_grads)
            #summaries.append(tf.scalar_summary('learning_rate', lr))
            for grad, var in grads:
                if grad:
                    summaries.append(
                        tf.histogram_summary(var.op.name + '/gradients', grad))
            apply_gradient_op = opt.apply_gradients(grads, global_step=global_step) 
            for var in tf.trainable_variables():
                summaries.append(tf.histogram_summary(var.op.name, var))
    
            train_op = apply_gradient_op
    
            saver = tf.train.Saver(tf.all_variables())
    
            summary_op = tf.merge_summary(summaries)
    
            init = tf.initialize_all_variables()
    
            sess = tf.Session("grpc://nelson-lab:2500",config=tf.ConfigProto(
                allow_soft_placement=True,
                log_device_placement=FLAGS.log_device_placement))
            sess.run(init)
    

    我想知道是否有人也尝试过使用 adam 进行双 GPU 训练。

    问候

    【讨论】:

    • 感谢分享您的实施。我还希望跨多个 GPU 并行化训练。您的实施成功了吗?
    【解决方案2】:

    附加

    如果您有计划在训练阶段更新学习率,请声明如下。

    lr = tf.Variable(FLAGS.learning_rate, trainable=False)
    opt = tf.train.AdamOptimizer(lr)
    

    之后

    sess.run(tf.assign(lr, new_lr))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-10-02
      • 2018-11-04
      • 2016-06-23
      • 2022-01-04
      • 2019-11-16
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多