keras 的 Model.train_on_batch 和 tensorflow 的 Session.run([train_optimizer]) 有什么区别？答案

【问题标题】：What is the difference between Model.train_on_batch from keras and Session.run([train_optimizer]) from tensorflow?keras 的 Model.train_on_batch 和 tensorflow 的 Session.run([train_optimizer]) 有什么区别？
【发布时间】：2018-11-20 15:19:08
【问题描述】：

在以下神经网络训练的 Keras 和 Tensorflow 实现中，keras 实现中的model.train_on_batch([x], [y]) 与 Tensorflow 实现中的sess.run([train_optimizer, cross_entropy, accuracy_op], feed_dict=feed_dict) 有何不同？特别是：这两行如何导致训练中的不同计算？：

keras_version.py

input_x = Input(shape=input_shape, name="x")
c = Dense(num_classes, activation="softmax")(input_x)

model = Model([input_x], [c])
opt = Adam(lr)
model.compile(loss=['categorical_crossentropy'], optimizer=opt)

nb_batchs = int(len(x_train)/batch_size)

for epoch in range(epochs):
    loss = 0.0
    for batch in range(nb_batchs):
        x = x_train[batch*batch_size:(batch+1)*batch_size]
        y = y_train[batch*batch_size:(batch+1)*batch_size]

        loss_batch, acc_batch = model.train_on_batch([x], [y])

        loss += loss_batch
    print(epoch, loss / nb_batchs)

tensorflow_version.py

input_x = Input(shape=input_shape, name="x")
c = Dense(num_classes)(input_x)

input_y = tf.placeholder(tf.float32, shape=[None, num_classes], name="label")
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits_v2(labels=input_y, logits=c, name="xentropy"),
    name="xentropy_mean"
)
train_optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(cross_entropy)

nb_batchs = int(len(x_train)/batch_size)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(epochs):
        loss = 0.0
        acc = 0.0

        for batch in range(nb_batchs):
            x = x_train[batch*batch_size:(batch+1)*batch_size]
            y = y_train[batch*batch_size:(batch+1)*batch_size]

            feed_dict = {input_x: x,
                         input_y: y}
            _, loss_batch = sess.run([train_optimizer, cross_entropy], feed_dict=feed_dict)

            loss += loss_batch
        print(epoch, loss / nb_batchs)

注意：这个问题跟在 Same (?) model converges in Keras but not in Tensorflow 之后，它被认为过于宽泛，但我在其中确切说明了为什么我认为这两个陈述在某种程度上不同并导致不同的计算。

【问题讨论】：

标签： python tensorflow machine-learning keras

【解决方案1】：

是的，结果可能不同。如果您提前知道以下内容，结果应该不会令人惊讶：

corss-entropy 在 Tensorflow 和 Keras 中的实现是不同的。 Tensorflow 将 tf.nn.softmax_cross_entropy_with_logits_v2 的输入假定为原始非标准化 logits，而 Keras 将输入作为概率接受
optimizers 在 Keras 和 Tensorflow 中的实现是不同的。
可能是您正在打乱数据并且传递的批次的顺序不同。尽管长时间运行模型并不重要，但最初的几个时期可能会完全不同。确保将同一批次传递给两者，然后比较结果。

【讨论】：

您能否详细说明优化器的实现有何不同？我已经尝试在 tensorflow 版本中计算和应用我自己的梯度，但这并没有带来更好的结果，但我仍在使用优化器类。在这种情况下，第 1 项和第 3 项不是令人满意的答案，因为 1 我向 tf 优化器提供了 softmax 操作的输出，而我没有使用 keras 1 和 3 tf 模型永远不会在 keras 1 总是收敛时收敛。
比较它的源代码。加上 1) 和 3) 是完全相关的。想知道是什么让你变得无足轻重
是的，它们是相关的。我的意思是它们不在我的特定情况下，因为 1 我在 tf 损失计算中提供 logits，而我在 keras 损失中提供概率，3 这代表一次运行，但在我的情况下，keras 代码总是会收敛，而 tf 永远不会做。是的，感谢您提供“比较源代码”的建议。整个问题是关于比较源代码，这就是重点：我还没有足够的能力理解差异。