Tensorflow：来自相同初始猜测的完全不同的结果答案

【问题标题】：Tensorflow: completely different results from a same initial guessTensorflow：来自相同初始猜测的完全不同的结果
【发布时间】：2019-06-27 18:39:38
【问题描述】：

我是 TensorFlow 的新手。我用普通的 GD 优化算法做了一个简单的多元回归。但是，即使使用相同的初始猜测，当应用两个不同的变量定义时，我得到的结果也完全不同。

这两种计算有什么区别？

当我定义变量时：

tau = tf.Variable([0.25, 0.25, 0.25, 0.25], name='parameter', dtype=tf.float64)
tau = tf.clip_by_value(tau, 0.1, 5.)

我在 10000 个 epochs 后得到了下面的结果。

tau= [0.28396885 0.24675105 0.26584612 1.37071573]

但是，当我将它们定义为标准化值时：

tau_norm = tf.Variable([0.025, 0.025, 0.025, 0.025], name='parameter', dtype=tf.float64)
tau_norm = tf.clip_by_value(tau_norm, 0.01, 0.5)
tau_max = 10
tau = tau_norm*tau_max

在相同的 10000 个 epoch 之后，我得到了完全不同的结果：

tau= [ nan 0.22451382 2.70862284 1.46199275]

我希望这两个计算给出相同（或足够相似）的结果，因为初始猜测相同。然而，那不是我看到的。我想知道是什么导致了这种差异。

这里，我使用的是 tensorflow-gpu 1.14.0，但 GPU 不用于此计算：

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="-1"

更新

好的，让我用改编自here的代码的例子来解释一下。我想我看到的和下面的基本一样。

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
import numpy as np

x = tf.placeholder("float")
y = tf.placeholder("float")
w = tf.Variable([1.0, 2.0], name="w")

y_model = tf.multiply(x, w[0]) + w[1]
error = tf.square(y - y_model)
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    print("Initial guess: ", session.run(w))
    np.random.seed(seed=100)
    for i in range(1000):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        session.run(train_op, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))

从代码中，我得到了Predicted model: 2.221x + 5.882。但是，当我将w 替换为

w_norm = tf.Variable([0.5, 1.0], name = 'w_norm')
w = w_norm*2.0

结果是Predicted model: 2.004x + 5.998，即使它具有相同的初始猜测 ([1. 2.]) 和相同数量的 epoch。我不知道是什么造成了这种差异。

【问题讨论】：

您好崇光，欢迎来到 StackOverflow！为了让您的问题更容易回答，如果您提供minimal reproducible example 会很有帮助。如果你这样做，你可以很快得到答案:)
感谢您的评论！我将添加一个示例。

标签： python tensorflow

【解决方案1】：

造成这种差异的原因是 GradientDescentOptimizer.minimize 将针对 tf.Variables 进行优化，因此您的梯度下降不会应用于相同的方程。

一次，您将(y - (x*w[0] + w[1]) 中的参数的错误w 最小化，另一次您将(y - (x*2*w[0] + 2*w[1]) 的错误也最小化w。

如果您在代码中更改学习率，您的算法最终会得到相同的结果。要考虑误差中的平方（以范数的平方作为误差），如果您在train_op = tf.train.GradientDescentOptimizer(0.04).minimize(error) 中将 0.04 而不是 0.01 设置为速率，您应该得到相同的结果。

所以：

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
import numpy as np

x = tf.placeholder("float")
y = tf.placeholder("float")
w = tf.Variable([1.0, 2.0], name="w")

y_model = tf.multiply(x, w[0]) + w[1]
error = tf.square(y - y_model)
train_op = tf.train.GradientDescentOptimizer(0.04).minimize(error)

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    print("Initial guess: ", session.run(w))
    np.random.seed(seed=100)
    for i in range(1000):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        session.run(train_op, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))

打印与

相同的结果

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
import numpy as np

x = tf.placeholder("float")
y = tf.placeholder("float")
w_norm = tf.Variable([0.5, 1.0], name = 'w_norm')
w = w_norm*2.0

y_model = tf.multiply(x, w[0]) + w[1]
error = tf.square(y - y_model)
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(error)

model = tf.global_variables_initializer()

with tf.Session() as session:
    session.run(model)
    print("Initial guess: ", session.run(w))
    np.random.seed(seed=100)
    for i in range(1000):
        x_value = np.random.rand()
        y_value = x_value * 2 + 6
        session.run(train_op, feed_dict={x: x_value, y: y_value})

    w_value = session.run(w)
    print("Predicted model: {a:.3f}x + {b:.3f}".format(a=w_value[0], b=w_value[1]))

【讨论】：

哦，我明白了。我有点误解了算法。基本上，学习率乘以损失函数 w.r.t 的梯度。 用tf.Variable定义的变量，然后在每个epoch从当前变量中减去。所以如果你用 N 对变量进行归一化，学习率需要小 N^2 倍才能获得相同的结果。谢谢你的回答！
在这种特殊情况下，是的，学习率乘以损失函数中的梯度。要记住的是，正如您所指出的，梯度下降应用于tf.Variable。拥有一个中间变量x = 2*x_norm 就像拥有一个不同的模型。很高兴我能帮上忙，干杯:)