相对于矩阵的张量流梯度答案

【问题标题】：Tensorflow gradient with respect to matrix相对于矩阵的张量流梯度
【发布时间】：2018-07-30 09:21:18
【问题描述】：

就上下文而言，我正在尝试使用 Tensorflow 实现梯度下降算法。

我有一个矩阵X

[ x1 x2 x3 x4 ]
[ x5 x6 x7 x8 ]

我乘以某个特征向量Y得到Z

      [ y1 ]
Z = X [ y2 ]  = [ z1 ]
      [ y3 ]    [ z2 ]
      [ y4 ]

然后我将 Z 通过 softmax 函数，并获取日志。我将输出矩阵称为 W。

所有这些都是按如下方式实现的（添加了一点样板文件以便它可以运行）

sess = tf.Session()
num_features = 4
num_actions = 2

policy_matrix = tf.get_variable("params", (num_actions, num_features))
state_ph = tf.placeholder("float", (num_features, 1))
action_linear = tf.matmul(params, state_ph)
action_probs = tf.nn.softmax(action_linear, axis=0)
action_problogs = tf.log(action_probs)

W（对应action_problogs）看起来像

[ w1 ]
[ w2 ]

我想求w1相对于矩阵X的梯度——也就是我想计算一下

          [ d/dx1 w1 ]
d/dX w1 =      .
               .
          [ d/dx8 w1 ]

（最好仍然看起来像一个矩阵，这样我就可以将它添加到X，但我真的不关心这个）

我希望tf.gradients 能解决问题。我这样计算“梯度”

problog_gradient = tf.gradients(action_problogs, policy_matrix)

但是，当我检查 problog_gradient 时，我得到了以下结果

[<tf.Tensor 'foo_4/gradients/foo_4/MatMul_grad/MatMul:0' shape=(2, 4) dtype=float32>]

请注意，这与X 的形状完全相同，但实际上不应该。我希望得到一个包含两个渐变的列表，每个渐变都涉及 8 个元素。我怀疑我得到了两个渐变，但每个渐变都涉及四个元素。

我对 tensorflow 很陌生，所以我会很感激并解释正在发生的事情以及如何实现我想要的行为。

【问题讨论】：

标签： python matrix tensorflow gradient-descent reinforcement-learning

【解决方案1】：

tf.gradients 实际上对 ys 求和并计算其梯度，这就是发生此问题的原因。

【讨论】：

【解决方案2】：

梯度需要一个标量函数，因此默认情况下，它会汇总条目。这是默认行为，因为所有梯度下降算法都需要这种类型的功能，而随机梯度下降（或其变体）是 Tensorflow 中的首选方法。您不会找到任何更高级的算法（例如 BFGS 或其他算法），因为它们根本还没有实现（而且它们需要一个真正的雅可比行列式，它也没有实现）。对于它的价值，这是我编写的一个有效的 Jacobian 实现：

def map(f, x, dtype=None, parallel_iterations=10):
    '''
    Apply f to each of the elements in x using the specified number of parallel iterations.

    Important points:
    1. By "elements in x", we mean that we will be applying f to x[0],...x[tf.shape(x)[0]-1].
    2. The output size of f(x[i]) can be arbitrary. However, if the dtype of that output
       is different than the dtype of x, then you need to specify that as an additional argument.
    '''
    if dtype is None:
        dtype = x.dtype

    n = tf.shape(x)[0]
    loop_vars = [
        tf.constant(0, n.dtype),
        tf.TensorArray(dtype, size=n),
    ]
    _, fx = tf.while_loop(
        lambda j, _: j < n,
        lambda j, result: (j + 1, result.write(j, f(x[j]))),
        loop_vars,
        parallel_iterations=parallel_iterations
    )
    return fx.stack()

def jacobian(fx, x, parallel_iterations=10):
    '''
    Given a tensor fx, which is a function of x, vectorize fx (via tf.reshape(fx, [-1])),
    and then compute the jacobian of each entry of fx with respect to x.
    Specifically, if x has shape (m,n,...,p), and fx has L entries (tf.size(fx)=L), then
    the output will be (L,m,n,...,p), where output[i] will be (m,n,...,p), with each entry denoting the
    gradient of output[i] wrt the corresponding element of x.
    '''
    return map(lambda fxi: tf.gradients(fxi, x)[0],
               tf.reshape(fx, [-1]),
               dtype=x.dtype,
               parallel_iterations=parallel_iterations)

虽然此实现有效，但当您尝试嵌套时它不起作用。例如，如果您尝试使用 jacobian( jacobian( ... )) 计算 Hessian 矩阵，则会出现一些奇怪的错误。这被跟踪为Issue 675。我仍然awaiting a response 为什么会引发错误。我相信无论是while循环实现还是梯度实现都有一个深层次的bug，但我真的不知道。

无论如何，如果您只需要一个 jacobian，请尝试上面的代码。

【讨论】：