TensorFlow 密集梯度解释？答案

【问题标题】：Tensorflow dense gradient explanation?TensorFlow 密集梯度解释？
【发布时间】：2016-06-23 21:01:49
【问题描述】：

我最近实现了一个模型，当我运行它时收到以下警告：

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. 
This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

使用一些类似的参数设置（嵌入维度），模型突然变得异常缓慢。

此警告意味着什么？看来我所做的事情已经导致所有的梯度都是密集的，所以反向传播正在做密集的矩阵计算
如果是模型存在问题导致此问题，我该如何识别并修复它？

【问题讨论】：

标签： tensorflow

【解决方案1】：

当稀疏的tf.IndexedSlices 对象隐式转换为密集的tf.Tensor 时，会打印此警告。这通常发生在一个操作（通常是tf.gather()）反向传播稀疏梯度时，但接收它的操作没有可以处理稀疏梯度的专用梯度函数。因此，TensorFlow 会自动对tf.IndexedSlices 进行加密，如果张量很大，这会对性能产生毁灭性影响。

要解决此问题，您应该尝试确保tf.gather() 的params 输入（或tf.nn.embedding_lookup() 的params 输入）是tf.Variable。变量可以直接接收稀疏更新，因此不需要转换。尽管tf.gather()（和tf.nn.embedding_lookup()）接受任意张量作为输入，但这可能会导致更复杂的反向传播图，从而导致隐式转换。

【讨论】：

感谢您的澄清。如何确定是哪个操作导致了这种情况？
最简单的方法是查看tf.gather() 或tf.nn.embedding_lookup() 调用的代码，找到张量t，即params（第一个）参数，以及打印t.op。通常，如果t 是tf.Variable，您将获得最佳性能，但某些操作（例如tf.concat()）具有使渐变高效的特化。
它似乎是一个boolean_mask 被喂一个reshape。这用于在多个reshapes、packs、tiles、expand_dims、squeezes、batch_matmuls 等之后的图中的损失计算。有没有办法确定哪些操作不能接受稀疏梯度？
我也有同样的问题。我对tf.gather 的输入是reshape 输出。如何将其转换为Variable？谢谢。
我也看到了这个带有 boolean_mask 的警告，但它只是被输入正常变量——没有任何东西被重塑。

【解决方案2】：

稠密张量可以被认为是一个标准的 Python 数组。稀疏的可以被认为是索引和值的集合，例如

# dense
array = ['a', None, None, 'c']

# sparse
array = [(0, 'a'), (3, 'c')]

因此，您可以看到，如果您有很多空条目，稀疏数组将比密集数组更有效。但是如果所有条目都填写，dense 的效率要高得多。在您的情况下，在张量流图中的某处，稀疏数组被转换为大小不确定的密集数组。警告只是说您可能会像这样浪费大量内存。但如果稀疏数组不太大/已经很密集，这可能根本不是问题。

如果您想诊断它，我建议您命名您的各种张量对象，然后它将准确打印在此转换中使用哪些张量对象，您可以计算出可以调整哪些内容以将其删除。

【讨论】：

【解决方案3】：

完全同意mrry的回答。

其实我会针对这个问题发布另一个解决方案。

您可以使用tf.dynamic_partition() 而不是tf.gather() 来消除警告。

示例代码如下：

# Create the cells for the RNN network
lstm = tf.nn.rnn_cell.BasicLSTMCell(128)

# Get the output and state from dynamic rnn
output, state = tf.nn.dynamic_rnn(lstm, sequence, dtype=tf.float32, sequence_length = seqlen)

# Convert output to a tessor and reshape it
outputs = tf.reshape(tf.pack(output), [-1, lstm.output_size])

# Set partions to 2
num_partitions = 2

# The partitions argument is a tensor which is already fed to a placeholder.
# It is a 1-D tensor with the length of batch_size * max_sequence_length.
# In this partitions tensor, you need to set the last output idx for each seq to 1 and 
# others remain 0, so that the result could be separated to two parts,
# one is the last outputs and the other one is the non-last outputs.
res_out = tf.dynamic_partition(outputs, partitions, num_partitions)

# prediction
preds = tf.matmul(res_out[1], weights) + bias

希望这对你有帮助。

【讨论】：

可以用dynamic_partition代替tf.gather()，用什么代替tf.nn.embedding_lookup()？
我猜这并不能真正解决问题，只是将警告静音。因为看起来tf.dynamic_partition 会产生密集的梯度？