在 Keras 和 Tensorflow 中使用稀疏矩阵答案

【问题标题】：Using sparse matrices with Keras and Tensorflow在 Keras 和 Tensorflow 中使用稀疏矩阵
【发布时间】：2017-05-23 03:55:37
【问题描述】：

我的数据可以看作是一个包含 10B 个条目 (100M x 100) 的矩阵，它非常稀疏（

我的第一个想法是将数据扩展为密集，即将所有 10B 条目写入一系列 CSV，其中大多数条目为零。然而，这很快就使我的资源不堪重负（即使是做 ETL 也让 pandas 不堪重负，导致 postgres 挣扎）。所以我需要使用真正的稀疏矩阵。

如何使用 Keras（和 Tensorflow）做到这一点？虽然 numpy 不支持稀疏矩阵，但 scipy 和 tensorflow 都支持。关于这个想法有很多讨论（例如 https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras-sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ） - 使用 scipy 的稀疏矩阵或直接进入 Tensorflow 的稀疏矩阵。但我找不到明确的结论，也无法让任何工作（甚至清楚地知道该走哪条路！）。

我该怎么做？

我相信有两种可能的方法：

将其保留为 scipy 稀疏矩阵，然后在给 Keras 一个 minibatch 时，使其变得密集
始终保持稀疏，并使用 Tensorflow 稀疏张量

我也认为#2 是首选，因为你会在整个过程中获得更好的性能（我相信），但#1 可能更容易并且足够。我会很高兴的。

如何实现？

【问题讨论】：

标签： tensorflow sparse-matrix keras

【解决方案1】：

对不起，没有评论的声誉，但我认为你应该在这里看看答案：Keras, sparse matrix issue。我已经尝试过了，它可以正常工作，但至少在我的情况下，只有一个音符，洗牌导致了非常糟糕的结果，所以我使用了这个稍微修改过的非洗牌替代方案：

def nn_batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while 1:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

它产生的准确度与 keras 的 shuffled 实现（在 fit 中设置 shuffle=True）所达到的准确度相当。

【讨论】：

【解决方案2】：

此答案解决了问题中提到的第二种方法。如果您编写自定义训练循环，则可以使用稀疏矩阵作为带有 Tensorflow 后端的 Keras 模型的输入。在下面的例子中，模型以一个稀疏矩阵作为输入，输出一个密集矩阵。

from keras.layers import Dense, Input
from keras.models import Model
import scipy
import numpy as np

trainX = scipy.sparse.random(1024, 1024)
trainY = np.random.rand(1024, 1024)

inputs = Input(shape=(trainX.shape[1],), sparse=True)
outputs = Dense(trainY.shape[1], activation='softmax')(inputs)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

steps = 10
for i in range(steps):
  # For simplicity, we directly use trainX and trainY in this example
  # Usually, this is where batches are prepared
  print(model.train_on_batch(trainX, trainY))
# [3549.2546, 0.0]
# ...
# [3545.6448, 0.0009765625]

但是，这种方法的有用性取决于您的模型是否需要对稀疏矩阵进行致密化。实际上，上述模型有一层将稀疏矩阵转换为密集矩阵。如果您的稀疏矩阵不适合内存，这可能是个问题。

【讨论】：

尝试用不支持的类型（)
TypeError: 'SparseTensor' 对象不可下标