是否有连接 scipy.sparse 矩阵的有效方法？答案

【问题标题】：Is there an efficient way of concatenating scipy.sparse matrices?是否有连接 scipy.sparse 矩阵的有效方法？
【发布时间】：2011-10-14 06:31:38
【问题描述】：

我正在处理一些相当大的稀疏矩阵（从 5000x5000 到 20000x20000），需要找到一种有效的方法以灵活的方式连接矩阵，以便从单独的部分构造随机矩阵。

现在我正在使用以下方式连接四个矩阵，但效率极低。有没有更好的方法来做到这一点而不涉及转换为密集矩阵？

rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridge
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()

【问题讨论】：

标签： python concatenation scipy sparse-matrix

【解决方案1】：

稀疏库现在有 hstack 和 vstack 分别用于水平和垂直连接矩阵。

【讨论】：

确保使用 scipy.sparse.hstack 而不是 numpy.hstack

【解决方案2】：

使用 hstack、vstack 或连接比连接内部数据对象本身要慢得多。原因是 hstack/vstack 将稀疏矩阵转换为 coo 格式，当矩阵非常大而不是 coo 格式时，这可能会非常慢。这是连接 csc 矩阵的代码，类似的方法可以用于 csr 矩阵：

def concatenate_csc_matrices_by_columns(matrix1, matrix2):
    new_data = np.concatenate((matrix1.data, matrix2.data))
    new_indices = np.concatenate((matrix1.indices, matrix2.indices))
    new_ind_ptr = matrix2.indptr + len(matrix1.data)
    new_ind_ptr = new_ind_ptr[1:]
    new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))

    return csc_matrix((new_data, new_indices, new_ind_ptr))

【讨论】：

只是在寻找一种将新行附加到 CSR 矩阵的快速方法。这正是我所需要的。谢谢@amos。
如果使用这种方法需要在'return csc_matrix((new_data, new_indices, new_ind_ptr))'中指定形状，即：'return csc_matrix((new_data, new_indices, new_ind_ptr), shape=( matrix1.shape[1], matrix1.shape[1] + matrix2.shape[1])'
csr 矩阵的代码是什么？原生 scipy 实现现在真的更快了吗？因为我必须连接四个子矩阵（左上、右上、左下、右下），我对结果不满意。尽管我只需要计算右上角和左下角，但重新计算整个矩阵所需的时间更少。因此，这种缓慢基本上使制表在我的情况下毫无用处。这让我很恼火，因为我认为如果矩阵和操作都得到了最佳实现，你只需要更改 C 中的一些指针。
虽然我不确定索引指针是存储在 C 中的列表还是数组中。如果它是一个列表，您是否只需在列表末尾重置一个指针？现在的样子，矩阵越大，堆叠的时间越长……

【解决方案3】：

好的，我找到了答案。使用 scipy.sparse.coo_matrix 比使用 lil_matrix 快得多。我将矩阵转换为 coo（无痛且快速），然后在添加正确的填充后将数据、行和列连接起来。

data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1])) 

scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )

【讨论】：

感谢您回来并评论您是如何快速完成的。我的 NLP 课程需要它。

【解决方案4】：

不再需要阿莫斯的回答。如果输入矩阵是 csr 或 csc 格式并且所需的输出格式设置为无或与输入矩阵相同的格式，Scipy 现在会在内部执行类似的操作。分别使用 scipy.sparse.vstack 或 scipy.sparse.hstack 垂直堆叠 csr 格式的矩阵或水平堆叠 csc 格式的矩阵非常有效。

【讨论】：

“现在”指的是哪个版本？你有这方面的参考吗？
相关代码是this snippet from scipy.sparse.bmat，vstack 和hstack 都使用。这个 hack 最初是在 2013 年添加的 here。看起来它最初包含在 scipy 1.0.0 中。
其实我错了。它最初包含在 0.14 中。