为 SciPy 稀疏 CSC 矩阵构建 Indptr答案

【问题标题】：Constructing Indptr for SciPy Sparse CSC Matrix为 SciPy 稀疏 CSC 矩阵构建 Indptr
【发布时间】：2020-05-31 08:43:42
【问题描述】：

我有许多表示稀疏矩阵的列表（即具有非零条目的列），我需要将其表示为 SciPy 稀疏 csc_matrix。但是，请注意，我的稀疏矩阵中只有一行，因此列表仅指向该行中具有非零条目的列。例如：

sparse_input = [4, 10, 21]  # My lists are much, much longer but very sparse

此列表告诉我单行稀疏矩阵中的哪些列存在非零值。这就是密集矩阵的样子。

x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])

我可以使用(data, (row, col)) 语法，但由于我的列表超长，csc_matrix 需要大量时间和内存来构建。因此，我正在考虑使用indptr 接口，但我无法弄清楚如何直接从给定的非零列条目的稀疏列表中快速自动地构建indptr。我尝试查看csr_matrix(x).indptr，发现indptr 看起来像：

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3], dtype=int32)

我已经阅读了 SciPy 文档和 Sparse Matrix Wikipedia page，但我似乎无法想出一种有效的方法来直接从非零列的列表中构造 indptr。考虑到稀疏矩阵中只有三个非零条目，感觉indptr 的长度不应该这么长。

【问题讨论】：

标签： python arrays numpy scipy sparse-matrix

【解决方案1】：

制作矩阵并探索它们的属性怎么样？

In [144]: from scipy import sparse                                                             
In [145]: x = np.array([[0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1]])                        

In [146]: M = sparse.coo_matrix(x)                                                             
In [147]: M                                                                                    
Out[147]: 
<1x22 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>
In [148]: M.row                                                                                
Out[148]: array([0, 0, 0], dtype=int32)
In [149]: M.col                                                                                
Out[149]: array([ 4, 10, 21], dtype=int32)
In [150]: M.data                                                                               
Out[150]: array([1, 1, 1])

企业社会责任：

In [152]: Mr = M.tocsr()                                                                       
In [153]: Mr.indptr                                                                            
Out[153]: array([0, 3], dtype=int32)
In [155]: Mr.indices                                                                           
Out[155]: array([ 4, 10, 21], dtype=int32)
In [156]: Mr.data                                                                              
Out[156]: array([1, 1, 1], dtype=int64)

csc:

In [157]: Mc = M.tocsc()                                                                       
In [158]: Mc.indptr                                                                            
Out[158]: 
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3], dtype=int32)
In [159]: Mc.indices                                                                           
Out[159]: array([0, 0, 0], dtype=int32)
In [160]: Mc.data                                                                              
Out[160]: array([1, 1, 1], dtype=int64)

还有x上的直接nonzero：

In [161]: np.nonzero(x)                                                                        
Out[161]: (array([0, 0, 0]), array([ 4, 10, 21]))

对于像这样的 1 行矩阵，我怀疑直接创建 csr indptr 是否会节省很多时间。大部分工作将在nonzero 步骤中完成。但请随意尝试。

===

一些时间

In [162]: timeit sparse.coo_matrix(x)                                                          
95.8 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [163]: timeit sparse.csr_matrix(x)                                                          
335 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [164]: timeit M.tocsr()                                                                     
115 µs ± 948 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [165]: timeit M.tocsc()                                                                     
117 µs ± 90.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [166]: sparse.csr_matrix?                                                                   
In [167]: timeit M.tocsc()                                                                     
117 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [168]: timeit sparse.csc_matrix(x)                                                          
335 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit sparse.coo_matrix(x).tocsr()                                                  
219 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

我有点惊讶csr_matrix 比coo 慢，然后是转换。

现在让我们尝试用indptr 等来制作矩阵。

In [170]: timeit np.nonzero(x)                                                                 
2.52 µs ± 65.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [173]: timeit sparse.csr_matrix((Mr.data, Mr.indices, Mr.indptr))                           
92.5 µs ± 79.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [174]: %%timeit 
     ...: indices = np.nonzero(x)[1] 
     ...: data = np.ones_like(indices) 
     ...: indptr = np.array([0,len(indices)]) 
     ...: sparse.csr_matrix((data, indices, indptr)) 
     ...:  
     ...:                                                                                      
161 µs ± 605 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

【讨论】：