使用 int 列表进行稀疏矩阵切片答案

【问题标题】：Sparse matrix slicing using list of int使用 int 列表进行稀疏矩阵切片
【发布时间】：2017-01-22 20:41:58
【问题描述】：

我正在针对大量稀疏数据编写机器学习算法（我的矩阵形状为 (347, 5 416 812 801)，但非常稀疏，只有 0.13% 的数据不为零。

我的稀疏矩阵大小为 105 000 字节 (csr 类型。

我正在尝试通过为每个设置选择示例索引列表来分离训练/测试集。所以我想将我的数据集分成两部分：

training_set = matrix[train_indices]

形状(len(training_indices), 5 416 812 801)，仍然稀疏

testing_set = matrix[test_indices]

形状(347-len(training_indices), 5 416 812 801)也稀疏

有training_indices和testing_indices两个listint

但training_set = matrix[train_indices] 似乎失败并返回Segmentation fault (core dumped)

这可能不是内存问题，因为我在具有 64GB RAM 的服务器上运行此代码。

任何线索可能是什么原因？

【问题讨论】：

我的猜测是 MemoryError 没有被很好地捕获。您可能需要研究matrix.__getitem__（索引方法）以了解它是如何进行选择的。每种稀疏格式都有自己的索引。 lil 和 csr 应该可以很好地处理行索引。 coo 根本不处理索引。索引稀疏矩阵不像数组那样隐藏在编译代码中（而且速度也不快）。
我会检查一下，但由于我使用 csr 并尝试获取行，应该没问题
您使用的是哪个版本的 scipy？您可以通过import scipy; print(scipy.__version__) 查询
可能需要在 SO 或 scipy github 上搜索“稀疏”和“分段错误”。

标签： python scipy segmentation-fault sparse-matrix

【解决方案1】：

我想我已经重新创建了 csr 行索引：

def extractor(indices, N):
   indptr=np.arange(len(indices)+1)
   data=np.ones(len(indices))
   shape=(len(indices),N)
   return sparse.csr_matrix((data,indices,indptr), shape=shape)

在我闲逛的csr 上进行测试：

In [185]: M
Out[185]: 
<30x40 sparse matrix of type '<class 'numpy.float64'>'
    with 76 stored elements in Compressed Sparse Row format>

In [186]: indices=np.r_[0:20]

In [187]: M[indices,:]
Out[187]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

In [188]: extractor(indices, M.shape[0])*M
Out[188]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

与许多其他csr 方法一样，它使用矩阵乘法来产生最终值。在这种情况下，稀疏矩阵在所选行中为 1。时间其实还好一点。

In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop

在您的情况下，提取器矩阵的形状为 (len(training_indices),347)，只有 len(training_indices) 值。所以不大。

但如果matrix 太大（或至少第二维如此之大）以至于它在矩阵乘法例程中产生一些错误，它可能会在没有 python/numpy 捕获它的情况下引起分段错误。

matrix.sum(axis=1) 是否有效。这也使用了矩阵乘法，尽管使用的是 1 的密集矩阵。还是sparse.eye(347)*M，类似大小的矩阵乘法？

【讨论】：

确实，两者都不起作用，sum 返回IndexError: index 21870 out-of-bounds in add.reduceat [0, 20815)，矩阵乘法返回分段错误。所以我唯一的解决方案是自己编写一个更慢但内存效率更高的代码来分割我的矩阵？
虽然我的第一个猜测是内存错误，但我现在怀疑这是大量的列，或者更确切地说是某些列索引的值很大。如果它不能求和或做矩阵乘法，它可能会在学习代码中出现问题。