删除 scipy 稀疏矩阵中的 nan 行答案

【问题标题】：Remove nan rows in a scipy sparse matrix删除 scipy 稀疏矩阵中的 nan 行
【发布时间】：2017-01-15 15:27:39
【问题描述】：

我得到一个（标准化的）稀疏邻接矩阵和相应矩阵行的标签列表。由于某些节点已被另一个清理功能删除，因此矩阵中有一些包含 NaN 的行。我想找到这些行并删除它们以及它们各自的标签。这是我写的函数：

def sanitize_nan_rows(adj, labels):
    # convert to numpy array and keep dimension
    adj = np.array(adj, ndmin=2)

    for i, row in enumerate(adj):
        # check if row all nans
        if np.all(np.isnan(row)):
            # print("Removing nan row label in %s" % i)
            # remove row index from labels
            del labels[i]
    # remove all nan rows
    adj = adj[~np.all(np.isnan(adj), axis=1)]
    # return sanitized adj and labels_clean
    return adj, labels

labels 是一个简单的 Python 列表，adj 的类型为 <class 'scipy.sparse.lil.lil_matrix'>（包含 <class 'numpy.float64'> 类型的元素），它们都是

adj, labels = nx.attr_sparse_matrix(infected, normalized=True)

执行时出现以下错误：

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-503-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-502-ead99efec677> in sanitize_nans(adj, labels)
      6     for i, row in enumerate(adj):
      7         # check if row all nans
----> 8         if np.all(np.isnan(row)):
      9             print("Removing nan row label in %s" % i)
     10             # remove row index from labels

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

所以我认为 SciPy NaN 与 numpy NaN 不同。之后，我尝试将稀疏矩阵转换为 numpy 数组（冒着淹没我的 RAM 的风险，因为矩阵有大约 40k 行和列）。但是，运行它时，错误保持不变。 np.array() 调用似乎只是包装了稀疏矩阵并没有转换它，因为 for 循环内的 type(row) 仍然输出 <class 'scipy.sparse.lil.lil_matrix'>

所以我的问题是如何解决这个问题以及是否有更好的方法来完成工作。我对 numpy 和 scipy（在 networkx 中使用）相当陌生，所以我很感激解释。谢谢！

编辑：将转换更改为hpaulj 建议的转换后，我收到了 MemoryError：

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-519-8a404b58eaa9> in <module>()
----> 1 adj, labels = sanitize_nans(adj, labels)

<ipython-input-518-44201f4ff35c> in sanitize_nans(adj, labels)
      1 def sanitize_nans(adj, labels):
----> 2     adj = adj.toarray()
      3 
      4     for i, row in enumerate(adj):
      5         # check if row all nans

/usr/lib/python3/dist-packages/scipy/sparse/lil.py in toarray(self, order, out)
    348     def toarray(self, order=None, out=None):
    349         """See the docstring for `spmatrix.toarray`."""
--> 350         d = self._process_toarray_args(order, out)
    351         for i, row in enumerate(self.rows):
    352             for pos, j in enumerate(row):

    /usr/lib/python3/dist-packages/scipy/sparse/base.py in_process_toarray_args(self, order, out)
    697             return out
    698         else:
--> 699             return np.zeros(self.shape, dtype=self.dtype, order=order)
    700 
    701 

MemoryError:

所以显然我必须坚持使用稀疏矩阵来节省 RAM。

【问题讨论】：

稀疏矩阵不是密集数组。查看adj.data 和adj.rows。对于lil 矩阵，这些是列表的对象数组，数组的每行一对子列表。
adj.A 或 adj.toarray() 产生一个数组
感谢您的快速回复！我根据您提出的更改和我的结果编辑了问题。（我只是把转换线改成adj = adj.toarray()）
是adj NaN 的整行还是该行的非零值？一个大的稀疏矩阵可能有数千列，但每行只有数百个非零条目。一行中的大多数值将为 0（并且在稀疏数据库中不存在）。

标签： python numpy scipy sparse-matrix networkx

【解决方案1】：

如果我制作一个示例数组：

In [328]: A=np.array([[1,0,0,np.nan],[0,np.nan,np.nan,0],[1,0,1,0]])
In [329]: A
Out[329]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

In [331]: M=sparse.lil_matrix(A)

这个 lil 稀疏矩阵存储在 2 个数组中：

In [332]: M.data
Out[332]: array([[1.0, nan], [nan, nan], [1.0, 1.0]], dtype=object)
In [333]: M.rows
Out[333]: array([[0, 3], [1, 2], [0, 2]], dtype=object)

使用您的函数，即使稀疏矩阵的中间行仅包含nan，也不会删除任何行。

In [334]: A[~np.all(np.isnan(A), axis=1)]
Out[334]: 
array([[  1.,   0.,   0.,  nan],
       [  0.,  nan,  nan,   0.],
       [  1.,   0.,   1.,   0.]])

我可以测试M 的行是否为nan，并识别仅包含nan 的行（除了0）。但是收集我们想要保留的那些可能更容易。

In [346]: ll = [i for i,row in enumerate(M.data) if not np.all(np.isnan(row))]
In [347]: ll
Out[347]: [0, 2]
In [348]: M[ll,:]
Out[348]: 
<2x4 sparse matrix of type '<class 'numpy.float64'>'
    with 4 stored elements in LInked List format>
In [349]: _.A
Out[349]: 
array([[  1.,   0.,   0.,  nan],
       [  1.,   0.,   1.,   0.]])

M 的一行是一个列表，但np.isnan(row) 会将其转换为数组并进行数组测试。

【讨论】：

这帮助很大！我已经相应地调整了我的代码。我还用adj = adj[ll,:][:,ll] 切片以保持邻接矩阵的对称性。然而，无论是这个还是ll 索引列表似乎都淹没了我的 RAM。有没有更有效的方法来完成这项工作？该矩阵大约有 40k 行和列。
索引lil 的行很有效，但删除列会很麻烦，需要更改每个子数组。我得看看它的代码，看看列是怎么做的。