在 Python 中即时在磁盘上构建稀疏矩阵答案

【问题标题】：Construct sparse matrix on disk on the fly in Python在 Python 中即时在磁盘上构建稀疏矩阵
【发布时间】：2015-09-10 22:52:33
【问题描述】：

我目前正在做一些内存密集型文本处理，为此我必须构造一个float32s 的sparse matrix，尺寸为~ (2M, 5M)。在阅读 5M 文档的语料库时，我正在逐列构建这个矩阵。为此，我使用来自SciPy 的稀疏dok_matrix 数据结构。但是，当到达第 500 000 个文档时，我的内存已满（使用了大约 30GB）并且程序崩溃了。我最终想要做的是使用sklearn 对矩阵执行降维算法，但是，如前所述，不可能在内存中保存和构造整个矩阵。我研究了numpy.memmap，因为 sklearn 支持这一点，并尝试memmap SciPy 稀疏矩阵的一些底层 numpy 数据结构，但我无法成功。

我不可能以密集格式保存整个矩阵，因为这需要 40TB 的磁盘空间。所以我认为HDF5 和PyTables 对我来说是没有选择的（？）。

我现在的问题是：如何即时构建稀疏矩阵，但直接写入磁盘而不是内存，以便之后可以在 sklearn 中使用它？

谢谢！

【问题讨论】：

对于dok格式的矩阵，底层数据结构是一个Python字典。实际上矩阵是一个字典子类。所以 mmap 的 numpy 版本没有帮助。而且很可能pythonmmap 也无济于事，因为字典数据不是连续的。即使你构造它，转换为另一种稀疏形式进行计算也可能是不可能的。
是的，确实，我已经走到了那一步……我研究了其他类型的稀疏矩阵，例如lil_matrix，但是对它们的内部数据结构进行 memmap 远非易事......
正如这里所讨论的：stackoverflow.com/a/30023214/901925 可以使用您以其他方式创建的数组来创建稀疏矩阵。分配coo 的data, i, j 数组时不会更改，类似地，如果您将csr 分配给data, intptr, indices 数组。可以想象，这些输入数组可能是 memmap。但是您可能必须在创建稀疏矩阵之前完整地构建这些数组。
我确实考虑过，但是我需要事先知道数据点的数量，因为 memmap 需要知道维度。所以我可以分两个阶段工作：我首先计算、计数并将我的所有数据写入一个纯文本文件，然后我自己为稀疏矩阵构建 memmap 数组。目前，这对我来说似乎是最好的解决方案。谢谢！

标签： python memory numpy matrix scipy

【解决方案1】：

如果您能提供最少的工作代码，那就太好了。我看不出您的矩阵是否因构造（1）而变得太大，或者仅仅是因为您有太多数据（2）。如果不是很在意自己建这个矩阵，可以直接看我的备注2。

对于问题（1），在下面的示例代码中，我制作了一个包装类来逐块构建 csr_matrix。这个想法是只添加列表的 (row,column,data) 元组，直到达到缓冲区限制（参见备注 1），并在此时实际更新矩阵。当达到限制时，它将减少内存中的数据，因为 csr_matrix 构造函数添加具有相同（行，列）元组的数据。这部分只允许您以快速的方式构建稀疏矩阵（比为每一行创建稀疏矩阵要快得多），并避免当一个单词在文档中出现多次时由于（行，列）的冗余而导致的内存错误.

import numpy as np
import scipy.sparse

class SparseMatrixBuilder():
    def __init__(self, shape, build_size_limit):
        self.sparse_matrix = scipy.sparse.csr_matrix(shape)
        self.shape = shape
        self.build_size_limit = build_size_limit
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []


    def add(self, data, col_indices, row_indices):
        self.data_temp.append(data)
        self.col_indices_temp.append(col_indices)
        self.row_indices_temp.append(row_indices)
        if len(self.data_temp) == self.build_size_limit:
            self.sparse_matrix += scipy.sparse.csr_matrix(
                (np.concatenate(self.data_temp),
                 (np.concatenate(self.col_indices_temp),
                  np.concatenate(self.row_indices_temp))),
                shape=self.shape
            )
            self.data_temp = []
            self.col_indices_temp = []
            self.row_indices_temp = []

    def get_matrix(self):
        self.sparse_matrix += scipy.sparse.csr_matrix(
            (np.concatenate(self.data_temp),
             (np.concatenate(self.col_indices_temp),
              np.concatenate(self.row_indices_temp))),
            shape=self.shape
        )
        self.data_temp = []
        self.col_indices_temp = []
        self.row_indices_temp = []
        return self.sparse_matrix

对于问题 (2)，您可以通过添加一个 save 方法轻松扩展此类，该方法在达到限制（或第二个限制）后将矩阵存储在磁盘上。因此，您最终会在磁盘上得到多个稀疏矩阵块。然后你需要一个可以处理分块矩阵的降维算法（见备注 2）。

remark 1：这里的缓冲区限制并没有很好地定义。与机器上可用的 RAM 相比，检查 numpy 数组 data_temp、col_indices_temp 和 row_indices_temp 的实际大小会更好（这很容易用 python 自动化）。

备注 2：gensim 是一个 python 库，它在使用分块文件构建 NLP 模型方面具有很大的优势。因此，您可以构建一个字典，构建一个稀疏矩阵并使用该库对其进行降维，而无需太多 RAM。

【讨论】：

【解决方案2】：

我假设您的所有数据都可以使用对内存更友好的稀疏矩阵格式（例如 COO）放入内存中。如果没有，您几乎没有希望继续使用sklearn，即使使用mmap。事实上，sklearn 可能会创建后续对象，其内存需求与您的输入数量级相同。

Scipy 的dok_matrix 实际上是原版dict 的子类。它们使用单独的 python 对象和大量指针存储数据，因此它们的内存效率不高。最紧凑的表示是coo_matrix 格式。您可以通过为坐标（行和列）和数据预先分配数组来增量构建创建 COO 矩阵所需的数据；如果您最初的猜测是错误的，最终会增加这些缓冲区。


def get_coo_from_iter(iterable, n_data_hint=1<<20, idx_dtype='uint32', data_dtype='float32'):
    counter = 0
    rows = numpy.empty(n_data_hint, dtype=idx_dtype)
    cols = numpy.empty(n_data_hint, dtype=idx_dtype)
    data = numpy.empty(n_data_hint, dtype=data_dtype)
    for row, col, value in iterable:
        if counter >= n_data_hint:
            n_data_hint *= 2
            rows, cols, data = _reallocate(rows, cols, data, n_data_hint)
        rows[counter] = row
        cols[counter] = col
        data[counter] = value
        counter += 1
    rows = rows[:counter]
    cols = cols[:counter]
    data = data[:counter]
    return coo_matrix((data, (rows, cols)))


def _reallocate(rows, cols, data, n):
    new_rows = numpy.empty(n, dtype=rows.dtype)
    new_cols = numpy.empty(n, dtype=cols.dtype)
    new_data = numpy.empty(n, dtype=data.dtype)
    new_rows[:rows.size] = rows
    new_cols[:cols.size] = cols
    new_data[:data.size] = data
    return new_rows, new_cols, new_data

您可以像这样使用随机生成的数据进行测试：

def get_random_data(n, max_row=2000, max_col=5000):
    for _ in range(n):
        row = numpy.random.choice(max_row)
        col = numpy.random.choice(max_col)
        val = numpy.random.randn()
        yield row, col, val

# test when initial hint is good
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=10000)
print(coo.shape)

# or to test when initial hint was too tiny
coo = get_coo_from_iter(get_random_data(10000), n_data_hint=1111)
print(coo.shape)

获得 COO 矩阵后，您可能希望使用 coo.tocsr() 转换为 CSR。 CSR 矩阵针对点积等常见操作进行了更优化。在某些行最初为空的情况下，它需要更多的内存。这是因为它存储了所有行的指针，甚至是空行。

【讨论】：

【解决方案3】：

我们在处理磁盘上的大型稀疏数据集的单细胞基因组数据领域遇到了类似的问题。我将向您展示一个简单的小例子，说明我将如何处理这个问题。我的假设是您的内存非常有限，并且可能无法一次将稀疏矩阵的多个副本放入内存中。即使您放不下一份完整的副本，这也可以使用。

我会逐列构造一个磁盘上的稀疏 CSC 矩阵。稀疏 csc 矩阵使用 3 个底层数组：

data：存储在矩阵中的值
indices: 矩阵中每个值的行索引
indptr：一个长度为n_cols + 1的数组，将indices和data除以它们所属的列。

作为一个解释性示例，i 列的值存储在data 的indptr[i]:indptr[i+1] 范围内。同样，这些值的行索引可以通过indices[indptr[i]:indptr[i+1]] 找到。

为了模拟您的数据生成过程（我假设是解析文档），我将定义一个函数 process_document，它返回相关文档的 indices 和 data 的值。

import numpy as np
import h5py
from scipy import sparse

from tqdm import tqdm  # For monitoring the writing process
from typing import Tuple, Union  # Just for argument annotation

def process_document():
    """
    Simulate processing a document. Results in sparse vector represenation.
    """
    n_items = np.random.negative_binomial(2, .0001)
    indices = np.random.choice(2_000_000, n_items, replace=False)
    indices.sort()
    data = np.random.random(n_items).astype(np.float32)
    return indices, data

def data_generator(n):
    """Iterator which yields simulated data."""
    for i in range(n):
        yield process_document()

现在我将在 hdf5 文件中创建一个组，该文件将存储稀疏矩阵的组成数组。

def make_sparse_csc_group(f: Union[h5py.File, h5py.Group], groupname: str, shape: Tuple[int, int]):
    """
    Create a group in an hdf5 file that can store a CSC sparse matrix.
    """
    g = f.create_group(groupname)
    g.attrs["shape"] = shape
    g.create_dataset("indices", shape=(1,), dtype=np.int64, chunks=True, maxshape=(None,))
    g["indptr"] = np.zeros(shape[1] + 1, dtype=int) # We want this to have a zero for the first value
    g.create_dataset("data", shape=(1,), dtype=np.float32, chunks=True, maxshape=(None,))
    return g

最后是一个将这个组读取为稀疏矩阵的函数（这个非常简单）。

def read_sparse_csc_group(g: Union[h5py.File, h5py.Group]):
    return sparse.csc_matrix((g["data"], g["indices"], g["indptr"]), shape=g.attrs["shape"])

现在我们将创建磁盘上的稀疏矩阵并一次写入一列（我使用的列较少，因为这可能有点慢）。

N_COLS = 10

def make_disk_matrix(f, groupname, data_iter, shape):
    group = make_sparse_csc_group(f, "mtx", shape)

    indptr = group["indptr"]
    data = group["data"]
    indices = group["indices"]
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(tqdm(data_iter)):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices.resize((n_total,))
        data.resize((n_total,))
        indices[n_prev:] = cur_indices
        data[n_prev:] = cur_data
        indptr[doc_num+1] = n_total

# Writing
with h5py.File("data.h5", "w") as f:
    make_disk_matrix(f, "mtx", data_generator(10), (2_000_000, 10))

# Reading
with h5py.File("data.h5", "r") as f:
    mtx = read_sparse_csc_group(f["mtx"])

再次考虑到内存非常受限的情况，在这种情况下，您可能无法在创建时将整个稀疏矩阵放入内存中。如果您可以处理整个稀疏矩阵加上至少一个副本，那么执行此操作的一种更快的方法是不打扰磁盘存储（类似于其他建议）。但是，稍微修改一下这段代码应该会给你更好的性能：

def make_memory_mtx(data_iter, shape):
    indices_list = []
    data_list = []
    indptr = np.zeros(shape[1]+1, dtype=int)
    n_total = 0

    for doc_num, (cur_indices, cur_data) in enumerate(data_iter):
        n_cur = len(cur_indices)
        n_prev = n_total
        n_total += n_cur
        indices_list.append(cur_indices)
        data_list.append(cur_data)
        indptr[doc_num+1] = n_total

    indices = np.concatenate(indices_list)
    data = np.concatenate(data_list)

    return sparse.csc_matrix((data, indices, indptr), shape=shape)

mtx = make_memory_mtx(data_generator(10), shape=(2_000_000, 10))

这应该相当快，因为它只会在您连接数组后复制数据。当前发布的其他解决方案在您处理时重新分配了数组，从而制作了许多大型数组的副本。

【讨论】：