创建块时如何将2D numpy数组块附加到二进制文件？答案

【问题标题】：How to append chunks of 2D numpy array to binary file as the chunks are created?创建块时如何将2D numpy数组块附加到二进制文件？
【发布时间】：2019-07-22 22:34:45
【问题描述】：

我有一个包含数据帧的大型输入文件（一个数据系列 (complex64)，每个帧中都有一个标识头）。它比我的可用内存大。标题重复，但随机排序，例如输入文件可能如下所示：

<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...

我想将数据排序到一个新文件中，该文件是一个形状为 (len(headers), len(data_series)) 的 numpy 数组。它必须在读取帧时构建输出文件，因为我无法将其全部放入内存中。

我查看了 numpy.savetxt 和 python csv 包，但出于磁盘大小、精度和速度的原因，我希望输出文件是二进制的。 numpy.save 很好，只是我不知道如何让它附加到未知的数组大小。

我必须在 Python2.7 中工作，因为读取这些帧需要一些依赖项。到目前为止，我所做的是使一个函数能够将具有匹配标头的所有帧写入单个二进制文件：

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

 with open("singleFrameHeader", 'ab') as f:
     current_data = input_data.readFrame() # This loads the next frame in the file
     if current_data.header == 0:
          float_arr = np.array(current_data.data).view(float)
          float_arr.tofile(f)

这很好用，但我需要将它扩展为二维。我开始将 h5py 视为一种选择，但希望有一个更简单的解决方案。

最好的东西是

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

 with open("bigMatrix", 'ab') as f:
     current_data = input_data.readFrame() # This loads the next frame in the file
     index = current_data.header
     float_arr = np.array(current_data.data).view(float)
     float_arr.tofile(f, index)

感谢任何帮助。我认为这将是一个更常见的用例，用于以附加模式读取和写入 2D 二进制文件。

【问题讨论】：

tofile 写入一个平面二进制数组 - 只是数据缓冲区的内容。不保存 shape 和 dtype 等数组属性。所以无论数组是二维的还是散列的，它写的都是一样的。
那么所有数据系列的长度都一样吗？
@MadPhysicist 是的，所有的长度都是一样的。
@nicholas。我已经更新了我的答案以包含该信息。通过单击回答您问题的答案旁边的复选标记，将您的问题从未回答队列中删除是标准程序。

标签： python python-2.7 numpy

【解决方案1】：

你有两个问题：一个是文件包含顺序数据，另一个是numpy二进制文件不存储形状信息。

开始解决此问题的一种简单方法是贯彻您最初的想法，即按标题将数据转换为文件，然后将所有二进制文件组合成一个大产品（如果您仍然觉得需要这样做）。

您可以维护迄今为止找到的标头与其输出文件、数据大小等的映射。这将允许您更智能地组合数据，例如，如果缺少块或标头或其他东西.

from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys

class Header:
    __slots__ = ('id', 'count', 'file', 'name')
    def __init__(self, id):
        self.id = id
        self.count = 0
        self.file = NamedTemporaryFile(delete=False)
        self.name = self.file.name
    def write_frame(self, frame):
        data = np.array(frame.data).view(float)
        self.count += data.size
        data.tofile(self.file)

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}

with ExitStack() as stack:
    while True:
        frame = input_data.next_frame()
        if frame is None:
            break  # recast this loop as necessary
        if frame.header not in file_map:
            header = Header(frame.header)
            stack.enter_context(header.file)
            file_map[frame.header] = header
        else:
            header = file_map[frame.header]
        header.write_frame(frame)

max_header = max(file_map)
max_count = max(h.count for h in file_map)

with open('singleFrameHeader', 'wb') as output:
    output.write(max_header.to_bytes(8, sys.byteorder))
    output.write(max_count.to_bytes(8, sys.byteorder))
    for i in range max_header:
        if i in file_map:
            h = file_map[i]
            with open(h.name, 'rb') as input:
                copyfileobj(input, output)
            remove(h.name)
            if h.count < max_count:
                np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
        else:
            np.full(max_count, np.nan, dtype=np.float).tofile(output)

前 16 个字节将分别是 int64 标头数和每个标头的元素数。请记住，文件是本机字节顺序的，不管它是什么，因此不可移植。

替代方案

如果（且仅当）您提前知道标头数据集的确切大小，您可以一次性完成，无需临时文件。如果标题是连续的，它也会有所帮助。否则，缺失的条带将被零填充。您仍然需要在标题中维护您当前位置的字典，但您不再需要为每个文件保留一个单独的文件指针。总而言之，如果您的用例允许，这是比原始解决方案更好的选择：

header_size = 500 * N  # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

header_map = {}
with open('singleFrameHeader', 'wb') as output:
    output.write(max_header.to_bytes(8, sys.byteorder))
    output.write(max_count.to_bytes(8, sys.byteorder))
    while True:
        frame = input_data.next__frame()
        if frame is None:
            break
        if frame.header not in header_map:
            header_map[frame.header] = 0
        data = np.array(frame.data).view(float)
        output.seek(16 + frame.header * header_size + header_map[frame.header])
        data.tofile(output)
        header_map[frame.header] += data.size * data.dtype.itemsize

由于这个答案，我问了一个关于这种无序写入模式的问题：What happens when you seek past the end of a file opened for writing?

【讨论】：