提高 pandas (PyTables?) HDF5 表写入性能答案

【问题标题】：Improve pandas (PyTables?) HDF5 table write performance提高 pandas (PyTables?) HDF5 表写入性能
【发布时间】：2013-12-03 16:30:08
【问题描述】：

我已经使用 pandas 进行研究大约两个月了，效果很好。凭借大量中型跟踪事件数据集，pandas + PyTables（HDF5 接口）在允许我使用我所知道和喜爱的所有 Python 工具处理异构数据方面做得非常出色。

一般来说，我在 PyTables 中使用 Fixed（以前称为“Storer”）格式，因为我的工作流程是一次写入、多次读取，而且我的许多数据集的大小都可以加载 50-100 个一次内存，没有严重的缺点。（注意：我的大部分工作都是在具有 128GB+ 系统内存的 Opteron 服务器级机器上完成的。）

但是，对于大型数据集（500MB 或更大），我希望能够使用 PyTables“表”格式的更具可扩展性的随机访问和查询能力，这样我就可以执行我的查询了——内存，然后将小得多的结果集加载到内存中进行处理。然而，这里的最大障碍是写入性能。是的，正如我所说，我的工作流程是一次写入，多次读取，但相对时间仍然不可接受。

例如，我最近在我的 48 核机器上运行了一个大型 Cholesky 因式分解，耗时 3 分 8 秒（188 秒）。这生成了一个约 2.2 GB 的跟踪文件 - 跟踪与程序并行生成，因此没有额外的“跟踪创建时间”。

将我的二进制跟踪文件初始转换为 pandas/PyTables 格式需要相当长的时间，但主要是因为二进制格式是故意乱序的，以减少跟踪生成器本身对性能的影响。这也与从 Storer 格式迁移到 Table 格式时的性能损失无关。

我的测试最初是使用 pandas 0.12、numpy 1.7.1、PyTables 2.4.0 和 numexpr 0.20.1 运行的。我的 48 核机器以每核 2.8GHz 运行，我正在写入可能（但不确定）在 SSD 上的 ext3 文件系统。

我可以在 7.1 秒内将整个数据集写入 Storer 格式的 HDF5 文件（生成的文件大小：3.3GB）。相同的数据集，写入 Table 格式（结果文件大小也是 3.3GB），需要 178.7 秒来写入。

代码如下：

with Timer() as t:
    store = pd.HDFStore('test_storer.h5', 'w')
    store.put('events', events_dataset, table=False, append=False)
print('Fixed format write took ' + str(t.interval))
with Timer() as t:
    store = pd.HDFStore('test_table.h5', 'w')
    store.put('events', events_dataset, table=True, append=False)
print('Table format write took ' + str(t.interval))

输出很简单

Fixed format write took 7.1
Table format write took 178.7

我的数据集有 28,880,943 行，列是基本数据类型：

node_id           int64
thread_id         int64
handle_id         int64
type              int64
begin             int64
end               int64
duration          int64
flags             int64
unique_id         int64
id                int64
DSTL_LS_FULL    float64
L2_DMISS        float64
L3_MISS         float64
kernel_type     float64
dtype: object

...所以我认为写入速度不应该存在任何特定于数据的问题。

我还尝试添加 BLOSC 压缩，以排除任何可能影响一种或另一种情况的奇怪 I/O 问题，但压缩似乎同样会降低两者的性能。

现在，我意识到 pandas 文档说 Storer 格式提供了显着更快的写入速度和稍快的读取速度。（我确实体验到更快的读取，因为读取 Storer 格式似乎需要大约 2.5 秒，而读取 Table 格式大约需要 10 秒。）但是 Table 格式写入应该花费 25 倍似乎真的过分了只要Storer格式写入即可。

参与 PyTables 或 pandas 的任何人都可以解释为什么写入可查询格式（显然需要很少的额外数据）应该花费一个数量级更长的架构（或其他）原因吗？将来有没有希望改善这一点？我很乐意为一个项目或另一个项目做出贡献，因为我的领域是高性能计算，并且我看到了该领域中这两个项目的重要用例....但是澄清一下会很有帮助首先涉及的问题，和/或一些关于如何加快速度的建议，来自那些知道系统是如何构建的人。

编辑：

在 IPython 中使用 %prun 运行之前的测试会为 Storer/Fixed 格式提供以下配置文件输出（为了可读性而有所降低）：

%prun -l 20 profile.events.to_hdf('test.h5', 'events', table=False, append=False)

3223 function calls (3222 primitive calls) in 7.385 seconds

Ordered by: internal time
List reduced from 208 to 20 due to restriction <20>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    6    7.127    1.188    7.128    1.188 {method '_createArray' of 'tables.hdf5Extension.Array' objects}
    1    0.242    0.242    0.242    0.242 {method '_closeFile' of 'tables.hdf5Extension.File' objects}
    1    0.003    0.003    0.003    0.003 {method '_g_new' of 'tables.hdf5Extension.File' objects}
   46    0.001    0.000    0.001    0.000 {method 'reduce' of 'numpy.ufunc' objects}

表格格式如下：

   %prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)

   499082 function calls (499040 primitive calls) in 188.981 seconds

   Ordered by: internal time
   List reduced from 526 to 40 due to restriction <40>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       29   92.018    3.173   92.018    3.173 {pandas.lib.create_hdf_rows_2d}
      640   20.987    0.033   20.987    0.033 {method '_append' of 'tables.hdf5Extension.Array' objects}
       29   19.256    0.664   19.256    0.664 {method '_append_records' of 'tables.tableExtension.Table' objects}
      406   19.182    0.047   19.182    0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
    14244   10.646    0.001   10.646    0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
      472   10.359    0.022   10.359    0.022 {method 'copy' of 'numpy.ndarray' objects}
       80    3.409    0.043    3.409    0.043 {tables.indexesExtension.keysort}
        2    3.023    1.512    3.023    1.512 common.py:134(_isnull_ndarraylike)
       41    2.489    0.061    2.533    0.062 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       87    2.401    0.028    2.401    0.028 {method 'astype' of 'numpy.ndarray' objects}
       30    1.880    0.063    1.880    0.063 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
      282    0.824    0.003    0.824    0.003 {method 'reduce' of 'numpy.ufunc' objects}
       41    0.537    0.013    0.668    0.016 index.py:607(final_idx32)
    14490    0.385    0.000    0.712    0.000 array.py:342(_interpret_indexing)
       39    0.279    0.007   19.635    0.503 index.py:1219(reorder_slice)
        2    0.256    0.128   10.063    5.031 index.py:1099(get_neworder)
        1    0.090    0.090  119.392  119.392 pytables.py:3016(write_data)
    57842    0.087    0.000    0.087    0.000 {numpy.core.multiarray.empty}
    28570    0.062    0.000    0.107    0.000 utils.py:42(is_idx)
    14164    0.062    0.000    7.181    0.001 array.py:711(_readSlice)

编辑 2：

使用 pandas 0.13 的预发布副本再次运行（2013 年 11 月 20 日美国东部标准时间 11:00 左右），Tables 格式的写入时间显着提高，但仍不能“合理”地与存储/固定格式。

%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)

         499748 function calls (499720 primitive calls) in 117.187 seconds

   Ordered by: internal time
   List reduced from 539 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      640   22.010    0.034   22.010    0.034 {method '_append' of 'tables.hdf5Extension.Array' objects}
       29   20.782    0.717   20.782    0.717 {method '_append_records' of 'tables.tableExtension.Table' objects}
      406   19.248    0.047   19.248    0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
    14244   10.685    0.001   10.685    0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
      472   10.439    0.022   10.439    0.022 {method 'copy' of 'numpy.ndarray' objects}
       30    7.356    0.245    7.356    0.245 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
       29    7.161    0.247   37.609    1.297 pytables.py:3498(write_data_chunk)
        2    3.888    1.944    3.888    1.944 common.py:197(_isnull_ndarraylike)
       80    3.581    0.045    3.581    0.045 {tables.indexesExtension.keysort}
       41    3.248    0.079    3.294    0.080 {method '_fillCol' of 'tables.tableExtension.Row' objects}
       34    2.744    0.081    2.744    0.081 {method 'ravel' of 'numpy.ndarray' objects}
      115    2.591    0.023    2.591    0.023 {method 'astype' of 'numpy.ndarray' objects}
      270    0.875    0.003    0.875    0.003 {method 'reduce' of 'numpy.ufunc' objects}
       41    0.560    0.014    0.732    0.018 index.py:607(final_idx32)
    14490    0.387    0.000    0.712    0.000 array.py:342(_interpret_indexing)
       39    0.303    0.008   19.617    0.503 index.py:1219(reorder_slice)
        2    0.288    0.144   10.299    5.149 index.py:1099(get_neworder)
    57871    0.087    0.000    0.087    0.000 {numpy.core.multiarray.empty}
        1    0.084    0.084   45.266   45.266 pytables.py:3424(write_data)
        1    0.080    0.080   55.542   55.542 pytables.py:3385(write)

我在运行这些测试时注意到，有很长一段时间写入似乎“暂停”（磁盘上的文件没有积极增长），但在其中一些时间段内 CPU 使用率也很低。

我开始怀疑某些已知的 ext3 限制可能会与 pandas 或 PyTables 产生不良交互。 Ext3 和其他基于非扩展的文件系统有时难以及时取消链接大文件，并且类似的系统性能（CPU 使用率低，但等待时间长）即使在 1GB 文件的简单“rm”期间也很明显。

为了澄清，在每个测试用例中，我确保在开始测试之前删除现有文件（如果有），以免招致任何 ext3 文件删除/覆盖惩罚。

但是，在 index=None 的情况下重新运行此测试时，性能会显着提高（约 50 秒，而索引时约 120 秒）。所以看起来这个过程要么继续受 CPU 限制（我的系统有相对较旧的 AMD Opteron Istanbul CPU 运行 @ 2.8GHz，尽管它也有 8 个插槽，每个插槽有 6 个核心 CPU，除了其中一个之外，当然，在写入期间处于空闲状态），或者 PyTables 或 pandas 尝试操作/读取/分析文件的方式之间存在一些冲突，当文件系统已经部分或全部在文件系统上时，这会导致在索引时出现病态的不良 I/O 行为发生。

编辑 3：

在将 PyTables 从 2.4 升级到 3.0.0 之后，@Jeff 建议在较小的数据集（磁盘上 1.3 GB）上进行测试，让我来到这里：

In [7]: %timeit f(df)
1 loops, best of 3: 3.7 s per loop

In [8]: %timeit f2(df) # where chunksize= 2 000 000
1 loops, best of 3: 13.8 s per loop

In [9]: %timeit f3(df) # where chunksize= 2 000 000
1 loops, best of 3: 43.4 s per loop

事实上，我的表现似乎在所有情况下都超过了他，除了打开索引（默认）。但是，索引似乎仍然是一个杀手，如果我在运行这些测试时解释来自top 和ls 的输出的方式是正确的，那么仍有一段时间既没有重要的处理也没有任何文件写入发生（即，Python 进程的 CPU 使用率接近 0，并且文件大小保持不变）。我只能假设这些是文件读取。我很难理解为什么文件读取会导致速度变慢，因为我可以在 3 秒内可靠地将整个 3+ GB 文件从该磁盘加载到内存中。如果它们不是文件读取，那么系统在“等待”什么？（没有其他人登录到机器，也没有其他文件系统活动。）

此时，通过相关python模块的升级版本，我的原始数据集的性能下降到以下数字。特别感兴趣的是系统时间，我认为它至少是执行 IO 所花费时间的上限，以及 Wall 时间，这似乎可能解释了这些神秘的无写入/无 CPU 活动时期。

In [28]: %time f(profile.events)
CPU times: user 0 ns, sys: 7.16 s, total: 7.16 s
Wall time: 7.51 s

In [29]: %time f2(profile.events)
CPU times: user 18.7 s, sys: 14 s, total: 32.7 s
Wall time: 47.2 s

In [31]: %time f3(profile.events)
CPU times: user 1min 18s, sys: 14.4 s, total: 1min 32s
Wall time: 2min 5s

尽管如此，索引似乎会导致我的用例显着放缓。也许我应该尝试限制索引的字段，而不是简单地执行默认情况（这很可能是对 DataFrame 中的所有字段的索引）？我不确定这会如何影响查询时间，尤其是在查询基于非索引字段进行选择的情况下。

根据 Jeff 的请求，生成文件的 ptdump。

ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := 
    [(1, ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id', 'DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1']]
/df/table (Table(28880943,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(10,), dflt=0, pos=1),
  "values_block_1": Float64Col(shape=(4,), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (4369,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'values_block_1',
    NROWS := 28880943,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'int64',
    values_block_0_kind := ['node_id', 'thread_id', 'handle_id', 'type', 'begin', 'end', 'duration', 'flags', 'unique_id', 'id'],
    values_block_1_dtype := 'float64',
    values_block_1_kind := ['DSTL_LS_FULL', 'L2_DMISS', 'L3_MISS', 'kernel_type']]

另一个 %prun 包含更新的模块和完整的数据集：

%prun -l 25  %time f3(profile.events)
CPU times: user 1min 14s, sys: 16.2 s, total: 1min 30s
Wall time: 1min 48s

        542678 function calls (542650 primitive calls) in 108.678 seconds

   Ordered by: internal time
   List reduced from 629 to 25 due to restriction <25>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      640   23.633    0.037   23.633    0.037 {method '_append' of 'tables.hdf5extension.Array' objects}
       15   20.852    1.390   20.852    1.390 {method '_append_records' of 'tables.tableextension.Table' objects}
      406   19.584    0.048   19.584    0.048 {method '_g_write_slice' of 'tables.hdf5extension.Array' objects}
    14244   10.591    0.001   10.591    0.001 {method '_g_read_slice' of 'tables.hdf5extension.Array' objects}
      458    9.693    0.021    9.693    0.021 {method 'copy' of 'numpy.ndarray' objects}
       15    6.350    0.423   30.989    2.066 pytables.py:3498(write_data_chunk)
       80    3.496    0.044    3.496    0.044 {tables.indexesextension.keysort}
       41    3.335    0.081    3.376    0.082 {method '_fill_col' of 'tables.tableextension.Row' objects}
       20    2.551    0.128    2.551    0.128 {method 'ravel' of 'numpy.ndarray' objects}
      101    2.449    0.024    2.449    0.024 {method 'astype' of 'numpy.ndarray' objects}
       16    1.789    0.112    1.789    0.112 {method '_g_flush' of 'tables.hdf5extension.Leaf' objects}
        2    1.728    0.864    1.728    0.864 common.py:197(_isnull_ndarraylike)
       41    0.586    0.014    0.842    0.021 index.py:637(final_idx32)
    14490    0.292    0.000    0.616    0.000 array.py:368(_interpret_indexing)
        2    0.283    0.142   10.267    5.134 index.py:1158(get_neworder)
      274    0.251    0.001    0.251    0.001 {method 'reduce' of 'numpy.ufunc' objects}
       39    0.174    0.004   19.373    0.497 index.py:1280(reorder_slice)
    57857    0.085    0.000    0.085    0.000 {numpy.core.multiarray.empty}
        1    0.083    0.083   35.657   35.657 pytables.py:3424(write_data)
        1    0.065    0.065   45.338   45.338 pytables.py:3385(write)
    14164    0.065    0.000    7.831    0.001 array.py:615(__getitem__)
    28570    0.062    0.000    0.108    0.000 utils.py:47(is_idx)
       47    0.055    0.001    0.055    0.001 {numpy.core.multiarray.arange}
    28570    0.050    0.000    0.090    0.000 leaf.py:397(_process_range)
    87797    0.048    0.000    0.048    0.000 {isinstance}

【问题讨论】：

也许是这样。我在这里看到了很多熊猫问题的答案，所以我决定有必要看看是否有人会加入“这有一个明显的答案，而且是 XYZ！”但我可能很快就会在那边发帖。

标签： python performance pandas hdf5 pytables

【解决方案1】：

这是我刚刚做过的类似比较。它大约是 10M 行数据的 1/3。最终大小约为 1.3GB

我定义了 3 个计时函数：

测试 Fixed 格式（在 0.12 中称为 Storer）。这以 PyTables 数组格式写入

def f(df):
    store = pd.HDFStore('test.h5','w')
    store['df'] = df
    store.close()

使用 PyTables 表格格式以表格格式写入。不要创建索引。

def f2(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df,index=False)
    store.close()

与f2相同，但创建一个索引（通常这样做）

def f3(df):
    store = pd.HDFStore('test.h5','w')
    store.append('df',df)
    store.close()

创建框架

In [25]: df = concat([DataFrame(np.random.randn(10000000,10)),DataFrame(np.random.randint(0,10,size=50000000).reshape(10000000,5))],axis=1)

In [26]: df
Out[26]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 4
dtypes: float64(10), int64(5)


v0.12.0

In [27]: %timeit f(df)
1 loops, best of 3: 14.7 s per loop

In [28]: %timeit f2(df)
1 loops, best of 3: 32 s per loop

In [29]: %timeit f3(df)
1 loops, best of 3: 40.1 s per loop

master/v0.13.0

In [5]: %timeit f(df)
1 loops, best of 3: 12.9 s per loop

In [6]: %timeit f2(df)
1 loops, best of 3: 17.5 s per loop

In [7]: %timeit f3(df)
1 loops, best of 3: 24.3 s per loop

使用 OP 提供的相同文件进行计时（链接如下）

In [4]: df = pd.read_hdf('test.h5','df')

In [5]: df
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28880943 entries, 0 to 28880942
Columns: 14 entries, node_id to kernel_type
dtypes: float64(4), int64(10)

和f1一样，固定格式

In [6]: %timeit df.to_hdf('test.hdf','df',mode='w')
1 loops, best of 3: 36.2 s per loop

和f2一样，表格格式，无索引

In [7]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False)
1 loops, best of 3: 45 s per loop

In [8]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',index=False,chunksize=2000000)
1 loops, best of 3: 44.5 s per loop

和f3一样，带索引的表格格式

In [9]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000)
1 loops, best of 3: 1min 36s per loop

和f3一样，带索引的表格格式，用blosc压缩

In [10]: %timeit df.to_hdf('test.hdf','df',mode='w',format='table',chunksize=2000000,complib='blosc')
1 loops, best of 3: 46.5 s per loop

In [11]: %timeit pd.read_hdf('test.hdf','df')
1 loops, best of 3: 10.8 s per loop

显示原始文件（test.h5 和压缩后的 test.hdf）

In [13]: !ls -ltr test.h*
-rw-r--r-- 1 jreback users 3471518282 Nov 20 18:20 test.h5
-rw-rw-r-- 1 jreback users  649327780 Nov 20 21:17 test.hdf

注意几点。

不创建索引会在时间上产生不小的差异。我也相信，如果你有一个基于字符串的索引，它会大大缩短写入时间。也就是说，您总是希望创建一个索引来快速检索。

你没有包括你的索引是什么，也没有包括它是否排序（尽管我认为这只是一个很小的区别）。
我的示例中的写入损失大约是 2 倍（尽管我发现在包含索引时间时它会更大一些）。因此，您的 7s（我的时间的 1/2），对于我所写数字的 3 倍来说是非常可疑的。我正在使用一个相当快的磁盘阵列。不过，如果您使用的是基于闪存的磁盘，那么这是可能的。
master/v0.13.0（很快发布），大大提高了表的写入时间。
您可以尝试在写入数据时将chunksize参数设置为更大的数字（默认为100000）。 “相对”低数字的目的是保持恒定的内存使用量。（例如，如果更大，您将使用更多内存，但理论上它应该写得更快）。
与固定格式相比，表具有 2 个优势：1) 查询检索和 2) 可附加性。读取整个表格并没有利用任何一个，因此如果您只想读取整个表格，则建议使用固定格式。（根据我的经验，Tables 的灵活性大大超过了写入损失，但是 YMMV）

底线是重复计时（使用 ipython，因为它会运行多个测试）。如果您可以重现您的结果，请发布 %prun，我会看看。

更新：

因此，对于这种大小的表，推荐的方法是使用 blosc 压缩并使用 pandas master/0.13.0 和 PyTables 3.0.0

【讨论】：

感谢您的回复！我知道 Tables 与 Fixed/Storer 的优势，您列出的那些当然是我有兴趣让它更好地工作的主要原因。可悲的是，您的测试代码的性能很好地反映了我之前的结果。 f() 每次循环 3.53 秒，f2() 每次循环 51 秒，f3() 每次循环 1 分钟 14 秒。老实说，我不知道这台机器上的后备存储是什么（不是我来管理的），但在我看来，如果我的机器上的 f() 比你的机器更快，其他人不应该大大落后于你的结果......
如果你可以尝试使用 master 并发布表格格式的修剪将是 gr8
彼得，我用你的文件来安排我的时间。我没有发现使用 Fixed 和 Table 格式之间有很大的区别（当使用 blosc 用索引编写它时）。您的数据是高度可压缩的（当您有很多处理器时，这很好；这意味着读取和写入数据比未压缩数据更快）。为什么您的固定表非常快；不知道。 HTH
我相信我的结果受到我的磁盘 IO 速率的限制（具有 100MB/s 写入多个磁盘的 NAS）。本地磁盘会快得多。这就是为什么压缩对我有很大帮助的原因。更少的数据写入让我可以更好地利用处理器。
我没有使用任何替代文件系统，所以没有帮助。 blosc 确实使用多处理器。我不确定 index 是否可以，但我认为不会。在您的情况下，本地磁盘似乎太快了，不值得花时间创建表，除非您正在执行大量查询（并且不要追加）

【解决方案2】：

这是一个有趣的讨论。我认为 Peter 在 Fixed 格式方面的表现非常出色，因为该格式一次写入，而且他有一个非常好的 SSD（它可以以超过 450 MB/s 的速度写入）。

追加到表是一个更复杂的操作（数据集必须扩大，并且必须检查新记录，以便我们可以确保它们遵循表的架构）。这就是为什么在表中追加行通常比较慢（但 Jeff 仍然获得了 ~ 70 MB/s，这非常好）。 Jeff 比 Peter 的速度更快可能是因为他拥有更好的处理器。

最后，PyTables 中的索引使用单个处理器，是的，这通常是一项昂贵的操作，因此如果您不打算在磁盘上查询数据，则应该禁用它。

【讨论】：