在 pandas 中使用 blosc 压缩会导致堆损坏答案

【问题标题】：Using blosc compression in pandas causes heap corruption在 pandas 中使用 blosc 压缩会导致堆损坏
【发布时间】：2015-12-30 02:45:24
【问题描述】：

我使用 Pandas 已经有一段时间了，但我是 HDF5 的新手，所以我正在尝试学习它并将我的一些研究数据文件转换为 HDF5 文件。我浏览了一堆关于 python 和 HDF5 的 SO 帖子，我对使用 BLOSC 压缩算法很感兴趣（我们对数据集进行了大量计算，因此读/写速度比存储大小更重要）。

在使用 pandas.to_hdf 时，我遇到了 blosc 压缩库的问题。当我使用 blosc 时，python 崩溃，当我在 Visual Studio 2010 中打开调试时，我得到了

python.exe 中 0x00007ffcd59fa28c 处的未处理异常：0xC0000374：堆已损坏。

我在脚本中设置了一个单独的示例并遇到了同样的问题：

import pandas as pd

test = pd.DataFrame()
test['random1'] = np.random.randn(1000000)
test['random2'] = np.random.randn(1000000)
test['random3'] = np.random.randn(1000000)

# Write out a csv first to compare file sizes
test.to_csv('./examples/data/random_3c.csv')

# Write out using different compression algorithms to compare
test.to_hdf('./examples/data/random_3c_zlib.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='zlib', fletcher32=True)

test.to_hdf('./examples/data/random_3c_blosc.h5',
            key='Random_3Col', mode='w', format='table', 
            append=False, complevel=9, complib='blosc', fletcher32=True)

csv 写得很好（文件大小为 65,217 kb）
zlib 压缩写得很好（文件大小为 21,719 kb）
blosc 压缩使内核崩溃，当我在 VS 中打开调试时收到堆损坏消息
我的熊猫版本是 0.16.2
我的 PyTables 版本是 3.2.0
我还从 hdfgroup
安装了 hdf5 我正在使用 Windows 机器

在这一点上，我什至不确定如何开始追踪导致崩溃的原因。有什么建议或以前有人看过吗？我发现一些人在尝试使用外部 blosc 库时遇到了 SO 问题，但我还没有接近这个问题。我想我会先搞定基础知识！据我所知，pandas 正在使用与 blosc 版本捆绑在一起的 pytables。

谢谢！

【问题讨论】：

同样的问题，我在github.com/pydata/pandas/issues/11266报告了一个错误

标签： python pandas hdf5 pytables

【解决方案1】：

如果您使用的是 anaconda 发行版，这是一个包构建问题：Pytables 3.2, python 3.4 under windows x64 · Issue #458 · ContinuumIO/anaconda-issues。您可以观看并等待修复。

【讨论】：

谢谢@xgdgsc！我将 pytables 降级到 3.1.1 并且 blosc 压缩工作。我正在关注该问题并等待修复。