将数组存储在表中（以及如何定义 Col() 类型）答案

【问题标题】：Store ndarray in a PyTable (and how to define the Col()-type)将数组存储在表中（以及如何定义 Col() 类型）
【发布时间】：2020-10-26 05:21:32
【问题描述】：

TL;DR：我有一个带有 float32 Col 的 PyTable，在将 numpy-float32-array 写入其中时出现错误。 （如何）我可以在 PyTables 表的列中存储一个 numpy-array (float32) 吗？

我是 PyTables 的新手 - 根据 TFtables（在 Tensorflow 中使用 HDF5 的库）的建议，我使用它来存储我所有的 HDF5 数据（目前分批分布在几个文件中，每三个数据集) 在单个 HDF5 文件的表中。数据集是

'data' : (n_elements, 1024, 1024, 4)@float32
'label' : (n_elements, 1024, 1024, 1)@uint8
'weights' : (n_elements, 1024, 1024, 1)@float32

n_elements 分布在多个文件中，我现在要合并为一个文件（以允许无序访问）。

因此，当我构建表格时，我认为每个数据集代表一列。我以通用方式构建了所有内容，允许对任意数量的数据集执行此操作：

# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)

# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}

# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))

我为每个 dset（使用 h5py 读取）获得的 dtype 显然是读取 dset 返回的 numpy 数组（ndarray）：float32 或 uint8。所以 Col() 类型是 Float32Col 和 UInt8Col。我天真地假设我现在可以将 float32-array 写入此 col，但使用以下内容填充数据：

dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files

sample = table.row
sample['data'] = dummy_data

结果为@987654331@。所以现在我觉得自己可以在里面写一个数组，但是没有提供“ArrayCol()”类型，PyTables doc 中也没有任何关于是否或如何可能的提示将数组写入列。我该怎么做？

Col() 类中有“形状”参数，它是后代，所以应该是可能的，否则这些有什么用？！

【问题讨论】：

不介意对为什么这个问题被否决提出一些建设性的批评......我已经做了很多工作。
是的，我也不明白为什么这个问题被否决了，所以我投了赞成票——因为我有同样的问题，而且描述得很好。

标签： python arrays numpy pytables

【解决方案1】：

我知道这有点晚了，但我认为您的问题的答案在于 Float32Col 的形状参数。

这是它在文档中的使用方式：


from tables import *
from numpy import *

# Describe a particle record
class Particle(IsDescription):
    name        = StringCol(itemsize=16)  # 16-character string
    lati        = Int32Col()              # integer
    longi       = Int32Col()              # integer
    pressure    = Float32Col(shape=(2,3)) # array of floats (single-precision)
    temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)

# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")

# Get the HDF5 root group
root = fileh.root

# Create the groups:
for groupname in ("Particles", "Events"):
    group = fileh.create_group(root, groupname)

# Now, create and fill the tables in Particles group
gparticles = root.Particles

# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
    # Create a table
    table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)

    # Get the record object associated with the table:
    particle = table.row

    # Fill the table with 257 particles
    for i in xrange(257):
        # First, assign the values to the Particle record
        particle['name'] = 'Particle: %6d' % (i)
        particle['lati'] = i
        particle['longi'] = 10 - i

        ########### Detectable errors start here. Play with them!
        particle['pressure'] = array(i*arange(2*3)).reshape((2,4))  # Incorrect
        #particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
        ########### End of errors

        particle['temperature'] = (i**2)     # Broadcasting

        # This injects the Record values
        particle.append()

    # Flush the table buffers
    table.flush()

这是我所指的文档部分的链接 https://www.pytables.org/usersguide/tutorials.html

【讨论】：

感谢您的贡献 :) 对我来说绝对是太晚了，我不知道下次我什么时候会使用 PyTables。我可能会检查并验证您的建议。如果有人读到这篇文章并发现这个答案比我想出的要好，请发表评论 - 我会将其标记为正确答案。

【解决方案2】：

编辑： 我刚刚看到tables.Col.from_type(type, shape) 允许使用类型的精度（float32 而不是单独的 float）。其余部分保持不变（采用字符串和形状）。

工厂函数tables.Col.from_kind(kind, shape)可以用来构造一个支持ndarrays的Col-Type。我发现的任何地方都没有记录什么是“种类”以及如何使用它；但是经过反复试验，我发现允许的“种类”是基本数据类型的字符串。即：'float', 'uint', ... 没有精度（不是 'float64'）

由于我从 h5py 读取数据集 (dset.dtype) 获得 numpy.dtypes，因此必须将这些转换为 str 并且需要删除精度。最后相关行如下所示：

# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)

# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing 
# the dtype w/o precision, e.g. 'float' or 'uint' 
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]

# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}

然后将数据写入表最终按建议工作：

sample = table.row
sample[key] = my_array

由于这一切都感觉有点“hacky”并且没有很好地记录，我仍然想知道，这是否不是 PyTables 的预期用途，并且会留下这个问题以查看是否 s.o.了解更多...

【讨论】：