禁止 numpy 对象数组的 numpy 数组创建协议答案

【问题标题】：suppress numpy array creation protocol for numpy arrays of objects禁止 numpy 对象数组的 numpy 数组创建协议
【发布时间】：2016-10-17 02:14:47
【问题描述】：

我正在尝试构建一个在 python 中读取复杂 HDF5 数据文件的库。

我遇到了一个问题，HDF5 Dataset 以某种方式实现了默认数组协议（有时），这样当从它创建一个 numpy 数组时，它会转换为特定的数组类型。

In [8]: ds
Out[8]: <HDF5 dataset "two_by_zero_empty_matrix": shape (2,), type "<u8">

In [9]: ds.value
Out[9]: array([2, 0], dtype=uint64)

这个Dataset对象，实现了numpy数组协议，当数据集由数字组成时，它提供了一个默认的数组类型。

In [10]: np.array(ds)
Out[10]: array([2, 0], dtype=uint64)

但是，如果数据集不是由数字组成，而是由其他一些对象组成，如您所料，它只使用 np.object 类型的 numpy 数组：

In [43]: ds2
Out[43]: <HDF5 dataset "somecells": shape (2, 3), type "|O8">

In [44]: np.array(ds2)
Out[44]: 
array([[<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>],
       [<HDF5 object reference>, <HDF5 object reference>,
        <HDF5 object reference>]], dtype=object)

这种行为可能看起来很方便，但在我的情况下它实际上很不方便，因为它干扰了我对数据文件的递归遍历。解决这个问题确实很困难，因为有很多不同的可能数据类型，根据它们是对象的子对象还是数字数组，它们的特殊情况会有所不同。

我的问题是：有没有办法抑制默认的数组创建协议，这样我就可以从想要转换为其自然鸭类型的数据集对象创建一个 object 数组？

也就是说，我想要类似：np.array(ds, dtype=object)，这将产生[<Dataset object of type int>, dtype=object] 的数组，而不是[3 4 5, dtype=int]。

但是np.array(ds, dtype=np.object) 抛出IOError: Can't read data (No appropriate function for conversion path)

我认真地尝试在谷歌上搜索一些关于 numpy 数组协议工作的文档，并找到了很多，但在我看来，没有人真正考虑过有人可能想要这种行为的可能性。

【问题讨论】：

标签： python numpy h5py

【解决方案1】：

我可以理解Out[44] 的来源。这是一个包含指向对象的指针的数组，在这种情况下 h5py 引用文件上的对象（我认为）。

使用np.array(ds, dtype=object)，您是否尝试创建更像这样的东西，而不是使用np.array(ds) 获得的“正常”数组？ array([2, 0], dtype=uint64).

但是什么是并行数组？带有指向ds 的指针的单元素数组？还是带有指向文件某处的2 和0 的指针的2 元素数组？如果不是<HDF5 object reference>怎么办？

在numpy 中，没有任何h5py 的东西，我可以从值列表创建一个对象数组：

In [104]: np.array([2,0], dtype=object)
Out[104]: array([2, 0], dtype=object)

或者我可以从一个空数组（填充None）开始并赋值：

In [105]: x=np.empty((2,), dtype=object)
In [106]: x[0]=2
In [107]: x[1]=0
In [108]: x
Out[108]: array([2, 0], dtype=object)

我想你可以试试：

x[0] = ds[0]
or
x[:] = ds[:]

或者做一个单元素对象数组

x = np.empty((), dtype=object)
x[()] = ds

我没有在我的 Ipython 会话上打开 h5py 测试文件来测试它。但我可以做一些奇怪的事情，比如制作一个包含自身的对象数组。我可以使用，但我无法在没有递归错误的情况下显示它。

In [118]: x=np.empty((),dtype=object)
In [119]: x[()]=x
In [120]: x1=x[()]
In [121]: x1==x
Out[121]: True

我在另一个终端上打开了一个小的 h5py 文件：

In [315]: list(f.keys())
Out[315]: ['d', 'x', 'y']
In [317]: f['d']    # the group
Out[317]: <HDF5 group "/d" (2 members)>

x 是一个字符串：

In [318]: f['x']    # a single element (a string)
Out[318]: <HDF5 dataset "x": shape (), type "|O4">
In [330]: f['x'].value
Out[330]: 'astring'
In [331]: np.array(f['x'])
Out[331]: array('astring', dtype=object)

y 是一个数组：

In [320]: f['y'][:]
Out[320]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [321]: f['y'].value
Out[321]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [322]: np.array(f['y'])
Out[322]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [323]: timeit np.array(f['y'])
1000 loops, best of 3: 364 µs per loop
In [324]: timeit f['y'].value
1000 loops, best of 3: 380 µs per loop

所以使用value 和array 访问是等效的。

以object 数组访问会出现与您遇到的相同类型的错误。

In [325]: np.array(f['y'],dtype=object)
...
OSError: can't read data (Dataset: Read failed)

转换为浮动效果很好：

In [326]: np.array(f['y'],dtype=float)
Out[326]: array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

并且分配给预定义的对象数组有效：

In [327]: x=np.empty((),dtype=object)
In [328]: x[()]=f['y']
In [329]: x
Out[329]: array(<HDF5 dataset "y": shape (10,), type "<i4">, dtype=object)

试图创建一个 10 元素数组来取y：

In [332]: y1=np.empty((10,),dtype=object)
In [333]: y1[:]=f['y']
...
OSError: can't read data (Dataset: Read failed)
In [334]: y1[:]=f['y'].value
In [335]: y1
Out[335]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)

y1[:]=f['y'][:] 也可以使用

我无法将数据集分配给y1（与我尝试np.array(f['y'],dtype=object)时的错误相同。但我可以分配它的值。我什至可以将数据集分配给y1的一个元素

In [338]: y1[-1]=f['y']
In [339]: y1
Out[339]: 
array([0, 1, 2, 3, 4, 5, 6, 7, 8,
       <HDF5 dataset "y": shape (10,), type "<i4">], dtype=object)

我不断回到基本思想，即对象数组只是指针的集合，本质上是数组包装器中的列表。

【讨论】：