【问题标题】:Extending numpy mask扩展 numpy 掩码
【发布时间】:2016-02-17 16:20:15
【问题描述】:

我想用mask 屏蔽一个numpy 数组a。掩码与a 的形状不完全相同,但无论如何都可以掩码a(我猜是因为附加维度是一维的(广播?))。

a.shape
>>> (3, 9, 31, 2, 1)
mask.shape
>>> (3, 9, 31, 2)
masked_a = ma.masked_array(a, mask)

但是,同样的逻辑不适用于数组 b,它的最后一维有 5 个元素。

ext_mask = mask[..., np.newaxis] # extending or not extending has same effect
ext_mask.shape
>>> (3, 9, 31, 2, 1)

b.shape
>>> (3, 9, 31, 2, 5)
masked_b = ma.masked_array(b, ext_mask)
>>> numpy.ma.core.MaskError: Mask and data not compatible: data size is 8370, mask size is 1674.

如何通过将(3, 9, 31, 2) 掩码的最后一个维度中的任何True 值扩展为[True, True, True, True, True](分别为False)从(3, 9, 31, 2) 掩码创建(3, 9, 31, 2, 5) 掩码?

【问题讨论】:

  • 这行得通:masked_b = ma.masked_array(*np.broadcast(b, ext_mask)),但我不知道为什么ma.masked_array 没有自动广播。编辑:也许是因为它只想将视图存储到两个大小相等的数组中以提高效率?
  • 这给了TypeError: __new__() takes at most 11 arguments (8371 given)
  • 呃,对不起,我的错! broadcast 是错误的功能。您需要使用broadcast_arrays
  • 文档说broadcast_arrays 将视图返回到原始数组中,这意味着不执行任何分配。
  • 是的,我会写一个答案,但首先我要对该主题进行更多研究:)

标签: python arrays numpy


【解决方案1】:

这给出了预期的结果:

masked_b = ma.masked_array(*np.broadcast(b, ext_mask))

我没有分析过这个方法,但它应该比分配一个新的掩码更快。根据documentation,没有数据被复制:

这些数组是原始数组的视图。他们通常不是 连续的。此外,广播数组的多个元素 可能指的是单个内存位置。如果您需要写信到 数组,先复制。

可以验证不复制行为:

bb, mb = np.broadcast(b, ext_mask)
print(mb.shape)       # (3, 9, 31, 2, 5) - same shape as b
print(mb.base.shape)  # (3, 9, 31, 2) - the shape of the original mask
print(mb.strides)     # (558, 62, 2, 1, 0) - that's how it works: 0 stride

numpy 开发人员如何实现广播令人印象深刻。沿最后一个维度使用 0 的步幅重复值。哇!

编辑

我用这段代码比较了广播和分配的速度:

import numpy as np
from numpy import ma

a = np.random.randn(30, 90, 31, 2, 1)
b = np.random.randn(30, 90, 31, 2, 5)

mask = np.random.randn(30, 90, 31, 2) > 0
ext_mask = mask[..., np.newaxis]

def broadcasting(a=a, b=b, ext_mask=ext_mask):
    mb1 = ma.masked_array(*np.broadcast_arrays(b, ext_mask))

def allocating(a=a, b=b, ext_mask=ext_mask):
    m2 = np.empty(b.shape, dtype=bool)
    m2[:] = ext_mask
    mb2 = ma.masked_array(b, m2)

广播显然比分配快,这里:

    # array size: (30, 90, 31, 2, 5)

In [23]: %timeit broadcasting()
The slowest run took 10.39 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39.4 µs per loop

In [24]: %timeit allocating()
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 982 µs per loop

请注意,我必须增加数组大小才能使速度差异变得明显。原始数组维度的分配比广播略快:

    # array size: (3, 9, 31, 2, 5)

In [28]: %timeit broadcasting()
The slowest run took 9.36 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 39 µs per loop

In [29]: %timeit allocating()
The slowest run took 9.22 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.6 µs per loop

广播解决方案的运行时间似乎不依赖于数组大小。

【讨论】:

  • 这两个测试使用了哪些数组大小?
  • (30, 90, 31, 2, x) 和 (3, 9, 31, 2, x)
  • 有趣。广播似乎不太依赖数组大小(如果全部的话)。绝对是更好的选择 - 再次感谢。
猜你喜欢
  • 2020-12-20
  • 2013-05-19
  • 2017-03-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-12-18
  • 1970-01-01
相关资源
最近更新 更多