【问题标题】:Replace zeros in an array with a continuous sequence of integers用连续的整数序列替换数组中的零
【发布时间】:2019-09-17 00:56:09
【问题描述】:

我有一个包含 NaN 值或零的数组,如下所示。我想遍历数组并以递增的顺序将每个 0 替换为整数。即,第一个零变成“1”,下一个零变成“2”,然后是“3”,依此类推。

输入:

arrayOfZeros = 

array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [ 0., nan, nan, nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan,  0.,  0.,  0.,  0.],
       [nan,  0., nan, nan, nan],
       [nan, nan,  0., nan, nan],
       [ 0., nan,  0., nan,  0.],
       [ 0., nan,  0., nan,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [nan, nan,  0.,  0.,  0.],
       [nan, nan, nan, nan,  0.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

想要的输出:

       [nan, nan, nan, nan, nan],
       [ 1., nan, nan, nan, nan],
       [ 2., nan, 19., nan, 39.],
       [ 3., 11., 20., 31., 40.],
       [ 4., 12., 21., 32., 41.],
       [nan, 13., 22., 33., 42.],
       [nan, 14., nan, nan, nan],
       [nan, nan, 23., nan, nan],
       [ 5., nan, 24., nan, 43.],
       [ 6., nan, 25., nan, 44.],
       [ 7., 15., 26., 34., 45.],
       [ 8., 16., 27., 35., 46.],
       [ 9., 17., 28., 36., 47.],
       [10., 18., 29., 37., 48.],
       [nan, nan, 30., 38., 49.],
       [nan, nan, nan, nan, 50.],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

目前,我几乎可以用下面的代码做我想做的事:

    with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y:
        preference = 1
        for x in y:
            if x == 0:
                x[...] = preference
                preference += 1

但是,如果我在 Python 控制台之外运行此代码,则会收到以下错误消息:

TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled

在 NumPy 中是否有另一种方法可以实现这一点?

【问题讨论】:

    标签: python arrays python-3.x pandas numpy


    【解决方案1】:

    为什么大家都坚持在这里使用cumsum?这很浪费。更好:

    out = arrayOfZeros.copy()
    z = out==out
    out.T[z.T] = np.arange(1,1+np.count_nonzero(z))
    

    时间安排:

    5.025142431259155   # PP
    38.67108239792287   # cumsum 1   rafaelc
    9.263199986889958   # cumsum 2   Derek Eden
    9.044178808107972   # cumsum 3   Onyambu
    10.640528565272689  # cumsum 4   Andy L.
    

    代码:

    import numpy as np
    
    array,nan = np.array,np.nan
    
    x = \
    array([[nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [ 0., nan, nan, nan, nan],
           [ 0., nan,  0., nan,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [nan,  0.,  0.,  0.,  0.],
           [nan,  0., nan, nan, nan],
           [nan, nan,  0., nan, nan],
           [ 0., nan,  0., nan,  0.],
           [ 0., nan,  0., nan,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [ 0.,  0.,  0.,  0.,  0.],
           [nan, nan,  0.,  0.,  0.],
           [nan, nan, nan, nan,  0.],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan]])
    
    from timeit import timeit
    
    def f_pp():
        out = x.copy()
        z = out==out
        out.T[z.T] = np.arange(1,1+np.count_nonzero(z))
        return out
    
    def f_cumsum():
        arr = x.copy()
        mask = ~np.isnan(arr)
        arr[mask] = np.nan_to_num(arr + 1).ravel('F').cumsum().reshape(arr.shape, order='F')[mask]
        return arr
    
    def f_cumsum_2():
        arr = x.copy()
        in_arr = arr.T
        fill = (in_arr==0).cumsum().reshape(in_arr.shape)
        return (in_arr + fill).T
    
    def f_cumsum_3():
        arrayOfZeros = x.copy()
        mask = arrayOfZeros==0
        arrayOfZeros.T[mask.T] = mask.T.cumsum()[mask.T.flatten()]
        return arrayOfZeros
    
    def f_cumsum_4():
        arrayOfZeros = x.copy()
        m = (arrayOfZeros == 0)
        a = (arrayOfZeros.T == 0).cumsum().reshape(-1, arrayOfZeros.shape[0]).T
        arrayOfZeros[m] = a[m]
        return arrayOfZeros
    
    assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum())).all()
    assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_2())).all()
    assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_3())).all()
    assert(np.nan_to_num(f_pp()) == np.nan_to_num(f_cumsum_4())).all()
    
    for f in (f_pp,f_cumsum,f_cumsum_2,f_cumsum_3,f_cumsum_4):
        print(timeit(f,number=10000)*100)
    

    【讨论】:

    • 我以为每个使用cumsum 的人都快疯了。 out == out 的可爱技巧,利用 np.nan != np.nan
    • @DanielF 有趣的是,它比简单地与零比较要快。
    • @Paul Panzer 你能详细说明一下 out == out 在这里做什么吗?为什么这是必要的?
    • @CharlesHerbertChadwellV 没有必要。正如 Daniel 所说,它利用了 nan 的特殊属性:nan != nan 正如我们所期望的那样,这里只有 0s 或 nans 等价于 out == 0np.isfinite(out)~np.isnan(out)out == out 恰好是这里最快的选择。
    • 啊,我不知道这种特殊性,但我想知道为什么 out == out 返回与 ~np.isnan(out) 相同的输出。
    【解决方案2】:

    使用广播。使用isnan 保存掩码,使用'F' ordering + cumsum 保存ravel() 以进行矢量化求和。

    mask = ~np.isnan(arr)
    arr[mask] = np.nan_to_num(arr + 1).ravel('F').cumsum().reshape(a.shape, order='F')[mask]
    

    由于你标记了pandas,如果你有一个df,你可以直接cumsum,因为它会跳过nan。

    pd.DataFrame(arr.ravel('F')).add(1).cumsum().to_numpy().reshape(a.shape, order='F')
    

    【讨论】:

    • 这完全符合我的要求!谢谢(你的)信息。我是 NumPy 的新手,广播不是我以前必须使用的东西,所以我有一些功课要做。
    【解决方案3】:

    也可以这样做:

    arr #just for example
    
    array([[ 0., nan,  0., nan, nan,  0.,  0.],
           [ 0.,  0.,  0., nan, nan, nan,  0.]])
    
    in_arr = arr.T
    fill = (in_arr==0).cumsum().reshape(in_arr.shape)
    out_arr = (in_arr + fill).T
    

    输出:

    array([[ 1., nan,  4., nan, nan,  6.,  7.],
           [ 2.,  3.,  5., nan, nan, nan,  8.]])
    

    【讨论】:

    • 关闭,但应该是cumsum 从上到下,而不是在右边
    【解决方案4】:
    mask = arrayOfZeros==0
    arrayOfZeros.T[mask.T] = mask.T.cumsum()[mask.T.flatten()]
    
    array([[nan, nan, nan, nan, nan],
           [nan, nan, nan, nan, nan],
           [ 1., nan, nan, nan, nan],
           [ 2., nan, 19., nan, 39.],
           [ 3., 11., 20., 31., 40.],
           [ 4., 12., 21., 32., 41.],
           [nan, 13., 22., 33., 42.],
           [nan, 14., nan, nan, nan],
           [nan, nan, 23., nan, nan],.....
    

    【讨论】:

      【解决方案5】:

      0 上创建True 掩码m。使用transposecumsumreshape 创建以0 为增量的数组。最后通过掩码分配m

      m = (arrayOfZeros == 0)
      a = (arrayOfZeros.T == 0).cumsum().reshape(-1, arrayOfZeros.shape[0]).T
      arrayOfZeros[m] = a[m]
      
      Out[353]:
      array([[nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [ 1., nan, nan, nan, nan],
             [ 2., nan, 19., nan, 39.],
             [ 3., 11., 20., 31., 40.],
             [ 4., 12., 21., 32., 41.],
             [nan, 13., 22., 33., 42.],
             [nan, 14., nan, nan, nan],
             [nan, nan, 23., nan, nan],
             [ 5., nan, 24., nan, 43.],
             [ 6., nan, 25., nan, 44.],
             [ 7., 15., 26., 34., 45.],
             [ 8., 16., 27., 35., 46.],
             [ 9., 17., 28., 36., 47.],
             [10., 18., 29., 37., 48.],
             [nan, nan, 30., 38., 49.],
             [nan, nan, nan, nan, 50.],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan],
             [nan, nan, nan, nan, nan]])
      

      【讨论】:

        【解决方案6】:

        你为什么使用nditer?基本上你让它工作了,这不是一项简单的任务。但不知何故错过了它不是速度工具的信息,至少在 Python 代码中使用时不会。简单的迭代通常也一样好,除非你正在做一些花哨的广播。但正如其他答案所示,非迭代方法更好。

        但是让我们关注nditer

        https://numpy.org/devdocs/reference/arrays.nditer.html

        重新创建你的数组:

        In [1]: nan=np.nan                                                                     
        In [2]: arr = np.array([[nan, nan, nan, nan, nan], 
           ...:        [nan, nan, nan, nan, nan], 
           ...:        [ 0., nan, nan, nan, nan], 
           ...:        [ 0., nan,  0., nan,  0.], 
           ...:        [ 0.,  0.,  0.,  0.,  0.], 
           ...:        [ 0.,  0.,  0.,  0.,  0.], 
           ...:        [nan,  0.,  0.,  0.,  0.], 
           ...:        [nan,  0., nan, nan, nan], 
        ...
        
        In [3]: arrayOfZeros = arr.copy()                                                      
        In [4]: arr.dtype                                                                      
        Out[4]: dtype('float64')
        In [5]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
           ...:         preference = 1 
           ...:         for x in y: 
           ...:             if x == 0: 
           ...:                 x[...] = preference 
           ...:                 preference += 1 
           ...:                                                                                
        In [6]: arrayOfZeros                                                                   
        Out[6]: 
        array([[nan, nan, nan, nan, nan],
               [nan, nan, nan, nan, nan],
               [ 1., nan, nan, nan, nan],
               [ 2., nan,  3., nan,  4.],
               [ 5.,  6.,  7.,  8.,  9.],
               [10., 11., 12., 13., 14.],
               [nan, 15., 16., 17., 18.],
               [nan, 19., nan, nan, nan],
        ...
        

        好的,它可以工作 - 但连续数字的布局与您的显示不匹配。您的显示器正在强制所有其他答案使用转置进行扭曲。

        如果我将数组的 dtype 更改为 object,我会收到您的错误:

        In [7]: arrayOfZeros = arr.astype(object)                                              
        In [8]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
           ...:         preference = 1 
           ...:         for x in y: 
           ...:             if x == 0: 
           ...:                 x[...] = preference 
           ...:                 preference += 1 
           ...:                                                                                
        ---------------------------------------------------------------------------
        TypeError                                 Traceback (most recent call last)
        <ipython-input-8-7dd225a24a36> in <module>
        ----> 1 with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y:
              2         preference = 1
              3         for x in y:
              4             if x == 0:
              5                 x[...] = preference
        
        TypeError: Iterator operand or requested dtype holds references, but the REFS_OK flag was not enabled
        

        进行建议修复:https://docs.scipy.org/doc/numpy/reference/generated/numpy.nditer.html

        In [10]: with np.nditer(arrayOfZeros, flags=['refs_ok'], op_flags=['readwrite']) as y: 
            ...:         preference = 1 
            ...:         for x in y: 
            ...:             if x == 0: 
            ...:                 x[...] = preference 
            ...:                 preference += 1 
            ...:                                                                               
        In [11]: arrayOfZeros                                                                  
        Out[11]: 
        array([[nan, nan, nan, nan, nan],
               [nan, nan, nan, nan, nan],
               [1, nan, nan, nan, nan],
               [2, nan, 3, nan, 4],
               [5, 6, 7, 8, 9],
               [10, 11, 12, 13, 14],
               [nan, 15, 16, 17, 18],
               [nan, 19, nan, nan, nan],
        

        由于 object dtype,它不会显示在整齐的列中。

        如果我将数组更改为order='F',我们会得到沿列向下的连续数字:

        In [12]: arrayOfZeros = arr.copy(order='F') 
        In [14]: with np.nditer(arrayOfZeros, op_flags=['readwrite']) as y: 
            ...:                                                                               
        In [15]: arrayOfZeros                                                                  
        Out[15]: 
        array([[nan, nan, nan, nan, nan],
               [nan, nan, nan, nan, nan],
               [ 1., nan, nan, nan, nan],
               [ 2., nan, 19., nan, 39.],
               [ 3., 11., 20., 31., 40.],
               [ 4., 12., 21., 32., 41.],
               [nan, 13., 22., 33., 42.],
               [nan, 14., nan, nan, nan],
        ....
        

        订单'Fand the object dtype makes me wonder - is the source of this array apandas`数据框?

        【讨论】:

        • 至于我为什么使用nditer - 我对 NumPy(和一般编程)还很陌生,我对模块的粗略理解使我在经过一番调查后找到了该代码。我怀疑这不是最快的方法。至于你的最后一个问题 - 是的,来源是 Pandas DataFrame。为什么这会导致该错误消息?感谢您的详细回复!
        • 默认 numpy 数组构造函数使用 order C(行主要)和数字 dtype。但我在其他 SO 问题中发现 pandas 很容易制作对象 dtype 系列(例如任何带有字符串的东西)。由Series 组成的数据框默认是面向列的,即。 'F'。
        • 我链接了讨论nditer 的两个主要页面、正式文档页面和教程。 两者都需要更好的速度免责声明。 教程页面以cython 示例结尾,该示例充分利用了其 c-api 速度和灵活性。但是当在纯 Python 代码中使用时,它不会提高速度,而且在大多数情况下也很难使用。只有几个高级numpy 函数使用它(例如np.ndindex)。
        猜你喜欢
        • 1970-01-01
        • 2015-03-02
        • 1970-01-01
        • 1970-01-01
        • 2020-07-05
        • 1970-01-01
        • 1970-01-01
        • 2022-12-17
        • 1970-01-01
        相关资源
        最近更新 更多