【问题标题】:Forward fill column on condition [closed]有条件的前向填充列[关闭]
【发布时间】:2025-12-18 06:40:01
【问题描述】:

我的数据框是这样的;

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})

如果 col1 在第 2 列中包含值 1,我想向前填充 1 n 次。例如,如果 n = 4,那么我需要这样的结果。

df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
                   ,'Col2':[0,1,1,1,1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1]})

我想我可以使用带有计数器的 for 循环来做到这一点,每次条件发生时都会重置,但有没有更快的方法来产生相同的结果?

谢谢!

【问题讨论】:

    标签: python pandas numpy conditional-statements fill


    【解决方案1】:

    方法 #1: 基于 NumPy 的方法,1D convolution -

    N = 4 # window size
    K = np.ones(N,dtype=bool)
    df['Col2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
    

    更紧凑的单线 -

    df['Col2'] = (np.convolve(df.Col1,[1]*N)[:-N+1]>0).view('i1')
    

    方法#2:这是SciPy's binary_dilation -

    from scipy.ndimage.morphology import binary_dilation
    
    N = 4 # window size
    K = np.ones(N,dtype=bool)
    df['Col2'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
    

    方法 #3: 使用基于跨步视图的工具从 NumPy 中挤出最好的 -

    from skimage.util.shape import view_as_windows
    
    N = 4 # window size
    mask = df.Col1.values==1
    w = view_as_windows(mask,N)
    idx = len(df)-(N-mask[-N:].argmax())
    if mask[-N:].any():
        mask[idx:idx+N-1] = 1
    w[mask[:-N+1]] = 1
    df['Col2'] = mask.view('i1')
    

    基准测试

    通过10,000x 放大给定样本的设置 -

    In [67]: df = pd.DataFrame({'Col1':[0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0]
        ...:                    ,'Col2':[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]})
        ...: 
        ...: df = pd.concat([df]*10000)
        ...: df.index = range(len(df.index))
    

    时间

    # @jezrael's soln
    In [68]: %%timeit
        ...: n = 3
        ...: df['Col2_1'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
    5.15 ms ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # App-1 from this post
    In [72]: %%timeit
        ...: N = 4 # window size
        ...: K = np.ones(N,dtype=bool)
        ...: df['Col2_2'] = (np.convolve(df.Col1,K)[:-N+1]>0).view('i1')
    1.41 ms ± 20.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    # App-2 from this post
    In [70]: %%timeit
        ...: N = 4 # window size
        ...: K = np.ones(N,dtype=bool)
        ...: df['Col2_3'] = binary_dilation(df.Col1,K,origin=-(N//2)).view('i1')
    2.92 ms ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # App-3 from this post
    In [35]: %%timeit
        ...: N = 4 # window size
        ...: mask = df.Col1.values==1
        ...: w = view_as_windows(mask,N)
        ...: idx = len(df)-(N-mask[-N:].argmax())
        ...: if mask[-N:].any():
        ...:     mask[idx:idx+N-1] = 1
        ...: w[mask[:-N+1]] = 1
        ...: df['Col2_4'] = mask.view('i1')
    1.22 ms ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    # @yatu's soln
    In [71]: %%timeit
        ...: n = 4
        ...: ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
        ...: df.loc[ix, 'Col2_5'] = 1
    7.55 ms ± 32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    【讨论】:

      【解决方案2】:

      对于一般解决方案,将非1 值替换为Series.where 的缺失值并使用限制参数前向填充1 值,最后用原始值替换缺失值:

      n = 3
      df['Col2'] = df['Col1'].where(df['Col1'].eq(1)).ffill(limit=n).fillna(df['Col1']).astype(int)
      
      print (df)
          Col1  Col2
      0      0     0
      1      1     1
      2      0     1
      3      0     1
      4      0     1
      5      0     0
      6      0     0
      7      0     0
      8      1     1
      9      0     1
      10     0     1
      11     0     1
      12     0     0
      13     0     0
      14     0     0
      15     0     0
      16     0     0
      17     1     1
      18     0     1
      19     0     1
      20     0     1
      

      【讨论】:

        【解决方案3】:

        这是一种基于 NumPy 的方法,使用 np.flatnonzero 来获取 Col1 为 1 的索引,并将广播 sum 的范围最大为 n

        n = 4
        ix = (np.flatnonzero(df.Col1 == 1) + np.arange(n)[:,None]).ravel('F')
        df.loc[ix, 'Col2'] = 1
        

        print(df)
        
             Col1  Col2
        0      0     0
        1      1     1
        2      0     1
        3      0     1
        4      0     1
        5      0     0
        6      0     0
        7      0     0
        8      1     1
        9      0     1
        10     0     1
        11     0     1
        12     0     0
        13     0     0
        14     0     0
        15     0     0
        16     0     0
        17     1     1
        18     0     1
        19     0     1
        20     0     1
        

        【讨论】:

          【解决方案4】:

          reindex 的东西

          N=4
          s=df.loc[df.Col1==1,'Col1']
          idx=s.index
          s=s.reindex(idx.repeat(N))
          s.index=(idx.values+np.arange(N)[:,None]).ravel('F')
          
          df.Col2.update(s)
          df
              Col1  Col2
          0      0     0
          1      1     1
          2      0     1
          3      0     1
          4      0     1
          5      0     0
          6      0     0
          7      0     0
          8      1     1
          9      0     1
          10     0     1
          11     0     1
          12     0     0
          13     0     0
          14     0     0
          15     0     0
          16     0     0
          17     1     1
          18     0     1
          19     0     1
          20     0     1
          

          【讨论】:

            最近更新 更多