【问题标题】:Count consecutive zeros over pandas rows计算熊猫行上的连续零
【发布时间】:2023-03-10 13:13:01
【问题描述】:

有以下pd.DataFrame

pd.DataFrame({'2010':[0, 45, 5], '2011': [12, 56, 0], '2012': [11, 22, 0], '2013': [0, 5, 0], '2014': [0, 0, 0]})

  2010 2011 2012 2013 2014
1  0    12   11   0    0
2  45   56   22   5    0
3  5    0    0    0    0

我想计算行上的连续零

1 [1, 2]
2 [1]
3 [4]

寻找不同的有效方法

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    为了效率,我建议采用纯 NumPy 方式 -

    def islandlen_perrow(df, trigger_val=0):
        a=df.values==trigger_val
        pad = np.zeros((a.shape[0],1),dtype=bool)
        mask = np.hstack((pad, a, pad))
        mask_step = mask[:,1:] != mask[:,:-1]
        idx = np.flatnonzero(mask_step)
        island_lens = idx[1::2] - idx[::2]
        n_islands_perrow = mask_step.sum(1)//2
        out = np.split(island_lens,n_islands_perrow[:-1].cumsum())
        return out
    

    示例运行 -

    In [69]: df
    Out[69]: 
       2010  2011  2012  2013  2014
    0     0    12    11     0     0
    1    45    56    22     5     0
    2     5     0     0     0     0
    
    In [70]: islandlen_perrow(df, trigger_val=0)
    Out[70]: [array([1, 2], dtype=int64), array([1], dtype=int64), array([4], dtype=int64)]
    
    In [76]: pd.Series(islandlen_perrow(df, trigger_val=0))
    Out[76]: 
    0    [1, 2]
    1       [1]
    2       [4]
    dtype: object
    

    大数组的计时 -

    In [77]: df = pd.DataFrame(np.random.randint(0,4,(1000,1000)))
    
    In [78]: from itertools import groupby
    
    # @Daniel Mesejo's soln
    In [79]: def count_zeros(x):
        ...:     return [sum(1 for _ in group) for key, group in groupby(x, key=lambda i: i == 0) if key]
    
    In [80]: %timeit df.apply(count_zeros, axis=1)
    1 loop, best of 3: 228 ms per loop
    
    # @coldspeed's soln-1
    In [84]: %%timeit
        ...: v = df.stack()
        ...: m = v.eq(0)
        ...: 
        ...: (m.ne(m.shift())
        ...:   .cumsum()
        ...:   .where(m)
        ...:   .dropna()
        ...:   .groupby(level=0)
        ...:   .apply(lambda x: x.value_counts(sort=False).tolist()))
    1 loop, best of 3: 516 ms per loop
    
    # @coldspeed's soln-2
    In [88]: %%timeit
        ...: v = df.stack()
        ...: m = v.eq(0)
        ...: (m.ne(m.shift())
        ...:   .cumsum()
        ...:   .where(m)
        ...:   .dropna()
        ...:   .groupby(level=0)
        ...:   .value_counts(sort=False)
        ...:   .groupby(level=0)
        ...:   .apply(list))
    1 loop, best of 3: 343 ms per loop
    
    # @jpp's soln
    In [90]: %timeit [[len(list(grp)) for flag, grp in groupby(row, key=bool) if not flag] \
        ...:                 for row in df.values]
    1 loop, best of 3: 334 ms per loop
    
    # @J. Doe's soln
    In [94]: %%timeit
        ...: data = df
        ...: data_transformed = np.equal(data.astype(int).values.tolist(), 0).astype(str)
        ...: pd.DataFrame(data_transformed).apply(lambda x: [i.count('True') for i in ''.join(list(x)).split('False') if i], axis=1)
    1 loop, best of 3: 519 ms per loop
    
    # From this post
    In [89]: %timeit pd.Series(islandlen_perrow(df, trigger_val=0))
    100 loops, best of 3: 9.8 ms per loop
    

    【讨论】:

      【解决方案2】:

      itertools.groupby 与列表理解一起使用:

      from itertools import groupby
      
      df['counts'] = [[len(list(grp)) for flag, grp in groupby(row, key=bool) if not flag] \
                      for row in df.values]
      
      print(df)
      
         2010  2011  2012  2013  2014  counts
      0     0    12    11     0     0  [1, 2]
      1    45    56    22     5     0     [1]
      2     5     0     0     0     0     [4]
      

      【讨论】:

        【解决方案3】:

        如果您对纯 pandas/numpy 解决方案感兴趣...您可以使用 groupbyvalue_counts

        v = df.stack()
        m = v.eq(0)
        
        (m.ne(m.shift())
          .cumsum()
          .where(m)
          .dropna()
          .groupby(level=0)
          .apply(lambda x: x.value_counts(sort=False).tolist()))
        
        0    [1, 2]
        1       [1]
        2       [4]
        dtype: object
        

        或者,避免lambda

        (m.ne(m.shift())
          .cumsum()
          .where(m)
          .dropna()
          .groupby(level=0)
          .value_counts(sort=False)
          .groupby(level=0)
          .apply(list))
        
        0    [1, 2]
        1       [1]
        2       [4]
        dtype: object
        

        【讨论】:

          【解决方案4】:

          你可以使用itertools.groupby:

          import pandas as pd
          
          from itertools import groupby
          
          
          def count_zeros(x):
              return [sum(1 for _ in group) for key, group in groupby(x, key=lambda i: i == 0) if key]
          
          
          df = pd.DataFrame({'2010':[0, 45, 5], '2011': [12, 56, 0], '2012': [11, 22, 0], '2013': [0, 5, 0], '2014': [0, 0, 0]})
          
          result = df.apply(count_zeros, axis=1)
          print(result)
          

          输出

          0    [1, 2]
          1       [1]
          2       [4]
          dtype: object
          

          【讨论】:

            【解决方案5】:

            一种方法是将值转换为布尔值,并通过False 值分割字符串

            data_transformed = np.equal(data.astype(int).values.tolist(), 0).astype(str)
            pd.DataFrame(data_transformed).apply(lambda x: [i.count('True') for i in ''.join(list(x)).split('False') if i], axis=1)
            

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 2022-01-24
              • 1970-01-01
              • 2018-10-30
              • 1970-01-01
              • 2019-03-04
              • 2022-10-14
              • 2020-12-23
              相关资源
              最近更新 更多