【问题标题】:Pandas: Drop consecutive duplicates熊猫:删除连续重复
【发布时间】:2013-10-28 03:25:31
【问题描述】:

在 pandas 中只删除连续重复项的最有效方法是什么?

drop_duplicates 给出了这个:

In [3]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])

In [4]: a.drop_duplicates()
Out[4]: 
1    1
2    2
4    3
dtype: int64

但我想要这个:

In [4]: a.something()
Out[4]: 
1    1
2    2
4    3
5    2
dtype: int64

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    使用shift:

    a.loc[a.shift(-1) != a]
    
    Out[3]:
    
    1    1
    3    2
    4    3
    5    2
    dtype: int64
    

    所以上面使用布尔标准,我们将数据帧与移动了 -1 行的数据帧进行比较以创建掩码

    另一种方法是使用diff:

    In [82]:
    
    a.loc[a.diff() != 0]
    Out[82]:
    1    1
    2    2
    4    3
    5    2
    dtype: int64
    

    但是如果你有大量的行,这会比原来的方法慢。

    更新

    感谢 Bjarke Ebert 指出一个细微的错误,我实际上应该使用 shift(1) 或只是 shift(),因为默认的句点是 1,这会返回第一个连续值:

    In [87]:
    
    a.loc[a.shift() != a]
    Out[87]:
    1    1
    2    2
    4    3
    5    2
    dtype: int64
    

    注意索引值的差异,感谢@BjarkeEbert!

    【讨论】:

    • 我们应该怎么做,如果我们想先做一个 groupby 然后删除连续的重复项?例如 df.groupby(['Col1','Col2']) 并再次将其保存为数据框?
    【解决方案2】:

    这是一项更新,可使其适用于多列。使用 ".any(axis=1)" 组合每一列的结果:

    cols = ["col1","col2","col3"]
    de_dup = a[cols].loc[(a[cols].shift() != a[cols]).any(axis=1)]
    

    【讨论】:

      【解决方案3】:

      由于我们要追求most efficient way,即性能,让我们使用数组数据来利用 NumPy。我们将一次性切片并进行比较,类似于前面@EdChum's post 中讨论的移位方法。但是使用 NumPy 切片,我们最终会得到一个数组,因此我们需要在开始时与 True 元素连接以选择第一个元素,因此我们会有这样的实现 -

      def drop_consecutive_duplicates(a):
          ar = a.values
          return a[np.concatenate(([True],ar[:-1]!= ar[1:]))]
      

      示例运行 -

      In [149]: a
      Out[149]: 
      1    1
      2    2
      3    2
      4    3
      5    2
      dtype: int64
      
      In [150]: drop_consecutive_duplicates(a)
      Out[150]: 
      1    1
      2    2
      4    3
      5    2
      dtype: int64
      

      比较 @EdChum's solution 的大型数组的计时 -

      In [142]: a = pd.Series(np.random.randint(1,5,(1000000)))
      
      In [143]: %timeit a.loc[a.shift() != a]
      100 loops, best of 3: 12.1 ms per loop
      
      In [144]: %timeit drop_consecutive_duplicates(a)
      100 loops, best of 3: 11 ms per loop
      
      In [145]: a = pd.Series(np.random.randint(1,5,(10000000)))
      
      In [146]: %timeit a.loc[a.shift() != a]
      10 loops, best of 3: 136 ms per loop
      
      In [147]: %timeit drop_consecutive_duplicates(a)
      10 loops, best of 3: 114 ms per loop
      

      所以,有一些改进!

      只为价值获得重大提升!

      如果只需要值,我们可以通过简单地索引数组数据来获得重大提升,就像这样 -

      def drop_consecutive_duplicates(a):
          ar = a.values
          return ar[np.concatenate(([True],ar[:-1]!= ar[1:]))]
      

      示例运行 -

      In [170]: a = pandas.Series([1,2,2,3,2], index=[1,2,3,4,5])
      
      In [171]: drop_consecutive_duplicates(a)
      Out[171]: array([1, 2, 3, 2])
      

      时间安排 -

      In [173]: a = pd.Series(np.random.randint(1,5,(10000000)))
      
      In [174]: %timeit a.loc[a.shift() != a]
      10 loops, best of 3: 137 ms per loop
      
      In [175]: %timeit drop_consecutive_duplicates(a)
      10 loops, best of 3: 61.3 ms per loop
      

      【讨论】:

      • 我不明白为什么 [147] 和 [175] 的时间不同?你能解释一下你做了什么改变,因为我没有看到任何改变吗?也许是错字?
      • @Biarys [175] 是修改后的Get major boost for values only! 第一节,因此时间差异。原始的适用于 pandas 系列,而修改后的适用于数组,如帖子中所列。
      • 哦,我明白了。很难注意到 return a[...]return ar[....] 的变化。您的函数是否适用于数据框?
      • @Biarys 对于数据帧,如果您正在寻找重复的行,我们只需要使用 slicing : ar[:,:-1]!= ar[:,1:],以及 ALL 减少。
      • 谢谢。我会试试的
      【解决方案4】:

      这是一个同时处理pd.Seriespd.Dataframes 的函数。您可以屏蔽/删除,选择轴,最后选择使用“任何”或“全部”“NaN”删除。它在计算时间方面没有优化,但它的优点是健壮且非常清晰。

      import numpy as np
      import pandas as pd
      
      # To mask/drop successive values in pandas
      def Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                   keep_first=True,
                                                   axis=0, how='all'):
      
          '''
          #Function built with the help of:
          # 1) https://stackoverflow.com/questions/48428173/how-to-change-consecutive-repeating-values-in-pandas-dataframe-series-to-nan-or
          # 2) https://stackoverflow.com/questions/19463985/pandas-drop-consecutive-duplicates
          
          Input:
          df should be a pandas.DataFrame of a a pandas.Series
          Output:
          df of ts with masked or dropped values
          '''
          
          # Mask keeping the first occurrence
          if keep_first:
              df = df.mask(df.shift(1) == df)
          # Mask including the first occurrence
          else:
              df = df.mask((df.shift(1) == df) | (df.shift(-1) == df))
      
          # Drop the values (e.g. rows are deleted)    
          if drop:
              return df.dropna(axis=axis, how=how)        
          # Only mask the values (e.g. become 'NaN')
          else:
              return df   
      

      这是一个包含在脚本中的测试代码:

      
      if __name__ == "__main__":
          
          # With time series
          print("With time series:\n")
          ts = pd.Series([1,1,2,2,3,2,6,6,float('nan'), 6,6,float('nan'),float('nan')], 
                          index=[0,1,2,3,4,5,6,7,8,9,10,11,12])
          
          print("#Original ts:")    
          print(ts)
      
          print("\n## 1) Mask keeping the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                         keep_first=True))
      
          print("\n## 2) Mask including the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=False, 
                                                         keep_first=False))
          
          print("\n## 3) Drop keeping the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                         keep_first=True))
          
          print("\n## 4) Drop including the first occurrence:")        
          print(Mask_Or_Drop_Successive_Identical_Values(ts, drop=True, 
                                                         keep_first=False))
          
          
          # With dataframes
          print("With dataframe:\n")
          df = pd.DataFrame(np.random.randn(15, 3))
          df.iloc[4:9,0]=40
          df.iloc[8:15,1]=22
          df.iloc[8:12,2]=0.23
              
          print("#Original df:")
          print(df)
      
          print("\n## 5) Mask keeping the first occurrence:") 
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                         keep_first=True))
      
          print("\n## 6) Mask including the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=False, 
                                                         keep_first=False))
          
          print("\n## 7) Drop 'any' keeping the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                         keep_first=True,
                                                         how='any'))
          
          print("\n## 8) Drop 'all' keeping the first occurrence:")    
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                         keep_first=True,
                                                         how='all'))
          
          print("\n## 9) Drop 'any' including the first occurrence:")        
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                         keep_first=False,
                                                         how='any'))
      
          print("\n## 10) Drop 'all' including the first occurrence:")        
          print(Mask_Or_Drop_Successive_Identical_Values(df, drop=True, 
                                                         keep_first=False,
                                                         how='all'))
      

      这是预期的结果:

      With time series:
      
      #Original ts:
      0     1.0
      1     1.0
      2     2.0
      3     2.0
      4     3.0
      5     2.0
      6     6.0
      7     6.0
      8     NaN
      9     6.0
      10    6.0
      11    NaN
      12    NaN
      dtype: float64
      
      ## 1) Mask keeping the first occurrence:
      0     1.0
      1     NaN
      2     2.0
      3     NaN
      4     3.0
      5     2.0
      6     6.0
      7     NaN
      8     NaN
      9     6.0
      10    NaN
      11    NaN
      12    NaN
      dtype: float64
      
      ## 2) Mask including the first occurrence:
      0     NaN
      1     NaN
      2     NaN
      3     NaN
      4     3.0
      5     2.0
      6     NaN
      7     NaN
      8     NaN
      9     NaN
      10    NaN
      11    NaN
      12    NaN
      dtype: float64
      
      ## 3) Drop keeping the first occurrence:
      0    1.0
      2    2.0
      4    3.0
      5    2.0
      6    6.0
      9    6.0
      dtype: float64
      
      ## 4) Drop including the first occurrence:
      4    3.0
      5    2.0
      dtype: float64
      With dataframe:
      
      #Original df:
                  0          1         2
      0   -1.890137  -3.125224 -1.029065
      1   -0.224712  -0.194742  1.891365
      2    1.009388   0.589445  0.927405
      3    0.212746  -0.392314 -0.781851
      4   40.000000   1.889781 -1.394573
      5   40.000000  -0.470958 -0.339213
      6   40.000000   1.613524  0.271641
      7   40.000000  -1.810958 -1.568372
      8   40.000000  22.000000  0.230000
      9   -0.296557  22.000000  0.230000
      10  -0.921238  22.000000  0.230000
      11  -0.170195  22.000000  0.230000
      12   1.460457  22.000000 -0.295418
      13   0.307825  22.000000 -0.759131
      14   0.287392  22.000000  0.378315
      
      ## 5) Mask keeping the first occurrence:
                  0          1         2
      0   -1.890137  -3.125224 -1.029065
      1   -0.224712  -0.194742  1.891365
      2    1.009388   0.589445  0.927405
      3    0.212746  -0.392314 -0.781851
      4   40.000000   1.889781 -1.394573
      5         NaN  -0.470958 -0.339213
      6         NaN   1.613524  0.271641
      7         NaN  -1.810958 -1.568372
      8         NaN  22.000000  0.230000
      9   -0.296557        NaN       NaN
      10  -0.921238        NaN       NaN
      11  -0.170195        NaN       NaN
      12   1.460457        NaN -0.295418
      13   0.307825        NaN -0.759131
      14   0.287392        NaN  0.378315
      
      ## 6) Mask including the first occurrence:
                 0         1         2
      0  -1.890137 -3.125224 -1.029065
      1  -0.224712 -0.194742  1.891365
      2   1.009388  0.589445  0.927405
      3   0.212746 -0.392314 -0.781851
      4        NaN  1.889781 -1.394573
      5        NaN -0.470958 -0.339213
      6        NaN  1.613524  0.271641
      7        NaN -1.810958 -1.568372
      8        NaN       NaN       NaN
      9  -0.296557       NaN       NaN
      10 -0.921238       NaN       NaN
      11 -0.170195       NaN       NaN
      12  1.460457       NaN -0.295418
      13  0.307825       NaN -0.759131
      14  0.287392       NaN  0.378315
      
      ## 7) Drop 'any' keeping the first occurrence:
                 0         1         2
      0  -1.890137 -3.125224 -1.029065
      1  -0.224712 -0.194742  1.891365
      2   1.009388  0.589445  0.927405
      3   0.212746 -0.392314 -0.781851
      4  40.000000  1.889781 -1.394573
      
      ## 8) Drop 'all' keeping the first occurrence:
                  0          1         2
      0   -1.890137  -3.125224 -1.029065
      1   -0.224712  -0.194742  1.891365
      2    1.009388   0.589445  0.927405
      3    0.212746  -0.392314 -0.781851
      4   40.000000   1.889781 -1.394573
      5         NaN  -0.470958 -0.339213
      6         NaN   1.613524  0.271641
      7         NaN  -1.810958 -1.568372
      8         NaN  22.000000  0.230000
      9   -0.296557        NaN       NaN
      10  -0.921238        NaN       NaN
      11  -0.170195        NaN       NaN
      12   1.460457        NaN -0.295418
      13   0.307825        NaN -0.759131
      14   0.287392        NaN  0.378315
      
      ## 9) Drop 'any' including the first occurrence:
                0         1         2
      0 -1.890137 -3.125224 -1.029065
      1 -0.224712 -0.194742  1.891365
      2  1.009388  0.589445  0.927405
      3  0.212746 -0.392314 -0.781851
      
      ## 10) Drop 'all' including the first occurrence:
                 0         1         2
      0  -1.890137 -3.125224 -1.029065
      1  -0.224712 -0.194742  1.891365
      2   1.009388  0.589445  0.927405
      3   0.212746 -0.392314 -0.781851
      4        NaN  1.889781 -1.394573
      5        NaN -0.470958 -0.339213
      6        NaN  1.613524  0.271641
      7        NaN -1.810958 -1.568372
      9  -0.296557       NaN       NaN
      10 -0.921238       NaN       NaN
      11 -0.170195       NaN       NaN
      12  1.460457       NaN -0.295418
      13  0.307825       NaN -0.759131
      14  0.287392       NaN  0.378315
      
      

      【讨论】:

      • 你也可以避免明确检查值 if keep_first: 就足够了(和更好的风格)
      【解决方案5】:

      对于其他 Stack 浏览器,以上面 johnml1135 的答案为基础。这将从多个列中删除下一个重复项,但不会删除所有列。对数据框进行排序时,它将保留第一行,但如果“cols”匹配,则删除第二行,即使有更多列具有不匹配的信息。

      cols = ["col1","col2","col3"]
      df = df.loc[(df[cols].shift() != df[cols]).any(axis=1)]
      

      【讨论】:

        【解决方案6】:

        只是另一种方式:

        a.loc[a.ne(a.shift())]
        

        方法pandas.Series.ne不等于运算符,所以a.ne(a.shift())等价于a != a.shift()。文档here

        【讨论】:

          【解决方案7】:

          这是EdChum's answer 的变体,它也将连续的 NaN 视为重复:

          def remove_consecutive_duplicates_and_nans(s):
              # By default, `shift` uses NaN as a fill value, which breaks our
              # removal of consecutive NaNs. Hence we use a different sentinel
              # object instead.
              shifted = s.astype(object).shift(-1, fill_value=object())
              return s.loc[
                  (shifted != s)
                  & ~(shifted.isna() & s.isna())
              ]
          

          【讨论】:

            【解决方案8】:

            创建新列。

            df['match'] = df.col1.eq(df.col1.shift())
            

            然后:

            df = df[df['match']==False]
            

            【讨论】:

              猜你喜欢
              • 2020-06-23
              • 2020-11-09
              • 2016-01-30
              • 1970-01-01
              • 2014-12-09
              • 1970-01-01
              • 2021-08-31
              • 1970-01-01
              • 1970-01-01
              相关资源
              最近更新 更多