根据条件删除 pandas DataFrame 中的重复行答案

【问题标题】：removing duplicate rows in pandas DataFrame based on a condition根据条件删除 pandas DataFrame 中的重复行
【发布时间】：2015-10-07 14:53:57
【问题描述】：

除非某些条件，否则我想使用参数“take_last = True”删除数据帧中与列“a”相关的重复行。例如，如果我有以下数据框

 a | b | c
 1 | S | Blue 
 2 | M | Black
 2 | L | Blue
 1 | L | Green

我想删除与列“a”相关的重复行，一般规则为 take_last = true，除非某些条件说 c = 'Blue'，在这种情况下，我想让参数 take_last = false。

所以我得到这个作为我的结果 df

 a | b | c
 1 | L | Green
 2 | M | Black

【问题讨论】：

我不明白。为什么take_last=True 表示take only Green？
这不是我想要做的。我现在编辑了这个问题。我只是想举例说明我的情况，即我想保留最后一个重复行，除非某些条件为真。
是的，关于“a”列
好的，现在你需要说明你的情况。 "c = 'Blue'" 是适用于 row 而非组的条件。如果我将所有 a = 1 行组合成一个组，我如何确定你想要最后一个还是第一个？除非组中的任何行具有 c = Blue，否则您想要最后一个吗？
我不明白生成 df 的示例。根据您的描述，我假设您希望将最后一行保留为给定的 a 值，除非存在 c 为“蓝色”的行 - 那么您希望保留 那些行中的第一行 i>，但这不是示例 df 显示的内容。

标签： python pandas dataframe

【解决方案1】：

#   a  b      c
#0  1  S   Blue
#1  2  M  Black
#2  2  L   Blue
#3  1  L  Green

#get first rows of groups, sort them and reset index; delete redundant col index
df1 = df.groupby('a').head(1).sort('a').reset_index()
del df1['index']

#get last rows of groups, sort them and reset index; delete redundant col index
df2 = df.groupby('a').tail(1).sort('a').reset_index()
del df2['index']
print df1
#   a  b      c
#0  1  S   Blue
#1  2  M  Black
print df2
#   a  b      c
#0  1  L  Green
#1  2  L   Blue

#if value in col c in df1 is 'Blue' replace this row with row from df2 (indexes are same)
df1.loc[df1['c'].isin(['Blue'])] = df2
print df1
#   a  b      c
#0  1  L  Green
#1  2  M  Black

【讨论】：