【问题标题】:how to perform drop_duplicates with multiple condition in a pandas dataframe如何在熊猫数据框中执行具有多个条件的 drop_duplicates
【发布时间】:2018-06-28 08:24:19
【问题描述】:

我有一个 df,

    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1   ''      Sri     2       sri is good in cricket
2   ''      Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4   ''      Ram     2       sri is good in cricket
5   ''      Ram     3       Ram went out
6   3       Sri     1       sri is a good player
7   ''      Sri     2       sri is good in cricket
8   ''      Sri     3       sri went out
9   4       Sri     1       sri is a good player
10  ''      Sri     2       sri is good in cricket
11  ''      Sri     3       sri went out
12  ''      Sri     4       sri came back

我正在尝试根据 ["Name","Class","Data"] 删除重复项。目标是根据每个 Sr 编号的所有句子删除重复项。

我的预期输出是,

out_df


    Sr.No   Name    Class   Data
0   1       Sri     1       sri is a good player
1           Sri     2       sri is good in cricket
2           Sri     3       sri went out
3   2       Ram     1       Ram is a good player
4           Ram     2       sri is good in cricket
5           Ram     3       Ram went out
9   4       Sri     1       sri is a good player
10          Sri     2       sri is good in cricket
11          Sri     3       sri went out
12          Sri     4       sri came back

【问题讨论】:

  • 您能否打印df.to_dict() 并将输出粘贴到您的问题中?您的数据框很难复制。
  • 您的 to_dict 输出与您发布的数据框不同。请务必使其保持一致,以便您的预期输出清晰;)
  • @cᴏʟᴅsᴘᴇᴇᴅ,我用正确的df.to_dict()编辑了我的问题,请检查
  • 我不明白你,当我做pd.DataFrame(my_dict) 它正确地给出了我的实际df。
  • 没关系,我最初误解了这个问题。

标签: python pandas dataframe group-by duplicates


【解决方案1】:

使用groupby + transform 操作创建一个虚拟列。

v = df.groupby(df['Class'].diff().le(0).cumsum())['Data'].transform(' '.join)

或者,

v = df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join) 

在决定要删除哪些行时,此虚拟列成为一个因素。

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"])    
df[~m]

    Class                    Data Name Sr.No
0       1   sri is  a good player  Sri     1
1       2  sri is good in cricket  Sri      
2       3            sri went out  Sri      
3       1    Ram is a good player  Ram     2
4       2  sri is good in cricket  Ram      
5       3            Ram went out  Ram      
9       1   sri is  a good player  Sri     4
10      2  sri is good in cricket  Sri      
11      3            sri went out  Sri      
12      4           sri came back  Sri      

详情

从单调递增的Class 值形成组 -

i = df['Class'].diff().le(0).cumsum()
i

0     0
1     0
2     0
3     1
4     1
5     1
6     2
7     2
8     2
9     3
10    3
11    3
12    3
Name: Class, dtype: int64

使用它进行分组,并使用str.join 操作转换Data -

v = df.groupby(i)['Data'].transform(' '.join)

这只是一列连接的字符串。最后分配dummy列并调用duplicated -

m = df.assign(Foo=v).duplicated(["Name", "Class", "Data", "Foo"]) 
m

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7      True
8      True
9     False
10    False
11    False
12    False
dtype: bool

【讨论】:

  • 效果很好,谢谢coldspped 和@jezrael。 Coldspeed 我可以知道你在数据科学/熊猫领域工作了多少年吗?
  • @pyd 不客气。至于你的问题,我已经和 pandas 合作了大约 5 个半月。
  • 这一行df.groupby(df.Class.diff().le(0).cumsum()).Data.transform(' '.join)我的专栏或关键字中的“数据”是什么。 ]
  • @pyd 一栏。为了清楚起见,我进行了编辑。另外,你没有标记。没用吗?
  • @pyd 除非我遗漏了什么,否则您也可以使用:df['Data'].groupby(df['Class'].diff().le(0).cumsum()).transform(' '.join)
猜你喜欢
  • 2020-01-11
  • 2019-02-26
  • 2020-01-30
  • 2020-10-14
  • 2020-12-28
  • 2018-10-20
  • 1970-01-01
  • 2022-01-25
  • 2016-10-07
相关资源
最近更新 更多