遍历 df 并使用条件删除不需要的行答案

【问题标题】：Looping through df and using a conditional to remove unneeded rows遍历 df 并使用条件删除不需要的行
【发布时间】：2022-01-15 13:07:38
【问题描述】：

抱歉，如果这可能是一个重复的问题，我没有运气通过半相似帖子的指导解决我的问题。

我有一个带有列 ID 和 current_stage 的 df，使用 Python

我想通过并查找 ID 中的重复值，并在那些重复的值中检查它们是否具有当前阶段的 1 或 2。如果他们只有 1 或 2 个，那么我只需要该 ID 的一条记录。如果重复 ID 的实例中有 3 或 4，我想保留该重复 ID 的所有记录。

感谢堆栈溢出之神提供的任何帮助！

谢谢

【问题讨论】：

发布一些数据样本。
这里有一个关于如何制作样本的指南，以便更容易回答问题：stackoverflow.com/questions/20109391/…

标签： python pandas numpy jupyter-notebook

【解决方案1】：

我可能有办法绕过..

您将数据分成两个数据帧，从一个数据帧中删除重复项并再次合并它们，如下所示：

df1 = df[df['current_stage'].isin([1,2])]
df2 = df[~df['current_stage'].isin([1,2])]
df1.drop_duplicates(subset=['ID'], inplace = True)
df = pd.concat([df1, df2])

【讨论】：

【解决方案2】：

给定下面的数据框：

df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3], 'current_stage': [1, 1, 2, 3, 3, 4, 4, 1, 2, 2, 4]})

    ID  current_stage
0    1              1
1    1              1
2    1              2
3    1              3
4    1              3
5    2              4
6    2              4
7    2              1
8    2              2
9    2              2
10   3              4

你可以这样做：

out = df[df.groupby('ID')['current_stage'].transform(np.size)>1].groupby('current_stage').apply(lambda x: x.iloc[0].to_frame().T if x.iloc[0]['current_stage'] in [1,2] else x).reset_index(drop=True)

输出：

   ID  current_stage
0   1              1
1   1              2
2   1              3
3   1              3
4   2              4
5   2              4

【讨论】：