Pandas 使用先前的排名值过滤掉当前行答案

【问题标题】：Pandas using the previous rank values to filter out current rowPandas 使用先前的排名值过滤掉当前行
【发布时间】：2021-09-14 13:16:28
【问题描述】：

正如标题所说，我正在尝试使用以前的排名来过滤掉当前的

这是我开始 df 的示例

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 2.95, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  2.95
# 4     3  4  5  3.00
# 5     3  2  5  6.00

这就是我想要的输出

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

基本上我想要发生的是，如果之前的排名有任何行带有 x、y（+- 1 双向）和 z（+- .1）来删除它。

因此，对于排名 1 的行，排名 2 中的任意行具有 x = (-1-1)、y = (-1-1)、z= (.9-1.1) 或 x = (2 -5), y = (3-5), z= (2.9-3.1) 我想去掉它

提前感谢所有帮助！

【问题讨论】：

不应该保留最后一行吗？ z 上的条件不满足
你说得对，我忘了补充
好的，我想我的解决方案应该适合你，请告诉我

标签： python python-3.x pandas

【解决方案1】：

这有点棘手，因为您需要访问上一个组。您可以先使用groupby 计算组，然后遍历元素并使用自定义函数执行检查：

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d1.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1 for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,0.1]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

输出：

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0

【讨论】：

你能解释一下你在你的函数中做了什么吗？我有点迷路了哈哈
@mike_gundy123 我评论了代码，如果您有任何问题，请告诉我
好的，感谢您的解释，这绝对有帮助！最后一个问题，我的真实数据集有额外的列，这些列对于比较并不重要。函数中的那些做什么？我是否继续在 [] 中忽略它们？
因为我的代码只是计算一个掩码，额外的列不应该影响进程
@mike_gundy123 那么，它对你有用吗？让我更新；）