Pandas 基于多个阈值创建掩码答案

【问题标题】：Pandas create a mask based on multiple thresholdsPandas 基于多个阈值创建掩码
【发布时间】：2022-01-19 14:03:07
【问题描述】：

问题：

假设有一个 Pandas 数据框：

d = {'A': [0.1, 0.4, 0.2, 0.2],
     'B': [0.7, 0.3, 0.2, 0.9],
     'Z': [0.5, 0.3, 0.4, 0.6],
     'sth': ['abc', 'something', 'unimportant', 'x']}
df = pd.DataFrame(data = d)
df

	A	B	Z	sth
0	0.1	0.7	0.5	"abc"
1	0.4	0.3	0.3	"something"
2	0.2	0.2	0.4	"unimportant"
3	0.2	0.9	0.6	"x"

thresholds = {'A': 0.5, 'B':0.8, 'Z': 0.3}

我想为每一行找到一个具有True的掩码，其中该行的最高值低于为该类定义的阈值。

对于给定的示例，正确的掩码应该是：

[ True, True, False, False]

解释：

行0。首先找到该行中的最大值max([0.1,0.7,0.5]) = 0.7。请注意，0.7 位于 B 列中。将此值与列 B 的阈值 (0.8) 进行比较。 0.8 > 0.7，所以结果为 True。
1 行在A 列具有最高值，因为max([0.4,0.3,0.3]) = 0.4，类A 的阈值为0.5，因此True
2 行在Z 列具有最高值，因为max([0.2,0.2,0.4]) = 0.4，类Z 的阈值为0.3，因此False
3 行在B 列具有最高值，因为max([0.2,0.9,0.6]) = 0.9，类B 的阈值为0.8。因为0.8 < 0.9 这一行是False

【问题讨论】：

在您的示例数据中，第 0 行的 Z 值为 0.5，这不会高于阈值并作为示例 [ False, True, False]
好点。不，因为我分析每一行，而不是每一列。我看到我的问题的措辞可能有点令人困惑。 ;S

标签： python pandas dataframe

【解决方案1】：

您可以使用 apply 和 lambda 函数来计算超出阈值的那些。

试试这个：

def within_threshold(x, thresh):
    key = pd.to_numeric(x[thresh.keys()]).idxmax(axis=0)
    return x[key] > thresh[key]

df["within_threshold"] = df.apply(lambda x: within_threshold(x, thresholds), axis=1)
df

完整代码sn-p：

import pandas as pd

thresholds = {'A': 0.5, 'B':0.8, 'Z': 0.3}

d = {'A': [0.1,0.4,0.2],'B':[0.7,0.3,0.2],'Z':[0.5,0.3,0.4],'sth':["a","b","c"]}
df = pd.DataFrame(data = d)

def within_threshold(x, thresh):
    key = pd.to_numeric(x[thresh.keys()]).idxmax(axis=0)
    return x[key] > thresh[key]

df["within_threshold"] = df.apply(lambda x: within_threshold(x, thresholds), axis=1)
df

应该给你这个：

    A   B   Z   sth within_threshold
0   0.1 0.7 0.5 a   True
1   0.4 0.3 0.3 b   True
2   0.2 0.2 0.4 c   False
3   0.2 0.9 0.2 d   False

另外，根据您的示例数据，row 0 的 Z 值为 0.5，高于 Z 阈值。

由 OP 编辑

这个答案引导我找到解决方案，所以我编辑了它，现在它解决了问题。

【讨论】：

我知道我没有包含一个全面的示例。我现在修好了。在第 0 行，您只查看最大值 (0.7)，然后将其与所述类的阈值（此处为 B）进行比较。
key = x[thresh.keys()].idxmax(axis=0) 行抛出错误。 TypeError: reduction operation 'argmax' not allowed for this dtype 当我运行它时。我认为我们可以通过删除if 并仅返回表达式来进一步减少示例。我添加了pd.to_numeric(... 来修复错误。

【解决方案2】：

列表推导可以直接完成工作：

[df[col].max < tresholds[col] for col in tresholds.keys()]

但是，我不会使用列表来获取结果，而是使用字典，其中键是列名，值是所需的布尔值。根据您使用的数据框，使用整数进行索引可能有点模棱两可。

【讨论】：

问题：

解释：

由 OP 编辑​​

由 OP 编辑