Pandas 数据帧行的特定复杂过滤答案

【问题标题】：Specific complicated filtering of Pandas dataframe rowsPandas 数据帧行的特定复杂过滤
【发布时间】：2021-07-28 10:31:04
【问题描述】：

数据有很多列，但有问题的列如下：

 MR     Version
GB1       Package
GB5       Package
GB9       3.5
GB5       3.3
GB1       Package
GB9       1.5
GB359     9.1
GB1       Package
GB99      5.5
...

MR（型号）名称重复，版本列中的Package 也重复。我需要首先使用 Version == Package 访问所有行，

然后以他们的 MR 型号名称为例 GB5
然后找到具有相同 MR 型号名称的所有其他行，然后
最后检查那些其他行（具有相同 MR 型号名称）的版本列的值是否与 Package(!= Package) 不同。有的我需要归类为好，没有的我需要归类为坏。

例如，从上面的示例数据 MR 模型 GB5 有一个 Package 和 non Package 单元，因此这个模型是好的，并且模型 GB1 在版本列中只有 Package 值，所以它是坏的。

对于版本列中只有整数值的 MR，例如 GB9，我们在此任务中不关心。

通常这些条目彼此相邻，并且通常有两个模型，因此我开发了一个循环，通过从数据框中选择每两行来成功解决下面的问题，但现在我发现在某些情况下这些条目并不相邻，所以我需要一个更好的解决方案来逃避我。非常感谢任何帮助，谢谢大家。在我下面的代码中，MR 被 Author 替换，但这并不重要。

good_aut = []
bad_aut = []
for i, g in merged_df.groupby(merged_df.index // 2): # takes every two rows
    if g.iloc[0]['Version'] == 'Package':            # if row 1 is a package citation
        if g.iloc[0]['Author'] == g.iloc[1]['Author']: # check if row 1 and 2 authors match
            if g.iloc[1]['Version'] != 'Package':       # finally check if row 2 citation is not package, hence it is GAP citation
                print(g)
                good_aut.append(g.iloc[0]['Author']) # if all conditions are met we add this author to the good list, once for every occurence
            else:
                bad_aut.append(g.iloc[0]['Author'])
        else:
            bad_aut.append(g.iloc[0]['Author'])

【问题讨论】：

你的例子对我来说并不完全清楚。 GB9 是好是坏？此外，您没有指定您期望的输出。你能检查一下my answer是不是你想要的吗？
您好，感谢您的调查。 GB9 不好也不坏，我只需要有 Package 版本单元格的模型，所以我认为选项 3 是最好的。非常感谢！
好的，太好了！如果您只想保留好/坏，您可以在函数中返回numpy.NaN 而不是other，并在输出上使用dropna() 来删除不需要的行。

标签： python pandas dataframe for-loop jupyter-notebook

【解决方案1】：

不清楚。除了其他值之外，您是否希望 Package 出现？

如果是

您可以分组MR 并检查Package 是否与其他值一起存在：

def good_or_bad(s):
    s=set(s)
    if 'Package' in s and len(s.difference(['Package']))>0:
        return 'good'
    return 'bad'
df.groupby('MR')['Version'].apply(good_or_bad)

输出：

MR
GB1       bad
GB359     bad
GB5      good
GB9       bad
GB99      bad
Name: Version, dtype: object

如果没有

您可以按MR 分组并检查是否存在Package 以外的值：

(df.groupby('MR')['Version']
 .apply(lambda s: len(set(s).difference(['Package']))>0)
 .map({True: 'good', False: 'bad'})
)

输出：

MR
GB1       bad
GB359    good
GB5      good
GB9      good
GB99     good
Name: Version, dtype: object

我想要所有三种可能性

def good_or_bad(s):
    s=set(s)
    if len(s.difference(['Package']))>0:
        if 'Package' in s:
            return 'good'
        return 'other'
    return 'bad'
df.groupby('MR')['Version'].apply(good_or_bad)

输出：

MR
GB1        bad
GB359    other
GB5       good
GB9      other
GB99     other
Name: Version, dtype: object

【讨论】：

完美答案涵盖所有可能性，谢谢。我需要涵盖所有可能性的选项 3。我想我只需要切换other 和bad 的位置。
对不起@mozway，我只需要澄清一下它是如何工作的，拜托。 python def good_or_bad(s): s=set(s) # what does that do ? :) if len(s.difference(['Package']))>0: # does that check for Package first if 'Package' in s: # does that check for package second time once model having package is identified return 'good' return 'other' return 'bad' df.groupby('MR')['Version'].apply(good_or_bad)
我觉得我懂了，不用切换就对了，只需要学习set就行了。
这里是python documentation on sets 和对mathematical concept of set 的引用。这些对于计算元素组的并集、交集、排除等功能非常强大