使用列的值（字符串数据类型）过滤熊猫组答案

【问题标题】：Filtering pandas groupby using value of column (string datatype)使用列的值（字符串数据类型）过滤熊猫组
【发布时间】：2020-01-29 23:42:30
【问题描述】：

我一直在研究一个大型基因组学数据集，该数据集包含每个样本的多次读取，以确保我们获得数据，但在分析它时，我们需要将其放到一行中，这样我们就不会扭曲数据（当它实际上是一个实例多次读取时，将基因计为存在 6 次）。每行都有一个 ID，所以我在 ID 上使用了 pandas df.groupby() 函数。这是一个表格来尝试说明我想要做什么：

# ID   |  functionality   |   v_region_score   |   constant_region 
# -----------------------------------------------------------------
# 123  |  productive      |      820           |      NaN
#      |  unknown         |      720           |      NaN
#      |  unknown         |      720           |      IgM
# 456  |  unknown         |      690           |      NaN
#      |  unknown         |      670           |      NaN
# 789  |  productive      |      780           |      IgM
#      |  productive      |      780           |      NaN

（编辑）这是示例数据框的代码：

df1 = pd.DataFrame([
    [789, "productive", 780, "IgM"],
    [123, "unknown", 720, np.nan],
    [123, "unknown", 720, "IgM"],
    [789, "productive", 780, np.nan],
    [123, "productive", 820, np.nan],
    [456, "unknown", 690, np.nan],
    [456, "unknown", 670, np.nan]], 
    columns=["ID", "functionality", "v_region_score", "constant_region"])

这将是选择正确行的最终输出：

df2 = pd.DataFrame([
    [789, "productive", 780, "IgM"],
    [123, "productive", 820, np.nan],
    [456, "unknown", 690, np.nan]], 
    columns=["ID", "functionality", "v_region_score", "constant_region"])

因此，分组后，对于每个组，如果它在功能上具有“生产性”值，我想保留该行，如果它是“未知”，我将采用最高 v_region_score，如果有多个“生产性”值，我取一个在它的 constant_region 中有一些值的那个。

我尝试了几种访问这些值的方法：

id, frame = next(iter(df_grouped))

if frame["functionality"].equals("productive"):
    # do something

只看一组：

x = df_grouped.get_group("1:1101:10897:22442")

for index, value in x["functionality"].items():
    print(value)

# returns the correct value and type "str"

甚至将每个组放入一个列表中：

new_groups = []

for id, frame in df_grouped:
    new_groups.append(frame)

# access a specific index returns a dataframe
new_groups[30]

我得到的所有这些错误是“系列的真值不明确”，我现在明白为什么这不起作用，但我不能使用 a.any()、a.all() 或 @987654331 @ 因为条件有多复杂。

有什么方法可以根据列的值在每个组中选择特定的行？对不起，这么复杂的问题，提前谢谢！ :)

【问题讨论】：

嗨，请分享您的原始数据框示例和您的预期输出。以此为指导：stackoverflow.com/questions/20109391/…

标签： python pandas split-apply-combine

【解决方案1】：

您可以从不同的角度解决您的问题：

根据您的条件对值进行排序
按ID分组
保留每个分组ID 的第一个结果

例如：

df1 = df1.sort_values(['ID','functionality','v_region_score','constant_region'], ascending=[True,True,False,True], na_position='last')

df1.groupby('ID').first().reset_index()

Out[0]:
    ID functionality  v_region_score constant_region
0  123    productive             820             IgM
1  456       unknown             690             NaN
2  789    productive             780             IgM

此外，如果您想在 null 时合并来自 constant_region 的值，您可以使用 fillna(method='ffill') 以便保留存在的值：

## sorted here

df1['constant_region'] = df1.groupby('ID')['constant_region'].fillna(method='ffill')

df1
Out[1]: 
    ID functionality  v_region_score constant_region
4  123    productive             820             NaN
2  123       unknown             720             IgM
1  123       unknown             720             IgM
5  456       unknown             690             NaN
6  456       unknown             670             NaN
0  789    productive             780             IgM
3  789    productive             780             IgM

## Group by here

【讨论】：

先排序非常好
欣赏@Kenan 的评论。
非常感谢！我什至没有想过这样做，它非常优雅:)