【问题标题】：Pandas groupby: get best zscore for counts() of each groupPandas groupby：为每组的 counts() 获得最佳 zscore
【发布时间】：2018-11-22 11:10:35
【问题描述】：

我有一个 pandas groupby 对象，它返回每种基因类型的计数，大致如下所示（为清楚起见，手动格式化列标题）：

counts = df.groupby(["ID", "Gene"]).size()

counts
ID      Gene      Count
1_1_1   SMARCB1     1
        smad       12
1_1_10  SMARCB1     2
        smad       17
1_1_100 SMARCB1     3

我需要得到组内zscore，然后返回zscore最高的基因。

我尝试了以下方法，但它似乎正在计算整个数据集的 zscore，并且没有返回正确的 zscore：

zscore = lambda x: (x - x.mean()) / x.std()
counts = df.groupby(["ID", "Match"]).size().pipe(zscore)

我尝试过使用 transform 并得到了相同的结果。

我试过了：

counts = match_df.groupby(["ID", "Match"]).size().apply(zscore)

这给了我以下错误：

'int' object has no attribute 'mean'

无论我尝试什么，它都不会给出正确的输出。前两行的 zscores 应该是 [-1,1] 在这种情况下，我将返回 1_1_1 SMARCB1 的行。等等。谢谢！

更新

感谢 @ZaxR 的帮助并切换到 numpy 均值和标准差，我能够解决这个问题，如下所示。该解决方案还提供了每个基因的原始计数和 zscores 的摘要数据框：

# group by id and gene match and sum hits to each molecule
counts = df.groupby(["ID", "Match"]).size()

# calculate zscore by feature for molecule counts
# features that only align to one molecule are given a score of 1
zscore = lambda x: (x - np.mean(x)) / np.std(x) 
zscores = counts.groupby('ID').apply(zscore).fillna('1').to_frame('Zscore')

# group results back together with counts and output to 
# merge with positions and save to file 
zscore_df = zscores.reset_index()
zscore_df.columns = ["ID", "Match", "Zscore"]
count_df = counts.reset_index()
count_df.columns = ["ID", "Match", "Counts"]
zscore_df["Counts"] = count_df["Counts"]

# select gene with best zscore meeting threshold
max_df = zscore_df[zscore_df.groupby('ID')['Zscore'].transform(max) \
                       == zscore_df['Zscore']]

【问题讨论】：

mmm 远离我的电脑，但试试.groupby(['FeautreID','Match'], as_index=False).size().groupby(['FeatureID','Match']).apply(zscore)
谢谢，但我需要先获得计数以计算 zscores。
是的，刚刚意识到，尝试我的编辑（修正任何可能潜入的拼写错误后，我正在使用手机）
感谢您的快速编辑。我试图让它工作，但它只返回 NaN。

标签： python pandas group-by statistics

【解决方案1】：

whydf.groupby(["ID", "Gene"]).size().transform(zscore) 不起作用的原因是因为最后一组是只有一个项目的系列，所以当您尝试将 lambda 函数 zscore 应用于单个 [integer] 时，你得到'int' object has no attribute 'mean' 错误。请注意，x.mean() 的行为与 pandas 的 'mean' 不同。

更新

我认为应该这样做：

# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
                   "Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
                   "Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])

# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)

# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]

Out:
                  Count   std_dev
ID       Gene
1_1_1    smad     12      0.707107
1_1_10   smad     17      0.707107

【讨论】：

谢谢，但正如我上面提到的，transform 没有给出正确的答案。它似乎采用的不是组 zscore，而是人口 zscore。我不确定它到底在做什么，但返回的答案不正确。
抱歉，我在阅读时错过了这一点。我添加了关于为什么转换方法不能作为 FYI 工作的解释
经过一些测试，我的答案不太理想，因为我丢失了基因信息。我也一直试图弄清楚如何为此目的更改 zscore 函数，以便如果组大小小于 2，它将返回 1 而不是尝试计算 zscore：zscore = lambda x: stats.zscore(x ) if len(x) > 1 else 1 但这也不行：/
感谢您的更新。标准偏差步骤只返回 NaN :(
非常感谢@ZaxR 的帮助。您引导我找到了我在上面发布的解决方案。