Python pandas：保留具有最高列值的行答案

【问题标题】：Python pandas: keep row with highest column valuePython pandas：保留具有最高列值的行
【发布时间】：2018-05-11 01:37:02
【问题描述】：

假设我有一个学生考试成绩的数据框，其中每个学生学习不同的科目。每个学生可以多次参加每个科目的考试，只保留最高分（满分 100 分）。例如，假设我有一个包含所有测试记录的数据框：

| student_name | subject | test_number | score | 
|--------------|---------|-------------|-------|
| sarah        | maths   | test1       | 78    |
| sarah        | maths   | test2       | 71    |
| sarah        | maths   | test3       | 83    |
| sarah        | physics | test1       | 91    |
| sarah        | physics | test2       | 97    |
| sarah        | history | test1       | 83    |
| sarah        | history | test2       | 87    |
| joan         | maths   | test1       | 83    |
| joan         | maths   | test2       | 88    |

(1) 如何只保留最高分的测试记录（行）？也就是说，

| student_name | subject | test_number | score | 
|--------------|---------|-------------|-------|
| sarah        | maths   | test1       | 78    |
| sarah        | maths   | test2       | 71    |
| sarah        | maths   | test3       | 83    |
| sarah        | physics | test1       | 91    |

(2) 我如何保持同一科目、同一学生的所有测试的平均？那就是：

| student_name | subject | test_number | ave_score | 
|--------------|---------|-------------|-----------|
| sarah        | maths   | na          | 77.333    |
| sarah        | maths   | na          | 94        |
| sarah        | maths   | na          | 85        |
| sarah        | physics | na          | 85.5      |

我尝试了df.sort_values() 和df.drop_duplicates(subset=..., keep=...) 的各种组合，但无济于事。

实际数据

| query | target   | pct-similarity | p-val | aln_length | bit-score |
|-------|----------|----------------|-------|------------|-----------|
| EV239 | B/Fw6/623 | 99.23         | 0.966 |  832       | 356       |
| EV239 | B/Fw6/623 | 97.34         | 0.982 |  1022      | 739       |
| EV239 | MMS-alpha | 92.23         | 0.997 |  838       | 384       |
| EV239 | MMS-alpha | 93.49         | 0.993 |  1402      | 829       |
| EV380 | B/Fw6/623 | 94.32         | 0.951 |  324       | 423       |
| EV380 | B/Fw6/623 | 95.27         | 0.932 |  1245      | 938       |
| EV380 | MMS-alpha | 99.23         | 0.927 |  723       | 522       |
| EV380 | MMS-alpha | 99.15         | 0.903 |  948       | 1092      |

应用聚合函数后，只有pct-similarity 列会感兴趣。

(1) 通过选择最大的aln_length 删除重复的查询+目标行。保留属于最大aln_length 的行的pct-similarity 值。

(2) 通过选择最大aln_length 的行并计算该组重复行的平均值pct-similarity，聚合重复查询+目标行。其他数字列不是必需的，最终会被删除，所以我真的不在乎对它们应用什么聚合函数（最大值或平均值）。

【问题讨论】：

标签： python pandas

【解决方案1】：

只需对每组学生/科目使用max()：

df.groupby(["student_name","subject"], as_index=False).max()


    student_name    subject         test_number     score
0   joan            maths           test2           88
1   sarah           history         test2           87
2   sarah           maths           test3           83
3   sarah           physics         test2           97

平均而言，这使用 mean() 代替：

df.groupby(["student_name","subject"], as_index=False).mean()

    student_name    subject     score
0   joan            maths       85.500000
1   sarah           history     85.000000
2   sarah           maths       77.333333
3   sarah           physics     94.000000

【讨论】：

df.groupby(["student_name","subject"],as_index=False).mean().score
这样更简单！谢谢@Wen :)
好像我的MVWE有点太M了；我的实际数据有多个数字列（例如，还有student_height 和student_weight，或temperature_of_test_date）。有没有办法使用最大/平均 test_score 指定平局/聚合？
我不认为我理解@AndreyIto 也许你可以发布一些示例数据？

【解决方案2】：

很有可能describe可以

df.groupby(["student_name","subject"]).score.describe()
Out[15]: 
                          count       mean       std   min    25%   50%  \
student_name   subject                                                    
 joan           maths       2.0  85.500000  3.535534  83.0  84.25  85.5   
 sarah          history     2.0  85.000000  2.828427  83.0  84.00  85.0   
                maths       3.0  77.333333  6.027714  71.0  74.50  78.0   
                physics     2.0  94.000000  4.242641  91.0  92.50  94.0   
                            75%   max  
student_name   subject                 
 joan           maths     86.75  88.0  
 sarah          history   86.00  87.0  
                maths     80.50  83.0  
                physics   95.50  97.0

还有drop_duplicates

df.sort_values('score').drop_duplicates(["student_name","subject"],keep='last')
Out[22]: 
     student_name    subject    test_number  score
2   sarah           maths      test3            83
6   sarah           history    test2            87
8   joan            maths      test2            88
4   sarah           physics    test2            97

对于mean 值与reindex

df.groupby(["student_name","subject"], as_index=False).mean().reindex(columns=df.columns)
Out[24]: 
     student_name    subject  test_number      score
0   joan            maths             NaN  85.500000
1   sarah           history           NaN  85.000000
2   sarah           maths             NaN  77.333333
3   sarah           physics           NaN  94.000000

【讨论】：

【解决方案3】：

我们可以在groupby 上使用agg 来获取'idxmax' 和'mean'。
这样我们就可以执行内部连接来获得正确的行和均值。

df.join(
    df.groupby(['student_name', 'subject'])
      .score.agg(['idxmax', 'mean']).set_index('idxmax'),
    how='inner'
)

  student_name  subject test_number  score       mean
2        sarah    maths       test3     83  77.333333
4        sarah  physics       test2     97  94.000000
6        sarah  history       test2     87  85.000000
8         joan    maths       test2     88  85.500000

【讨论】：