【问题标题】:Create a new column that stores a ratio score in Pandas在 Pandas 中创建一个存储比率分数的新列
【发布时间】:2020-12-17 14:32:31
【问题描述】:

我有一个这样的 DataFrame:

    RANK  STA  RUN  BIB       NAME  FINISH  FINISH.1  FINISH.2            COURSE
0      1    3    3    1  ingenting     3.0      0.00       NaN           LØYPE 1
1      2    8    2    3  ingenting     4.0      1.97       NaN           LØYPE 3
2      3    9    3    3  ingenting     5.0      2.06       NaN           LØYPE 1
3      4    2    2    1  ingenting     6.0      3.21       NaN  STRAIGHT-GLIDING
4      5    5    1    2  ingenting     6.0      3.32       NaN           LØYPE 1
5      6    1    1    1  ingenting     6.0      3.34       NaN  STRAIGHT-GLIDING
6      7    4    4    1  ingenting     6.0      3.43       NaN           LØYPE 1
7      8   13    7    3  ingenting     6.0      3.48       NaN  STRAIGHT-GLIDING
8      9   12    6    3  ingenting     6.0      3.65       NaN  STRAIGHT-GLIDING
9     10   11    5    3  ingenting     NaN      4.19       NaN  STRAIGHT-GLIDING
10    11    6    2    2  ingenting     7.0      4.20       NaN           LØYPE 3
11    12   14    3    2  ingenting     7.0      4.30       NaN  STRAIGHT-GLIDING
12    13   10    4    3  ingenting     8.0      5.14       NaN           LØYPE 2
13    14    7    1    3  ingenting     8.0      5.75       NaN           LØYPE 3

数据框由不同课程 (COURSES) 中的不同运动员 (BIB) 组成。每个 BIB 也有自己的 RUN 编号。我的主要兴趣是 FINISH 专栏。现在我想获得以下内容:

  • 我想找到每个 BIB 的第一个 STRAIGHT-GLIDING FINISH 时间。
  • 接下来,我想将此“存储”为参考时间。
  • 接下来,对于每个观察(本例中为 13 个),我想计算从该 BIB 的 STRAIGHT-GLIDING 时间中减去该 BIB 的 FINISH 时间。

解决方案应为每个观察添加一个包含此信息的新列。举个例子,在观察 0 中,FINISH 时间是 3.0,他的第一个 STRAIGHT-GLIDING 时间是 '3.21'。因此,我想创建一个 3.0 - 3.21 的值。我怎样才能做到这一点?

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    这是我的答案。它有点长:)

    # Create filter for 'STRAIGHT-GLIDING'
    sg_filt = df['COURSE'] == 'STRAIGHT-GLIDING'
    
    # Create 'STRAIGHT-GLIDING' only dataframe using filter
    sg_only = df.loc[sg_filt].copy()
    
    # Preview new DataFrame
    sg_only
    
      Rank  STA RUN BIB NAME    FINISH  FINISH.1    COURSE
    3   4   2   2   1   ingenting   6.0 3.21    STRAIGHT-GLIDING
    5   6   1   1   1   ingenting   6.0 3.34    STRAIGHT-GLIDING
    7   8   13  7   3   ingenting   6.0 3.48    STRAIGHT-GLIDING
    8   9   12  6   3   ingenting   6.0 3.65    STRAIGHT-GLIDING
    9   10  11  5   3   ingenting   NaN 4.19    STRAIGHT-GLIDING
    11  12  14  3   2   ingenting   7.0 4.30    STRAIGHT-GLIDING
    
    # Create DataFrame on only first times per BIB
    first_times = sg_only[sg_only.groupby(['BIB','COURSE']).cumcount() == 0][['BIB','FINISH']].copy()
    
    # Change column name on first_times dataFrame for merge
    first_times.rename(columns={'FINISH':'Reference_Time'},inplace=True)
    
    # Merge original DataFrame with first_times DataFrame to get reference time
    final_df = pd.merge(df,first_times,on='BIB',how='left')
    
    
       Rank STA RUN BIB NAME     FINISH FINISH.1    COURSE  Reference_Time
    0   1   3   3   1   ingenting   3.0  0.00    LØYPE 1            6.0
    1   2   8   2   3   ingenting   4.0  1.97    LØYPE 3            6.0
    2   3   9   3   3   ingenting   5.0  2.06    LØYPE 1            6.0
    3   4   2   2   1   ingenting   6.0  3.21    STRAIGHT-GLIDING   6.0
    4   5   5   1   2   ingenting   6.0  3.32    LØYPE 1            7.0
    5   6   1   1   1   ingenting   6.0  3.34    STRAIGHT-GLIDING   6.0
    6   7   4   4   1   ingenting   6.0  3.43    LØYPE 1            6.0
    7   8   13  7   3   ingenting   6.0  3.48    STRAIGHT-GLIDING   6.0
    8   9   12  6   3   ingenting   6.0  3.65    STRAIGHT-GLIDING   6.0
    9   10  11  5   3   ingenting   NaN  4.19    STRAIGHT-GLIDING   6.0
    10  11  6   2   2   ingenting   7.0  4.20    LØYPE 3            7.0
    11  12  14  3   2   ingenting   7.0  4.30    STRAIGHT-GLIDING   7.0
    12  13  10  4   1   ingenting   8.0  5.14    LØYPE 2            6.0
    13  14  7   1   1   ingenting   8.0  5.75    LØYPE 3            6.0
    
    # Create FINISH_TIME column 
    final_df['FINISH_TIME'] = final_df['FINISH'] - final_df['Reference_Time']
    
       Rank STA RUN BIB NAME    FINISH  FINISH.1    COURSE  Reference_Time  FINISH_TIME
    0   1   3   3   1   ingenting   3.0  0.00   LØYPE 1             6.0    3.0
    1   2   8   2   3   ingenting   4.0  1.97   LØYPE 3             6.0   -2.0
    2   3   9   3   3   ingenting   5.0  2.06   LØYPE 1             6.0   -1.0
    3   4   2   2   1   ingenting   6.0  3.21   STRAIGHT-GLIDING    6.0    0.0
    4   5   5   1   2   ingenting   6.0  3.32   LØYPE 1             7.0   -1.0
    5   6   1   1   1   ingenting   6.0  3.34   STRAIGHT-GLIDING    6.0    0.0
    6   7   4   4   1   ingenting   6.0  3.43   LØYPE 1             6.0    0.0
    7   8   13  7   3   ingenting   6.0  3.48   STRAIGHT-GLIDING    6.0    0.0
    8   9   12  6   3   ingenting   6.0  3.65   STRAIGHT-GLIDING    6.0    0.0
    9   10  11  5   3   ingenting   NaN  4.19   STRAIGHT-GLIDING    6.0    NaN
    10  11  6   2   2   ingenting   7.0  4.20   LØYPE 3             7.0    0.0
    11  12  14  3   2   ingenting   7.0  4.30   STRAIGHT-GLIDING    7.0    0.0
    12  13  10  4   1   ingenting   8.0  5.14   LØYPE 2             6.0    2.0
    13  14  7   1   1   ingenting   8.0  5.75   LØYPE 3             6.0    2.0
    

    【讨论】:

    • 谢谢!非常接近 :D 唯一的问题是您使用了 FINISH.1 来设置参考时间;它应该是 FINISH 列。所以对于第一次观察,参考时间应该是 6.0。
    • 有没有简单的方法来解决这个问题?我喜欢你的方法
    【解决方案2】:

    这是我的解决方案(希望我理解正确):

    import pandas as pd
    import numpy as np
    
    previousBib = ""
    for i in range(df.shape[0]):
        currentBib = df.BIB.to_numpy()[i]
        
        if (currentBib != previousBib):
            instances_BibI = df.loc[df.BIB == currentBib]
            instances_BibI = instances_BibI.sort_values(by=["RUN"])             # To ensure that the first gliding finish is the first race with that finish
            first_StraightGliding_Finish = instances_BibI.loc[instances_BibI.COURSE == "STRAIGHT-GLIDING"].FINISH_1.to_numpy()[0]
            
        df.at[i, 'FINISH_2'] = df.iloc[i, 5] - first_StraightGliding_Finish
        
        previousBib = currentBib
    

    df 是您的示例数据框

    我的示例输出(按 BIB 和 RUN 排序)如下:

        RANK  STA   RUN BIB NAME    FINISH  FINISH_1    FINISH_2    COURSE
    5   6     1     1   1   Olle    6.0     3.34        2.66        STRAIGHT-GLIDING
    3   4     2     2   1   Olle    6.0     3.21        2.66        STRAIGHT-GLIDING
    0   1     3     3   1   Olle    3.0     0.00       -0.34        Loop1
    6   7     4     4   1   Olle    6.0     3.43       2.66         Loop1
    4   5     5     1   2   Olle    6.0     3.32       1.70         Loop1
    10  11    6     2   2   Olle    7.0     4.20       2.70         Loop3
    11  12    14    3   2   Olle    7.0     4.30       2.70         STRAIGHT-GLIDING
    13  14    7     1   3   Olle    8.0     5.75       3.81         Loop3
    1   2     8     2   3   Olle    4.0     1.97       -0.19        Loop3
    2   3     9     3   3   Olle    5.0     2.06       0.81         Loop1
    12  13    10    4   3   Olle    8.0     5.14       3.81         Loop2
    9   10    11    5   3   Olle    NaN     4.19       NaN          STRAIGHT-GLIDING
    8   9     12    6   3   Olle    6.0     3.65       1.81         STRAIGHT-GLIDING
    7   8     13    7   3   Olle    6.0     3.48       1.81         STRAIGHT-GLIDING
    

    FINISH_2 列是您要计算的减去时间值

    【讨论】:

      猜你喜欢
      • 2022-01-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-11-23
      • 2018-12-13
      • 2022-01-03
      • 1970-01-01
      • 2019-02-23
      相关资源
      最近更新 更多