【问题标题】:Calculate the average consumption from data in Pandas Dataframe根据 Pandas Dataframe 中的数据计算平均消耗
【发布时间】:2021-05-05 13:18:04
【问题描述】:

我有一个数据框,我需要计算每个引擎的平均消耗量。

    iterables = [['A123B'], ['2021-03-04 10:10:17', '2021-03-04 11:18:51', '2021-03-04 12:50:24', 
                             '2021-03-04 13:02:02', '2021-03-04 14:37:23']]
    control_id = [1, 2, 3, 4, 5]
    index = pd.MultiIndex.from_product(iterables, names=["ENGINE_ID", "TIME"])
    steps = [354815, 355160, 355428, 357850, 358314]
    quantity = [156.32, 85.49, 100.00, 157.02, 134.00]
    full = [1, 0, 0, 1, 0]
    dict = {'CONTROL_ID':control_id, 'STEPS':steps, 'QUANTITY':quantity, 'FULL':full}
    df = pd.DataFrame(dict, index=index)
ENGINE_ID TIME CONTROL_ID STEPS QUANTITY FULL
A123B 2021-03-04 10:10:17 1 354815 156.32 1
2021-03-04 11:18:51 2 355160 85.49 0
2021-03-04 12:50:24 3 355428 100.00 0
2021-03-04 13:02:02 4 357850 157.02 1
2021-03-04 14:37:23 5 358314 134.00 0

目标是计算发动机已满的步数之差除以数量之和。 与上表一样,考虑到 CONTROL_ID = 5,步骤之间的差异为 (357850 - 354815) = 3035,数量为 (85.49 + 100.00 + 157.02) = 342.51,平均消耗为 3035/342.51 = 8.86。在此示例中,预期结果将如下表所示。我有一个包含多个引擎和步骤的数据框。

ENGINE_ID TIME CONTROL_ID STEPS QUANTITY FULL AVERAGE
A123B 2021-03-04 10:10:17 1 354815 156.32 1 0
2021-03-04 11:18:51 2 355160 85.49 0 0
2021-03-04 12:50:24 3 355428 100.00 0 0
2021-03-04 13:02:02 4 357850 157.02 1 8.86
2021-03-04 14:37:23 5 358314 134.00 0 0

如何计算并插入整个数据框的 AVERAGE 列?我在此处和 Pandas 文档中查找了类似的示例,但没有找到从哪里开始。

谢谢!

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    让我们尝试这样的事情:

    import pandas as pd
    import numpy as np
    
    iterables = [['A123B'], ['2021-03-04 10:10:17', '2021-03-04 11:18:51',
                             '2021-03-04 12:50:24', '2021-03-04 13:02:02',
                             '2021-03-04 14:37:23']]
    control_id = [1, 2, 3, 4, 5]
    index = pd.MultiIndex.from_product(iterables, names=["ENGINE_ID", "TIME"])
    steps = [354815, 355160, 355428, 357850, 358314]
    quantity = [156.32, 85.49, 100.00, 157.02, 134.00]
    full = [1, 0, 0, 1, 0]
    d = {'CONTROL_ID': control_id, 'STEPS': steps, 'QUANTITY': quantity, 'FULL': full}
    df = pd.DataFrame(d, index=index)
    
    # Boolean Index for where FULL == 1
    full_m = df.FULL.eq(1)
    # Get Values Needed For Average For Each Group Between Fulls
    sums = df.assign(
        # Difference Between This and Previous FULL == 1 Rows
        STEP_DIFF=df.loc[full_m, 'STEPS'] - df.loc[full_m, 'STEPS'].shift()
    ).groupby(
        # Create Groups Starting With Row After FULL == 1 ending with next FULL == 1
        df.FULL.shift().cumsum().fillna(0)
    )[['STEP_DIFF', 'QUANTITY']].transform('sum')
    
    # Place in the Averages or 0s
    df['AVERAGE'] = np.where(full_m, sums.STEP_DIFF / sums.QUANTITY, 0)
    
    # For Display
    print(df.to_string())
    

    输出:

    CONTROL_ID 步数全平均 ENGINE_ID 时间 A123B 2021-03-04 10:10:17 1 354815 156.32 1 0.000000 2021-03-04 11:18:51 2 355160 85.49 0 0.000000 2021-03-04 12:50:24 3 355428 100.00 0 0.000000 2021-03-04 13:02:02 4 357850 157.02 1 8.861055 2021-03-04 14:37:23 5 358314 134.00 0 0.000000

    【讨论】:

      【解决方案2】:

      不确定这是否是最佳解决方案,但我会使用一系列 shift 操作,如下所示:

      import numpy as np
      
      df['QUANT'] = df['QUANTITY'].shift(-1) # Shift QUANTITY by 1
      df['GROUP'] = df['FULL'].cumsum() # Get a group number which increments when a 1 occurs in the FULL column
      
      df2 = df.drop_duplicates(subset=['GROUP'], keep='first') # Create a new dataframe dropping and keeping the first
      df2['NEXT_STEPS'] = df2['STEPS'].shift(-1) # Shift the STEPS column by 1
      df2['DIFF'] = df2['NEXT_STEPS'] - df2['STEPS'] # Get the difference between the previous and next steps which is 357850 - 354815
      df = pd.merge(df.reset_index(), df2[['DIFF', 'GROUP']], on='GROUP') # Merge it with the original df
      
      
      df = pd.merge(df, df.groupby('GROUP')['QUANT'].sum().reset_index(), on='GROUP') # Get the QUANTITY sum for each group and merge with original df
      df['AVERAGE'] = (df['DIFF']/df['QUANT_y']).shift(1) # Calculate the AVERAGE
      df['AVERAGE'] = np.where(df['FULL']==1, df.AVERAGE, 0) # Replace AVERAGE column with 0 where FULL is not 1 else keep it
      df['AVERAGE'] = df['AVERAGE'].fillna(0) # Replace any nan with 0
      df = df[['ENGINE_ID', 'TIME', 'CONTROL_ID', 'STEPS', 'QUANTITY', 'FULL', 'AVERAGE']]
      

      为了更好地了解发生了什么,我建议您将其分解并打印出结果。

      【讨论】:

        【解决方案3】:

        首先,获取数量的累积总和,然后仅定位引擎已满的行 (FULL==1)。

        import numpy as np
        df['cum']=df.QUANTITY.cumsum()
        dffull=df[df.FULL==1]
        

        使用增量的 numpy-array 除法计算每个 STEP 的消耗(因此减去 1 个索引的移位)。

        consumption=(np.array(dffull.iloc[1:].STEPS)-np.array(dffull.iloc[0:-1].STEPS))/(np.array(dffull.iloc[1:].cum)-np.array(dffull.iloc[0:-1].cum))
        

        现在,分配结果。因为消费列表短了一个元素,这里第一个元素设置为0。

        dffull["consumption"]=[0]+list(consumption)
        

        这就是 dffull 的样子:

                                       CONTROL_ID   STEPS  ...     cum  consumption
        ENGINE_ID TIME                                     ...                     
        A123B     2021-03-04 10:10:17           1  354815  ...  156.32     0.000000
                  2021-03-04 13:02:02           4  357850  ...  498.83     8.861055
        

        最后,在 df 中创建一个列消耗,初始化为 0,然后分配计算的值(你会得到一个警告,它可以忽略),然后完成。

        df["consumption"]=0
        df["consumption"][df.FULL==1]=dffull.consumption
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-09-11
          • 2019-04-19
          • 2022-08-17
          • 1970-01-01
          • 1970-01-01
          • 2019-12-03
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多