根据 Pandas Dataframe 中的数据计算平均消耗答案

【问题标题】：Calculate the average consumption from data in Pandas Dataframe根据 Pandas Dataframe 中的数据计算平均消耗
【发布时间】：2021-05-05 13:18:04
【问题描述】：

我有一个数据框，我需要计算每个引擎的平均消耗量。

    iterables = [['A123B'], ['2021-03-04 10:10:17', '2021-03-04 11:18:51', '2021-03-04 12:50:24', 
                             '2021-03-04 13:02:02', '2021-03-04 14:37:23']]
    control_id = [1, 2, 3, 4, 5]
    index = pd.MultiIndex.from_product(iterables, names=["ENGINE_ID", "TIME"])
    steps = [354815, 355160, 355428, 357850, 358314]
    quantity = [156.32, 85.49, 100.00, 157.02, 134.00]
    full = [1, 0, 0, 1, 0]
    dict = {'CONTROL_ID':control_id, 'STEPS':steps, 'QUANTITY':quantity, 'FULL':full}
    df = pd.DataFrame(dict, index=index)

ENGINE_ID	TIME	CONTROL_ID	STEPS	QUANTITY	FULL
A123B	2021-03-04 10:10:17	1	354815	156.32	1
	2021-03-04 11:18:51	2	355160	85.49	0
	2021-03-04 12:50:24	3	355428	100.00	0
	2021-03-04 13:02:02	4	357850	157.02	1
	2021-03-04 14:37:23	5	358314	134.00	0

目标是计算发动机已满的步数之差除以数量之和。与上表一样，考虑到 CONTROL_ID = 5，步骤之间的差异为 (357850 - 354815) = 3035，数量为 (85.49 + 100.00 + 157.02) = 342.51，平均消耗为 3035/342.51 = 8.86。在此示例中，预期结果将如下表所示。我有一个包含多个引擎和步骤的数据框。

ENGINE_ID	TIME	CONTROL_ID	STEPS	QUANTITY	FULL	AVERAGE
A123B	2021-03-04 10:10:17	1	354815	156.32	1	0
	2021-03-04 11:18:51	2	355160	85.49	0	0
	2021-03-04 12:50:24	3	355428	100.00	0	0
	2021-03-04 13:02:02	4	357850	157.02	1	8.86
	2021-03-04 14:37:23	5	358314	134.00	0	0

如何计算并插入整个数据框的 AVERAGE 列？我在此处和 Pandas 文档中查找了类似的示例，但没有找到从哪里开始。

谢谢！

【问题讨论】：

标签： python pandas

【解决方案1】：

让我们尝试这样的事情：

import pandas as pd
import numpy as np

iterables = [['A123B'], ['2021-03-04 10:10:17', '2021-03-04 11:18:51',
                         '2021-03-04 12:50:24', '2021-03-04 13:02:02',
                         '2021-03-04 14:37:23']]
control_id = [1, 2, 3, 4, 5]
index = pd.MultiIndex.from_product(iterables, names=["ENGINE_ID", "TIME"])
steps = [354815, 355160, 355428, 357850, 358314]
quantity = [156.32, 85.49, 100.00, 157.02, 134.00]
full = [1, 0, 0, 1, 0]
d = {'CONTROL_ID': control_id, 'STEPS': steps, 'QUANTITY': quantity, 'FULL': full}
df = pd.DataFrame(d, index=index)

# Boolean Index for where FULL == 1
full_m = df.FULL.eq(1)
# Get Values Needed For Average For Each Group Between Fulls
sums = df.assign(
    # Difference Between This and Previous FULL == 1 Rows
    STEP_DIFF=df.loc[full_m, 'STEPS'] - df.loc[full_m, 'STEPS'].shift()
).groupby(
    # Create Groups Starting With Row After FULL == 1 ending with next FULL == 1
    df.FULL.shift().cumsum().fillna(0)
)[['STEP_DIFF', 'QUANTITY']].transform('sum')

# Place in the Averages or 0s
df['AVERAGE'] = np.where(full_m, sums.STEP_DIFF / sums.QUANTITY, 0)

# For Display
print(df.to_string())

输出：

CONTROL_ID 步数全平均 ENGINE_ID 时间 A123B 2021-03-04 10:10:17 1 354815 156.32 1 0.000000 2021-03-04 11:18:51 2 355160 85.49 0 0.000000 2021-03-04 12:50:24 3 355428 100.00 0 0.000000 2021-03-04 13:02:02 4 357850 157.02 1 8.861055 2021-03-04 14:37:23 5 358314 134.00 0 0.000000

【讨论】：

【解决方案2】：

不确定这是否是最佳解决方案，但我会使用一系列 shift 操作，如下所示：

import numpy as np

df['QUANT'] = df['QUANTITY'].shift(-1) # Shift QUANTITY by 1
df['GROUP'] = df['FULL'].cumsum() # Get a group number which increments when a 1 occurs in the FULL column

df2 = df.drop_duplicates(subset=['GROUP'], keep='first') # Create a new dataframe dropping and keeping the first
df2['NEXT_STEPS'] = df2['STEPS'].shift(-1) # Shift the STEPS column by 1
df2['DIFF'] = df2['NEXT_STEPS'] - df2['STEPS'] # Get the difference between the previous and next steps which is 357850 - 354815
df = pd.merge(df.reset_index(), df2[['DIFF', 'GROUP']], on='GROUP') # Merge it with the original df


df = pd.merge(df, df.groupby('GROUP')['QUANT'].sum().reset_index(), on='GROUP') # Get the QUANTITY sum for each group and merge with original df
df['AVERAGE'] = (df['DIFF']/df['QUANT_y']).shift(1) # Calculate the AVERAGE
df['AVERAGE'] = np.where(df['FULL']==1, df.AVERAGE, 0) # Replace AVERAGE column with 0 where FULL is not 1 else keep it
df['AVERAGE'] = df['AVERAGE'].fillna(0) # Replace any nan with 0
df = df[['ENGINE_ID', 'TIME', 'CONTROL_ID', 'STEPS', 'QUANTITY', 'FULL', 'AVERAGE']]

为了更好地了解发生了什么，我建议您将其分解并打印出结果。

【讨论】：

【解决方案3】：

首先，获取数量的累积总和，然后仅定位引擎已满的行 (FULL==1)。

import numpy as np
df['cum']=df.QUANTITY.cumsum()
dffull=df[df.FULL==1]

使用增量的 numpy-array 除法计算每个 STEP 的消耗（因此减去 1 个索引的移位）。

consumption=(np.array(dffull.iloc[1:].STEPS)-np.array(dffull.iloc[0:-1].STEPS))/(np.array(dffull.iloc[1:].cum)-np.array(dffull.iloc[0:-1].cum))

现在，分配结果。因为消费列表短了一个元素，这里第一个元素设置为0。

dffull["consumption"]=[0]+list(consumption)

这就是 dffull 的样子：

                               CONTROL_ID   STEPS  ...     cum  consumption
ENGINE_ID TIME                                     ...                     
A123B     2021-03-04 10:10:17           1  354815  ...  156.32     0.000000
          2021-03-04 13:02:02           4  357850  ...  498.83     8.861055

最后，在 df 中创建一个列消耗，初始化为 0，然后分配计算的值（你会得到一个警告，它可以忽略），然后完成。

df["consumption"]=0
df["consumption"][df.FULL==1]=dffull.consumption

【讨论】：