【发布时间】:2020-11-11 15:29:28
【问题描述】:
我在下面有dataframe:
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
})
df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd')
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days
df['diff'] = df.groupby('ID')['Invoice_Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
def func(x):
x = x.values
values = [x[0]]
for i in range(1, len(x)):
value = values[i-1] + x[i]
if value < 30:
values.append(value)
elif x[i] >= 30:
values.append(0)
else:
values.append(x[i])
return values
df['days'] = df.groupby("ID")["diff"].transform(func)
df
Out[1]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay diff days
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0.0 0.0
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3.0 3.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1.0 4.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14.0 18.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11.0 29.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5.0 5.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3.0 8.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0.0 0.0
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38.0 0.0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8.0 8.0
我想创建一个列 Mean,其中计算是 Delay 的总和除以基于 ID 的 30 天期间内的发票数量。
例如,ID 27459 的初始 Invoice_Date 是 2020 年 6 月 26 日,因此 30 天的期限将是直到 2020 年 7 月 25 日,而平均值将根据该日期时间的 Delay 计算.
棘手的部分是,实际上在一个ID 中有两种方法。我尝试使用groupby.mean,但这仅适用于我需要从同一 ID 组中找到平均值的情况。
预期的输出应该或多或少像这样:
Out [2]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay diff days Mean
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0.0 0.0
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3.0 3.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1.0 4.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14.0 18.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11.0 29.0 0.6
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5.0 5.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3.0 8.0 10.5
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0.0 0.0 29
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38.0 0.0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8.0 8.0 0.5
【问题讨论】:
-
你的问题是什么?
-
我需要根据延迟找到平均列
-
您在寻找可以为您编写代码的人吗?
-
不,只是想法
标签: python pandas function dataframe