【发布时间】:2020-02-17 19:24:06
【问题描述】:
我有以下数据框:
duid start_date end_date
0 b2919f1eb 2019-08-26 2019-09-05
1 e372dedd4 2019-08-26 NaT
2 ba8147ce9 2019-09-09 2019-11-05
3 902c56036 2019-09-13 2019-10-01
4 16ec096a7 2019-09-17 2019-10-02
5 1faac1a15 2019-09-17 NaT
6 319fb59f5 2019-09-24 2020-01-20
7 2a3f1dac5 2019-10-01 NaT
8 aecbcf0c5 2019-10-01 2019-11-05
9 0ee088b63 2019-10-08 2019-10-03
10 c0c02fa4c 2019-10-31 2019-10-31
12 aac5fbc7d 2019-11-05 2019-11-05
11 c76bc248a 2019-11-05 2019-11-29
13 20dcef410 2019-11-12 NaT
14 bc7ea631d 2019-11-12 NaT
15 786af275b 2019-11-12 2019-11-12
16 005ec00c8 2019-11-15 NaT
17 482462695 2019-11-19 NaT
18 ecba54e5d 2019-11-26 NaT
19 28490c52f 2019-12-17 NaT
20 02f2f7f4b 2020-01-15 NaT
21 0ea659d1a 2020-01-29 NaT
22 0b78caca1 2020-01-29 NaT
23 368cc8744 2020-01-29 2020-01-29
此表描述了员工的聘用和离职日期。到目前为止,我已经设法计算了每月的计数:
df.groupby(df['start_date'].dt.strftime('%Y %B')) \
.agg(hired=('start_date', 'size'), left=('end_date', 'count')) \
.reset_index()
start_date hired left
0 2019 August 2 1
1 2019 December 1 0
2 2019 November 8 3
3 2019 October 4 3
4 2019 September 5 4
5 2020 January 4 1
另外,我尝试计算每个日期的累积总和,但它返回奇怪的结果
ds = df.groupby(df['start_date'].dt.strftime('%Y %B'))
ds.size().cumsum()
start_date
2019 August 2
2019 December 3
2019 November 11
2019 October 15
2019 September 20
2020 January 24
dtype: int64
还有累积的左...
de = df.groupby(df['end_date'].dt.strftime('%Y %B'))
de.size().cumsum()
end_date
2019 November 5
2019 October 9
2019 September 10
2020 January 12
dtype: int64
有一个排序问题,我不知道为什么表格不按start_date 排序,但这个问题与计算两个值之间的差异无关,即:
df = df.sort_values('start_date')
如何将start_date 和end_date 两列的累计值相加得到以下结果
start_date hired left rooster
0 2019 August 2 1 1
1 2019 September 5 4 2
2 2019 October 4 3 3
3 2019 November 8 3 8
4 2019 December 1 0 9
5 2020 January 4 1 12
【问题讨论】:
-
表格没有按日期排序,因为您的日期是字符串,而不是
datetime。您应该考虑将dt.strftime('%Y %B')替换为dt.to_period('M')。