Pandas 数据框中的日期之间的差异答案

【问题标题】：Difference between dates in Pandas dataframePandas 数据框中的日期之间的差异
【发布时间】：2018-03-29 14:58:13
【问题描述】：

这是related to this question，但现在我需要找出存储在“YYYY-MM-DD”中的日期之间的差异。本质上，count 列中的值之间的差异是我们所需要的，但会根据每行之间的天数进行归一化。

我的数据框是：

date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,53.0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,53.0
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,53.0
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,54.0
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,54.0
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,54.0
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,58.0
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,521.0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,524.0
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,531.0
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,533.0
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,534.0

我想在按date+site+country+kind+ID 元组分组后找出每个日期之间的差异。

[date,site,country_code,kind,ID,rank,votes,sessions,avg_score,count,day_diff
2017-03-20,website1,US,0,84,226,0.0,15.0,3.370812,0,0
2017-03-21,website1,US,0,84,214,0.0,15.0,3.370812,0,1
2017-03-22,website1,US,0,84,226,0.0,16.0,3.370812,0,1
2017-03-23,website1,US,0,84,234,0.0,16.0,3.369048,0,1
2017-03-24,website1,US,0,84,226,0.0,16.0,3.369048,0,1
2017-03-25,website1,US,0,84,212,0.0,16.0,3.369048,0,1
2017-03-27,website1,US,0,84,228,0.0,16.0,3.369048,4,2
2017-02-15,website2,AU,1,91,144,4.0,148.0,4.727272,0,0
2017-02-16,website2,AU,1,91,144,3.0,147.0,4.727272,3,1
2017-02-20,website2,AU,1,91,100,4.0,148.0,4.727272,7,4
2017-02-21,website2,AU,1,91,118,6.0,149.0,4.727272,3,1
2017-02-22,website2,AU,1,91,114,4.0,151.0,4.727272,1,1]

一种选择是使用pd.to_datetime() 将date 列转换为Pandas datetime 并使用diff 函数，但这会产生timetelda64 类型的“x days”值。我想使用这个差异来找到每日平均计数，所以如果这可以通过一个/不那么痛苦的步骤来完成，那会很好。

【问题讨论】：

标签： python pandas datetime dataframe pandas-groupby

【解决方案1】：

你可以使用.dt.days访问器：

In [72]: df['date'] = pd.to_datetime(df['date'])

In [73]: df['day_diff'] = df.groupby(['site','country_code','kind','ID'])['date'] \
                            .diff().dt.days.fillna(0)

In [74]: df
Out[74]:
         date      site country_code  kind  ID  rank  votes  sessions  avg_score  count  day_diff
0  2017-03-20  website1           US     0  84   226    0.0      15.0   3.370812   53.0       0.0
1  2017-03-21  website1           US     0  84   214    0.0      15.0   3.370812   53.0       1.0
2  2017-03-22  website1           US     0  84   226    0.0      16.0   3.370812   53.0       1.0
3  2017-03-23  website1           US     0  84   234    0.0      16.0   3.369048   54.0       1.0
4  2017-03-24  website1           US     0  84   226    0.0      16.0   3.369048   54.0       1.0
5  2017-03-25  website1           US     0  84   212    0.0      16.0   3.369048   54.0       1.0
6  2017-03-27  website1           US     0  84   228    0.0      16.0   3.369048   58.0       2.0
7  2017-02-15  website2           AU     1  91   144    4.0     148.0   4.727272  521.0       0.0
8  2017-02-16  website2           AU     1  91   144    3.0     147.0   4.727272  524.0       1.0
9  2017-02-20  website2           AU     1  91   100    4.0     148.0   4.727272  531.0       4.0
10 2017-02-21  website2           AU     1  91   118    6.0     149.0   4.727272  533.0       1.0
11 2017-02-22  website2           AU     1  91   114    4.0     151.0   4.727272  534.0       1.0

【讨论】：