【问题标题】:Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequenceGroupby by sort 基于日期时间,groupby 序列基于 'ID' 和 Date 然后按序列表示
【发布时间】:2019-11-20 22:48:21
【问题描述】:

我是熊猫功能的新手。 我有一个 DF,如下所示。这是手机的维修数据。

  ID    Status       Date     Cost
0   1     F       22-Jun-17   500
1   1     M       22-Jul-17   100
2   2     M       29-Jun-17   200
3   3     M       20-Mar-17   300
4   4     M       10-Aug-17   800
5   2     F       29-Sep-17   600
6   2     F       29-Jan-18   500
7   1     F       22-Jun-18   600
8   3     F       20-Jun-18   700
9   1     M       22-Aug-18   150
10  1     F       22-Mar-19   750
11  3     M       20-Oct-18   250
12  4     F       10-Jun-18   100

我试图从以前的状态中找出每个 id 的持续时间。

找出该 ID 的每个状态序列的平均值。

我的预期输出如下所示。

    ID  S1    S1_Dur  S2    S2_dur  S3   S3_dur  S4   S4_dur    Avg_MF  Avg_FM
0   1   F-M    30     M-F   335.00  F-M  61.00   M-F  750.00    542.50  45.50
1   2   M-F    92     F-F   122.00  NaN  nan     NaN    nan     92.00   nan
2   3   M-F    457    F-M   122.00  NaN  nan     NaN    nan     457.00  122.00
3   4   M-F    304    NaN   nan     NaN  nan     NaN    nan     304.00  nan


S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration

我尝试了以下代码

df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)

然后我得到了一个如下图所示的DF

   ID   Status  index   Date    Cost    Difference
0   1   F        0  2017-06-22  500     NaT
1   1   M        1  2017-07-22  100     30 days
2   1   F        7  2018-06-22  600     335 days
3   1   M        9  2018-08-22  150     61 days
4   1   F       10  2019-03-22  750     212 days
5   2   M        2  2017-06-29  200     NaT
6   2   F        5  2017-09-29  600     92 days
7   2   F        6  2018-01-29  500     122 days
8   3   M        3  2017-03-20  300     NaT
9   3   F        8  2018-06-20  700     457 days
10  3   M       11  2018-10-20  250     122 days
11  4   M        4  2017-08-10  800     NaT
12  4   F       12  2018-06-10  100     304 days

之后我就卡住了。

【问题讨论】:

    标签: pandas datetime pandas-groupby


    【解决方案1】:

    想法是通过DataFrameGroupBy.diff 创建新的差异列,并通过DataFrameGroupBy.shift 连接Status 的移位值。删除S 列中缺失值的行。然后通过DataFrame.unstackGroupBy.cumcount 重塑计数器列,通过DataFrame.pivot_table 为每对S 创建均值,最后使用DataFrame.join

    df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
    df = df.sort_values(['ID', 'Date', 'Status'])
    
    df['D'] = df.groupby('ID')['Date'].diff().dt.days
    df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
    df = df.dropna(subset=['S'])
    df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
    
    df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
    
    df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
    df2.columns = df2.columns.map('_'.join)
    
    df3 = df2.join(df1).reset_index()
    print (df3)
    
       ID    D_1  S_1    D_2  S_2   D_3  S_3    D_4  S_4  Avg_F-F  Avg_F-M  \
    0   1   30.0  F-M  335.0  M-F  61.0  F-M  212.0  M-F      NaN     45.5   
    1   2   92.0  M-F  122.0  F-F   NaN  NaN    NaN  NaN    122.0      NaN   
    2   3  457.0  M-F  122.0  F-M   NaN  NaN    NaN  NaN      NaN    122.0   
    3   4  304.0  M-F    NaN  NaN   NaN  NaN    NaN  NaN      NaN      NaN   
    
       Avg_M-F  
    0    273.5  
    1     92.0  
    2    457.0  
    3    304.0 
    

    【讨论】:

      猜你喜欢
      • 2021-08-25
      • 2020-07-19
      • 2020-07-31
      • 1970-01-01
      • 2020-08-11
      • 1970-01-01
      • 1970-01-01
      • 2020-07-30
      • 2020-05-21
      相关资源
      最近更新 更多