扁平化来自 pandas df 的时间序列数据答案

【问题标题】：flattening time series data from pandas df扁平化来自 pandas df 的时间序列数据
【发布时间】：2020-11-08 10:45:03
【问题描述】：

我有一个看起来像这样的 df：

我正试图把它变成这样：

以下代码为我提供了一个列表列表，我可以将其转换为 df 并包括预期输出的前 3 列，但不确定如何获取我需要的列数（注意：我有超过 3 列列编号，但将此用作简单说明）。

x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
    if not(df.iloc[i-1]['DateSpellIndicator']):
        ID= df.iloc[i]['ID']
        start = df.iloc[i]['Date']
    if not(df.iloc[i]['DateSpellIndicator']):
        newrow = [ID, start,df.iloc[i]['Date'],...]
        x.append(newrow)

【问题讨论】：

标签： python pandas loops time-series flatten

【解决方案1】：

这是使用 pandas groupby 的一种方法。

输入数据框：

    ID  DATE        NUM TORF
0   1   2020-01-01  40  True
1   1   2020-02-01  50  True
2   1   2020-03-01  60  False
3   1   2020-06-01  70  True
4   2   2020-07-01  20  True
5   2   2020-08-01  30  False

输出数据框：

    END         ID  Number1 Number2 Number3 START
0   2020-08-01  2   20      30.0    NaN     2020-07-01
1   2020-06-01  1   70      NaN     NaN     2020-06-01
2   2020-03-01  1   40      50.0    60.0    2020-01-01

代码：

new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
    #Within each group split at the occurence of False
    dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
    for sub_df in dfnew:
        #within each subgroup
        if sub_df.empty==False:
            dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])        
            j=0
            for nindex, srow in sub_df.iterrows():
                dfmod['Number{}'.format(j+1)]=srow['NUM']
                j=j+1
            #concatenate the existing and modified dataframes
            new_df=pd.concat([dfmod, new_df], axis=0)
        
new_df.reset_index(drop=True)

【讨论】：

确认我已经在我的主要 df 的一个子集上测试了它并且它有效！非常感谢。我正在使用一个非常大的 df（16m 行），所以它需要一段时间才能运行。很高兴有任何提高性能的建议，但这很棒。谢谢！

【解决方案2】：

可以减少一些步骤以获得相同的输出。我使用cumsum 来获取第一个日期和最后一个日期。使用list 以您想要的方式获取列。请注意，输出的列名与您的示例不同。我假设您可以按照自己的方式更改它们。

df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)


id  new2    start   end 0   1   2
0   1   0   2020-01-01  2020-03-01  40.0    50.0    60.0
1   1   1   2020-06-01  2020-06-01  70.0    NaN NaN
2   2   1   2020-07-01  2020-08-01  20.0    30.0    NaN

这是我使用的数据框。

    id  date    number  datespell   new1    new2
0   1   2020-01-01  40  True    False   0
1   1   2020-02-01  50  True    False   0
2   1   2020-03-01  60  False   True    0
3   1   2020-06-01  70  True    False   1
4   2   2020-07-01  20  True    False   1
5   2   2020-08-01  30  False   True    1

【讨论】：

非常感谢您的意见。你能简要描述一下 .agg() 中发生了什么吗？询问，因为我在这里收到关于嵌套聚合的错误。
抱歉，我的意思是嵌套重命名。我特别好奇什么是“第一”和“最后”。
实际上，我测试了上述 .agg() 语法的变通方法。以下是使您的其余代码完美运行的编辑（再次感谢）！ 1) .agg(start=('date', 'first'), end=('date', 'last'), cols=('number',lambda x: list(x))) 2) 删除该行以 check.columns 开头的代码。
还想在此确认此语法的运行速度比那些对性能感兴趣的人接受的答案更快。