【问题标题】:splitting a pandas Dataframe拆分熊猫数据框
【发布时间】:2017-07-27 14:18:42
【问题描述】:

我想使用 progressPercentage 从 1.0 变为 100 的条件过滤并拆分我的原始数据帧,并将其拆分为多个数据帧,如下例所示:

输入:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0

输出:

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0

我一直在使用 .shift() 和 groupby,如下所示:

 a = dfb['Operation.progressPercentage'].shift().eq(100)
 grouping = dfb.groupby([dfb.wm_id,a])

但它没有提供预期的结果。 请问,我应该如何更改代码以完成它的任何帮助?

提前非常感谢。 此致, 卡罗

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    如果 Finish 值有时会丢失并且只需要使用 progressPercentage 列使用:

    shifted = df['progressPercentage'].shift()
    #compare difference for second 100 if together 100 (e.g. 15, 16 row)   
    m = shifted.diff(-1).ne(0) & shifted.eq(100)
    a = m.cumsum()
    
    aa = df.groupby([df.id_B,a])
    
    for k, gp in aa: 
        print('key=' + str(k))
        print(gp)
        print('A NEW ONE...')
    

    key=('id1', 0)
      id_B                 ts_B  course  weight   Phase remainingTime  \
    0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
    1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
    2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
    3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
    4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
    5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
    6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   
    
       progressPercentage  
    0                23.0  
    1                24.0  
    2                24.0  
    3                24.0  
    4                99.0  
    5               100.0  
    6               100.0  
    A NEW ONE...
    key=('id1', 1)
       id_B                 ts_B  course  weight   Phase remainingTime  \
    7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
    8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
    9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
    10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
    11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
    12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
    13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
    14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
    15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
    16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   
    
        progressPercentage  
    7                  1.0  
    8                  2.0  
    9                  2.0  
    10                 3.0  
    11                 4.0  
    12                 5.0  
    13                98.0  
    14                99.0  
    15               100.0  
    16               100.0  
    A NEW ONE...
    key=('id2', 2)
    
    ...
    

    【讨论】:

    • 它正在工作。非常感谢耶斯瑞尔。实际上,我的解决方案与您的解决方案非常接近,并且基于您之前提供的反馈。谢谢。
    • 太棒了,很高兴你能帮上忙。
    • 这更整洁。加一位先生
    • 鉴于上述输出,请问:如何计算每个 id 我得到了多少数据集?在我们的示例中,id1=4 和 id2=1
    • 你可以使用print (df.id_B.value_counts())
    【解决方案2】:

    您可以将数据帧除以 progressPercentage 等于 100。如果它们是连续的,则删除较早的索引。然后将数据帧切片并附加到数组中。希望这会有所帮助

    import numpy as np
    df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
    df1 = df[(df["progressPercentage "]==100)]
    x = (np.array(df1.index) + 1).tolist()
    x.insert(0,0)
    #Remove the consecutive elements so that they can be treated under one dataframe. 
    x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
    x.insert(len(x),df.shape[0])
    frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]   
    

    您可以使用 for 循环打印数据帧,即

    for df in frames:
        print(df)
    

    数据帧的输出:

    id_B ts_B 课程权重阶段剩余时间\ 0 id1 2017-04-27 01:35:30 棉 3.5 A 01:15:00 1 id1 2017-04-27 01:37:30 棉 3.5 B 01:13:00 2 id1 2017-04-27 01:38:00 棉 3.5 B 01:13:00 3 id1 2017-04-27 01:38:30 棉3.5 C 01:13:00 4 id1 2017-04-27 01:39:00 棉 3.5 C 00:02:00 5 id1 2017-04-27 01:39:30 棉3.5 C 00:01:00 6 id1 2017-04-27 01:40:00 棉 3.5 整理 00:01:00 进度百分比 0 23.0 1 24.0 2 24.0 3 24.0 4 99.0 5 100.0 6 100.0 id_B ts_B 课程权重阶段剩余时间\ 7 id1 2017-04-27 02:35:30 棉3.5 A 03:15:00 8 id1 2017-04-27 02:36:00 棉 3.5 A 03:14:00 9 id1 2017-04-27 02:36:30 棉3.5 A 03:14:00 10 id1 2017-04-27 02:37:00 棉3.5 B 03:13:00 11 id1 2017-04-27 02:37:30 棉3.5 B 03:13:00 12 id1 2017-04-27 02:38:00 棉3.5 B 03:13:00 13 id1 2017-04-27 02:38:30 棉3.5 C 03:13:00 14 id1 2017-04-27 02:39:00 棉 3.5 C 00:02:00 15 id1 2017-04-27 02:39:30 棉 3.5 C 00:01:00 16 id1 2017-04-27 02:40:00 棉 3.5 整理 00:01:00 进度百分比 7 1.0 8 2.0 9 2.0 10 3.0 11 4.0 12 5.0 13 98.0 14 99.0 15 100.0 16 100.0 id_B ts_B 课程权重阶段剩余时间\ 17 id2 2017-04-27 03:36:00 棉3.5 A 03:15:00 18 id2 2017-04-27 03:36:30 棉3.5 A 03:14:00 19 id2 2017-04-27 03:37:00 棉3.5 B 03:13:00 20 id2 2017-04-27 03:37:30 棉3.5 B 03:13:00 21 id2 2017-04-27 03:38:00 棉3.5 B 03:13:00 22 id2 2017-04-27 03:38:30 棉3.5 C 03:13:00 23 id2 2017-04-27 03:39:00 棉3.5 C 00:02:00 24 id2 2017-04-27 03:39:30 棉3.5 C 00:01:00 25 id2 2017-04-27 03:40:00 棉 3.5 整理 00:01:00 进度百分比 17 1.0 18 1.0 19 2.0 20 2.0 21 3.0 22 98.0 23 99.0 24 100.0 25 100.0 id_B ts_B 课程权重阶段剩余时间\ 26 id1 2017-05-27 01:35:30 棉3.5 A 03:15:00 27 id1 2017-05-27 01:37:30 棉3.5 B 03:13:00 28 id1 2017-05-27 01:38:00 棉3.5 B 03:13:00 29 id1 2017-05-27 01:38:30 棉3.5 C 03:13:00 30 id1 2017-05-27 01:39:00 棉 3.5 C 00:02:00 31 id1 2017-05-27 01:39:30 棉3.5 C 00:01:00 32 id1 2017-05-27 01:40:00 棉 3.5 整理 00:01:00 进度百分比 26 23.0 27 24.0 28 24.0 29 24.0 30 99.0 31 100.0 32 100.0 id_B ts_B 课程权重阶段剩余时间\ 33 id1 2017-05-27 02:35:30 棉3.5 A 01:15:00 34 id1 2017-05-27 02:36:00 棉3.5 A 01:14:00 35 id1 2017-05-27 02:36:30 棉3.5 A 01:13:00 36 id1 2017-05-27 02:37:00 棉3.5 B 01:12:00 37 id1 2017-05-27 02:37:30 棉3.5 B 01:11:00 38 id1 2017-05-27 02:38:00 棉3.5 B 01:10:00 39 id1 2017-05-27 02:38:30 棉3.5 C 01:09:00 40 id1 2017-05-27 02:39:00 棉 3.5 C 00:08:00 41 id1 2017-05-27 02:39:00 棉 3.5 C 00:08:00 进度百分比 33 1.0 34 2.0 35 2.0 36 3.0 37 4.0 38 5.0 39 98.0 40 99.0 41 100.0 id_B ts_B 课程权重阶段剩余时间\ 42 id2 2017-04-27 03:36:00 棉3.5 A 03:15:00 43 id2 2017-04-27 03:36:30 棉3.5 A 03:14:00 44 id2 2017-04-27 03:37:00 棉3.5 B 03:13:00 45 id2 2017-04-27 03:37:30 棉3.5 B 03:13:00 46 id2 2017-04-27 03:38:00 棉3.5 B 03:13:00 47 id2 2017-04-27 03:38:30 棉3.5 C 03:13:00 48 id2 2017-04-27 03:39:00 棉 3.5 C 00:02:00 49 id2 2017-04-27 03:39:30 棉3.5 C 00:01:00 50 id2 2017-04-27 03:40:00 棉 3.5 整理 00:01:00 进度百分比 42 1.0 43 1.0 44 2.0 45 2.0 46 3.0 47 98.0 48 99.0 49 100.0 50 100.0

    【讨论】:

    • 谢谢巴拉特。我将使用我设计的解决方案检查您的解决方案: a = dfb['Operation.progressPercentage'].shift().eq(100).cumsum() df_output = dfb.groupby([dfb.wm_id,a])
    • 但是,Finish 这个词并不总是存在。
    • 我将您的解决方案更改为 df1 = dfb[dfb['Operation.progressPercentage'] == 100]
    • 刚跑完,没用。使用“40:00”作为数据框的结尾是什么意思?如您所见,我需要将一个实验与另一个实验分开,并在百分比为 100% 时完成。
    • 嗯,这是不可能的,因为每个实验都可能不同。
    【解决方案3】:

    我发现的最好方法如下:

        a = dfb['progressPercentage'].shift().eq(100).cumsum()
        df_output = dfb.groupby([dfb.id_B,a])
    
        for k, gp in aa: 
            print('key=' + str(k))
            print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
            print('A NEW ONE...')
    

    【讨论】:

    • *eventTime'= ts_B and 'wm_id'=id_B
    • for 循环中有什么aa
    • aa 是 df_output
    猜你喜欢
    • 2013-06-23
    • 2019-10-26
    • 1970-01-01
    • 2019-05-29
    • 2013-07-08
    • 2017-05-08
    • 2020-02-19
    • 1970-01-01
    • 2023-01-12
    相关资源
    最近更新 更多