拆分熊猫数据框答案

【问题标题】：splitting a pandas Dataframe拆分熊猫数据框
【发布时间】：2017-07-27 14:18:42
【问题描述】：

我想使用 progressPercentage 从 1.0 变为 100 的条件过滤并拆分我的原始数据帧，并将其拆分为多个数据帧，如下例所示：

输入：

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,23.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,24.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,24.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0

输出：

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 01:35:30,cotton,3.5,A,01:15:00,23.0
id1,2017-04-27 01:37:30,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:00,cotton,3.5,B,01:13:00,24.0
id1,2017-04-27 01:38:30,cotton,3.5,C,01:13:00,24.0
id1,2017-04-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-04-27 02:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-04-27 02:36:00,cotton,3.5,A,03:14:00,2.0  
id1,2017-04-27 02:36:30,cotton,3.5,A,03:14:00,2.0 
id1,2017-04-27 02:37:00,cotton,3.5,B,03:13:00,3.0
id1,2017-04-27 02:37:30,cotton,3.5,B,03:13:00,4.0
id1,2017-04-27 02:38:00,cotton,3.5,B,03:13:00,5.0
id1,2017-04-27 02:38:30,cotton,3.5,C,03:13:00,98.0
id1,2017-04-27 02:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-04-27 02:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-04-27 02:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id2,2017-04-27 03:36:00,cotton,3.5,A,03:15:00,1.0
id2,2017-04-27 03:36:30,cotton,3.5,A,03:14:00,1.0 
id2,2017-04-27 03:37:00,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:37:30,cotton,3.5,B,03:13:00,2.0
id2,2017-04-27 03:38:00,cotton,3.5,B,03:13:00,3.0
id2,2017-04-27 03:38:30,cotton,3.5,C,03:13:00,98.0
id2,2017-04-27 03:39:00,cotton,3.5,C,00:02:00,99.0
id2,2017-04-27 03:39:30,cotton,3.5,C,00:01:00,100.0
id2,2017-04-27 03:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 01:35:30,cotton,3.5,A,03:15:00,1.0
id1,2017-05-27 01:37:30,cotton,3.5,B,03:13:00,2.0
id1,2017-05-27 01:38:00,cotton,3.5,B,03:13:00,3.0
id1,2017-05-27 01:38:30,cotton,3.5,C,03:13:00,4.0
id1,2017-05-27 01:39:00,cotton,3.5,C,00:02:00,99.0
id1,2017-05-27 01:39:30,cotton,3.5,C,00:01:00,100.0
id1,2017-05-27 01:40:00,cotton,3.5,Finish,00:01:00,100.0

id_B, ts_B,course,weight,Phase,remainingTime,progressPercentage
id1,2017-05-27 02:35:30,cotton,3.5,A,01:15:00,1.0
id1,2017-05-27 02:36:00,cotton,3.5,A,01:14:00,2.0  
id1,2017-05-27 02:36:30,cotton,3.5,A,01:13:00,2.0 
id1,2017-05-27 02:37:00,cotton,3.5,B,01:12:00,3.0
id1,2017-05-27 02:37:30,cotton,3.5,B,01:11:00,4.0
id1,2017-05-27 02:38:00,cotton,3.5,B,01:10:00,5.0
id1,2017-05-27 02:38:30,cotton,3.5,C,01:09:00,98.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:08:00,99.0
id1,2017-05-27 02:39:00,cotton,3.5,C,00:01:00,100.0

我一直在使用 .shift() 和 groupby，如下所示：

 a = dfb['Operation.progressPercentage'].shift().eq(100)
 grouping = dfb.groupby([dfb.wm_id,a])

但它没有提供预期的结果。请问，我应该如何更改代码以完成它的任何帮助？

提前非常感谢。此致，卡罗

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

如果 Finish 值有时会丢失并且只需要使用 progressPercentage 列使用：

shifted = df['progressPercentage'].shift()
#compare difference for second 100 if together 100 (e.g. 15, 16 row)   
m = shifted.diff(-1).ne(0) & shifted.eq(100)
a = m.cumsum()

aa = df.groupby([df.id_B,a])

for k, gp in aa: 
    print('key=' + str(k))
    print(gp)
    print('A NEW ONE...')

key=('id1', 0)
  id_B                 ts_B  course  weight   Phase remainingTime  \
0  id1  2017-04-27 01:35:30  cotton     3.5       A      01:15:00   
1  id1  2017-04-27 01:37:30  cotton     3.5       B      01:13:00   
2  id1  2017-04-27 01:38:00  cotton     3.5       B      01:13:00   
3  id1  2017-04-27 01:38:30  cotton     3.5       C      01:13:00   
4  id1  2017-04-27 01:39:00  cotton     3.5       C      00:02:00   
5  id1  2017-04-27 01:39:30  cotton     3.5       C      00:01:00   
6  id1  2017-04-27 01:40:00  cotton     3.5  Finish      00:01:00   

   progressPercentage  
0                23.0  
1                24.0  
2                24.0  
3                24.0  
4                99.0  
5               100.0  
6               100.0  
A NEW ONE...
key=('id1', 1)
   id_B                 ts_B  course  weight   Phase remainingTime  \
7   id1  2017-04-27 02:35:30  cotton     3.5       A      03:15:00   
8   id1  2017-04-27 02:36:00  cotton     3.5       A      03:14:00   
9   id1  2017-04-27 02:36:30  cotton     3.5       A      03:14:00   
10  id1  2017-04-27 02:37:00  cotton     3.5       B      03:13:00   
11  id1  2017-04-27 02:37:30  cotton     3.5       B      03:13:00   
12  id1  2017-04-27 02:38:00  cotton     3.5       B      03:13:00   
13  id1  2017-04-27 02:38:30  cotton     3.5       C      03:13:00   
14  id1  2017-04-27 02:39:00  cotton     3.5       C      00:02:00   
15  id1  2017-04-27 02:39:30  cotton     3.5       C      00:01:00   
16  id1  2017-04-27 02:40:00  cotton     3.5  Finish      00:01:00   

    progressPercentage  
7                  1.0  
8                  2.0  
9                  2.0  
10                 3.0  
11                 4.0  
12                 5.0  
13                98.0  
14                99.0  
15               100.0  
16               100.0  
A NEW ONE...
key=('id2', 2)

...

【讨论】：

它正在工作。非常感谢耶斯瑞尔。实际上，我的解决方案与您的解决方案非常接近，并且基于您之前提供的反馈。谢谢。
太棒了，很高兴你能帮上忙。
这更整洁。加一位先生
鉴于上述输出，请问：如何计算每个 id 我得到了多少数据集？在我们的示例中，id1=4 和 id2=1
你可以使用print (df.id_B.value_counts())

【解决方案2】：

您可以将数据帧除以 progressPercentage 等于 100。如果它们是连续的，则删除较早的索引。然后将数据帧切片并附加到数组中。希望这会有所帮助

import numpy as np
df = pd.read_csv('input.csv',delimiter=',') # The input csv provided
df1 = df[(df["progressPercentage "]==100)]
x = (np.array(df1.index) + 1).tolist()
x.insert(0,0)
#Remove the consecutive elements so that they can be treated under one dataframe. 
x = [ begin for begin, end in zip(x, x[1:]) if (begin != end-1)]
x.insert(len(x),df.shape[0])
frames = [df.iloc[begin:end] for begin, end in zip(x, x[1:])]

您可以使用 for 循环打印数据帧，即

for df in frames:
    print(df)

数据帧的输出：

id_B ts_B 课程权重阶段剩余时间\ 0 id1 2017-04-27 01:35:30 棉 3.5 A 01:15:00 1 id1 2017-04-27 01:37:30 棉 3.5 B 01:13:00 2 id1 2017-04-27 01:38:00 棉 3.5 B 01:13:00 3 id1 2017-04-27 01:38:30 棉3.5 C 01:13:00 4 id1 2017-04-27 01:39:00 棉 3.5 C 00:02:00 5 id1 2017-04-27 01:39:30 棉3.5 C 00:01:00 6 id1 2017-04-27 01:40:00 棉 3.5 整理 00:01:00 进度百分比 0 23.0 1 24.0 2 24.0 3 24.0 4 99.0 5 100.0 6 100.0 id_B ts_B 课程权重阶段剩余时间\ 7 id1 2017-04-27 02:35:30 棉3.5 A 03:15:00 8 id1 2017-04-27 02:36:00 棉 3.5 A 03:14:00 9 id1 2017-04-27 02:36:30 棉3.5 A 03:14:00 10 id1 2017-04-27 02:37:00 棉3.5 B 03:13:00 11 id1 2017-04-27 02:37:30 棉3.5 B 03:13:00 12 id1 2017-04-27 02:38:00 棉3.5 B 03:13:00 13 id1 2017-04-27 02:38:30 棉3.5 C 03:13:00 14 id1 2017-04-27 02:39:00 棉 3.5 C 00:02:00 15 id1 2017-04-27 02:39:30 棉 3.5 C 00:01:00 16 id1 2017-04-27 02:40:00 棉 3.5 整理 00:01:00 进度百分比 7 1.0 8 2.0 9 2.0 10 3.0 11 4.0 12 5.0 13 98.0 14 99.0 15 100.0 16 100.0 id_B ts_B 课程权重阶段剩余时间\ 17 id2 2017-04-27 03:36:00 棉3.5 A 03:15:00 18 id2 2017-04-27 03:36:30 棉3.5 A 03:14:00 19 id2 2017-04-27 03:37:00 棉3.5 B 03:13:00 20 id2 2017-04-27 03:37:30 棉3.5 B 03:13:00 21 id2 2017-04-27 03:38:00 棉3.5 B 03:13:00 22 id2 2017-04-27 03:38:30 棉3.5 C 03:13:00 23 id2 2017-04-27 03:39:00 棉3.5 C 00:02:00 24 id2 2017-04-27 03:39:30 棉3.5 C 00:01:00 25 id2 2017-04-27 03:40:00 棉 3.5 整理 00:01:00 进度百分比 17 1.0 18 1.0 19 2.0 20 2.0 21 3.0 22 98.0 23 99.0 24 100.0 25 100.0 id_B ts_B 课程权重阶段剩余时间\ 26 id1 2017-05-27 01:35:30 棉3.5 A 03:15:00 27 id1 2017-05-27 01:37:30 棉3.5 B 03:13:00 28 id1 2017-05-27 01:38:00 棉3.5 B 03:13:00 29 id1 2017-05-27 01:38:30 棉3.5 C 03:13:00 30 id1 2017-05-27 01:39:00 棉 3.5 C 00:02:00 31 id1 2017-05-27 01:39:30 棉3.5 C 00:01:00 32 id1 2017-05-27 01:40:00 棉 3.5 整理 00:01:00 进度百分比 26 23.0 27 24.0 28 24.0 29 24.0 30 99.0 31 100.0 32 100.0 id_B ts_B 课程权重阶段剩余时间\ 33 id1 2017-05-27 02:35:30 棉3.5 A 01:15:00 34 id1 2017-05-27 02:36:00 棉3.5 A 01:14:00 35 id1 2017-05-27 02:36:30 棉3.5 A 01:13:00 36 id1 2017-05-27 02:37:00 棉3.5 B 01:12:00 37 id1 2017-05-27 02:37:30 棉3.5 B 01:11:00 38 id1 2017-05-27 02:38:00 棉3.5 B 01:10:00 39 id1 2017-05-27 02:38:30 棉3.5 C 01:09:00 40 id1 2017-05-27 02:39:00 棉 3.5 C 00:08:00 41 id1 2017-05-27 02:39:00 棉 3.5 C 00:08:00 进度百分比 33 1.0 34 2.0 35 2.0 36 3.0 37 4.0 38 5.0 39 98.0 40 99.0 41 100.0 id_B ts_B 课程权重阶段剩余时间\ 42 id2 2017-04-27 03:36:00 棉3.5 A 03:15:00 43 id2 2017-04-27 03:36:30 棉3.5 A 03:14:00 44 id2 2017-04-27 03:37:00 棉3.5 B 03:13:00 45 id2 2017-04-27 03:37:30 棉3.5 B 03:13:00 46 id2 2017-04-27 03:38:00 棉3.5 B 03:13:00 47 id2 2017-04-27 03:38:30 棉3.5 C 03:13:00 48 id2 2017-04-27 03:39:00 棉 3.5 C 00:02:00 49 id2 2017-04-27 03:39:30 棉3.5 C 00:01:00 50 id2 2017-04-27 03:40:00 棉 3.5 整理 00:01:00 进度百分比 42 1.0 43 1.0 44 2.0 45 2.0 46 3.0 47 98.0 48 99.0 49 100.0 50 100.0

【讨论】：

谢谢巴拉特。我将使用我设计的解决方案检查您的解决方案： a = dfb['Operation.progressPercentage'].shift().eq(100).cumsum() df_output = dfb.groupby([dfb.wm_id,a])
但是，Finish 这个词并不总是存在。
我将您的解决方案更改为 df1 = dfb[dfb['Operation.progressPercentage'] == 100]
刚跑完，没用。使用“40:00”作为数据框的结尾是什么意思？如您所见，我需要将一个实验与另一个实验分开，并在百分比为 100% 时完成。
嗯，这是不可能的，因为每个实验都可能不同。

【解决方案3】：

我发现的最好方法如下：

    a = dfb['progressPercentage'].shift().eq(100).cumsum()
    df_output = dfb.groupby([dfb.id_B,a])

    for k, gp in aa: 
        print('key=' + str(k))
        print(gp.sort_values(['eventTime', 'wm_id'], ascending=[1, 0]).to_string())
        print('A NEW ONE...')

【讨论】：

*eventTime'= ts_B and 'wm_id'=id_B
for 循环中有什么aa
aa 是 df_output