获取 Pandas DataFrame 子集的第一个和最后一个索引答案

【问题标题】：Get first and last Index of a Pandas DataFrame subset获取 Pandas DataFrame 子集的第一个和最后一个索引
【发布时间】：2020-07-26 18:30:43
【问题描述】：

我确实在 pandas DataFrame 中得到了一些数据，看起来像这样。

df =
        A       B
time                               
0.1     10.0    1
0.15    12.1    2
0.19    4.0     2
0.21    5.0     2
0.22    6.0     2
0.25    7.0     1
0.3     8.1     1
0.4     9.45    2
0.5     3.0     1

基于以下条件，我寻找一个通用解决方案来查找每个子集的第一个和最后一个索引。

cond = df.B == 2

到目前为止，我尝试使用 groupby 概念，但没有得到预期的结果。

df_1 = cond.reset_index()
df_2 = df_1.groupby(df_1['B']).agg(['first','last']).reset_index()

这是我得到的输出。

      B       time          
              first    last
0    False    0.1      0.5
1    True     0.15     0.4

这是我想要得到的输出。

      B       time          
              first    last
0    False    0.1      0.1
1    True     0.15     0.22
2    False    0.25     0.3
3    True     0.4      0.4
3    False    0.5      0.5

我怎样才能通过或多或少通用的方法来实现这一点？

【问题讨论】：

标签： python-3.x pandas dataframe subset

【解决方案1】：

通过Series.shift 和Series.ne 创建帮助器Series 并通过Series.cumsum 为连续值的组创建累积总和，然后使用字典进行聚合：

df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg({'B':'first','time': ['first','last']}).reset_index(drop=True)

print (df_2)
       B  time      
   first first  last
0  False  0.10  0.10
1   True  0.15  0.22
2  False  0.25  0.30
3   True  0.40  0.40
4  False  0.50  0.50

如果想避免 MultiIndex 使用命名聚合：

df_1 = df_1.reset_index()
df_1.B = df_1.B == 2
g = df_1.B.ne(df_1.B.shift()).cumsum()
df_2 = df_1.groupby(g).agg(B=('B','first'),
                           first=('time','first'),
                           last=('time','last')).reset_index(drop=True)

print (df_2)
       B  first  last
0  False   0.10  0.10
1   True   0.15  0.22
2  False   0.25  0.30
3   True   0.40  0.40
4  False   0.50  0.50

【讨论】：

非常感谢您的回答，我自己无法解决这个问题。帮助系列的想法给了我很多关于 Pandas 使用的新意见。