在熊猫数据框中查找连续索引的开始和结束索引答案

【问题标题】：Find begin and end index of consecutive ones in pandas dataframe在熊猫数据框中查找连续索引的开始和结束索引
【发布时间】：2020-02-17 09:50:22
【问题描述】：

我有以下数据框：

     A    B    C
0    1    1    1
1    0    1    0
2    1    1    1
3    1    0    1
4    1    1    0
5    1    1    0 
6    0    1    1
7    0    1    0

当每列 3 个或更多连续值的值为 1 时，我想知道开始和结束索引。期望的结果：

Column    From    To    
     A       2     5
     B       1     3         
     B       4     7

首先我过滤掉不连续的3个或更多值

filtered_df = df.copy().apply(filter, threshold=3)

在哪里

def filter(col, threshold=3):  
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(1)
    col.update(col.loc[mask].replace(1,0))
    return col

filtered_df 现在看起来像：

     A    B    C
0    0    1    0
1    0    1    0
2    1    1    0
3    1    0    0
4    1    1    0
5    1    1    0 
6    0    1    0
7    0    1    0

如果数据框只有一列带有 0 和 1，则可以像 How to use pandas to find consecutive same data in time series 那样实现结果。但是，我正在努力一次对多个列执行类似的操作。

【问题讨论】：

也许将您的代码打包在一个函数中，然后将该函数作为一个整体应用于数据帧？您当然需要扩展 filter 函数以将其应用于 df.columns 中的每个列。

标签： python pandas

【解决方案1】：

使用DataFrame.pipe 为所有DataFrame 应用函数。

在第一个解决方案中，获取每列连续1 的第一个和最后一个值，将输出添加到列表和最后一个concat：

def f(df, threshold=3): 
    out = []
    for col in df.columns:
        m = df[col].eq(1)
        g = (df[col] != df[col].shift()).cumsum()[m]
        mask = g.groupby(g).transform('count').ge(threshold)
        filt = g[mask].reset_index()
        output = filt.groupby(col)['index'].agg(['first','last'])
        output.insert(0, 'col', col)
        out.append(output)

    return pd.concat(out, ignore_index=True)

或者先通过unstack reshape 然后应用解决方案：

def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7

filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
3   C      2     3

【讨论】：

谢谢！两种方法都有效。其中一个比另一个更好吗？
@Peter - 很难的问题，如果有很多组，很多列，第二个应该更慢。真实数据中的最佳测试。

【解决方案2】：

您可以使用rolling 在数据框上创建一个窗口。然后你可以应用你所有的条件和shift窗口回到它的开始位置：

length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)

哪个打印：

       A      B      C
0  False   True  False
1  False  False  False
2   True  False  False
3   True  False  False
4  False   True  False
5  False   True  False
6    NaN    NaN    NaN
7    NaN    NaN    NaN

【讨论】：