在 pandas 数据框列中找到匹配后的行数答案

【问题标题】：Find the number of rows after match found in a pandas dataframe column在 pandas 数据框列中找到匹配后的行数
【发布时间】：2021-03-20 05:40:02
【问题描述】：

我有以下数据框和预期的输出，我在 balance 列中搜索 True 的值。一旦找到，我对该行使用balPrice 列值并将其与endPrice 列进行比较，以查找endPrice 低于balPrice 的第一个实例，并计算balance == True 行中的行数和找到较低值行的第一个实例。如果没有找到更低的值，则将行数设置为 0。

balance balPrice    endPrice
0   False   5.34    5.34
1   False   5.34    5.34
2   False   5.34    5.34
3   False   5.34    5.27
4   False   5.44    5.25
5   False   5.28    5.12
6   True    5.31    5.2
7   False   5.44    5.35
8   False   5.485   5.44
9   False   5.525   5.5
10  False   5.53    5.53
11  False   5.58    5.51
12  False   5.65    5.52
13  False   5.3     5.3
14  False   5.58    5.54
15  False   5.64    5.55
16  True    5.69    5.65
17  False   5.69    5.59
18  False   5.7     5.62
19  False   5.81    5.77
20  False   5.65    5.73
21  False   5.65    5.86
22  True    6.00    5.89
23  False   5.65    5.85
24  False   5.65    5.83
25  False   5.9     5.88

这是我尝试过的，看起来很复杂。正在寻找更好的解决方案。

df_filtered = df[df.balance == True]
idx = []
for i in df_filtered.index:
    pos = np.where(df.endPrice[i+1:] <= df.balPrice[i])[0]
    if pos.size > 0:
        idx.append(pos[0]+1)
    else:
        idx.append(0)

df_filtered['numrows'] = idx

预期输出：

balance balPrice    endPrice    numrows
True    5.31        5.2           7
True    5.69        5.65          1
True    6.00        5.89          1

【问题讨论】：

请以文本形式包含数据，以便我们复制它们。 Images cannot be copied
比较 idx 元素的示例代码和屏幕截图，不清楚如何计算 numrows。
numrows 是具有 balance == True 的列与 endPrice 列的条件小于或等于 balPrice 的值的第一个实例之间的行数，其中 balance ==真的。 idx 列表是满足条件的 np.where 数组的第一个元素。
您的代码会为该数据生成[7, 1, 1] - 行17 和23 满足该条件。
你是对的，我的错我剪切粘贴不正确。

标签： python pandas dataframe

【解决方案1】：

更新

对于较大的数据集，使用屏蔽/索引似乎比使用连接方法（至少在我的本地测试中）要快得多。

start_idxs = df.loc[ df.balance ].index
df.loc[start_idxs, 'Price'] = df.loc[start_idxs, 'balPrice']
df.ffill(inplace=True)

end_rows = (df.balance == False) & (df.endPrice <= df.Price)
df.loc[end_rows, 'index'] = df.loc[end_rows].index
df.bfill(inplace=True)

df.loc[start_idxs, 'index'] -= start_idxs
df.rename(dict(index='numrows'), axis=1, inplace=True)
df.drop(columns='Price', inplace=True)
df.loc[df.balance == False, 'numrows'] = 0

join()方法

不确定您是否会认为它更好，但一种方法是将您的条件所需的列添加到过滤的行中 - 然后执行连接。

df['numrows'] = 0

has_balance = df.loc[ df.balance ]
has_balance = has_balance.assign(
    index=has_balance.index,  
    lastPrice=has_balance.balPrice
)
 
df1 = df.join(has_balance, rsuffix='_r').ffill()
df1 = df1.query('balance == False and endPrice <= lastPrice').drop_duplicates('index')
df1['numrows'] = df1.index - df1['index']

has_balance.update(df1.set_index('index')['numrows'])
df.update(has_balance)

	balance	balPrice	endPrice	numrows
6	True	5.31	5.2	7
16	True	5.69	5.65	1
22	True	6	5.89	1

编写时没有df1 变量

df['numrows'] = 0

has_balance = df.loc[ df.balance ]
has_balance = has_balance.assign(
    index=has_balance.index, 
    lastPrice=has_balance.balPrice
)

has_balance.update((
    df.join(has_balance, rsuffix='_r')
      .ffill()
      .query('balance == False and endPrice <= lastPrice')
      .drop_duplicates('index')
      .assign(numrows=lambda df: df.index - df['index'])
      .set_index('index')
      .numrows
))

df.update(has_balance)

【讨论】：

我希望有一个解决方案，可以将带有结果 numrows 列的新列添加到原始 df 并以 df[df.balance == True] 结束以获得最终结果结果。
您可以使用df.update(has_balance) 修改原件 - 我已经编辑了答案。
看起来使用屏蔽/索引比 join() 方法快得多 - 我已将其添加到答案中。
谢谢卡尔让我试试这个然后回来。

【解决方案2】：

您可以使用groupby(df.balance.cumsum()) 和apply(numrows) 将行从一个balance == True 分组到下一个组：

def numrows(group):
    index = (group.endPrice.iloc[1:] <= group.balPrice.iloc[0]).idxmax()
    result = index - group.index[0]
    return result

numrows = df.groupby(df.balance.cumsum()).apply(numrows)[1:]
df['numrows'] = numrows.set_axis(df[df.balance].index)
df['numrows'] = df.numrows.fillna(0).astype(int)

除了balance == True 行之外，numrows 列的值为 0：

df.tail()

#    balance  balPrice  endPrice  numrows
# ...
# 21   False      5.65      5.86        0
# 22    True      6.00      5.89        1
# 23   False      5.65      5.85        0
# 24   False      5.65      5.83        0
# 25   False      5.90      5.88        0

所以最后你可以这样做：

df[df.balance]

#    balance  balPrice  endPrice  numrows
#  6    True      5.31      5.20        7
# 16    True      5.69      5.65        1
# 22    True      6.00      5.89        1

【讨论】：