在 pandas 中，根据表 B 中的条件获取表 A 中行索引的有效方法答案

【问题标题】：In pandas, the efficient way to get the indices of the rows in table A based the condition from table B在 pandas 中，根据表 B 中的条件获取表 A 中行索引的有效方法
【发布时间】：2020-12-30 13:43:52
【问题描述】：

我正在尝试构建可用于时间序列建模的数据。现在我有两张桌子：

表A:

Index UserID   SessionDate  
0      1       '2020-01-01'  
1      1       '2020-01-03'
2      2       '2020-03-01'
3      2       '2020-03-02'
4      3       '2020-01-05'

表B：

Index UserID   SnapshotDate  
0      1       '2020-01-01'  
1      1       '2020-01-02'
2      2       '2020-03-01'
3      2       '2020-03-02'
4      3       '2020-01-01'

因此，对于每个用户，在表B 中的每个快照日期，如果满足该用户的会话日期小于或等于快照日期，请给我表A 中的相应索引。

我试过使用 apply 函数

def index_search(x, df):
    user = x['UserID']
    snap_date = x['SnapshotDate']
    dd = df[df.UserID==user]
    ix = dd[dd.SessionDate <= snap_date].index.values
    return ix

idx = B.apply(index_search,df=A, axis=1)

但它很慢（我的数据集很大），所以我想知道有没有更有效的方法？

【问题讨论】：

标签： python pandas dataframe time-series

【解决方案1】：

你可以试试numpy broadcasting：

x1, y1 = A.to_numpy().T
x2, y2 = B.to_numpy().T
mask = (x2[:, None] == x1) & (y2[:, None] >= y1)
idx = [A.index[m].tolist() for m in mask]

print(idx)

[[0], [0], [2], [2, 3], []]

【讨论】：

这可以工作，但会占用大量内存，对吧？面具是一个巨大的矩阵，让它一直显示Memory Error
@TianqiWang 是的，我猜...你的数据框有多大？
每个表有 30M 行，所以掩码矩阵为 (30M,30M)，在我的实验中也很慢（甚至比熊猫应用慢），但我认为这可能是正确的numpy的方向，但应该有一些优化空间。
@TianqiWang 讨论here
它说我需要 20 声望才能在那里聊天，而我现在没有那么多声望: (