pandas - 通过另一个数据帧的索引过滤数据帧，然后组合两个数据帧答案

【问题标题】：pandas - filtering a dataframe by index of another dataframe, then combine the two dataframespandas - 通过另一个数据帧的索引过滤数据帧，然后组合两个数据帧
【发布时间】：2017-10-26 07:32:44
【问题描述】：

我有两个数据框如下：

df1 

Index   Fruit
1       Apple
2       Banana
3       Peach

df2 

Index   Taste
1       Tasty
1.5     Rotten
2       Tasty
2.6     Tasty
3       Rotten
3.3     Tasty
4       Tasty

我想通过使用两个数据帧的索引过滤 df2，例如 df1.index + 0.5

生成的数据框应如下所示：

df_outcome          

Index   Fruit   Index_df2   Taste
1       Apple   1.5         Rotten
2       Banana  2.6         Tasty
3       Peach   4           Tasty

我尝试执行以下df2[df2.index>=df1.index + 0.5] 但它返回了

ValueError: 只能比较标签相同的 Series 对象

有什么帮助吗？

【问题讨论】：

您好，您介意根据您的实际数据对我们的两种解决方案进行时间测试吗？
当然，但是@jezrael，我得到 ValueError: Cannot shift with no freq 当我尝试你的代码时。
好像有bug，可能需要升级pandas

标签： python pandas

【解决方案1】：

要从 df2 获取行，请使用 numpy 广播 和 argmax。然后，使用pd.concat 将结果与df1 连接起来。

r = df2.iloc[(df1.Index.values + 0.5 
       <= df2.Index.values[:, None]).argmax(axis=0)].reset_index(drop=1)

pd.concat([df1, r], 1)

   Index   Fruit  Index   Taste
0      1   Apple    1.5  Rotten
1      2  Banana    2.6   Tasty
2      3   Peach    4.0   Tasty

详情

广播给出：

x = (df1.Index.values + 0.5 <= df2.Index.values[:, None])
array([[False, False, False],
       [ True, False, False],
       [ True, False, False],
       [ True,  True, False],
       [ True,  True, False],
       [ True,  True, False],
       [ True,  True,  True]], dtype=bool)

如果使用argmax，你有：

x.argmax(axis=0)
array([1, 3, 6])

【讨论】：

在小数据中的表现不是很好，在大数据中测试更好:(
@jezrael 也许但是在这种情况下你知道它的去向，搜索排序是 O(NLogN) ，这比对数因子要慢。另外我不确定如何将此测试扩展到大数据，执行 pd.concat(df * 10000) 不会提供足够的样本空间来测试这两种解决方案的优点。
您需要创建自定义数据，pd.concat(df * 10000) 不能使用，当然...但是小数据的计时是非常糟糕的主意，我认为最好是删除它...因为在大数据计时可能会有所不同（不确定您的解决方案，也许时间显示您的解决方案在大数据中更好，我不知道）
@jezrael 好的，我会制作一些自定义数据。但我不会删除这个。显示小数据集的时间并没有错。如果 OP 的数据很小，这会使他们的选择变得简单。给我一些时间，我会在完成后通知你。
不，如果添加大数据，没问题。但是小数据真的很糟糕:(

【解决方案2】：

将searchsorted 用于索引，然后按iloc 和最后一个concat 选择：

df = pd.concat([df1.reset_index(), 
                df2.iloc[df2.index.searchsorted(df1.index + .5)].reset_index()], axis=1)
print (df)
   Index   Fruit  Index   Taste
0      1   Apple    1.5  Rotten
1      2  Banana    2.6   Tasty
2      3   Peach    4.0   Tasty

详情：

print (df2.index.searchsorted(df1.index + .5))
[1 3 6]

print (df2.iloc[df2.index.searchsorted(df1.index + .5)])
        Taste
Index        
1.5    Rotten
2.6     Tasty
4.0     Tasty

【讨论】：

您好，请查看对主要问题的评论
我认为这看起来像错误，如果不帮助升级熊猫使用其他解决方案。