熊猫：当某些级别不匹配时，将一个多索引数据帧与另一个多索引切片答案

【问题标题】：Pandas: slice one multiindex dataframe with multiindex of another when some levels don't match熊猫：当某些级别不匹配时，将一个多索引数据帧与另一个多索引切片
【发布时间】：2018-04-13 07:33:42
【问题描述】：

我有两个多索引数据框，一个有两个级别，一个有三个级别。前两个级别在两个数据帧中匹配。我想从第一个数据帧中找到前两个索引级别在第二个数据帧中匹配的所有值。第二个数据帧没有第三层。

我找到的最接近的答案是： How to slice one MultiIndex DataFrame with the MultiIndex of another - 但是设置略有不同，似乎并没有转化为这种情况。

考虑下面的设置

array_1 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['a', 'a','a', 'a','b','b','b','b' ])]

array_2 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
      np.array(['one', 'two', 'three', 'one', 'two', 'two', 'one', 'two'])]

df_1 = pd.DataFrame(np.random.randn(8,4), index=array_1).sort_index()

print df_1
                  0         1         2         3
bar one a  1.092651 -0.325324  1.200960 -0.790002
    two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066
    two a  0.315192 -1.548431 -0.214253 -1.790330
foo one b  1.022050 -2.791862  0.172165  0.924701
    two b  0.622062 -0.193056 -0.145019  0.763185
qux one b -1.241954 -1.270390  0.147623 -0.301092
    two b  0.778022  1.450522  0.683487 -0.950528

df_2 = pd.DataFrame(np.random.randn(8,4), index=array_2).sort_index()

print df_2

                  0         1         2         3
bar one   -0.354889 -1.283470 -0.977933 -0.601868
    two   -0.849186 -2.455453  0.790439  1.134282
baz one   -0.143299  2.372440 -0.161744  0.919658
    three -1.008426 -0.116167 -0.268608  0.840669
foo two   -0.644028  0.447836 -0.576127 -0.891606
    two   -0.163497 -1.255801 -1.066442  0.624713
qux one   -1.545989 -0.422028 -0.489222 -0.357954
    two   -1.202655  0.736047 -1.084002  0.732150

现在我查询第二个数据帧，返回原始索引的子集

df_2_selection = df_2[(df_2 > 1).any(axis=1)]
print df_2_selection

                0         1         2         3
bar two -0.849186 -2.455453  0.790439  1.134282
baz one -0.143299  2.372440 -0.161744  0.919658

我想在 df_1 中找到与 df_2 中的索引匹配的所有值。前两个级别排列，但第三个没有。

当索引对齐时，这个问题很容易，可以通过df_1.loc[df_2_selection.index] #this works if indexes are the same之类的方法解决

我还可以找到与其中一个级别匹配的值，例如 df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)] 但这并不能解决问题。

将这些语句链接在一起并不能提供所需的功能

df_1[(df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)) & (df_1.index.isin(df_2_selection.index.get_level_values(1),level = 1))]

我的设想是：

df_1_select = df_1[(df_1.index.isin(
    df_2_selection.index.get_level_values([0,1]),level = [0,1])) #Doesnt Work

print df_1_select

                  0         1         2         3
bar two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066

我尝试了许多其他方法，但都没有达到我想要的效果。谢谢您的考虑。

编辑：

这个 df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]也不行

我只想要两个级别都匹配的行。不是任何一个级别匹配的地方。

编辑 2：此解决方案由已删除的人发布

id=[x+([x for x in df_1.index.levels[-1]]) for x in df_2_selection.index.values]

pd.concat([df_1.loc[x] for x in id])

确实有效！然而，在大型数据帧上，它的速度非常慢。非常感谢任何有关新方法/加速的帮助。

【问题讨论】：

您的意思是“我想在df_1 中找到与df_2_selection 中的索引匹配的所有值”？
是的，这正是我的意思。很抱歉措辞混乱

标签： python pandas indexing slice multi-index

【解决方案1】：

您可以使用reset_index() 和merge()。

df_2_selection 为：

                0         1         2         3
foo two -0.530151  0.932007 -1.255259  2.441294
qux one  2.006270  1.087412 -0.840916 -1.225508

合并：

lvls = ["level_0","level_1"]

(df_1.reset_index()
 .merge(df_2_selection.reset_index()[lvls], on=lvls)
 .set_index(["level_0","level_1","level_2"])
 .rename_axis([None]*3)
)

输出：

                  0         1         2         3
foo two b -0.112696  0.287421 -0.380692 -0.035471
qux one b  0.658227  0.632667 -0.193224  1.073132

注意：rename_axis() 部分只是删除了级别名称，例如level_0。这纯粹是装饰性的，不需要执行实际的匹配过程。

【讨论】：

这个解决方案效果很好！谢谢！我猜出于某种原因，我认为这个功能会内置到多索引对象中，但我想它不是。
好！如果此答案解决了您的问题，请单击答案左侧的复选标记将其标记为已接受。

【解决方案2】：

试试这个：

pd.concat([
    df_1.xs(key, drop_level=False)
    for key in df_2_selection.index.values])

【讨论】：

这项工作也是如此，谢谢！但是，上面的解决方案（重置索引）在较大的数据集上更快