在 Pandas 杂草中使用切片器进行多索引答案

【问题标题】：In the Weeds on Pandas Multi-Indexing Using Slicers在 Pandas 杂草中使用切片器进行多索引
【发布时间】：2014-10-21 05:16:41
【问题描述】：

我正在尝试利用 Pandas 中新增的功能来使用切片器访问多索引，但我在处理看似相当简单的切片问题时遇到了一些麻烦，所以我只想通过在这里分组寻求帮助。

以下是代码示例，其中包含一些有效的示例和一些无效的示例：

import pandas as pd

# Displays: '0.14.1'
pd.__version__

df = pd.DataFrame({'A': ['A0'] * 5 + ['A1']*5 + ['A2']*5,
            'B': ['B0','B0','B1','B1','B2'] * 3,
            'DATE': ["2013-06-11",
                    "2013-07-02",
                    "2013-07-09",
                    "2013-07-30",
                    "2013-08-06",
                    "2013-06-11",
                    "2013-07-02",
                    "2013-07-09",
                    "2013-07-30",
                    "2013-08-06",
                    "2013-09-03",
                    "2013-10-01",
                    "2013-07-09",
                    "2013-08-06",
                    "2013-09-03"],
             'VALUES': [22, 35, 14,  9,  4, 40, 18, 4, 2, 5, 1, 2, 3,4, 2]})

df.DATE = df['DATE'].apply(lambda x: pd.to_datetime(x))

df1 = df.set_index(['A', 'B', 'DATE'])
df1 = df1.sortlevel()

df2 = df.set_index('DATE')

# A1 - Works - Get all values under "A0" and "A1"
df1.loc[(slice('A1')),:]

# A2 - Works - Get all values from the start to "A2"
df1.loc[(slice('A2')),:]

# A3 - Works - Get all values under "B1" or "B2"
df1.loc[(slice(None),slice('B1','B2')),:]

# A4 - Works - Get all values between 2013-07-02 and 2013-07-09
df1.loc[(slice(None),slice(None),slice('20130702','20130709')),:]

##############################################
# These do not work and I'm wondering why... #
##############################################

# B1 - Does not work - Get all values in B0 that are also under A0, A1 and A2
df1.loc[(slice('A2'),slice('B0')),:]

# B2 - Does not work - Get all values in B0, B1 and B2 (similar to what #2 is doing for the As)
df1.loc[(slice(None),slice('B2')),:]

# B3 - Does not work - Get all values from B1 to B2 and up to 2013-08-06
df1.loc[(slice(None),slice('B1','B2'),slice('2013-08-06')),:]

# B4 - Does not work - Same as A4 but the start of the date slice is not a key.
#                      Would have thought the behavior would be similar to something like df2['20130701':]
#                      In other words, date indexing allowed starting on non-key points
df1.loc[(slice(None),slice(None),slice('20130701','20130709')),:]

虽然肯定有其他更简单的方法来获取数据，但我希望能够回答下面的具体示例问题，以便能够将知识用作构建块来进行更复杂的多索引切片在路上。

提前感谢您的帮助！

【问题讨论】：

请显示熊猫版本； 0.14.1 修复了 0.14.0 中针对多索引切片器的几个错误（并掌握了更多）
您可以考虑使用idx = pd.IndexSlice 语法来使这些更易于阅读。 pandas.pydata.org/pandas-docs/stable/…
您还需要 .sortlevel() 或确保多级不起作用 - 它可能没有提升（并且即使未排序也尝试工作 - 这可能是一个错误）
抱歉 - 刚刚添加了几行代码来显示版本并对数据框进行排序。我认为即使进行排序（版本为 0.14.1），我仍然会遇到相同的错误。只是好奇，示例代码对您有用吗 Jeff？感谢 chrisb 的建议，现在也会看看。
明天我得看看这些更详细的信息 - 会告诉你

标签： python pandas slice multi-index

【解决方案1】：

刚刚合并到 master/0.15.0，这个 PR http://github.com/pydata/pandas/pull/8134 修复了不起作用的情况。

# A1 - Works - Get all values under "A0" and "A1"
df1.loc[(slice('A1')),:]

                  VALUES
A  B  DATE              
A0 B0 2013-06-11      22
      2013-07-02      35
   B1 2013-07-09      14
      2013-07-30       9
   B2 2013-08-06       4
A1 B0 2013-06-11      40
      2013-07-02      18
   B1 2013-07-09       4
      2013-07-30       2
   B2 2013-08-06       5

# A2 - Works - Get all values from the start to "A2"
df1.loc[(slice('A2')),:]

                  VALUES
A  B  DATE              
A0 B0 2013-06-11      22
      2013-07-02      35
   B1 2013-07-09      14
      2013-07-30       9
   B2 2013-08-06       4
A1 B0 2013-06-11      40
      2013-07-02      18
   B1 2013-07-09       4
      2013-07-30       2
   B2 2013-08-06       5
A2 B0 2013-09-03       1
      2013-10-01       2
   B1 2013-07-09       3
      2013-08-06       4
   B2 2013-09-03       2

# A3 - Works - Get all values under "B1" or "B2"
df1.loc[(slice(None),slice('B1','B2')),:]

                  VALUES
A  B  DATE              
A0 B1 2013-07-09      14
      2013-07-30       9
   B2 2013-08-06       4
A1 B1 2013-07-09       4
      2013-07-30       2
   B2 2013-08-06       5
A2 B1 2013-07-09       3
      2013-08-06       4
   B2 2013-09-03       2

# A4 - Works - Get all values between 2013-07-02 and 2013-07-09
df1.loc[(slice(None),slice(None),slice('20130702','20130709')),:]

                  VALUES
A  B  DATE              
A0 B0 2013-07-02      35
   B1 2013-07-09      14
A1 B0 2013-07-02      18
   B1 2013-07-09       4
A2 B1 2013-07-09       3

# B1 -  Get all values in B0 that are also under A0, A1 and A2
df1.loc[(slice('A2'),slice('B0')),:]

                  VALUES
A  B  DATE              
A0 B0 2013-06-11      22
      2013-07-02      35
A1 B0 2013-06-11      40
      2013-07-02      18
A2 B0 2013-09-03       1
      2013-10-01       2

# B2 - Get all values in B0, B1 and B2 (similar to what #2 is doing for the As)
df1.loc[(slice(None),slice('B2')),:]

                 VALUES
A  B  DATE              
A0 B0 2013-06-11      22
      2013-07-02      35
   B1 2013-07-09      14
      2013-07-30       9
   B2 2013-08-06       4
A1 B0 2013-06-11      40
      2013-07-02      18
   B1 2013-07-09       4
      2013-07-30       2
   B2 2013-08-06       5
A2 B0 2013-09-03       1
      2013-10-01       2
   B1 2013-07-09       3
      2013-08-06       4
   B2 2013-09-03       2

# B3 - Get all values from B1 to B2 and up to 2013-08-06
df1.loc[(slice(None),slice('B1','B2'),slice('2013-08-06')),:]

                  VALUES
A  B  DATE              
A0 B1 2013-07-09      14
      2013-07-30       9
   B2 2013-08-06       4
A1 B1 2013-07-09       4
      2013-07-30       2
   B2 2013-08-06       5
A2 B1 2013-07-09       3
      2013-08-06       4

# B4 - Same as A4 but the start of the date slice is not a key.
df1.loc[(slice(None),slice(None),slice('20130701','20130709')),:]
                  VALUES
A  B  DATE              
A0 B0 2013-07-02      35
   B1 2013-07-09      14
A1 B0 2013-07-02      18
   B1 2013-07-09       4
A2 B1 2013-07-09       3

【讨论】：

感谢 Jeff 的修复。所以这是一个有点愚蠢的问题，但是现在修复程序在存储库中，在 0.15 版本之前访问它的正确方法是下载 SVN 代码并自己构建它吗？到目前为止，我有限的知识只知道如何双击一个EXE。
我猜你是在 Windows 上。我通常发布二进制文件pandas.pydata.org/pandas-build/dev。这些没有更新到最新的（因为它们是当前服务器上的一个问题）。请稍后再回来查看。
非常感谢 Jeff - 将继续检查您的构建站点以获取更新的二进制文件。希望您有一个愉快的劳动节周末（或者如果您不在美国，则只是普通的周末）。
我刚刚在此处发布了一些 Windows 二进制文件：github.com/pydata/pandas/releases（0.15pre）版本。