Pandas IndexSlice 是如何工作的答案

【问题标题】：Pandas how does IndexSlice workPandas IndexSlice 是如何工作的
【发布时间】：2017-10-20 14:55:57
【问题描述】：

我正在关注本教程：GitHub Link

如果您向下滚动（Ctrl+F：练习：选择评论最多的啤酒）到显示Exercise: Select the most-reviewd beers 的部分：

数据框是多索引的：

要选择评论最多的啤酒：

top_beers = df['beer_id'].value_counts().head(10).index
reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']]

我的问题是IndexSlice的使用方式，怎么能跳过top_beers后面的冒号，代码还能运行？

reviews.loc[pd.IndexSlice[:, top_beers, :], ['beer_name', 'beer_style']]

共有三个索引，pofile_name、beed_id 和 time。为什么pd.IndexSlice[:, top_beers] 工作（没有指定如何处理时间列）？

【问题讨论】：

这就是: 运算符的作用。您仅按层次索引的三列之一进行过滤。其他两个（使用: 的）可以取任何值。您可以将: 视为与True 匹配任何值的过滤器。
@GustavoBezerra 问题是即使没有第三个: 代码仍然有效。 reviews.loc[pd.IndexSlice[:, top_beers], ['beer_name', 'beer_style']] 即使没有第三个 `:' 也可以工作
top_beers 是一个列表。您通过 top_beers 过滤二级索引字段啤酒 id。其他两个级别默认所有值。如果要按范围切片，请使用 slice(a:b)

标签： python pandas

【解决方案1】：

为了补充前面的答案，让我解释一下pd.IndexSlice 的工作原理以及它为什么有用。

好吧，关于它的实现没有什么好说的。正如您在source 中所读到的，它只是执行以下操作：

class IndexSlice(object):
    def __getitem__(self, arg):
        return arg

由此我们看到pd.IndexSlice 只转发__getitem__ 收到的参数。看起来很愚蠢，不是吗？但是，它实际上做了一些事情。

您肯定已经知道，如果您通过括号运算符obj[arg] 访问对象obj，则会调用obj.__getitem__(arg)。对于序列类型对象，arg 可以是整数或slice object。我们很少自己构建切片。相反，我们会为此目的使用切片运算符:（又名省略号），例如obj[0:5].

重点来了。 python 解释器在调用对象的__getitem__(arg) 方法之前将这些切片运算符: 转换为切片对象。因此，IndexSlice.__getItem__() 的返回值实际上是一个切片、一个整数（如果没有使用 :）或它们的元组（如果传递了多个参数）。总之，IndexSlice 的唯一目的是我们不必自己构造切片。这种行为对pd.DataFrame.loc 尤其有用。

我们先来看看下面的例子：

import pandas as pd
idx = pd.IndexSlice
print(idx[0])               # 0
print(idx[0,'a'])           # (0, 'a')
print(idx[:])               # slice(None, None, None)
print(idx[0:3])             # slice(0, 3, None)
print(idx[0.1:2.3])         # slice(0.1, 2.3, None)
print(idx[0:3,'a':'c'])     # (slice(0, 3, None), slice('a', 'c', None))

我们观察到冒号: 的所有用法都被转换为切片对象。如果将多个参数传递给索引运算符，则参数将转换为 n 元组。

为了演示这对具有多级索引的 pandas 数据框 df 有何用处，让我们看一下以下内容。

# A sample table with three-level row-index
# and single-level column index.
import numpy as np
level0 = range(0,10)
level1 = list('abcdef')
level2 = ['I', 'II', 'III', 'IV']
mi = pd.MultiIndex.from_product([level0, level1, level2])
df = pd.DataFrame(np.random.random([len(mi),2]), 
                  index=mi, columns=['col1', 'col2'])

# Return a view on 'col1', selecting all rows.
df.loc[:,'col1']            # pd.Series         

# Note: in the above example, the returned value has type
# pd.Series, because only one column is returned. One can 
# enforce the returned object to be a data-frame:
df.loc[:,['col1']]          # pd.DataFrame, or
df.loc[:,'col1'].to_frame() # 

# Select all rows with top-level values 0:3.
df.loc[0:3, 'col1']   

# If we want to create a slice for multiple index levels
# we need to pass somehow a list of slices. The following
# however leads to a SyntaxError because the slice 
# operator ':' cannot be placed inside a list declaration.
df.loc[[0:3, 'a':'c'], 'col1'] 

# The following is valid python code, but looks clumsy:
df.loc[(slice(0, 3, None), slice('a', 'c', None)), 'col1']

# Here is why pd.IndexSlice is useful. It helps
# to create a slice that makes use of two index-levels.
df.loc[idx[0:3, 'a':'c'], 'col1'] 

# We can expand the slice specification by a third level.
df.loc[idx[0:3, 'a':'c', 'I':'III'], 'col1'] 

# A solitary slicing operator ':' means: take them all.
# It is equivalent to slice(None).
df.loc[idx[0:3, 'a':'c', :], 'col1'] # pd.Series

# Semantically, this is equivalent to the following,
# because the last ':' in the previous example does 
# not add any information about the slice specification.
df.loc[idx[0:3, 'a':'c'], 'col1']    # pd.Series

# The following lines are also equivalent, but
# both expressions evaluate to a result with multiple columns.
df.loc[idx[0:3, 'a':'c', :], :]    # pd.DataFrame
df.loc[idx[0:3, 'a':'c'], :]       # pd.DataFrame

总之，pd.IndexSlice 在为行和列索引指定切片时有助于提高可读性。

pandas 对这些切片的处理是另一回事。它本质上选择行/列，从最顶层的索引级别开始，并在进一步降低级别时减少选择，具体取决于指定的级别。 pd.DataFrame.loc 是一个拥有自己的 __getitem__() 函数的对象，它可以完成所有这些工作。

正如您已经在您的一个 cmets 中指出的那样，pandas 在某些特殊情况下的行为似乎很奇怪。您提到的两个示例实际上将得出相同的结果。但是，pandas 在内部对它们的处理方式有所不同。

# This will work.
reviews.loc[idx[top_reviewers,        99, :], ['beer_name', 'brewer_id']]
# This will fail with TypeError "unhashable type: 'Index'".
reviews.loc[idx[top_reviewers,        99]   , ['beer_name', 'brewer_id']]
# This fixes the problem. (pd.Index is not hashable, a tuple is.
# However, the problem matters only with the second expression.)
reviews.loc[idx[tuple(top_reviewers), 99]   , ['beer_name', 'brewer_id']]

诚然，差异是微妙的。

【讨论】：

浮点数是什么索引？那么它会如何工作呢？
@arash：一样。 slice() 与数据类型无关。它只是捆绑了有关start、end 和step 的信息。如何解释特定切片（例如slice(0.1, 2.3, 4.5)）取决于接收切片的对象。对于df = pd.DataFrame([[1,2,3],[4,5,6]], columns=[0.1,2.3,4.5])，您可以通过idx[0.1:4.5] 访问所有列，这与其他索引类型的行为一致。 pandas 为 idx[0.1:4.5:2.3] 引发错误也就不足为奇了，因为它无法为浮点型步骤提供意义。
@arash 也可以看看 this answer

【解决方案2】：

Pandas 只需要您指定足够多的 MultiIndex 级别来消除歧义。由于您在第 2 级进行切片，因此您需要第一个 : 表示我没有在此级别进行过滤。

任何未指定的其他级别都会完整返回，因此相当于每个级别上的:。

【讨论】：

如果是这种情况，那么为什么我不能在同一教程reviews.loc[pd.IndexSlice[top_reviewers, 99,:], ['beer_name', 'brewer_id']] 中从该行中删除冒号，如果我在99 之后删除冒号和逗号，我会得到一个@987654325 @错误
我不确定。根据错误消息，关于 Index 不可散列，它可能采用不同的索引路径。你可以用一个更简单的例子在 github 上打开一个 issue，我们来看看。
@Cheng：问题是top_reviewers 是pd.Index 类型，显然它不能开箱即用。要解决此问题，您可以先将其转换为列表（可以进一步转换为可散列对象）。所以以下将起作用：reviews.loc[pd.IndexSlice[top_reviewers.tolist(), 99], ['beer_name', 'brewer_id']]
@Cheng 但是你确实发现了pandas处理切片的方式有一个小的不一致：pd.IndexSlice[top_reviewers, 99, :]和pd.IndexSlice[top_reviewers, 99]中的top_reviewers没有以完全相同的方式处理，后者导致一个错误，而前者没有。