如何按子级别中的行数对 MultiIndex 级别进行排序答案

【问题标题】：How to sort MultiIndex level by number of rows in the child level如何按子级别中的行数对 MultiIndex 级别进行排序
【发布时间】：2019-10-11 06:12:19
【问题描述】：

我有一家公司出售给许多不同客户的商品的数量和金额（交易中收取的费用）的历史数据。我希望对此数据进行一些时间序列分析，但希望在商品客户级别进行。

这是我的原始数据：

      Year         Month   Day      Qty           Amount     Item   Customer
0     2003         9       1         30.0         220.80     N2719  3110361
1     2003         9       1          1.0          75.17     X1046  3126034
2     2003         9       1        240.0         379.20     D5853  0008933
3     2003         9       1       2112.0        2787.84     D5851  0008933
4     2003         9       1       3312.0        4371.84     D5851  0008933
...
...
<2.7M rows>

这是按年/月/日排序的交易数据，记录了哪些商品卖给了哪些客户，以及该笔销售的数量和金额。

由于我希望按项目和客户分析时间序列，因此我对其应用了 MultiIndex：

df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sortlevel(inplace=True)

这给了我一个排序良好的数据框，如下所示：

Item      Customer     Year   Month   Day   Qty      Amount
X1046     3126034      2003   9       1     1.0      75.17
                       < ...  other transactions for X1046/3126034 item/customer combination ...>
          3126035      2005   1       2     50.0     500.00
                        < ...  other transactions for X1046/3126035 item/customer combination ...>
      < ... 48 other customers for X1046 ...>

N2719     3110361      2003    9      1     30.0      220.80   
                       < ...  other transactions for N2719/3110361 item/customer combination ...>
          3110362      2004    9      10     9.0     823.00
                       < ...  other transactions for N2719/3110362 item/customer combination ...>
      < ... 198 other customers for N2719 ... >
< ... 6998 other items ... >

如您所见，由于我有 7,000 种不同的商品，并且每种商品都可以有几十或数百名客户，因此我想只关注那些拥有大量客户群的商品。数据集中有很多商品可能在过去某个时间被 1 位客户购买过，并且可能已经停产，等等。

因此，请使用以下方法来获取按客户数量排序的商品：

item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False)

这给了我按客户数量排序的项目作为熊猫系列：

Item
N2719    200
X1046     50
<... 6998 other rows ...>

现在我想将此排序顺序应用于我的 DataFrame，因此项目 N2719 的数据首先显示（保留其中 MultiIndex 的所有级别），然后是 X1046，依此类推。

我无法弄清楚如何做到这一点。

这是我迄今为止尝试过的：

sorted_data = df.set_index(item_by_customers.index)
< ... gives me ValueError: Length mismatch: Expected axis has 2.7M elements, new values have 7000 elements ...>

我知道为什么会出现此错误，因为我在索引中有 7,000 个项目，在 DataFrame 中有 270 万行。

我也尝试过重新索引：

sorted_data = df.reindex(index=item_by_customers.index, columns=['Item'])
< ... gives me Exception: cannot handle a non-unique multi-index! ...>

还有一个sort_index()，它本质上是根据索引列自己的值而不是其他一些标准对索引列进行排序。

我正在寻找一些关于如何将item_by_customers.index 应用于 DataFrame 的指导，因此我得到了一个如下所示的 DataFrame：

Item      Customer     Year   Month   Day   Qty      Amount
N2719     3110361      2003    9      1     30.0      220.80   
                       < ...  other transactions for N2719/3110361 item/customer combination ...>
          3110362      2004    9      10     9.0     823.00
                       < ...  other transactions for N2719/3110362 item/customer combination ...>
      < ... 198 other customers for N2719 ... >

X1046     3126034      2003   9       1     1.0      75.17
                       < ...  other transactions for X1046/3126034 item/customer combination ...>
          3126035      2005   1       2     50.0     500.00
                        < ...  other transactions for X1046/3126035 item/customer combination ...>
      < ... 48 other customers for X1046 ...>

< ... 6998 other items ... >

【问题讨论】：

标签： python-3.x pandas sorting dataframe multi-index

【解决方案1】：

`transform`

df.assign(nu=df.groupby('Item').Customer.transform('nunique')) \
   .sort_values(['nu', 'Item'], ascending=[False, True])

【讨论】：

给我 KeyError: 'Item'，可能是因为 df 已经有一个 MultiIndex。打破你的表达，如果我自己做df.reset_index().groupby('Item').Customer.transform('nunique')，那会毫无错误地通过，但是当我做sort_values部分时，我再次得到KeyError: 'Item'。你能解释一下为什么这会奏效吗？

【解决方案2】：

以下是实现目标的方法：

import pandas as pd

df = pd.DataFrame({
    'Item':['X1046','X1046','N2719','N2719','N2719'],
    'Customer':['3126034','3126035','3110361','3110362','3110363'],
    'Year':[2003,2005,2003,2004,2004],
    'Month':[9,1,9,9,9],
    'Day':[1,2,1,10,10],
    'Qty':[1,50,30,9,9],
    'Amount':[75.17,500,220,823,823]
})

df.set_index(['Item', 'Customer', 'Year', 'Month', 'Day'], inplace=True, drop=True)
df.sort_index(inplace=True)

item_by_customers = df.reset_index().groupby('Item')['Customer'].nunique().sort_values(ascending=False).rename('Unique_Customers')

df = df.join(item_by_customers, on='Item').sort_values('Unique_Customers', ascending=False)

print(df)

输出如下：

                               Qty  Amount  Unique_Customers
Item  Customer Year Month Day
N2719 3110361  2003 9     1     30  220.00                 3
      3110362  2004 9     10     9  823.00                 3
      3110363  2004 9     10     9  823.00                 3
X1046 3126034  2003 9     1      1   75.17                 2
      3126035  2005 1     2     50  500.00                 2

因此，基本策略是将客户的唯一计数作为一列添加到原始数据框中，然后根据需要进行排序。

【讨论】：

感谢@WebDev。这在大多数情况下都有效，除了 df.join(item_by_customer) 就足够了。事实上，包括on='Item' 导致了一个KeyError。对我不起作用的另一件事（可能是因为我使用的是 Pandas 0.19.2，而不是最新版本）是 sort_values 在您的解决方案中显示的链接时给了我一个错误。所以我必须分两步完成。我倾向于将此标记为答案，因为问题可能完全在我这边。
我在测试解决方案的 pandas 0.24.2 上。所以是的，您提到的调整可能需要在以前的版本上进行。