【问题标题】:Sort multi index dataframe using a reference list使用参考列表对多索引数据帧进行排序
【发布时间】:2021-10-23 23:30:51
【问题描述】:

给定一个多重索引df 如下

mylevelA caseA VAR_A   mylevelA_caseA__VAR_A   bar one -0.054973 -0.092080
         caseC VAR_B   mylevelA_caseC__VAR_B   bar two -0.282347  0.882559
               VAR_A   mylevelA_caseC__VAR_A   baz one -0.691023  0.879495
         caseB VAR_B   mylevelA_caseB__VAR_B   baz two -0.321049  1.036407
         caseA VAR_C   mylevelA_caseA__VAR_C   foo one -0.411117  0.523282
         caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.998682  0.232587
         caseC VAR_E   mylevelA_caseC__VAR_E   qux one  0.690079  0.985688
         caseD VAR_A   mylevelA_caseD__VAR_A   qux two -2.151700  0.554983

我想根据列表对级别=1进行排序

order_list=[caseC,caseB,caseD,caseA]

这将产生以下结果,

                                                            col1      col2
mylevelA 
         caseC VAR_A   mylevelA_caseC__VAR_A   baz one  1.135174 -0.547376
               VAR_E   mylevelA_caseC__VAR_E   qux one  0.021435 -0.047488
               VAR_B   mylevelA_caseC__VAR_B   bar two -0.892378  2.649619
         caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two  1.945302 -1.848938
               VAR_B   mylevelA_caseB__VAR_B   baz two -2.552820  1.025900
         caseD VAR_A   mylevelA_caseD__VAR_A   qux two -0.833289 -1.478944
         caseA VAR_C   mylevelA_caseA__VAR_C   foo one  1.269452  0.956567

我觉得这可以使用sort_valuessort_index 解决

df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
                                                        sort_remaining=False)

但是,sort_index 上面只有参数ascending

另外,使用上面的表达式,我得到了以下输出

import pandas as pd
import numpy as np
import re
from itertools import chain
arrays = [["mylevelA_caseA__VAR_A", "mylevelA_caseC__VAR_B", "mylevelA_caseC__VAR_A",
           "mylevelA_caseB__VAR_B", "mylevelA_caseA__VAR_C", "mylevelA_caseB__VAR_C_D",
           "mylevelA_caseC__VAR_E", "mylevelA_caseD__VAR_A"],
          ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
  ["one", "two", "one", "two", "one", "two", "one", "two"]]


df = pd.DataFrame(np.random.randn(8, 2), index=arrays,columns=['col1','col2'])
df.index = pd.MultiIndex.from_tuples([re.split('__?',e[0], maxsplit=2)+list(e)
                                      for e in df.index])

df=df.sort_values(df.columns.tolist()).sort_index(level=1, ascending=False,
                                                        sort_remaining=False)

输出

                                                            col1      col2
mylevelA caseD VAR_A   mylevelA_caseD__VAR_A   qux two  1.240834 -0.097545
         caseC VAR_B   mylevelA_caseC__VAR_B   bar two -0.293481  1.342649
               VAR_E   mylevelA_caseC__VAR_E   qux one -0.581308 -1.370208
               VAR_A   mylevelA_caseC__VAR_A   baz one -1.179519  1.006746
         caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two  0.430511  0.447371
               VAR_B   mylevelA_caseB__VAR_B   baz two -0.355763 -1.794507
         caseA VAR_A   mylevelA_caseA__VAR_A   bar one  0.747331 -0.476303
               VAR_C   mylevelA_caseA__VAR_C   foo one -0.702220  0.237277

我的问题,我们如何使用给定的list_order 对多索引顺序进行排序?

【问题讨论】:

    标签: python pandas sorting multi-index


    【解决方案1】:

    不用sort_index,可以用reindex(),如下:

    order_list=['caseC','caseB','caseD','caseA']
    
    df.reindex(level=1, labels=order_list)
    

    结果:

                                                                col1      col2
    mylevelA caseC VAR_B   mylevelA_caseC__VAR_B   bar two  1.536922 -1.285441
                   VAR_A   mylevelA_caseC__VAR_A   baz one  0.734785  0.845596
                   VAR_E   mylevelA_caseC__VAR_E   qux one -0.577822 -0.689958
             caseB VAR_B   mylevelA_caseB__VAR_B   baz two -0.740523  0.345331
                   VAR_C_D mylevelA_caseB__VAR_C_D foo two  0.534257 -0.120670
             caseD VAR_A   mylevelA_caseD__VAR_A   qux two  1.327925  0.242728
             caseA VAR_A   mylevelA_caseA__VAR_A   bar one  1.530633 -0.190661
                   VAR_C   mylevelA_caseA__VAR_C   foo one -0.290205 -0.323746
    

    【讨论】:

      【解决方案2】:

      可以使用分类类型。此解决方案将与 sort_index 一起使用。将此添加到您的代码中:

      cat_type = pd.CategoricalDtype(
          categories=["caseC", "caseB", "caseD", "caseA"], ordered=True
      )
      
      df.reset_index(inplace=True)
      
      df["level_1"] = df["level_1"].astype(cat_type)
      
      df = (
          df.set_index([i for i in df.columns if i.startswith("level_")])
          .sort_index(level=1, ascending=True, sort_remaining=False)
      )
      
      df.rename_axis(index=df.index.nlevels * [None], inplace=True)
      

      输出将是:

                                                                  col1      col2
      mylevelA caseC VAR_A   mylevelA_caseC__VAR_A   baz one  0.095391  1.723488
                     VAR_E   mylevelA_caseC__VAR_E   qux one -0.505066  0.871808
                     VAR_B   mylevelA_caseC__VAR_B   bar two -1.223648 -0.468713
               caseB VAR_C_D mylevelA_caseB__VAR_C_D foo two -0.747988  0.794639
                     VAR_B   mylevelA_caseB__VAR_B   baz two -0.749597  1.385091
               caseD VAR_A   mylevelA_caseD__VAR_A   qux two -1.071768  0.920789
               caseA VAR_A   mylevelA_caseA__VAR_A   bar one  1.670896 -2.067492
                     VAR_C   mylevelA_caseA__VAR_C   foo one  0.437768  0.417799
      

      【讨论】:

        猜你喜欢
        • 2020-07-10
        • 2017-03-27
        • 2019-10-24
        • 2021-04-12
        • 1970-01-01
        • 2017-04-27
        • 2016-01-04
        • 2012-09-07
        • 2023-03-29
        相关资源
        最近更新 更多