【问题标题】:rearranging columns after getting dummies得到假人后重新排列列
【发布时间】:2018-04-08 14:17:31
【问题描述】:
       A            B            C               D              E
0   165349.20   136897.80    471784.10        New York      192261.83
1   162597.70   151377.59    443898.53        California    191792.06
2   153441.51   101145.55    407934.54        Florida       191050.39
3   144372.41   118671.85    383199.62        New York      182901.99
4   142107.34   91391.77     366168.42        Florida       166187.94

使用后df = pd.get_dummies(df, columns=['D'])

        A            B              C           E      D_New York    D_California     D_Florida
0   165349.20    136897.80      471784.10   192261.83      0             0                1
1   162597.70    151377.59      443898.53   191792.06      1             0                0
2   153441.51    101145.55      407934.54   191050.39      0             1                0
3   144372.41    118671.85      383199.62   182901.99      0             0                1
4   142107.34    91391.77       366168.42   166187.94      0             1                0

有没有一种方法可以在不使用 df[['A','B','C','D_Califorina','D_New York','D_Florida','E']] 的情况下使输出看起来像这样?

        A            B          C      D_New York    D_California     D_Florida     E
0   165349.20   136897.80   471784.10       0               0          1    192261.83
1   162597.70   151377.59   443898.53       1               0          0    191792.06
2   153441.51   101145.55   407934.54       0               1          0    191050.39
3   144372.41   118671.85   383199.62       0               0          1    182901.99
4   142107.34   91391.77    366168.42       0               1          0    166187.94

【问题讨论】:

标签: python-3.x pandas one-hot-encoding


【解决方案1】:

可能未按排序顺序的列的通用解决方案:
找到列的位置以相应地进行虚拟化和连接

j = df.columns.get_loc('D')

left = df.iloc[:, :j]
dumb = pd.get_dummies(df[['D']])
rite = df.iloc[:, j+1:]

pd.concat([left, dumb, rite], axis=1)

           A          B          C  D_California  D_Florida  D_New York          E
0  165349.20  136897.80  471784.10             0          0           1  192261.83
1  162597.70  151377.59  443898.53             1          0           0  191792.06
2  153441.51  101145.55  407934.54             0          1           0  191050.39
3  144372.41  118671.85  383199.62             0          0           1  182901.99
4  142107.34   91391.77  366168.42             0          1           0  166187.94

【讨论】:

    【解决方案2】:

    通过使用sort_index

    df.sort_index(axis=1)
    Out[813]: 
               A          B          C  D_California  D_Florida  D_NewYork  \
    0  165349.20  136897.80  471784.10             0          0          1   
    1  162597.70  151377.59  443898.53             1          0          0   
    2  153441.51  101145.55  407934.54             0          1          0   
    3  144372.41  118671.85  383199.62             0          0          1   
    4  142107.34   91391.77  366168.42             0          1          0   
               E  
    0  192261.83  
    1  191792.06  
    2  191050.39  
    3  182901.99  
    4  166187.94  
    

    编辑:.....用dictlambda列出sort

    A=dict(zip(df.columns,list(range(0,df.shape[1]))))
    #build a dict A store the order of original df
    df1=pd.get_dummies(df, columns=['State'])
    #get your df
    youroder=list(df1)
    #new disorder column name
    youroder.sort(key=lambda val: A[val.split(sep='_')[0]])
    # sort it 
    df1[youroder]
    
    Out[842]: 
       R&D Spend  Administration  Marketing Spend  State_California  \
    0  165349.20       136897.80        471784.10                 0   
    1  162597.70       151377.59        443898.53                 1   
    2  153441.51       101145.55        407934.54                 0   
    3  144372.41       118671.85        383199.62                 0   
    4  142107.34        91391.77        366168.42                 0   
       State_Florida  State_NewYork  Profit(E)  
    0              0              1  192261.83  
    1              0              0  191792.06  
    2              1              0  191050.39  
    3              0              1  182901.99  
    4              1              0  166187.94  
    

    【讨论】:

    • 假设列名不像我的示例中那样按字母顺序排列,还有其他方法吗?
    • 这些是原始列名,分别为:R&D Spend、Administration、Marketing Spend、State、Profit(E)。我想将它们安排到:研发支出、管理、营销支出、State_California、State_New York、State_Florida、Profit(E)
    • @ZaleGoldart 我能想到的就是拆分原始 df,然后将它们连接回来
    【解决方案3】:

    不确定是否有更好的方法,但这会起作用

    col = ['R&D Spend', 'Administration', 'Marketing Spend', 'State_California', 'State_New York', 'State_Florida', 'Profit(E)']
    
    df=df.loc[:, col]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2011-10-03
      • 2015-09-16
      • 2013-10-26
      • 1970-01-01
      • 1970-01-01
      • 2023-01-17
      相关资源
      最近更新 更多