【问题标题】:Python/Pandas transposePython/Pandas 转置
【发布时间】:2020-02-06 19:21:18
【问题描述】:

我有以下格式的数据,不同月份有多个度量列,如下所示。

Cust_No Measure1_month1 Measure1_month2 .... Measure1_month72  Measure2_month_1 Measure2_month_2....so on 
1       10             20             .... 500              40               50 
2       20             40             .... 800              70               150             ....    

我想实现以下两种格式。 格式 1)

+-------------+----------+---------+-------+
| CustNum     | Measure  |   Value | Month |
+-------------+----------+---------+-------+
| 1           | Measure1 | 10      | 1     |
| 1           | Measure1 | 20      | 2     |
| 1           | Measure1 | 30      | 3     |
| 1           | Measure1 | 70      | 4     |
| 1           | Measure1 | 40      | 5     |
| .           | .        | .       | .     |
| .           | .        | .       | .     |
| 1           | Measure1 | 700     | 72    |
| 1           | Measure2 | 30      | 1     |
| 1           | Measure2 | 40      | 2     |
| 1           | Measure2 | 80      | 3     |
| 1           | Measure2 | 90      | 4     |
| 1           | Measure2 | 100     | 5     |
| .           | .        | .       | .     |
| .           | .        | .       | .     |
| .           | .        | .       | .     |
| 1           | Measure2 | 50      | 72    |
+-------------+----------+---------+-------+

每个客户编号以此类推

格式2:

+---------+---------+----------+----------+
| CustNum |   Month | Measure1 | Measure2 |
+---------+---------+----------+----------+
| 1       | 1       | 10       | 30       |
| 1       | 2       | 20       | 40       |
| 1       | 3       | 30       | 80       |
| 1       | 4       | 70       | 90       |
| 1       | 5       | 40       | 100      |
| .       | .       | .        | .        |
| .       | .       | .        | .        |
| 1       | 72      | 700      | 50       |
+---------+---------+----------+----------+

每个客户编号以此类推

你能帮我解决这个问题吗?

谢谢

【问题讨论】:

    标签: python pandas transpose


    【解决方案1】:

    设置

    dct = {'Cust_No': {0: 1, 1: 2},
     'Measure1_month1': {0: 10, 1: 20},
     'Measure1_month2': {0: 20, 1: 40},
     'Measure1_month72': {0: 500, 1: 800},
     'Measure2_month_1': {0: 40, 1: 70},
     'Measure2_month_2': {0: 50, 1: 150}}
    
    df = pd.DataFrame(dct)
    

    很多争论,但总的来说:将列拆分为 MultiIndex,然后堆叠。您想要的第二种格式是第一种格式的转折点。


    d = df.set_index('Cust_No')
    d.columns = d.columns.str.replace('month\_', 'month').str.split('_', expand=True)
    
    u = d.stack((0, 1)).rename_axis(
          ['Cust_No', 'Measure', 'Month']).to_frame('Value').reset_index()
    
    f1 = u.assign(Month=u.Month.str.extract(r'(\d+)')[0])
    
    f2 = f1.pivot_table(
           index=['Cust_No', 'Month'], columns='Measure', values='Value', fill_value=0)
    

    输出

    >>> f1                                                   
       Cust_No   Measure Month  Value  
    0        1  Measure1     1   10.0  
    1        1  Measure1     2   20.0  
    2        1  Measure1    72  500.0  
    3        1  Measure2     1   40.0  
    4        1  Measure2     2   50.0  
    5        2  Measure1     1   20.0  
    6        2  Measure1     2   40.0  
    7        2  Measure1    72  800.0  
    8        2  Measure2     1   70.0  
    9        2  Measure2     2  150.0  
    
    >>> f2                                               
    Measure        Measure1  Measure2  
    Cust_No Month                      
    1       1            10        40  
            2            20        50  
            72          500         0  
    2       1            20        70  
            2            40       150  
            72          800         0  
    

    【讨论】:

    • 嗨@user3483203。你能告诉我你将如何使用 pyspark 创建它吗?我只需要输出中的 f2 。数据保持不变。再次感谢。
    【解决方案2】:

    给定输入数据框,df 为:

    np.random.seed(123)
    df = pd.DataFrame(np.random.randint(20,500,(2,144)), 
                 columns = pd.MultiIndex.from_product([['Measure1','Measure2'], [f'Month{i}' for i in range(1,73)]]),
                 index=[1,2]).rename_axis('Cust_no').reset_index()
    df.columns = df.columns.map('_'.join).str.strip('_')
    df
    

    输出:

       Cust_no  Measure1_Month1  Measure1_Month2  ...  Measure2_Month70  Measure2_Month71  Measure2_Month72
    0        1              385              402  ...               153               380               129
    1        2              106               66  ...               363               361               173
    
    [2 rows x 145 columns]
    

    格式 1:

    df = df.set_index('Cust_no')
    df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')), names=['Measure', 'Month'])
    df_format1 = df.stack([0,1]).rename('Value').reset_index()
    df_format1['Month'] = df_format1['Month'].str.extract('(\d+)')
    df_format1
    

    输出:

        Cust_no   Measure Month  Value
    0          1  Measure1     1    385
    1          1  Measure1    10    143
    2          1  Measure1    11     77
    3          1  Measure1    12    234
    4          1  Measure1    13    245
    ..       ...       ...   ...    ...
    283        2  Measure2    70    363
    284        2  Measure2    71    361
    285        2  Measure2    72    173
    286        2  Measure2     8     65
    287        2  Measure2     9    461
    
    [288 rows x 4 columns]
    

    格式 2:

    df_format2 = (df_format1.set_index(['Cust_no','Month','Measure'])['Value']
                            .unstack().reset_index().rename_axis(None, axis=1))
    df_format2
    

    输出:

         Cust_no Month  Measure1  Measure2
    0          1     1       385        90
    1          1    10       143       379
    2          1    11        77       479
    3          1    12       234       458
    4          1    13       245       475
    ..       ...   ...       ...       ...
    139        2    70       108       363
    140        2    71       258       361
    141        2    72       235       173
    142        2     8       453        65
    143        2     9       276       461
    
    [144 rows x 4 columns]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-01-06
      • 2013-04-08
      • 1970-01-01
      • 2018-02-27
      • 2020-04-02
      • 1970-01-01
      相关资源
      最近更新 更多