【问题标题】:Pandas - unstack with duplicates熊猫 - 取消堆叠重复
【发布时间】:2021-12-05 11:10:01
【问题描述】:

我试图解开具有重复跳过不同箍的数据帧。至今无果。如果有任何帮助,我将不胜感激:

我有一个“长”格式的数据框:

| id | variable  | value |
|----|-----------|-------|
| 1  | outcome_1 | NaN   |
| 2  | outcome_1 | 18:33 |
| 2  | outcome_1 | 20:39 |
| 2  | outcome_3 | 01:40 |
| 3  | outcome_2 | 03:59 |
| 3  | outcome_4 | 07:46 |
| 3  | outcome_3 | 10:53 |

并且想将其转换为“宽”格式,但不聚合并保留所有值,因此结果如下所示:

| id_nmbr | outcome_1_0 | outcome_1_1 | outcome_2_0 | outcome_3_0 | outcome_4_0 |
|---------|-------------|-------------|-------------|-------------|-------------|
| 1       | NaN         | NaN         | NaN         | NaN         | NaN         |
| 2       | 18:33       | 20:39       | NaN         | 01:40       | NaN         |
| 3       | NaN         | NaN         | 03:59       | 07:46       | 10:53       |

所以基本上,保留每个值,并为每个重复项创建一个新列。

我尝试过 pivot 或 unstack,以及 pivot_table,但我认为我需要将一些函数串在一起来实现它。有什么想法吗?

【问题讨论】:

    标签: python pandas dataframe duplicates pivot


    【解决方案1】:

    使用GroupBy.cumcount 作为计数器,然后通过Series.unstack 重新整形并排序MultiIndex 并在map 中展平:

    g = df.groupby(['id','variable']).cumcount()
    
    df = df.set_index(['id','variable', g])['value'].unstack([1,2]).sort_index(axis=1)
    df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
    df = df.reset_index()
    print (df)
       id outcome_1_0 outcome_1_1 outcome_2_0 outcome_3_0 outcome_4_0
    0   1         NaN         NaN         NaN         NaN         NaN
    1   2       18:33       20:39         NaN       01:40         NaN
    2   3         NaN         NaN       03:59       10:53       07:46
    

    【讨论】:

      【解决方案2】:

      pyjanitor 中的pivot_wider 函数可以帮助抽象重塑过程:

      # pip install pyjanitor
      import pandas as pd
      import janitor
      
       # the cumcount helps to get a unique index
      (df.assign(counter = df.groupby(group).cumcount())
         .pivot_wider(index='id', 
                      names_from=['variable', 'counter'], 
                      values_from='value')
      ) 
         id outcome_1_0 outcome_1_1 outcome_3_0 outcome_2_0 outcome_4_0
      0   1         NaN         NaN         NaN         NaN         NaN
      1   2       18:33       20:39       01:40         NaN         NaN
      2   3         NaN         NaN       10:53       03:59       07:46
      
      
      

      【讨论】:

        猜你喜欢
        • 2019-01-08
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-08-22
        • 1970-01-01
        相关资源
        最近更新 更多