如何将熊猫组转换为不同的列？答案

【问题标题】：How to convert pandas groups into different columns?如何将熊猫组转换为不同的列？
【发布时间】：2020-10-30 01:25:48
【问题描述】：

我有一个如下所示的数据框。

unit time s1 s2 ....
1    1    2  3
1    2    4  5
1    3    9  7
2    1    5  2
2    2    3  1

我想按单位对数据进行分组，根据时间保持上次观察的最小相似数量（单元 2 有 2 个观察），并为 s1 列单独分组。所以，如下所示。

unit_1 unit_2 
   4      5 
   9      3

谢谢。

【问题讨论】：

保持 'last minimum similar number of last overvations' 是什么意思？为什么单元1、s1的值为2被丢弃了？您需要最多 2 个值吗？或者您想要基于时间的最后 2 个值？
我想要基于时间的最后 2 个值。很抱歉对于这个误会。改了。
另外，你能在输出示例中显示 s2 发生了什么吗？你做单独的行吗？还是列？
我想丢弃它。我想计算不同组的同一列之间的相关性。所以，我会在循环中为 s2 做类似的过程。

标签： python pandas group-by pandas-groupby

【解决方案1】：

这应该可以解决您的问题 -

def f(col):
    #First step is to get the last 2 for each group using .tail(2)
    dff = df[['unit','time',col]].sort_values(by=['unit','time'],axis=0).groupby(['unit']).tail(2)

    #Next we need the ordered rank of the time values instead of the actual values of time, 
    #since then we can keep the time values 2,3 as 1,2 and 1,2 as 1,2.
    dff['time'] = dff.groupby(['unit']).rank()

    #Last we pivot over the time and units to get the columns that you need for correlation analysis
    dff = dff.pivot(index='time',columns='unit',values=col).reset_index(drop=True).add_prefix('unit_')
    return dff

f('s1')

unit    unit_1  unit_2
   0         4       5
   1         9       3

使用此函数可加快运行速度。

def f(col):
    filt = df[['unit',col]].groupby('unit').tail(2)  #filter last 2
    filt['count'] = filt.groupby('unit').cumcount()  #add a counter column for pivot
    
    #Use counter column as index and unit as column for pivot, then add prefix
    filt = filt.pivot(index='count',columns='unit',values=col).reset_index(drop=True).add_prefix("unit_")
    return filt

【讨论】：

【解决方案2】：

所以，我提出了这个解决方案：

import pandas as pd
import numpy as np

df = pd.DataFrame({'units': [1,1,1,2,2], 's1':[2,4,9,5,3]})

new_df = df.groupby('units').tail(2) # Taking the last 2 values
new_df
Out:
     units s1
    1   1   4
    2   1   9
    3   2   5
    4   2   3


units_list = new_df.units.unique() # How many units do we have?
units_columns = [] # For col names
form_dict = {}
# We have 2 values for each unit, so the number of elements is 2n, 
# where n is a number of unit corresponding the new_df.
n = 0

for unit in units_list:
    units_columns.append('unit_{}'.format(unit))

while n != len(new_df['s1']):
    for col in units_columns:
        form_dict.update({col:new_df['s1'][n:n+2].values})
        n += 2
        
final_df = pd.DataFrame(form_dict)
final_df

结果是：

 unit_1 unit_2
0   4   5
1   9   3

【讨论】：

【解决方案3】：

Groupby 单元并传递nth 值的列表。删除不需要的列。转置数据框并将前缀单元添加到名称中。转置和散开以组合列

   g= df.groupby('unit', group_keys=False).nth([-1,-2]).drop(columns=['time','s2']).T.add_prefix('unit_')#.unstack('s1')

final = pd.DataFrame({'unit_1': g['unit_1'].values.T.ravel(),
                    'unit_2': g['unit_2'].values.T.ravel()})
final

    unit_1  unit_2
0       4       5
1       9       3

【讨论】：

你能把“.apply(lambda x: x.iloc[-2:])”改成“.nlargest(2)”吗？
这是不正确的，因为它只返回 s1.. 中的最大值，而不是基于时间。 iloc 会有所帮助。那应该可以解决它。
您的第二个解决方案有效，但第一个解决方案与df.groupby(['unit'])['unit','s1'].tail(2)产生的结果相同