想要在 python 中执行分组，分组数据将进入行答案

【问题标题】：Want to perform group by in python where grouped data will come into rows想要在 python 中执行分组，分组数据将进入行
【发布时间】：2019-05-02 14:43:39
【问题描述】：

我有这样的数据：

ID Value
1  ABC
1  BCD
1  AKB
2  CAB
2  AIK
3  KIB

我想执行一个操作，它会给我这样的东西：

ID Value1 Value2 Value3
1  ABC    BCD    AKB 
2  CAB    AIK
3  KIB

我使用了 SAS，我们曾经在其中使用 retain 和 by 来得到答案。在 Python 中，我没有任何办法。我知道我必须使用 group by 然后一些东西。但不知道我能用什么。在使用 group by 和 collect_list 的 Pyspark 中，我们可以以数组格式获取它，但我想在 Pandas 数据框中进行

【问题讨论】：

标签： python pandas python-2.7 pandas-groupby

【解决方案1】：

将set_index 与cumcount 一起用于MultiIndex，然后通过unstack 重塑：

df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']
        .unstack()
        .rename(columns=lambda x: 'Value{}'.format(x + 1))
        .reset_index())

对于python 3.6+ 可以使用f-strings 重命名列名：

df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']
        .unstack()
        .rename(columns=lambda x: f'Value{x+1}')
        .reset_index())

另一个想法是由构造函数创建lists 和新的DataFrame：

s = df.groupby('ID')['Value'].apply(list)
df1 = (pd.DataFrame(s.values.tolist(), index=s.index)
       .rename(columns=lambda x: 'Value{}'.format(x + 1))
       .reset_index())

print (df1)
   ID Value1 Value2 Value3
0   1    ABC    BCD    AKB
1   2    CAB    AIK    NaN
2   3    KIB    NaN    NaN

性能：取决于ID 列的行数和唯一值数：

np.random.seed(45)

a = np.sort(np.random.randint(1000, size=10000))
b = np.random.choice(list('abcde'), size=10000)

df = pd.DataFrame({'ID':a, 'Value':b})
#print (df)

In [26]: %%timeit
    ...: (df.set_index(['ID',df.groupby('ID').cumcount()])['Value']
    ...:         .unstack()
    ...:         .rename(columns=lambda x: f'Value{x+1}')
    ...:         .reset_index())
    ...: 
8.96 ms ± 628 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [27]: %%timeit
    ...: s = df.groupby('ID')['Value'].apply(list)
    ...: (pd.DataFrame(s.values.tolist(), index=s.index)
    ...:        .rename(columns=lambda x: 'Value{}'.format(x + 1))
    ...:        .reset_index())
    ...: 
    ...: 
105 ms ± 7.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#jpp solution
In [28]: %%timeit
    ...: def group_gen(df):
    ...:     for key, x in df.groupby('ID'):
    ...:         x = x.set_index('ID').T
    ...:         x.index = pd.Index([key], name='ID')
    ...:         x.columns = [f'Value{i}' for i in range(1, x.shape[1]+1)]
    ...:         yield x
    ...: 
    ...: pd.concat(group_gen(df)).reset_index()
    ...: 

3.23 s ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：

@Arun - 如果我的回答有帮助，请不要忘记accept。谢谢。
接受。 :)。我之前不知道这个功能。

【解决方案2】：

`groupby` + `concat`

一种方法是迭代 groupby 对象并连接生成的数据帧：

def group_gen(df):
    for key, x in df.groupby('ID'):
        x = x.set_index('ID').T
        x.index = pd.Index([key], name='ID')
        x.columns = [f'Value{i}' for i in range(1, x.shape[1]+1)]
        yield x

res = pd.concat(group_gen(df)).reset_index()

print(res)

   ID Value1 Value2 Value3
0   1    ABC    BCD    AKB
1   2    CAB    AIK    NaN
2   3    KIB    NaN    NaN

【讨论】：

工作。太棒了。
@Arun，当然，np。不要忘记您可以对有帮助/没有帮助的帖子投赞成票/反对票！

【解决方案3】：

假设您的数据位于名为 df 的数据框中，您必须这样做：

from pyspark.sql.functions import *

result = df.groupBy(col('ID')).agg(collect_list(col('Value')).alias('Values'))

how = result.select(max(size(col('Values'))).alias('len')).collect()

for i in range(how[0]['len']):
    result = result.withColumn('Values' + str(i+1), col('Values')[i])

那么，结果会是这样的：

ID    Values1    Values2    Values3
1     ABC        BCD        AKB
2     CAB        AIK
3     KIB

【讨论】：

这是给 Pyspark 的。我想要它用于 Pandas 数据框。
哦，对不起。我没见过。我已经改进了 pySpark 的结果，但我不知道如何在 Pandas 中做到这一点。

groupby + concat

`groupby` + `concat`