在熊猫分组后选择样本随机组？答案

【问题标题】：Select sample random groups after groupby in pandas?在熊猫分组后选择样本随机组？
【发布时间】：2021-07-15 13:02:29
【问题描述】：

我有一个非常大的DataFrame，看起来像这个例子df：

df = 

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
apple   pink     1.99 
apple   pink     1.99 
apple   pink     2.99 
...     ....      ...
pear    green     .99 
pear    green     .99 
pear    green    1.29

我按这样的 2 列分组：

g = df.groupby(['col1', 'col2'])

现在我想选择 3 个随机组。所以我的预期输出是这样的：

col1    col2     col3 
apple   red      2.99 
apple   red      2.99 
apple   red      1.99 
pear    green     .99 
pear    green     .99 
pear    green    1.29
lemon   yellow    .99 
lemon   yellow    .99 
lemon   yellow   1.99

（让我们假设以上三个组是来自 df 的随机组）。我怎样才能做到这一点？我使用this。但这对我来说并没有帮助。

【问题讨论】：

您只需要 3 个组，还是每组只需要 3 个项目？还是两者兼而有之？

标签： python pandas

【解决方案1】：

您可以使用 shuffle 和 ngroup

g = df.groupby(['col1', 'col2'])

a=np.arange(g.ngroups)
np.random.shuffle(a)

df[g.ngroup().isin(a[:2])]# change 2 to what you need :-)

【讨论】：

当我使用这个但现在我收到错误“TypeError: unsupported operand type(s) for -: 'dict' and 'int'”你知道为什么吗？
@Hana 这里a = np.arange(g.groups) 更改为a=np.arange(g.ngroups)
使用numpy.random.choice 可以更简洁地完成组抽样（无需打乱完整列表）——即。 df[g.ngroup().isin(choice(g.ngroups, 2, replace=False)].
@jstol，我喜欢你的解决方案，它只是缺少一个右括号：df[g.ngroup().isin(choice(g.ngroups, 2, replace=False))] 并且对于那些（像我一样最初没有发现它）import numpy as np，该行应该是 @ 987654330@

【解决方案2】：

使用sample 打乱您的数据帧，然后执行非排序groupby：

df = df.sample(frac=1)
df2 = pd.concat(
    [g for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

如果您需要每组的前 3 个，请使用 groupby.head(3)；

df2 = pd.concat(
    [g.head(3) for _, g in df.groupby(['col1', 'col2'], sort=False, as_index=False)][:3],
    ignore_index=True 
)

【讨论】：

【解决方案3】：

如果您只需要在一列中进行这种类型的采样，这也是一种替代方法：

df.loc[df['col1'].isin(pd.Series(df['col1'].unique()).sample(2))]

更长：

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'],
                      'col2': np.random.randint(5, size=9),
                      'col3': np.random.randint(5, size=9)
                     })
>>> df
  col1  col2  col3
0    a     4     3
1    a     3     0
2    a     4     0
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1
>>> sample = pd.Series(df['col1'].unique()).sample(2)
>>> sample
0    b
1    c
dtype: object
>>> df.loc[df['col1'].isin(sample)]
  col1  col2  col3
3    b     4     4
4    b     4     1
5    b     1     3
6    c     4     4
7    c     3     2
8    c     3     1

【讨论】：

【解决方案4】：

这是一种方式：

from io import StringIO
import pandas as pd
import numpy as np

np.random.seed(100)

data = """
col1    col2     col3
apple   red      2.99
apple   red      2.99
apple   red      1.99
apple   pink     1.99
apple   pink     1.99
apple   pink     2.99
pear    green     .99
pear    green     .99
pear    green    1.29
"""
# Number of groups
K = 2

df = pd.read_table(StringIO(data), sep=' ', skip_blank_lines=True, skipinitialspace=True)
# Use columns as indices
df2 = df.set_index(['col1', 'col2'])
# Choose random sample of indices
idx = np.random.choice(df2.index.unique(), K, replace=False)
# Select
selection = df2.loc[idx].reset_index(drop=False)
print(selection)

输出：

    col1   col2  col3
0  apple   pink  1.99
1  apple   pink  1.99
2  apple   pink  2.99
3   pear  green  0.99
4   pear  green  0.99
5   pear  green  1.29

【讨论】：

【解决方案5】：

我把@Arvid Baarnhielm 的回答变成了一个简单的函数

def sampleCluster(df:pd.DataFrame, columnCluster:str, fraction) -> pd.DataFrame:
    return df.loc[df[columnCluster].isin(pd.Series(df[columnCluster].unique()).sample(frac=fraction))]

【讨论】：

【解决方案6】：

本着this answer精神的简单解决方案

n_groups = 2    
df.merge(df[['col1','col2']].drop_duplicates().sample(n=n_groups))

【讨论】：