Pandas 数据框随机随机排列组中的一些列值答案

【问题标题】：Pandas dataframe randomly shuffle some column values in groupsPandas 数据框随机随机排列组中的一些列值
【发布时间】：2020-03-20 10:54:47
【问题描述】：

我想洗牌一些列值，但只在某个组内，并且只在组内一定百分比的行。例如，对于每组，我想将 b 列中 n% 的值相互打乱。

df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})

   grouper_col   b
0            1  12
1            1  13
2            2  16
3            3  21
4            3  14
5            3  11
6            3  12
7            4  13
8            4  15

示例输出：

   grouper_col   b
0            1  13
1            1  12
2            2  16
3            3  21
4            3  11
5            3  14
6            3  12
7            4  15
8            4  13

我找到了

df.groupby("grouper_col")["b"].transform(np.random.permutation)

但是我无法控制洗牌值的百分比。

感谢您的任何提示！

【问题讨论】：

标签： python pandas pandas-groupby permutation shuffle

【解决方案1】：

你可以使用numpy来创建这样的函数（它需要一个numpy数组作为输入）

import numpy as np

def shuffle_portion(arr, percentage): 
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    return arr

np.random.choice 将选择一组具有您需要的大小的索引。然后，给定数组中的相应值可以按打乱顺序重新排列。现在这应该从 cloumn 'b' 中的 9 个值中洗出 3 个值

df['b'] = shuffle_portion(df['b'].values, 33)

编辑：要与apply 一起使用，您需要将传递的数据帧转换为函数内的数组（在 cmets 中解释）并创建返回数据帧

def shuffle_portion(_df, percentage=50): 
    arr = _df['b'].values
    shuf = np.random.choice(np.arange(arr.shape[0]),  
                            round(arr.shape[0]*percentage/100), 
                            replace=False) 
    arr[np.sort(shuf)] = arr[shuf] 
    _df['b'] = arr
    return _df

现在你可以做

df.groupby("grouper_col", as_index=False).apply(shuffle_portion)

如果您将需要洗牌的列的名称传递给函数（def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...）会更好

【讨论】：

谢谢。这样可行。但是，在 groupby 语句中使用它时，我必须以某种方式设置非整数索引，否则会出现索引错误。设置另一个索引后，我可以使用：df.groupby("grouper_col", as_index=False)["b"].apply(shuffle_portion, percentage=50) ，这正是我所需要的。你知道我为什么会收到这个索引错误吗？ (KeyError: "[Int64Index([0, 2], dtype='int64')] 都不在 [index] 中")
您正在尝试将该函数应用于 groupby 对象而不是 df。 Refer to the docs first. 它说您的函数应该将数据框作为输入并返回一个数据框。我不知道这是我回答时的用法，也不清楚您要做什么
您如何改写“例如，每组，我想将 b 列中的 n% 的值相互打乱”，以便更清楚？我虽然“每组”会表明我需要使用某种 groupby
对不起，我错过了，很抱歉延迟回复。由于apply 将分组数据帧作为其输入参数，因此发生错误。为了解决这个问题，您必须将'b' 的值转换为函数内的数组。我会更新答案