python pandas中的Groupby：快速方式答案

【问题标题】：Groupby in python pandas: Fast Waypython pandas中的Groupby：快速方式
【发布时间】：2016-11-03 18:22:45
【问题描述】：

我想改进 python pandas 中 groupby 的时间。我有这个代码：

df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

目标是计算客户在一个月内拥有多少合同，并将此信息添加到新列中 (Nbcontrats)。

Client：客户端代码
Month: 数据提取月份
Contrat: 合同号

我想改善时间。下面我只使用我的真实数据的一个子集：

%timeit df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)
1 loops, best of 3: 391 ms per loop

df.shape
Out[309]: (7464, 61)

如何提高执行时间？

【问题讨论】：

我建议添加 numpy 标签。我记得@Divakar 提出了比使用 np.einsum 的 groupby 更快的解决方案。
@ayhan，你的意思是this解决方案吗？
@MaxU 不是熊猫专家，我想请教各位熊猫大师。所以，我可以想象groupby 在这里用df.groupby(['Client', 'Month']) 做了什么。但是，那么使用['Contrat'] 选择/索引“Contrat”列可以实现什么？还是根本没有索引？从我的测试结果来看，索引不会影响最终结果。知道那里发生了什么吗？
@Divakar 通常，df.groupby(['Col1', 'Col2'])['Col3'] 按Col1 和Col2 对数据帧进行分组，并选择Col3（没有聚合，只有键（Col1，Col2）和值（Col3）对）。如果您进行聚合，比如说取平均值，它会为您提供每组 Col3 的平均值。如果您不指定任何列，只指定df.groupby(['Col1', 'Col2'])，它将将该函数应用于所有列（尽可能）。在此示例中，OP 正在使用函数 len。由于组的长度不会随着列的变化而变化，所以它只是一个辅助列。
@Divakar，我正要回答你的问题，但@ayhan，更快... :) 正如ayhan 所说，如果我们使用列选择['Contrat'] transform(len) 将仅适用于该列，否则将应用于所有列（在 groupby 操作后可用）

标签： python pandas numpy pandas-groupby

【解决方案1】：

这里有一种方法：

将输入数据帧中的相关列 (['Client', 'Month']) 切成 NumPy 数组。这主要是一个以性能为中心的想法，因为我们稍后将使用 NumPy 函数，这些函数已针对 NumPy 数组进行了优化。
将['Client', 'Month'] 中的两列数据转换为单个1D 数组，将两列中的元素视为对，这将是等效的线性索引。因此，我们可以假设来自'Client' 的元素代表行索引，而'Month' 元素是列索引。这就像从2D 到1D。但是，问题在于确定 2D 网格的形状来执行这样的映射。为了涵盖所有对，一个安全的假设是假设一个二维网格，由于 Python 中基于 0 的索引，其每列的维度比最大值大一。因此，我们将得到线性索引。
接下来，我们根据每个线性索引的唯一性对其进行标记。我认为这将对应于使用grouby 获得的密钥。我们还需要沿该一维数组的整个长度获取每个组/唯一键的计数。最后，使用这些标签对计数进行索引应该为每个元素映射相应的计数。

这就是它的全部想法！这是实现 -

# Save relevant columns as a NumPy array for performing NumPy operations afterwards
arr_slice = df[['Client', 'Month']].values

# Get linear indices equivalent of those columns
lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)

# Get unique IDs corresponding to each linear index (i.e. group) and grouped counts
unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)

# Index counts with the unique tags to map across all elements with the counts
df["Nbcontrats"] = counts[unqtags]

运行时测试

1) 定义函数：

def original_app(df):
    df["Nbcontrats"] = df.groupby(['Client', 'Month'])['Contrat'].transform(len)

def vectorized_app(df):
    arr_slice = df[['Client', 'Month']].values
    lidx = np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1)
    unq,unqtags,counts = np.unique(lidx,return_inverse=True,return_counts=True)
    df["Nbcontrats"] = counts[unqtags]

2) 验证结果：

In [143]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 
     ...: # Run the function on the inputs
     ...: original_app(df)
     ...: vectorized_app(df1)
     ...: 

In [144]: np.allclose(df["Nbcontrats"],df1["Nbcontrats"])
Out[144]: True

3) 最后给他们计时：

In [145]: # Let's create a dataframe with 100 unique IDs and of length 10000
     ...: arr = np.random.randint(0,100,(10000,3))
     ...: df = pd.DataFrame(arr,columns=['Client','Month','Contrat'])
     ...: df1 = df.copy()
     ...: 

In [146]: %timeit original_app(df)
1 loops, best of 3: 645 ms per loop

In [147]: %timeit vectorized_app(df1)
100 loops, best of 3: 2.62 ms per loop

【讨论】：

太棒了 - 速度提高了 246 倍！您能否为 numpy 解决方案添加一个简短的解释？
@MaxU 刚刚添加了一些解释。我在那里尽了最大努力，但我通常很烂:)
太完美了——谢谢你教我 numpy！不幸的是，我不能多次投票；）
np.ravel_multi_index(arr_slice.T,arr_slice.max(0)+1) 返回TypeError: must be str, not int
有没有办法找回分组的数据框？如df_grouped = df.groupby ...

【解决方案2】：

使用DataFrameGroupBy.size 方法：

df.set_index(['Client', 'Month'], inplace=True)
df['Nbcontrats'] = df.groupby(level=(0,1)).size()
df.reset_index(inplace=True)

大部分工作是将结果分配回源 DataFrame 的列中。

【讨论】：

我看不出它如何提高性能，可以将运行时间与正常的 groupby 进行比较吗？
我没有具体的基准，但在我的情况下，一个经典的 groupby 运行了（可能 > 1 小时），然后最终因内存不足错误而崩溃。但是使用这里提供的解决方案和索引，成功运行了大约 6 秒