给定数据分布，离散化 Pandas 列

【问题标题】：Discretisize Pandas' column given the data distribution给定数据分布，离散化 Pandas 列
【发布时间】：2017-11-10 03:40:44
【问题描述】：

我有一个 pandas 的数据框，其中有一列包含从 0 到 50 的真实数据。它们分布不均匀。

我可以使用：

hist, bins = np.histogram(df["col"])

我想做的是将每个值替换为它所在的箱号。

为此，这是有效的：

for i in range(len(df["speed_array"])):
    df["speed_array"].iloc[i] = np.searchsorted(bins, df["speed_array"].iloc[i])

但是，对于包含 400 万行的数据框，它非常慢（50 分钟）。我正在寻找一种更有效的方法。你们有更好的主意吗？

【问题讨论】：

标签： python performance pandas numpy

【解决方案1】：

只需在整个底层数组数据上使用np.searchsorted -

df["speed_array"] = np.searchsorted(bins, df["speed_array"].values)

运行时测试-

In [140]: # 4 million rows with 100 bins
     ...: df = pd.DataFrame(np.random.randint(0,1000,(4000000,1)))
     ...: df.columns = [['speed_array']]
     ...: bins = np.sort(np.random.choice(1000, size=100, replace=0))
     ...: 

In [141]: def searchsorted_app(df):
     ...:     df["speed_array"] = np.searchsorted(bins, df["speed_array"].values)
     ...:     

In [142]: %timeit searchsorted_app(df)
10 loops, best of 3: 15.3 ms per loop

【讨论】：

就像我梦寐以求的一样简单！谢谢！
@Xema 很高兴知道原始 50min 标记的加速:)
嗯，它几乎是即时的！