【问题标题】:Binning a column with Python Pandas使用 Python Pandas 对列进行分箱
【发布时间】:2017-12-29 15:13:56
【问题描述】:

我有一个带有数值的数据框列:

df['percentage'].head()
46.5
44.2
100.0
42.12

我想将该列视为bin counts

bins = [0, 1, 5, 10, 25, 50, 100]

我怎样才能将结果作为带有 值计数的 bin 获得

[0, 1] bin amount
[1, 5] etc
[5, 10] etc
...

【问题讨论】:

    标签: python pandas numpy dataframe binning


    【解决方案1】:

    使用Numba 模块加速。

    在大型数据集(超过 500k)上,pd.cut 对数据进行分箱可能会很慢。

    我用即时编译在 Numba 中编写了自己的函数,速度大约快 六倍

    from numba import njit
    
    @njit
    def cut(arr):
        bins = np.empty(arr.shape[0])
        for idx, x in enumerate(arr):
            if (x >= 0) & (x < 1):
                bins[idx] = 1
            elif (x >= 1) & (x < 5):
                bins[idx] = 2
            elif (x >= 5) & (x < 10):
                bins[idx] = 3
            elif (x >= 10) & (x < 25):
                bins[idx] = 4
            elif (x >= 25) & (x < 50):
                bins[idx] = 5
            elif (x >= 50) & (x < 100):
                bins[idx] = 6
            else:
                bins[idx] = 7
    
        return bins
    
    cut(df['percentage'].to_numpy())
    
    # array([5., 5., 7., 5.])
    

    可选:您也可以将其作为字符串映射到 bin:

    a = cut(df['percentage'].to_numpy())
    
    conversion_dict = {1: 'bin1',
                       2: 'bin2',
                       3: 'bin3',
                       4: 'bin4',
                       5: 'bin5',
                       6: 'bin6',
                       7: 'bin7'}
    
    bins = list(map(conversion_dict.get, a))
    
    # ['bin5', 'bin5', 'bin7', 'bin5']
    

    速度对比

    # Create a dataframe of 8 million rows for testing
    dfbig = pd.concat([df]*2000000, ignore_index=True)
    
    dfbig.shape
    
    # (8000000, 1)
    
    %%timeit
    cut(dfbig['percentage'].to_numpy())
    
    # 38 ms ± 616 µs per loop (mean ± standard deviation of 7 runs, 10 loops each)
    
    %%timeit
    bins = [0, 1, 5, 10, 25, 50, 100]
    labels = [1,2,3,4,5,6]
    pd.cut(dfbig['percentage'], bins=bins, labels=labels)
    
    # 215 ms ± 9.76 ms per loop (mean ± standard deviation of 7 runs, 10 loops each)
    

    【讨论】:

      【解决方案2】:

      你可以使用pandas.cut:

      bins = [0, 1, 5, 10, 25, 50, 100]
      df['binned'] = pd.cut(df['percentage'], bins)
      print (df)
         percentage     binned
      0       46.50   (25, 50]
      1       44.20   (25, 50]
      2      100.00  (50, 100]
      3       42.12   (25, 50]
      

      bins = [0, 1, 5, 10, 25, 50, 100]
      labels = [1,2,3,4,5,6]
      df['binned'] = pd.cut(df['percentage'], bins=bins, labels=labels)
      print (df)
         percentage binned
      0       46.50      5
      1       44.20      5
      2      100.00      6
      3       42.12      5
      

      numpy.searchsorted:

      bins = [0, 1, 5, 10, 25, 50, 100]
      df['binned'] = np.searchsorted(bins, df['percentage'].values)
      print (df)
         percentage  binned
      0       46.50       5
      1       44.20       5
      2      100.00       6
      3       42.12       5
      

      ...然后value_countsgroupby 和聚合size

      s = pd.cut(df['percentage'], bins=bins).value_counts()
      print (s)
      (25, 50]     3
      (50, 100]    1
      (10, 25]     0
      (5, 10]      0
      (1, 5]       0
      (0, 1]       0
      Name: percentage, dtype: int64
      

      s = df.groupby(pd.cut(df['percentage'], bins=bins)).size()
      print (s)
      percentage
      (0, 1]       0
      (1, 5]       0
      (5, 10]      0
      (10, 25]     0
      (25, 50]     3
      (50, 100]    1
      dtype: int64
      

      默认情况下cut 返回categorical

      SeriesSeries.value_counts() 这样的方法将使用所有类别,即使某些类别不存在于数据中,operations in categorical

      【讨论】:

      • 如果没有bins = [0, 1, 5, 10, 25, 50, 100],我可以说创建 5 个垃圾箱,它会按平均切割来切割吗?例如,我有 110 条记录,我想将它们分成 5 个 bin,每个 bin 有 22 条记录。
      • @qqqwww - 不知道是否理解,你认为qcutlink
      • @qqqwww 要做到这一点,其页面中的 pd.cut 示例显示它: pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3) 将将数组切成 3 等份。
      • @AyanMitra - 你觉得df.groupby(pd.cut(df['percentage'], bins=bins)).mean() 吗?
      • 谢谢这个答案对我有帮助:)
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-08-15
      相关资源
      最近更新 更多