优化循环，数据帧分区答案

【问题标题】：Optimizing for loops, dataframe partition优化循环，数据帧分区
【发布时间】：2021-04-16 14:34:02
【问题描述】：

我正在制定一种用于二进制分类中的特征选择的算法，该算法通过np.array 或pd.series 以贪婪的方法找到具有良好目标划分的区间。

代码运行良好，但是我使用for 循环和if 条件，因此性能很慢。我想知道是否有更聪明（更快）的方法来做到这一点。我的代码如下所示：

import pandas as pd
df = pd.DataFrame([[51, 35, 1], [52, 3, 1], [53, 11, 1], [61, 8, 0], [75, 23, 0], [83, 45, 0], [95, 56, 1], [13, 66, 1], [1, 0, 1], [22, 68, 1]], columns=['feat1', 'feat2', 'target'])
target = df['target'] # values range from 0 to 1

def my_generic_metric_function(y):
  #This is just a generic metric that I'm using as an example.
  if len(y)>0:

    tgt = sum(y==1)
    no_tgt = sum(y==0)
    return 1.0*tgt/1.0*(no_tgt+tgt)

  else:
    return 0


def find_intervals(x, min_metric=10):
    ## Important: all my features receive a treatment that "fits" them in a range from 0 to 100
    ## Note that I'm not iterating through the DataFrame, I'm iterating over a range of values and finding the partitions in the dataframe.
    print(x.name)
    steps = [0]
    metric_partition = []
    for i in range(0, 101):


        ## This the target series filtered by the interval in x value
        band = target[(x>steps[-1]) & (x<=i)]
        partition_metric = my_generic_metric_function(band) 

        
        if partition_metric >= min_metric:
            steps.append(i)
            metric_partition.append(partition_metric)

    return {'f':x.name,'s': steps, 'm':metric_partition}

我会使用 .apply() 将此函数应用于整个数据框：

bi_df = df.drop("target", axis=1).apply(find_intervals)

这个问题看起来很像 CART 算法，但是我没有找到任何可以帮助我优化问题的实现。

【问题讨论】：

这是RandomForest的某种实现吗？
虽然对于贪婪的 CART 算法非常相似，但我实际上正在构建一个针对线性模型的 特征选择 和分类算法（我希望主要用于逻辑回归）。
my_generic_function 函数的用途是什么？它有一些错误。
这个想法是用这个指标来评估剪辑的质量，我用这个指标只是作为一个例子。如果某个时间间隔的此指标高于阈值（在此示例中为 1.2），则它将来自迭代器的当前 i 值附加到 steps 列表和metric_partition 列表中的通用指标值以评估下一个可能的时间间隔.我使用的真实指标稍微复杂一些，并且计算它不会耗费大量时间，所以我只是用“通用”函数替换了这个真实指标。
当我按原样运行时，我得到一个KeyError: '[1 0 0 0 1 1 1 1] not found in axis'。似乎应将应用行调整为：bi_df = df.drop("target", axis=1).apply(find_intervals)。但随后它在return tgt / (no_tgt + tgt) 上返回ZeroDivisionError: division by zero。你能确认一下吗？

标签： python pandas for-loop optimization

【解决方案1】：

似乎my_generic_metric_function 是相对于循环中的i 的单调递增函数。我现在可能无法证明这一点，但我用随机数做了一些模拟，都得到了这个结果。
如果是这种情况，您可以进行（几乎）二进制搜索，而不是进行线性搜索 (for i in range(0, 101))，寻找第一个达到 min_metric 阈值的数字。
这几乎是一个简单的二进制搜索，但作为第二个验证步骤，您需要检查之前的数字是否小于阈值。
这会将循环时间从 n 减少到 log(n)。

另外你可以试试index your feature columns。

其他选项可以尝试对所有内容使用 numpy 而不是 pandas，adjust and compilefind_interval 甚至并行化每列应用（使用多处理或其他简单方法）。但是这三个不会减少您的算法时间，只会提高执行时间（尽管对于某些操作，numpy 的性能比 pandas 好几个数量级）。

【讨论】：