使用 split-apply-combine 通过自定义函数删除一些值并组合剩余的值答案

【问题标题】：Using split-apply-combine to remove some values with a customized function and combine what's left使用 split-apply-combine 通过自定义函数删除一些值并组合剩余的值
【发布时间】：2026-02-13 23:55:01
【问题描述】：

所以这不是我需要使用的数据集，但它是我正在为癌症研究项目使用的一个巨大数据集（约 180 万个数据点）的模板，所以我想我是否可以得到它使用较小的，然后我可以适应我的大的！所以作为一个示例，假设我有以下数据集：

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'Array':  ['S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','SS','TT'],
   'X':  [1, 2, 3, 1, 2 , 3, 4, 7.3, 5.1, 3.2, 1.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'Y':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'Marker':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'Area': [3.0, 2.0, 2.88, 1.33,  2.44, 1.25, 1.53, 1.0, 0.156, 2.0, 2.4, 6.3, 6.9, 9.78, 10.2, 15.0, 16.0, 19.0]
})
print(df)

这会产生如下所示的输出：

   cond Array    X     Y  Marker    Area
0     A     S  1.0  3.10    2.00   3.000
1     A     S  2.0  2.20    1.20   2.000
2     A    TT  3.0  2.10    1.20   2.880
3     A    TT  1.0  1.20    2.01   1.330
4     A     S  2.0  2.40    2.55   2.440
5     A     S  3.0  1.20    2.05   1.250
6     A    TT  4.0  1.50    1.66   1.530
7     A    TT  7.3  1.33    3.20   1.000
8     A     S  5.1  1.50    3.21   0.156
9     B     S  3.2  1.60    3.04   2.000
10    B    TT  1.4  1.40    8.01   2.400
11    B    TT  5.5  1.30    9.10   6.300
12    B     S  9.9  0.90    7.06   6.900
13    B     S  3.2  0.78    8.10   9.780
14    B    TT  1.1  1.20    7.90  10.200
15    B    TT  3.3  4.00    5.12  15.000
16    B    SS  1.2  5.00    5.23  16.000
17    B    TT  5.4  6.00    5.15  19.000

好的，现在我需要做的是根据两个标签“cond”和“Array”来拆分它们。我是这样做的

g=df.groupby(['cond','Array'])['Marker']

这将其分成 4 个较小的集合，分别为 A-S、A-TT、B-S、B-TT 配对。现在我有一个自定义功能可以使用。这是函数的一部分，我将解释它是如何工作的：

def num_to_delete(p,alpha,N):
    if p==0.950:
        if 1-alpha==0.90:
            if N<=60:
                m=1
            if 60<N<80:
                m=round(N/20-2)
            if 80<=N:
                m=2
        if 1-alpha==0.95:
            if N<=80:
                m=1
            if 80<N<=100:
                m=round(N/20 -3)
            if 100<N:
                m=2
    return m

好的，它的工作方式是我向其中输入我选择的“p”和“alpha”（真正的函数涵盖了更多 p 和 alpha 的情况）。输入它的 N 是我的较小数据集的元素数（在这种情况下，对于 A-S，它是 5，对于 A-TT，它是 4，等等）。所以我想要发生的是，对于每个较小的数据集，吐出一些要删除的点（在这个例子中，函数总是给我们 1，但我试图用函数来编码应用于超大数据集）。既然它给出了数字 1，那么我希望它删除该集合的 1 个最大数据点，并告诉我剩下的最高点是什么。

例如，对于 A-S 耦合，我有 5 个数据点：2.0、1.2、2.55、2.05 和 3.21。由于有 5 个数据点，我的函数告诉我删除其中的 1 个，所以忽略 3.21，并告诉我剩下的最高数据点是什么，在这种情况下是 2.55。我想对每个耦合都这样做，但在我的真实数据集中，我会有不同数量的元素，所以函数会告诉我为每个耦合删除不同的数字。

我的最终目标是拥有一个看起来像这样的决赛桌：

   cond Array   NumDeleted p95/a05  p95/a10       
0     A     S  1.0      2.55   2.55
1     A    TT  1.0      2.01   2.01
2     B     S  1.0      7.06   7.06
3     B    TT  1.0      8.01   8.01

对于较大的集合，最后 2 列中的值会有所不同，因为在大型数据集中，要删除的值的数量差异很大，因此剩余的值也会有所不同。我最终需要根据我得到的 p95/a05 和 p95/a10 的值来更改第二个数据集

无论如何，很抱歉解释了这么长，但如果有人能提供帮助，那就太棒了！我希望这是一件相当简单的事情，因为我已经坚持了一个多星期了。

【问题讨论】：

不会NumDeleted 依赖于p 和a 的值吗？那么NumDeleted指的是哪个值呢？
@adrianp 是的，这是正确的。理论上，根据我使用的 p 和 a，我需要多个 NumDeleted 列
我不清楚。在您的输出数据框中，您有两列具有不同的 p 和 a 值，但只有一个 NumDeleted 列。每个配置都需要NumDeleted 吗？
@adrianp 对不起，我误解了你写的内容。理想情况下，我会为每种配置设置一个，作为我继续跟踪的一种方式（尽管这些列不如 p95/a05 和 p95/a10 列重要，因为这将构成分析的大部分，所以如果它是工作太多，我不需要）
嗨，m 是什么p # 0.95？

标签： python pandas split-apply-combine

【解决方案1】：

编辑：更通用的解决方案

首先，创建closure 来定义您的配置会有所帮助。这是假设您将来会有更多配置：

def create_num_to_delete(p, alpha):
    """Create a num_to_delete function given p and alpha."""
    def num_to_delete(N):
        if p == 0.950:
            if 1 - alpha == 0.90:
                if N <= 60:
                    m = 1
                if 60 < N < 80:
                    m = round(N/20 - 2)
                if 80 <= N:
                    m = 2
            if 1-alpha == 0.95:
                if N <= 80:
                    m = 1
                if 80 < N <= 100:
                    m = round(N/20 -3)
                if 100 < N:
                    m = 2
        return m

    return num_to_delete

然后你可以使用这个闭包来定义一个配置字典：

configurations = {
    'p95/a05': create_num_to_delete(0.95, 0.05),
    'p95/a10': create_num_to_delete(0.95, 0.10),
}

然后，定义一个汇总数据的函数。此函数应依赖于您的配置，以使其保持动态。

def summarize(x):
    # The syntax on the right-hand side is called list comprehension.
    # As you can probably guess, it's essentially a flattened for-loop that
    # produces a list. The syntax starting with "for" is your basic for loop
    # statement, and the syntax to the left of "for" is an expression that
    # that serves as the value of the resulting list for each iteration
    # of the loop.
    #
    # Here, we are looping through the "num_to_delete" functions we defined in
    # our `configurations` dictionary. And calling it in our group `x`.
    Ns = [num_to_delete(len(x)) for num_to_delete in configurations.values()]

    markers = x['Marker'].sort_values(ascending=False)

    highest_markers = []
    for N in Ns:
        if N == len(x):
            highest_markers.append(None)
        else:
            # Since we know that `markers` is already sorted in descending
            # order, all we need to get the highest remaining value is to get
            # the value in the *complete list* of values offset by the 
            # the number of values that need to be deleted (this is `N`).
            #
            # Since sequences are 0-indexed, simply indexing by `N` is enough.
            # For example, if `N` is 1, indexing by `N` would give us
            # the marker value *indexed by* 1, which is,
            # in a 0-sequenced index, simply the second value.
            highest_markers.append(markers.iloc[N])

    # Returning a list from an applied groupby function translates into
    # a DataFrame which the series index as the columns and the series values
    # as the row values. Index in this case is just the list of configuration
    # names we have in the `configurations` dictionary.
    return pd.Series(highest_markers, index=list(configurations.keys()))

最后，apply 将函数添加到您的数据集并重置索引。这会将cond 和Array 保留为列：

grouped = df.groupby(['cond', 'Array'])
grouped.apply(summarize).reset_index()

输出是：

    cond    Array   p95/a05 p95/a10
0   A   S   2.55    2.55
1   A   TT  2.01    2.01
2   B   S   7.06    7.06
3   B   SS  NaN NaN
4   B   TT  8.01    8.01

希望这会有所帮助。

【讨论】：

哇，这太棒了，而且真实地最终比我自己想出的要先进得多！出于学习的目的，您能否解释一下“def summary(x):”中的不同行在做什么？里面有一些我以前没见过的东西，所以我想了解更多关于这个功能是如何工作的。我真的很感激这一点！ :)
@Brenton 当然。等我下班后，我会回到这个问题。
@Brenton 请参阅 summary 上为 cmets 编辑的帖子。我希望这有帮助。如果没有，请随时给我发消息。