【问题标题】:Vectorized function with counter on pandas dataframe column在 pandas 数据框列上带有计数器的矢量化函数
【发布时间】:2021-03-03 18:56:04
【问题描述】:

考虑当value 低于5(任何阈值)时condition 列为1 的pandas 数据框。

import pandas as pd
d = {'value': [30,100,4,0,80,0,1,4,70,70],'condition':[0,0,1,1,0,1,1,1,0,0]}
df = pd.DataFrame(data=d)
df

Out[1]:
   value  condition
0     30          0
1    100          0
2      4          1
3      0          1
4     80          0
5      0          1
6      1          1
7      4          1
8     70          0
9     70          0

我想要的是让所有低于 5 的连续值具有相同的 id,并且所有高于 5 的值都具有 0(或 NA 或负值,没关系,它们只需要相同)。我想创建一个名为 new_id 的新列,其中包含这些累积 ID,如下所示:

   value  condition  new_id
0     30          0       0
1    100          0       0
2      4          1       1
3      0          1       1
4     80          0       0
5      0          1       2
6      1          1       2
7      4          1       2
8     70          0       0
9     70          0       0

在一个非常低效的 for 循环中,我会这样做(可行):

for i in range(0,df.shape[0]):
    if (df.loc[df.index[i],'condition'] == 1) & (df.loc[df.index[i-1],'condition']==0):
        new_id = counter # assign new id 
        counter += 1 

    elif (df.loc[df.index[i],'condition']==1) & (df.loc[df.index[i-1],'condition']!=0):
        new_id = counter-1 # assign current id

    elif (df.loc[df.index[i],'condition']==0):
        new_id = df.loc[df.index[i],'condition'] # assign 0

    df.loc[df.index[i],'new_id'] = new_id
df
  

但这非常低效,而且我有一个非常大的数据集。因此,我尝试了不同类型的矢量化,但到目前为止我未能阻止它在每个连续点的“集群”内计数:

# First try using cumsum():
df['new_id'] = 0
df['new_id_temp'] = ((df['condition'] == 1)).astype(int).cumsum()
df.loc[(df['condition'] == 1), 'new_id'] = df['new_id_temp']
df[['value', 'condition', 'new_id']]

# Another try using list comprehension but this just does +1:
[row+1 for ind, row in enumerate(df['condition']) if (row != row-1)]

我还尝试将apply() 与自定义 if else 函数一起使用,但似乎这不允许我使用计数器。

已经有大量关于此的类似帖子,但没有一个在连续行中保持相同的 id。

示例帖子是: Maintain count in python list comprehension Pandas cumsum on a separate column condition Python - keeping counter inside list comprehension python pandas conditional cumulative sum Conditional count of cumulative sum Dataframe - Loop through columns

【问题讨论】:

    标签: python pandas list-comprehension vectorization


    【解决方案1】:

    你可以使用cumsum(),就像你第一次尝试一样,只是稍微修改一下:

    # calculate delta
    df['delta'] = df['condition']-df['condition'].shift(1)
    # get rid of -1 for the cumsum (replace it by 0)
    df['delta'] = df['delta'].replace(-1,0)
    
    # cumulative sum conditional: multiply with condition column
    df['cumsum_x'] = df['delta'].cumsum()*df['condition']
    

    【讨论】:

    • 完美运行,甚至比我的 cumsum 解决方案还要快。谢谢!
    【解决方案2】:

    欢迎来到 SO!为什么不只依赖基础 Python 呢?

    def counter_func(l):
        new_id = [0]   # First value is zero in any case
        counter = 0
        for i in range(1, len(l)):
            if l[i] == 0:
                new_id.append(0)
            elif l[i] == 1 and l[i-1] == 0:
                counter += 1
                new_id.append(counter)
            elif l[i] == l[i-1] == 1:
                new_id.append(counter)
            else: new_id.append(None)
        return new_id
    
    df["new_id"] = counter_func(df["condition"])
    

    看起来像这样

       value  condition  new_id
    0     30          0       0
    1    100          0       0
    2      4          1       1
    3      0          1       1
    4     80          0       0
    5      0          1       2
    6      1          1       2
    7      4          1       2
    8     70          0       0
    9     70          0       0
    

    编辑:

    您也可以使用numba,这对我来说大大加快了该功能:大约 1 秒到 ~60 毫秒。

    你应该在函数中输入 numpy 数组来使用它,这意味着你必须df["condition"].values

    from numba import njit
    import numpy as np
    @njit
    def func(arr):
        res = np.empty(arr.shape[0])
        counter = 0
        res[0] = 0 # First value is zero anyway
        for i in range(1, arr.shape[0]):
            if arr[i] == 0:
                res[i] = 0
            elif arr[i] and arr[i-1] == 0:
                counter += 1
                res[i] = counter
            elif arr[i] == arr[i-1] == 1:
                res[i] = counter
            else: res[i] = np.nan
        return res
    
    df["new_id"] = func(df["condition"].values)
    

    【讨论】:

      猜你喜欢
      • 2021-10-11
      • 2020-11-08
      • 1970-01-01
      • 2019-10-02
      • 2014-03-31
      • 2023-03-14
      • 2021-01-23
      • 2023-04-06
      • 2019-01-30
      相关资源
      最近更新 更多