【问题标题】:Derive a feature or column based on the given condition in pandas根据 pandas 中的给定条件派生特征或列
【发布时间】:2021-03-11 12:03:36
【问题描述】:

我有如下图所示的df

ID     Age_days    N_30     N_31_90     N_91_180      N_180_365
1      201         60       15          30            1
2      800         0        15          5             10
3      800         0        0           10            6
4      100         0        0           0             370
5      600         0        6           5             10
6      800         0        0           15            6
7      500         10       10          30            9
8      200         0        0           0             0
9      500         0        0           0             0

从上面的df我想导出一个名为Recency的列

解释:

if df['N_30'] != 0, then Recency = (30/df['N_30'])
elif df['N_31_90'] != 0 then Recency = 30 + (60/df['N_31_90'])
elif df['N_91_180'] != 0 then Recency = 90 + (90/df['N_91_180'])
elif df['N_181_365'] != 0 then Recency = 180 + (185/df['N_181_365'])
else 
  if df['age_days'] <= 365, Recency = df['age_days']
  else Recency = 366

预期输出:

ID     Limit    N_30     N_31_90     N_91_180      N_180_365    Recency
1      201      60       15          30            1            (30/60) = 0.5
2      800      0        15          5             10           30+(60/15) = 34
3      800      0        0           10            6            90+90/10 = 100
4      100      0        0           0             370          180+(185/370) = 180.5           
5      600      0        6           5             10           30+(60/6) = 36
6      800      0        0           15            6            90+(90/15) = 96
7      500      10       10          30            9            30/10 = 3
8      200      0        0           0             0            200
9      500      0        0           0             0            366

我试过下面的代码

pd.set_option("use_inf_as_na", True)
df2 = df[['N_30', 'N_31_90', 'N_91_180', 'N_180_365']]
df["Recency"] = (df2.eq(0) * [30, 60, 90, 180]).sum(1) + ([30, 60, 90, 185] / df2).bfill(1).iloc[:, 0]
df["Recency"].fillna(366)

【问题讨论】:

    标签: python-3.x pandas dataframe


    【解决方案1】:

    使用numpy.select

    import numpy as np
    
    conditions = [df['N_30'] != 0, df['N_31_90'] != 0, df['N_91_180'] != 0, df['N_180_365'] != 0, df['Age_days'] <= 365]
    
    choices = [(30/df['N_30']), 30 + (60/df['N_31_90']), 90 + (90/df['N_91_180']), 180 + (185/df['N_180_365']), df['Age_days']]
    
    df['Recency']=np.select(conditions, choices, default=366)
    

    输出:

       ID  Age_days  N_30  N_31_90  N_91_180  N_180_365  Recency
    0   1       201    60       15        30          1      0.5
    1   2       800     0       15         5         10     34.0
    2   3       800     0        0        10          6     99.0
    3   4       100     0        0         0        370    180.5
    4   5       600     0        6         5         10     40.0
    5   6       800     0        0        15          6     96.0
    6   7       500    10       10        30          9      3.0
    7   8       200     0        0         0          0    200.0
    8   9       500     0        0         0          0    366.0
    

    我假设几乎没有更正,我使用的是 N_180_365 而不是 N_181_365,你有条件但不是 DF。

    【讨论】:

    • 你能分享你的输出吗
    【解决方案2】:

    仅用于学习目的。

    您可以尝试创建 dict 并映射元素。

    def func(x):
        if (x[x['coln']]!=0):
    #     if x!=np.nan:
            return (d[x['coln']](x[x['coln']]))
        elif x['Age_days']<=365:
            return x['Age_days'] 
        else:
            return 366
    
    d = {'N_30': lambda x: (30/x), 'N_31_90': lambda x: 30 + (60/x), 'N_91_180': lambda x: 90 + (90/x), 
    'N_180_365': lambda x: 180 + (185/x)}
    
    df['recency'] = df.assign(coln = df.filter(like='N').idxmax(axis=1).reset_index(drop=True)).apply(func,axis=1)
    

    df:

    ID Age_days N_30 N_31_90 N_91_180 N_180_365 recency
    0 1 201 60 15 30 1 0.5
    1 2 800 0 15 5 10 34.0
    2 3 800 0 0 10 6 99.0
    3 4 100 0 0 0 370 180.5
    4 5 600 0 6 5 10 198.5
    5 6 800 0 0 15 6 96.0
    6 7 500 10 10 30 9 93.0
    7 8 200 0 0 0 0 200.0
    8 9 500 0 0 0 0 366.0

    更正:

    应该是:

     df.filter(like='N').replace(0,np.nan).notna().idxmax(axis=1)
    

    修正后你会得到同样的结果。

    【讨论】:

      猜你喜欢
      • 2019-08-17
      • 2021-06-01
      • 2019-12-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-12-19
      • 1970-01-01
      相关资源
      最近更新 更多