在 python 数据框中创建宏变量答案

【问题标题】：create macro variables in python data frame在 python 数据框中创建宏变量
【发布时间】：2018-01-08 11:33:14
【问题描述】：

我正在尝试自动化代码，因为它会导致重复工作，因为一直手动更改代码中的列名。我试图了解如何像在 SAS 中一样在 python 中创建宏变量。非常感谢任何帮助！

##I'm creating my cutoff points first

#### I need to assign a value to col1 in a macro in order not to hardcode it all the time!

    cutoff1 = my_data['col1'].describe([.1,.2,.3,.4,.5,.6,.7, .8, 0.9])['10%'].astype('float64')
    cutoff2 = my_data['col1'].describe([.1,.2,.3,.4,.5,.6,.7, .8, 0.9])['20%'].astype('float64')
    cutoff3 = my_data['col1'].describe([.1,.2,.3,.4,.5,.6,.7, .8, 0.9])['30%'].astype('float64')

    ##Then I'm assigning the new values to my continuous variables by using the thresholds I've determined above

    #### I also need to assign a value to COL1_RANK such as %s='COL1' i.e.  %s&'_RANK'

    def f(row): 

                if row['col1'] <=cutoff1 : 
                        COL1_RANK = 1 

                elif row['col1']<=cutoff2: 
                        COL1_RANK = 2 

                elif row['col1']<=cutoff3: 
                        COL1_RANK = 3 

                 else : 
                        COL1_RANK = 4
                return COL1_RANK


    my_data['COL1_RANK'] = my_data.apply(f, axis=1) 

    my_data.head(5)

【问题讨论】：

标签： python macros

【解决方案1】：

我认为您需要使用 quantile 和 cut 创建自定义函数：

def func(df, input_col, output_col):
    cutoffs = df[input_col].quantile([.1,.2,.3]).astype('float64')
    bins = [-np.inf, cutoffs[.1], cutoffs[.2], cutoffs[.3], np.inf]
    labels=[1,2,3,4]

    df[output_col] = pd.cut(df[input_col], bins=bins, labels=labels)
    return df

my_data = func(my_data, 'col1', 'COL1_RANK')

【讨论】：

一个简单的问题，如果我想给出一些手动截止点而不是分位数，我该怎么做？例如，代替 quantile([.1,.2,.3]) ，我可以使用一些手动截止值，例如 ([50,100,300]) @jezrael 吗？
是的，只需将[-np.inf, cutoffs[.1], cutoffs[.2], cutoffs[.3], np.inf]更改为[-np.inf, 50, 100, 300, np.inf]
嗨@jezrael，我对上周的线程有一个后续问题。我还想在我的 bin 中捕获丢失的数据，例如 [-np.inf, "", 100, 300, np.inf] 并将它们标记为 labels=['missing_data',1,2,3] 。我该怎么做？
我认为这不是那么容易 - 首先需要过滤数据以用于缺失和非缺失，对于非缺失应用此解决方案并为缺失分配“缺失数据”。最好的方法是使用示例数据和预期输出创建新问题。