【问题标题】:Pandas Memory Error when creating new columns with apply() custom function使用 apply() 自定义函数创建新列时出现 Pandas 内存错误
【发布时间】:2020-02-06 17:40:22
【问题描述】:

计算 2 次重复的平均 log(1+TPM) 的函数

def average_TPM(a,b):
    log_a = np.log(1+a)
    log_b = np.log(1+b)
    if log_a > 0.1 and log_b > 0.1:
        avg = np.mean([log_a,log_b])
    else:
        avg = np.nan
    return avg

将函数应用于 df 以创建新列

df.loc[:,'leaf'] = df.apply(lambda row:  average_TPM(row['leaf1'],row['leaf2']),axis=1)
df.loc[:,'flag_leaf'] = df.apply(lambda row:  average_TPM(row['flag_leaf1'],row['flag_leaf2']),axis=1)
df.loc[:,'anther'] = df.apply(lambda row:  average_TPM(row['anther1'],row['anther2']),axis=1)
df.loc[:,'premeiotic'] = df.apply(lambda row:  average_TPM(row['premeiotic1'],row['premeiotic2']),axis=1)
df.loc[:,'leptotene'] = df.apply(lambda row:  average_TPM(row['leptotene1'],row['leptotene2']),axis=1)
df.loc[:,'zygotene'] = df.apply(lambda row:  average_TPM(row['zygotene1'],row['zygotene2']),axis=1)
df.loc[:,'pachytene'] = df.apply(lambda row:  average_TPM(row['pachytene1'],row['pachytene2']),axis=1)
df.loc[:,'diplotene'] = df.apply(lambda row:  average_TPM(row['diplotene1'],row['diplotene2']),axis=1)
df.loc[:,'metaphase_I'] = df.apply(lambda row:  average_TPM(row['metaphaseI_1'],row['metaphaseI_2']),axis=1)
df.loc[:,'metaphase_II'] = df.apply(lambda row:  average_TPM(row['metaphaseII_1'],row['metaphaseII_2']),axis=1)
df.loc[:,'pollen'] = df.apply(lambda row:  average_TPM(row['pollen1'],row['pollen2']),axis=1)

【问题讨论】:

    标签: python pandas memory-management vectorization apply


    【解决方案1】:

    不知道为什么你有内存错误,但你可以矢量化你的问题:

    #dummy variable
    np.random.seed = 2
    df = pd.DataFrame(np.random.random(8*4).reshape(8,-1), columns=['a1','a2','b1','b2'])
    print (df)
             a1        a2        b1        b2
    0  0.416493  0.964483  0.089547  0.218952
    1  0.655331  0.468490  0.272494  0.652915
    2  0.680433  0.461191  0.919223  0.552074
    3  0.077158  0.138839  0.385818  0.462848
    4  0.149198  0.912372  0.893708  0.081125
    5  0.255422  0.143502  0.466123  0.524544
    6  0.842095  0.486603  0.628405  0.686393
    7  0.329461  0.714052  0.176126  0.566491
    

    定义要创建的列列表,然后在整个数据上一次使用np.log1p

    col_create = ['a','b'] #what you need to redefine for your problem
    col_get = [f'{col}{i}'for col in col_create for i in range(1,3)] #to ensure the order od columns
    arr_log = np.log1p(df[col_get].to_numpy())
    

    现在您可以使用np.where 并将新列与assign 进行矢量化比较:

    df = df.assign(**pd.DataFrame( np.where( (arr_log[:,::2]>0.1)&(arr_log[:,1::2]>0.1), 
                                             (arr_log[:,::2] + arr_log[:,1::2])/2., np.nan), 
                                   columns=col_create, index=df.index))
    print (df)
             a1        a2        b1        b2         a         b
    0  0.533141  0.695231  0.909976  0.441877  0.477569  0.506518
    1  0.961887  0.872382  0.064593  0.030619  0.650559       NaN
    2  0.646332  0.912140  0.615057  0.354700  0.573386  0.391475
    3  0.019646  0.926524  0.160417  0.676512       NaN  0.332748
    4  0.249448  0.474937  0.349048  0.390213  0.305659  0.314428
    5  0.046568  0.985072  0.147037  0.161261       NaN  0.143344
    6  0.812421  0.750128  0.861377  0.765981  0.577176  0.595012
    7  0.950178  0.397550  0.803165  0.156186  0.501321  0.367335
    

    【讨论】:

      猜你喜欢
      • 2019-10-22
      • 1970-01-01
      • 1970-01-01
      • 2013-07-07
      • 2021-01-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多