【问题标题】:reshaping data frame and applying calculation for each row重塑数据框并为每一行应用计算
【发布时间】:2019-04-18 19:25:58
【问题描述】:

我有一个数据框如下:

df=pd.DataFrame({ 'family' : ["A","A","B","B"],
  'V1' : [5,5,40,10,],
  'V2' :[50,10,180,20],
  'gr_0' :["all","all","all","all"],
  'gr_1' :["m1","m1","m2","m3"],
  'gr_2' :["m12","m12","m12","m9"],
  'gr_3' :["NO","m14","m15","NO"]
                           })

我想通过以下方式对其进行改造:

df_new=pd.DataFrame({ 'family' : ["A","A","A","A","B","B","B","B","B","B"],
  'gr' : ["all","m1","m12","m14","all","m2","m3","m12","m9","m15"],
  "calc(sumV2/sumV1)":[6,6,6,2,4,4.5,2,4.5,2,4.5]            
               })

  family   gr  calc(sumV2/sumV1)
0      A  all                6.0
1      A   m1                6.0
2      A  m12                6.0
3      A  m14                2.0
4      B  all                4.0
5      B   m2                4.5
6      B   m3                2.0
7      B  m12                4.5
8      B   m9                2.0
9      B  m15                4.5

为了到达 df_new:

  1. 我希望行按“family”X“gr_”列的每个唯一值对齐。
  2. 为每一行计算各自的 sum(V2)/sum(V1),如 df_new 所示。

我对python很陌生。对此的软编码对我来说似乎很复杂。 最好,我不希望在此 df_new 中列出“否”记录,但它也可以保留在输出中。

【问题讨论】:

    标签: python python-3.x pandas pivot-table pandas-groupby


    【解决方案1】:

    melt + groupby:

    v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
    w = v.loc[v.gr != 'NO']
    x = w.groupby(['family', 'gr']).sum()
    
    (x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
    

      family   gr  calc(sumV2/sumV1)
    0      A  all                6.0
    1      A   m1                6.0
    2      A  m12                6.0
    3      A  m14                2.0
    4      B  all                4.0
    5      B  m12                4.5
    6      B  m15                4.5
    7      B   m2                4.5
    8      B   m3                2.0
    9      B   m9                2.0
    

    this answer 类似的方法,但具有完全矢量化的优点,并避免apply


    性能

    a = np.random.randint(1, 1000, (1_000_000, 7))
    df = pd.DataFrame(a, columns=['family', 'V1', 'V2', 'gr_0', 'gr_1', 'gr_2', 'gr_3'])   
    df[['gr_0', 'gr_1', 'gr_2', 'gr_3']] = df[['gr_0', 'gr_1', 'gr_2', 'gr_3']].astype(str)
    
    %%timeit
    v = df.melt(id_vars=['family','V1','V2'], value_name='gr')
    w = v.loc[v.gr != 'NO']
    x = w.groupby(['family', 'gr']).sum()    
    (x.V2 / x.V1).reset_index(name='calc(sumV2/sumV1)')
    
    2.71 s ± 32.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    df_new = (df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
                     .apply(lambda x: x.V2.sum()/x.V1.sum())
                     .reset_index(name='calc(sumV2/sumV1)'))
    df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
    
    5min 24s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    【讨论】:

    • @AlexandreNixon 在方法上类似,但我只是添加了时间来说明为什么应该真正避免使用apply。矢量化操作带来了很大的不同
    【解决方案2】:

    你可以这样做:

    df_new = df.melt(id_vars=['family','V1','V2']).groupby(['family','value'])
                    .apply(lambda x: x.V2.sum()/x.V1.sum())
                    .reset_index(name='calc(sumV2/sumV1)')
    df_new = df_new[df_new.value != 'NO'].reset_index(drop=True)
    
    print(df_new)
    
         family value  calc(sumV2/sumV1)
    0      A    all           6.0
    1      A    m1            6.0
    2      A    m12           6.0
    3      A    m14           2.0
    4      B    all           4.0
    5      B    m12           4.5
    6      B    m15           4.5
    7      B    m2            4.5
    8      B    m3            2.0
    9      B    m9            2.0
    

    【讨论】:

    • 感谢@Ben.T :)
    猜你喜欢
    • 2021-01-06
    • 2020-08-29
    • 2019-09-23
    • 2022-11-10
    • 1970-01-01
    • 1970-01-01
    • 2018-09-11
    • 2014-01-22
    • 2019-09-09
    相关资源
    最近更新 更多