【问题标题】:Python equivalent for tidyr::complete in R that allows specifying additional valuesR 中 tidyr::complete 的 Python 等效项,允许指定附加值
【发布时间】:2019-09-15 05:25:17
【问题描述】:

我正在寻找重新创建一个 R 脚本,但我被困在如何在 Python 中重新创建这个管道。我正在分析不同工厂的累计生产,需要将它们的累计生产时间归一化以便比较。

管道看起来像这样:

Norm_hrs <- Cum_df%>%
  group_by(Name)%>%
  complete(Cum_hrs = seq(0,max(Cum_hrs),730.5))

需要这个:

Name        Cum_Hrs A   B           C
Factory 1   1       0   1.887861    3.775722
Factory 1   251     0   2104.335728 21932.57871
Factory 1   611     0   2324.586178 37498.99722
Factory 1   1208    0   4361.588197 65235.05541
Factory 2   48      0   1517.840244 6604.770432
Factory 2   163     0   3370.461172 17252.70972
Factory 2   822     0   13284.87786 71918.78308
Factory 2   1541    0   21476.93602 134569.0388
Factory 2   2285    0   32053.99192 225895.1477
Factory 2   3028    0   42299.41357 340798.6151
Factory 2   3699    0   50125.85599 462145.5438
Factory 2   4436    0   56715.74945 584474.9989

然后把它变成这样:

Name        Cum_Hrs A   B           C
Factory 1   1       0   1.887861    3.775722
Factory 1   251     0   2104.335728 21932.57871
Factory 1   611     0   2324.586178 37498.99722
Factory 1   730.5   NA  NA          NA
Factory 1   1208    0   4361.588197 65235.05541
Factory 2   48      0   1517.840244 6604.770432
Factory 2   163     0   3370.461172 17252.70972
Factory 2   730.5   NA  NA          NA
Factory 2   822     0   13284.87786 71918.78308
Factory 2   1461    NA  NA          NA
Factory 2   1541    0   21476.93602 134569.0388
Factory 2   2091.5  NA  NA          NA
Factory 2   2285    0   32053.99192 225895.1477
Factory 2   2922    NA  NA          NA
Factory 2   3028    0   42299.41357 340798.6151

这反过来又允许我在 DataFrame 中插入 NA 的值以实现标准化时间步长

【问题讨论】:

    标签: python r pandas dplyr pandas-groupby


    【解决方案1】:

    简单地将所有唯一 Name 的顺序数据帧与增量 Cum_Hrs 值连接起来:

    seq_df = pd.concat([pd.DataFrame({'Name': i, 'Cum_Hrs': np.arange(0, max(g['Cum_Hrs']), 730.5)})
                         for i,g in df.groupby(['Name'])])
    
    final_df = (pd.concat([df, seq_df], sort=True)
                  .sort_values(['Name', 'Cum_Hrs'])
                  .reset_index(drop=True)
                  .reindex(columns=df.columns)
                )
    
    print(final_df)
    #          Name  Cum_Hrs    A             B              C
    # 0   Factory 1      0.0  NaN           NaN            NaN
    # 1   Factory 1      1.0  0.0      1.887861       3.775722
    # 2   Factory 1    251.0  0.0   2104.335728   21932.578710
    # 3   Factory 1    611.0  0.0   2324.586178   37498.997220
    # 4   Factory 1    730.5  NaN           NaN            NaN
    # 5   Factory 1   1208.0  0.0   4361.588197   65235.055410
    # 6   Factory 2      0.0  NaN           NaN            NaN
    # 7   Factory 2     48.0  0.0   1517.840244    6604.770432
    # 8   Factory 2    163.0  0.0   3370.461172   17252.709720
    # 9   Factory 2    730.5  NaN           NaN            NaN
    # 10  Factory 2    822.0  0.0  13284.877860   71918.783080
    # 11  Factory 2   1461.0  NaN           NaN            NaN
    # 12  Factory 2   1541.0  0.0  21476.936020  134569.038800
    # 13  Factory 2   2191.5  NaN           NaN            NaN
    # 14  Factory 2   2285.0  0.0  32053.991920  225895.147700
    # 15  Factory 2   2922.0  NaN           NaN            NaN
    # 16  Factory 2   3028.0  0.0  42299.413570  340798.615100
    # 17  Factory 2   3652.5  NaN           NaN            NaN
    # 18  Factory 2   3699.0  0.0  50125.855990  462145.543800
    # 19  Factory 2   4383.0  NaN           NaN            NaN
    # 20  Factory 2   4436.0  0.0  56715.749450  584474.998900
    

    在 base R 中可以处理类似的过程。通常将 base R(非 tidyverse)转换为 Pandas 更容易:

    • seq ==> np.arange
    • by ==> pd.DataFrame.groupby
    • data.frame ==> pd.DataFrame
    • do.call + rbind ==> pd.concat
    • order ==> pd.sort_values
    • row.names=NULL ==> pd.reset_index()

    R

    # BUILD SEQUENCE DATA FRAME
    seq_df = do.call(rbind, by(df, df$Name, function(sub) 
                                  data.frame(Name = sub$Name[[1]], 
                                             Cum_Hrs = seq(0, max(sub$Cum_Hrs), 730.5),
                                             A = NA, B = NA, C = NA))
                    )
    
    # CONCATENATE REFERENCING EVERY COLUMN
    final_df = rbind(df, seq_df)
    
    # SORT ROWS AND RESET ROW NAMES
    final_df = with(final_df, data.frame(final_df[order(Name, Cum_Hrs),], row.names=NULL))
    
    final_df
    

    Rextester Demo

    【讨论】:

      【解决方案2】:

      这将是一种解决方法 - 虽然可能不是最直接的:

      reps = (df.groupby("Name")["Cum_Hrs"].agg("max") / 730.5).astype(int)
      
      # reps:
      Name
      Factory1    1
      Factory2    6
      Name: Cum_Hrs, dtype: int64
      

      将此扩展为您要填充的两列,然后将其与原始 df 合并:

      newdf = pd.DataFrame({"Name": np.repeat(reps.index.values, reps.values),
                            "Cum_Hrs": np.hstack([np.arange(e)+1 for e in reps.values]) * 730.5})
      df = pd.concat([df, newdf], ignore_index=True).sort(["Name", "Cum_Hrs"])[df.columns]
      
      # merged df:
              Name  Cum_Hrs    A             B              C
      0   Factory1      1.0  0.0      1.887861       3.775722
      1   Factory1    251.0  0.0   2104.335728   21932.578710
      2   Factory1    611.0  0.0   2324.586178   37498.997220
      12  Factory1    730.5  NaN           NaN            NaN
      3   Factory1   1208.0  0.0   4361.588197   65235.055410
      4   Factory2     48.0  0.0   1517.840244    6604.770432
      5   Factory2    163.0  0.0   3370.461172   17252.709720
      13  Factory2    730.5  NaN           NaN            NaN
      6   Factory2    822.0  0.0  13284.877860   71918.783080
      14  Factory2   1461.0  NaN           NaN            NaN
      7   Factory2   1541.0  0.0  21476.936020  134569.038800
      15  Factory2   2191.5  NaN           NaN            NaN
      8   Factory2   2285.0  0.0  32053.991920  225895.147700
      16  Factory2   2922.0  NaN           NaN            NaN
      9   Factory2   3028.0  0.0  42299.413570  340798.615100
      17  Factory2   3652.5  NaN           NaN            NaN
      10  Factory2   3699.0  0.0  50125.855990  462145.543800
      18  Factory2   4383.0  NaN           NaN            NaN
      11  Factory2   4436.0  0.0  56715.749450  584474.998900
      

      【讨论】:

        【解决方案3】:

        使用datar很容易在python中复制:

        >>> from datar.all import f, tribble, group_by, complete, seq, max, as_integer
        [2021-06-24 16:15:59][datar][WARNING] Builtin name "max" has been overriden by datar.
        >>> 
        >>> Cum_df = tribble(
        ...   f.Name,      f.Cum_hrs, f.A,   f.B,           f.C,
        ...   "Factory 1", 1,         0,     1.887861,      3.775722,
        ...   "Factory 1", 251,       0,     2104.335728,   21932.57871,
        ...   "Factory 1", 611,       0,     2324.586178,   37498.99722,
        ...   "Factory 1", 1208,      0,     4361.588197,   65235.05541,
        ...   "Factory 2", 48,        0,     1517.840244,   6604.770432,
        ...   "Factory 2", 163,       0,     3370.461172,   17252.70972,
        ...   "Factory 2", 822,       0,     13284.87786,   71918.78308,
        ...   "Factory 2", 1541,      0,     21476.93602,   134569.0388,
        ...   "Factory 2", 2285,      0,     32053.99192,   225895.1477,
        ...   "Factory 2", 3028,      0,     42299.41357,   340798.6151,
        ...   "Factory 2", 3699,      0,     50125.85599,   462145.5438,
        ...   "Factory 2", 4436,      0,     56715.74945,   584474.9989,
        ... )
        >>> 
        >>> Cum_df >> \
        ...   group_by(f.Name) >> \
        ...   complete(Cum_hrs=as_integer(seq(0, max(f.Cum_hrs), 730.5)))
                 Name   Cum_hrs         A             B              C
             <object> <float64> <float64>     <float64>      <float64>
        0   Factory 1       0.0       NaN           NaN            NaN
        1   Factory 1     730.0       NaN           NaN            NaN
        2   Factory 2       0.0       NaN           NaN            NaN
        3   Factory 2     730.0       NaN           NaN            NaN
        4   Factory 2    1461.0       NaN           NaN            NaN
        5   Factory 2    2191.0       NaN           NaN            NaN
        6   Factory 2    2922.0       NaN           NaN            NaN
        7   Factory 2    3652.0       NaN           NaN            NaN
        8   Factory 2    4383.0       NaN           NaN            NaN
        9   Factory 1       1.0       0.0      1.887861       3.775722
        10  Factory 1     251.0       0.0   2104.335728   21932.578710
        11  Factory 1     611.0       0.0   2324.586178   37498.997220
        12  Factory 1    1208.0       0.0   4361.588197   65235.055410
        13  Factory 2      48.0       0.0   1517.840244    6604.770432
        14  Factory 2     163.0       0.0   3370.461172   17252.709720
        15  Factory 2     822.0       0.0  13284.877860   71918.783080
        16  Factory 2    1541.0       0.0  21476.936020  134569.038800
        17  Factory 2    2285.0       0.0  32053.991920  225895.147700
        18  Factory 2    3028.0       0.0  42299.413570  340798.615100
        19  Factory 2    3699.0       0.0  50125.855990  462145.543800
        20  Factory 2    4436.0       0.0  56715.749450  584474.998900
        
        [Groups: Name (n=2)]
        

        免责声明:我是datar 包的作者。

        【讨论】:

          猜你喜欢
          • 2023-04-01
          • 2019-07-01
          • 2017-03-27
          • 1970-01-01
          • 2021-11-04
          • 2020-02-25
          • 1970-01-01
          • 2016-01-03
          • 2017-09-14
          相关资源
          最近更新 更多