【问题标题】:How to expand data frame based on values? [duplicate]如何根据值扩展数据框? [复制]
【发布时间】:2017-10-15 09:29:45
【问题描述】:

我有以下输入数据框:

df <- data.frame(x=c('a','b','c'),y=c(4,5,6),from=c(1,2,3),to=c(2,4,6))  
df
  x y  from to
1 a 4  1    2
2 b 5  2    4
3 c 6  3    6

现在我想将每一行乘以 from 和 to 之间的值,即 ('a',4) 跨越两行,即1,2。预期结果如下所示:

exp <- data.frame(x=c('a','a','b','b','b','c','c','c','c'),
                  y=c(4,4,5,5,5,6,6,6,6),
                  z=c(1,2,2,3,4,3,4,5,6))
exp
  x y z
1 a 4 1
2 a 4 2
3 b 5 2
4 b 5 3
5 b 5 4
6 c 6 3
7 c 6 4
8 c 6 5
9 c 6 6

在没有循环的情况下,最惯用的方法是什么?

【问题讨论】:

    标签: r dataframe


    【解决方案1】:

    一种“非tidyverse”方式:

    data.frame(
      x = c('a', 'b', 'c'),
      y = c(4, 5, 6),
      from = c(1, 2, 3),
      to = c(2, 4, 6),
      stringsAsFactors = FALSE
    ) -> xdf
    
    do.call(rbind.data.frame, lapply(1:nrow(xdf), function(i) {
      data.frame(x = xdf$x[i], y=xdf$y[i], z=xdf$from[i]:xdf$to[i], stringsAsFactors=FALSE)
    }))
    

    一种“tidyverse”方式:

    library(tidyverse)
    
    data_frame(
      x = c('a', 'b', 'c'),
      y = c(4, 5, 6),
      from = c(1, 2, 3),
      to = c(2, 4, 6)
    ) -> xdf
    
    rowwise(xdf) %>% 
      do(data_frame(x = .$x, y=.$y, z=.$from:.$to))
    

    另一种在下面进行基准测试的“tidyverse”方式:

    xdf %>% 
      rowwise() %>% 
      do( merge( as_tibble(.), tibble(z=.$from:.$to), by=NULL) ) %>%
      select( -from, -to )     # Omit this line if you want to keep all original columns.
    

    既然你问的是性能:

    library(microbenchmark)
    
    data.table::data.table(
      x = c('a','b','c'),
      y = c(4,5,6),
      from = c(1,2,3),
      to = c(2,4,6)
    ) -> xdt1
    
    data.frame(
      x = c('a', 'b', 'c'),
      y = c(4, 5, 6),
      from = c(1, 2, 3),
      to = c(2, 4, 6),
      stringsAsFactors = FALSE
    ) -> xdf1 
    

    data.table ops 经常就地修改,因此保持公平​​竞争环境,并在执行操作之前复制每个数据帧/表。

    在大多数现代系统上,时间损失约为 100 纳秒

    microbenchmark(
    
      data.table = {
        xdt2 <- xdt1
        xdt2[, diff:= (to - from) + 1]
        xdt2 <- xdt2[rep(1:.N, diff)]
        xdt2[,z := seq(from,to), by=.(x,y,from,to)]
        xdt2[,c("x", "y", "z")]
      }, 
    
      base = {
        xdf2 <- xdf1
        do.call(rbind.data.frame, lapply(1:nrow(xdf2), function(i) {
          data.frame(x = xdf2$x[i], y=xdf2$y[i], z=xdf2$from[i]:xdf2$to[i], stringsAsFactors=FALSE)
        }))
      }, 
    
      tidyverse = {
        xdf2 <- xdf1
        dplyr::rowwise(xdf2) %>% 
          dplyr::do(dplyr::data_frame(x = .$x, y=.$y, z=.$from:.$to))
      }, 
    
      plyr = {
        xdf2 <- xdf1
        plyr::mdply(xdf2, function(x,y,from,to) data.frame(x,y,z=seq(from,to)))[c("x","y","z")]
      },
    
      times = 1000
    
    )
    ## Unit: microseconds
    ##        expr       min         lq       mean    median         uq        max neval
    ##  data.table   920.361  1072.9265  1257.2321  1178.832  1280.2660  10628.552  1000
    ##        base   677.069   761.3145   884.4136   825.472   915.8985   5366.515  1000
    ##   tidyverse 15926.127 17231.5015 19201.4798 17994.919 20014.4140 166901.570  1000
    ##        plyr  1938.838  2196.4205  2448.5314  2322.949  2501.5075   5735.255  1000
    

    【讨论】:

    • 对于“tidyverse”方式,您通过显式命名 (x = .$x, y=.$y) 保留列并添加一个新列 (z=.$from:.$to)。您知道如何保留所有现有列并附加新列 z 而不明确命名要保留的列吗?也就是说,像mutate 一样添加一列,但是当新变量是该向量时重复行...感谢您的帮助!
    • 这几乎可以做到...dplyr::starwars[1:2,1] %&gt;% rowwise() %&gt;% do( expand.grid( ., z = 1:2 )) 除了我得到list 类型的第一列有一堆长度为 1 的列表...
    【解决方案2】:

    您可以使用data.table

    library(data.table)    
    df <- data.table(x=c('a','b','c'),y=c(4,5,6),from=c(1,2,3),to=c(2,4,6))  
    df <- df[, diff:= (to - from) + 1]
    
    df <- df[rep(1:.N,diff)]
    df <- df[,z := seq(from,to) , by=.(x,y,from,to)]
    df
    
    > df
       x y from to diff z
    1: a 4    1  2    2 1
    2: a 4    1  2    2 2
    3: b 5    2  4    3 2
    4: b 5    2  4    3 3
    5: b 5    2  4    3 4
    6: c 6    3  6    4 3
    7: c 6    3  6    4 4
    8: c 6    3  6    4 5
    9: c 6    3  6    4 6
    

    【讨论】:

      【解决方案3】:

      我知道这个问题已经得到解答,但一个单一的data.table 解决方案是:

      library(data.table)
      setDT(df)[,.(z = from:to), by = .(x,y)]
      
      #   x y z
      #1: a 4 1
      #2: a 4 2
      #3: b 5 2
      #4: b 5 3
      #5: b 5 4
      #6: c 6 3
      #7: c 6 4
      #8: c 6 5
      #9: c 6 6
      

      【讨论】:

        【解决方案4】:

        可以使用plyr 包,有一个使用mdply 的简洁解决方案:

        library(plyr)
        df <- data.frame(x=c('a','b','c'),y=c(4,5,6),from=c(1,2,3),to=c(2,4,6)) 
        res <- mdply(df, function(x,y,from,to) data.frame(x,y,z=seq(from,to)))[c("x","y","z")]
        res
          x y z
        1 a 4 1
        2 a 4 2
        3 b 5 2
        4 b 5 3
        5 b 5 4
        6 c 6 3
        7 c 6 4
        8 c 6 5
        9 c 6 6
        

        由于它为每一行创建一个数据框,它可能不是超级高效......或者?

        【讨论】:

        • 添加了基准以挖掘当前跨答案的解决方案集。
        • 非常感谢!!!
        猜你喜欢
        • 2021-12-20
        • 1970-01-01
        • 1970-01-01
        • 2020-06-11
        • 2020-10-10
        • 2014-04-06
        • 2020-06-21
        • 2020-11-11
        • 1970-01-01
        相关资源
        最近更新 更多