【问题标题】:Multiply each row of one dataframe by all rows of a second dataframe将一个数据帧的每一行乘以第二个数据帧的所有行
【发布时间】:2020-05-30 02:46:48
【问题描述】:

由于我的数据集非常大,我在操作上苦苦挣扎,我提供了一个我想要的示例。

我有两个数据框。

df1 - 包含定义为列名(10,000 行)的变量的每个参数的采样派生迭代

df2 - 包含定义为列名的每个变量的实际值(4,000 行)

我想要一个 df3,它实际上是 df2 的每一行乘以 df1,因此是 4000*10000 行

作为一个简短的示例,我提供了 df1 和 df2 的最小示例。我已经提供了我将在 df3 中查看的输出。

df1 <- structure(list(intercept = c(3.4, 3.6, 3.7), age = c(0.08, 0.05, 
0.06), male = c(0.07, 0.06, 0.07)), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(id = structure(1:2, .Label = c("a", "b"), class = "factor"), 
intercept = c(1L, 1L), age = c(40L, 45L), male = 1:0), class = "data.frame", row.names = c(NA, 
-2L))

df3 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), intercept = c(3.4, 3.6, 3.7, 3.4, 3.6, 
3.7), age = c(3.2, 2, 2.4, 3.6, 2.25, 2.7), male = c(0.07, 0.06, 
0.07, 0, 0, 0)), class = "data.frame", row.names = c(NA, -6L))

有人可以指出在 R 中执行此操作的有效方法吗?

【问题讨论】:

  • 谁来做基准测试? :)
  • @Sotos 是的,我在答案中添加了基准测试
  • @Sotos 和所有其他人 - 非常感谢!

标签: r


【解决方案1】:

另一个想法是使用outer,通过base R,

data.frame(id = rep(df2$id, each = nrow(df1)), 
           mapply(function(x, y)c(outer(x, y, `*`)), df1, df2[-1])
           )

给出,

  id intercept  age male
1  a       3.4 3.20 0.07
2  a       3.6 2.00 0.06
3  a       3.7 2.40 0.07
4  b       3.4 3.60 0.00
5  b       3.6 2.25 0.00
6  b       3.7 2.70 0.00

【讨论】:

    【解决方案2】:

    您可以像下面那样执行逐行 Kronecker 乘积(来自包 MGLM

    out <- data.frame(id = rep(df2$id,each=nrow(df1)),
                      t(MGLM::kr(t(df2[-1]),t(df1))))
    

    这样

    > out
      id intercept  age male
    1  a       3.4 3.20 0.07
    2  a       3.6 2.00 0.06
    3  a       3.7 2.40 0.07
    4  b       3.4 3.60 0.00
    5  b       3.6 2.25 0.00
    6  b       3.7 2.70 0.00
    

    基准测试(到目前为止,@Sotos 的方法是赢家)

    df1 <- do.call(rbind,replicate(500,structure(list(intercept = c(3.4, 3.6, 3.7), age = c(0.08, 0.05, 
                                                                0.06), male = c(0.07, 0.06, 0.07)), class = "data.frame", row.names = c(NA, 
                                                                                                                                        -3L)),simplify = F))
    
    df2 <- do.call(rbind,replicate(100,structure(list(id = structure(1:2, .Label = c("a", "b"), class = "factor"), 
                          intercept = c(1L, 1L), age = c(40L, 45L), male = 1:0), class = "data.frame", row.names = c(NA, 
                                                                                                                     -2L)),simplify = F))
    
    library(MGLM)
    library(purrr)
    
    f_ThomasIsCoding <- function() {
      data.frame(id = rep(df2$id,each=nrow(df1)),
                        t(MGLM::kr(t(df2[-1]),t(df1))))
    }
    
    f_tmfmnk_1 <- function() {
      map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x))
    }
    
    f_tmfmnk_2 <- function() {
      data.frame(do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x))),
                 id = rep(df2$id, each = nrow(df1)))
    }
    
    f_RonakShah <- function() {
      new1 <- df1[rep(seq(nrow(df1)), nrow(df2)), ] 
      new2 <- df2[rep(seq(nrow(df2)), each = nrow(df1)),]
      out <- cbind(new2[1], new1 * new2[-1])
      rownames(out) <- NULL
      out
    }
    
    f_Sotos <- function() {
      data.frame(id = rep(df2$id, each = nrow(df1)), 
                 mapply(function(x, y)c(outer(x, y, `*`)), df1, df2[-1])
      )
    }
    
    bmk <- microbenchmark(times = 20,
                   unit = "relative",
                   f_ThomasIsCoding(),
                   f_tmfmnk_1(),
                   f_tmfmnk_2(),
                   f_RonakShah(),
                   f_Sotos())
    

    给了

    > bmk
    Unit: relative
                   expr       min        lq      mean    median       uq       max neval
     f_ThomasIsCoding()  1.186124  1.218201  1.197346  1.321731 1.042721  1.077854    20
           f_tmfmnk_1()  7.594520  7.572723  4.539698  7.297610 2.437621  3.446436    20
           f_tmfmnk_2()  9.670286 12.212220  6.583183 11.888061 3.370593  4.088534    20
          f_RonakShah() 28.918724 28.861437 16.707258 27.889563 8.403161 11.668252    20
              f_Sotos()  1.000000  1.000000  1.000000  1.000000 1.000000  1.000000    20
    

    【讨论】:

    • 哇哦!没想到。我确信sweep 会席卷我们所有人:P
    • @Sotos 哈哈,恭喜!
    【解决方案3】:

    您可以根据其他数据框中的行数重复两个数据框中的行并直接相乘

    df1[rep(seq(nrow(df1)), nrow(df2)),] * df2[rep(seq(nrow(df2)), each = nrow(df1)),-1]
    
    #    intercept  age male
    #1         3.4 3.20 0.07
    #2         3.6 2.00 0.06
    #3         3.7 2.40 0.07
    #1.1       3.4 3.60 0.00
    #2.1       3.6 2.25 0.00
    #3.1       3.7 2.70 0.00
    

    还要获取id

    new1 <- df1[rep(seq(nrow(df1)), nrow(df2)), ] 
    new2 <- df2[rep(seq(nrow(df2)), each = nrow(df1)),]
    out <- cbind(new2[1], new1 * new2[-1])
    rownames(out) <- NULL
    
    out
    #  id intercept  age male
    #1  a       3.4 3.20 0.07
    #2  a       3.6 2.00 0.06
    #3  a       3.7 2.40 0.07
    #4  b       3.4 3.60 0.00
    #5  b       3.6 2.25 0.00
    #6  b       3.7 2.70 0.00
    

    【讨论】:

      【解决方案4】:

      涉及purrr 的一个选项可能是:

      map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x))
      
        intercept  age male
      1       3.4 3.20 0.07
      2       3.6 2.00 0.06
      3       3.7 2.40 0.07
      4       3.4 3.60 0.00
      5       3.6 2.25 0.00
      6       3.7 2.70 0.00
      

      如果 id 列也很重要:

      data.frame(map_dfr(.x = asplit(df2[-1], 1), ~ sweep(df1, 2, FUN = `*`, .x)),
                 id = rep(df2$id, each = nrow(df1)))
      
        intercept  age male id
      1       3.4 3.20 0.07  a
      2       3.6 2.00 0.06  a
      3       3.7 2.40 0.07  a
      4       3.4 3.60 0.00  b
      5       3.6 2.25 0.00  b
      6       3.7 2.70 0.00  b
      

      base R:

      do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x)))
      

      或者:

      data.frame(do.call(rbind, lapply(asplit(df2[-1], 1), function(x) sweep(df1, 2, FUN = `*`, x))),
                 id = rep(df2$id, each = nrow(df1)))
      

      【讨论】:

      • 或类似:do.call(rbind, apply(df1, 1, function(x) sweep(df2[-1], 2, x, `*`)))
      猜你喜欢
      • 1970-01-01
      • 2019-08-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-05-22
      • 2022-09-30
      • 2015-08-28
      相关资源
      最近更新 更多