将两个数据表相乘，保留所有可能性答案

【问题标题】：multiply two data.tables, keep all possibilities将两个数据表相乘，保留所有可能性
【发布时间】：2016-12-21 12:08:20
【问题描述】：

我现在找不到重复的。

我的问题如下：

我有两个data.tables。一个有两列（featurea、count），另一个有三列（featureb、featurec、count）。我想乘以（？），以便我有一个新的data.table 具有所有可能性。诀窍是这些功能不匹配，因此merge 解决方案可能无法解决问题。

MRE如下：

# two columns
DT1 <- data.table(featurea =c("type1","type2"), count = c(2,3))

#       featurea count
#1:    type1     2
#2:    type2     3

#three columns
DT2 <- data.table(origin =c("house","park","park"), color =c("red","blue","red"),count =c(2,1,2))

#   origin color count
#1:  house   red     2
#2:   park  blue     1
#3:   park   red     2

在这种情况下，我的预期结果是data.table，如下所示：

> DT3
   origin color featurea total
1:  house   red    type1     4
2:  house   red    type2     6
3:   park  blue    type1     2
4:   park  blue    type2     3
5:   park   red    type1     4
6:   park   red    type2     6

【问题讨论】：

DT2[, .(featurea = DT1[["featurea"]], count = count * DT1[["count"]]), by = .(origin, color)] 是否足够高效？
@Roland 似乎是这样，这听起来是最好的答案，所以你应该这样发布它

标签： r data.table

【解决方案1】：

请在更大的数据上进行测试，我不确定这是如何优化的：

DT2[, .(featurea = DT1[["featurea"]], 
        count = count * DT1[["count"]]), by = .(origin, color)]
#   origin color featurea count
#1:  house   red    type1     4
#2:  house   red    type2     6
#3:   park  blue    type1     2
#4:   park  blue    type2     3
#5:   park   red    type1     4
#6:   park   red    type2     6

如果DT1 的组较少，则切换它可能会更有效：

DT1[, c(DT2[, .(origin, color)], 
        .(count = count * DT2[["count"]])), by = featurea]
#   featurea origin color count
#1:    type1  house   red     4
#2:    type1   park  blue     2
#3:    type1   park   red     4
#4:    type2  house   red     6
#5:    type2   park  blue     3
#6:    type2   park   red     6

【讨论】：

【解决方案2】：

这是一种方式。首先，我在splitstackshape 包中使用expandRows() 扩展了DT2 中的行。自从我指定count = 2, count.is.col = FALSE 以来，每一行都重复了两次。然后，我处理了乘法并创建了一个名为total 的新列。同时，我为featurea 创建了一个新列。最后，我放弃了count。

library(data.table)
library(splitstackshape)

expandRows(DT2, count = nrow(DT1), count.is.col = FALSE)[,
    `:=` (total = count * DT1[, count], featurea = DT1[, featurea])][, count := NULL]

编辑

如果您不想添加其他包，可以在 David 的评论中尝试他的想法。

DT2[rep(1:.N, nrow(DT1))][,
   `:=`(total = count * DT1$count, featurea = DT1$featurea, count = NULL)][]



#   origin color total featurea
#1:  house   red     4    type1
#2:  house   red     6    type2
#3:   park  blue     2    type1
#4:   park  blue     3    type2
#5:   park   red     4    type1
#6:   park   red     6    type2

【讨论】：

@DavidArenburg 是的，我同意你的看法。如果 OP 提供了更详细的示例，则此想法需要修改。 nrow(DT1) 是个好主意。
@jazzurro 更全面的示例需要什么？我的数据集比这大得多，并且没有相同的列名。不过我还是投了赞成票
@erasmortg 我并不是说我需要整个数据集。很抱歉造成混乱。
@DavidArenburg 我最初通过origin 定义了组。但我认为组应该由origin' and color 定义。也许这就是 OP 在他/她心中的想法。
@DavidArenburg 它确实有效，但不是在这种情况下，因为我目前没有任何额外的库，如果您想添加，您的 data.table 提出的解决方案我可以立即实施作为答案

【解决方案3】：

用dplyr解决方案

library(dplyr)
library(data.table)

DT1 <- data.table(featurea =c("type1","type2"), count = c(2,3))
DT2 <- data.table(origin =c("house","park","park"), color =c("red","blue","red"),count =c(2,1,2))

创建一个虚拟列以进行内部连接（对我来说是key）：

inner_join(DT1 %>% mutate(key=1), 
          DT2 %>% mutate(key=1), by="key") %>% 
mutate(total=count.x*count.y) %>% 
select(origin, color, featurea, total) %>% 
arrange(origin, color)

【讨论】：