R 项目组合 3 组答案

【问题标题】：R item combinations group of 3R 项目组合 3 组
【发布时间】：2016-03-17 21:39:49
【问题描述】：

这是查找pairs 的解决方案，但是三元组呢？

如果我有：

consumer=c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,4,5)
items=c("apple","banana","carrot","date","eggplant","apple","banana",
        "fig","grape","apple","banana","apple","carrot","date",
        "eggplant","apple")
shoppinglists <- data.frame(consumer,items)
table(shoppinglists)

有没有一种简单的方法可以找到最多的三元组组合？例如，三元组“胡萝卜”+“日期”+“茄子”、“苹果”+“胡萝卜”+“日期”、“苹果”+“胡萝卜”+“茄子”和“苹果”+“日期”+ “茄子”分别出现在两个列表中（消费者 1 和 4）。

可以看到有很多并列第二名的一个出现：A+B+C、A+B+D、A+B+E、B+C+D、B+C+E（消费者1 ); A+B+F、A+B+G（消费者2）。

【问题讨论】：

我希望只返回三元组的情况，并让结果给我一个三元组列表以及它们各自出现的次数

标签： r count combinations

【解决方案1】：

这是data.table 的答案，很容易扩展到四倍等：

library(data.table); setDT(shoppinglists)

#exclude if consumer didn't buy 3 goods
shoppinglists[ , if (.N >= 3L) 
  .(triplet =
      #get the combinations 3 at a time;
      #  keep them as a list (simplify=FALSE)
      #  for easy post-manipulation with sapply
      sapply(combn(items, 3L, simplify = FALSE),
             #**should be a better way...**
             paste, collapse = ",")), 
  by = consumer
  #now count the total frequency of each triplet
  ][ , .N, by = triplet
     #and sort to see the most frequent
     ][order(-N)]
#                    triplet N
#  1:      apple,carrot,date 2
#  2:  apple,carrot,eggplant 2
#  3:    apple,date,eggplant 2
#  4:   carrot,date,eggplant 2
#  5:    apple,banana,carrot 1
#  6:      apple,banana,date 1
#  7:  apple,banana,eggplant 1
#  8:     banana,carrot,date 1
#  9: banana,carrot,eggplant 1
# 10:   banana,date,eggplant 1
# 11:       apple,banana,fig 1
# 12:     apple,banana,grape 1
# 13:        apple,fig,grape 1
# 14:       banana,fig,grape 1

对于双打，我们可以使用combn(value, 2L)；对于四倍，combn(value, 4L) 等。

将order(-N) 替换为N == max(N) 以排除除最常见的以外的所有内容。

我希望我们不必 paste-collapse 这个 - 我希望 list() 可以工作，但计算 by 一个 list 列显然不起作用。

【讨论】：

这里是great tutorial for data.table，如果有人像我一样想了解这个答案的工作原理！
@gregorio099 更好的是来自 GitHub 的 list of tutorials

【解决方案2】：

您可以使用arules 包。如果你做了很多这样的工作，那么值得探索，因为它：

提供用于表示、操作和处理的基础设施分析交易数据和模式（频繁项集和关联规则）。还提供了 C 实现的接口 C. Borgelt 的关联挖掘算法 Apriori 和 Eclat。

这里是一个使用 eclat 算法的解决方案：

# Set up the object you'll pass to eclat:
tbl <- table(shoppinglists)
itemList <- matrix(tbl)
dim(itemList) <- dim(tbl)
colnames(itemList) <- colnames(tbl)

现在，您可以使用eclat。有一个support 参数用于指定被视为频繁的项集所需的最小支持。在这种情况下，无论频率如何，您都想要一切，因此您可以将 support 设置为 0。您会收到一条警告，将其设置为 0 可能会导致内存不足。

library(arules)
d <- eclat(itemList, parameter = list(minlen = 3, maxlen = 3, support = 0))

您可以使用d 中包含的数据构建您想要的data.frame。通过将支持 (quality(d)) 乘以事务总数 (info(d)$ntransactions) 生成每个项目集的事务数：

d2 <- data.frame(items = labels(d), quality(d) * info(d)$ntransactions)
names(d2)[2] <- "N" # to rename from "support" to "N"
d2
#                      items N
#1         {apple,fig,grape} 1
#2        {banana,fig,grape} 1
#3      {apple,banana,grape} 1
#4        {apple,banana,fig} 1
#5     {apple,date,eggplant} 2
#6    {banana,date,eggplant} 1
#7    {carrot,date,eggplant} 2
#8   {apple,carrot,eggplant} 2
#9  {banana,carrot,eggplant} 1
#10  {apple,banana,eggplant} 1
#11      {apple,carrot,date} 2
#12     {banana,carrot,date} 1
#13      {apple,banana,date} 1
#14    {apple,banana,carrot} 1

【讨论】：

感谢您的回答——这也很有趣。当我使用购物清单示例尝试此操作时，我得到与您相同的结果，但是当我使用更大的列表时，eclat 函数仅写入/输出两个三元组。你知道会发生什么吗？
@gregorio099 让我知道有关设置support 参数的编辑是否解决了问题。