因素组合的流行率（在 R 中）答案

【问题标题】：Prevalence of combination of factors (in R)因素组合的流行率（在 R 中）
【发布时间】：2016-12-10 00:07:34
【问题描述】：

我正在研究肿瘤模式组合在预测其恶性程度方面的作用。我有这个由 6 个分类变量 (YES/NO) 描述的甲状腺结节特征表。

ID color shape halo calcium margins solid
1    1     1    1      1       0      0
2    1     1    0      0       1      0
3    0     0    1      1       1      1
4    0     0    1      0       0      0
5    1     1    1      1       0      1

我想知道他们三个的存在组合的流行程度。在这个例子中是：

          combination freq
color, shape, calcium   2
shape, halo,  calcium   2
color, shape, margins   1
....

我最终得到了他们每个人的普遍性

as.data.frame(table(tiradsLong$caratteristica, tiradsLong$valore))

这不是我的目标。

提前致谢，安杰洛

【问题讨论】：

标签： r combinations

【解决方案1】：

这是我能想出的一种解决方案，我确信它可以在优雅方面得到改进：

x <- combn(2:ncol(df), 3)
as.data.frame(do.call(rbind,
              apply(x, 2, function(y)
                    list(cols = names(df)[y],
                    value = sum(rowSums(df[, y]) == 3)))))

输出是：

                   cols value
1    color, shape, halo     2
2 color, shape, calcium     2
3 color, shape, margins     1
4   color, shape, solid     1
5  color, halo, calcium     2
...
...

一般来说，您可能希望查看此类事物的频繁项集和 apriori（arules 包）。

【讨论】：

这正是我正在寻找的，即使它到目前为止还没有工作...... rowSums(tiradsWide[, y]) 中的错误：'x' 必须是数字
请检查您的栏目类别。另外，您可以查看让我们说 tiradsWide[, c(2, 3, 4] 的类作为示例。它在我这边工作得很好。
是的，这些列是列表类型的。 Apply 返回一个矩阵（列表类型）。一些按摩可以进入所需类型的适当数据框。但是，逻辑和输出是正确的。
好的，我按照您的建议（将列转换为数字），效果很好，正如预期的那样。非常感谢

【解决方案2】：

以下解决方案取决于您的数据的格式。如果您通过dput 或类似方式提供一些示例数据，将会非常有帮助。

无论如何，以下是众多可能的解决方案之一。

df <- data.frame(ID = 1:50,
                 color = rbinom(50, size = 1, prob = 0.5),
                 shape = rbinom(50, size = 1, prob = 0.5),
                 halo = rbinom(50, size = 1, prob = 0.5),
                 calcium = rbinom(50, size = 1, prob = 0.5),
                 margins = rbinom(50, size = 1, prob = 0.5),
                 solid = rbinom(50, size = 1, prob = 0.5))

library(tidyverse)

df %>%
  gather("feature", "value", - ID) %>%
  filter(value == 1) %>%
  group_by(ID) %>%
  summarise(fdata = paste(sort(feature), collapse = "_")) %>%
  group_by(fdata) %>%
  summarise(count = n())

使用dplyr，首先您需要将数据转换为long 格式。然后您可以过滤您的信号，即1。通过按 id 分组，您可以对一组特征进行编码并将它们组合成一个字符串。 sort 是必需的，因为我们需要在编码字符串中添加一些结构。之后，我们按编码字符串分组并计算组中的 ID 数量。

编辑： @Gopala 提示您只需要三个组，您可以将这些行添加到上面的 sn-p：

... %>%
    mutate(threeCombos = purrr::map(fdata, function(.x) {
      splittedStrings = unlist(strsplit(.x, "_"))
      if (length(splittedStrings) > 2) {
        res <- data.frame(t(combn(splittedStrings, m = 3)), stringsAsFactors = FALSE) %>%
          unite("threecombs", starts_with("X"), sep = ",")
      } else {
        res <- data.frame()
      }
      return(res)
    })) %>% 
    unnest() %>%
    group_by(threecombs) %>%
    summarise(freq = sum(count))

这可能比通过选择（n，m）组合计算得更快。但同样，这取决于您想对三胞胎做什么进一步的统计分析。

【讨论】：

您为什么认为这会给出上述正确（期望的）结果？
你认为这是错误的，因为列名（组合，频率）和分隔符“_”是错误的？
在原始数据上，这给出了 5 行输出，而 3 列有 20 种可能的组合。此外，这会生成 3 个以上的列组合。
确实如此。我想知道为什么这个 3 的限制存在？ OP想要做什么样的统计分析？