R：将大型数据帧转换为成对相关矩阵答案

【问题标题】：R: Converting Large Dataframe to Pairwise Correlation MatrixR：将大型数据帧转换为成对相关矩阵
【发布时间】：2019-10-13 19:50:22
【问题描述】：

我有以下形式的数据：

df <- data.frame(group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
                  thing = c(rep(c('a','b','c','d','e'),5)),
                  score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0))

它报告了一堆“组”的每个“事物”的“分数”。

我想创建一个相关矩阵，该矩阵显示所有“事物”的成对得分相关性，基于它们在各组之间的得分相关性：

         thing_a thing_b thing_c thing_d thing_e
thing_a  1       .       .       .       .
thing_b  corr    1       .       .       .
thing_c  corr    corr    1       .       .
thing_d  corr    corr    corr    1       .
thing_e  corr    corr    corr    corr    1

例如，事物“a”和事物“b”之间相关性的基础数据是：

group  thing_a_score  thing_b_score
1      1              1
2      1              1
3      1              1
4      0              1
5      0              1

实际上，唯一组的数量约为 1,000，事物的数量约为 10,000，因此我需要一种比蛮力 for 循环更有效的方法。

我不需要得到的相关矩阵在单个矩阵中，甚至在矩阵本身中（即，它可以是一组具有三列“thing_1 thing_2 corr”的数据集）。

【问题讨论】：

对于这个特定的例子，这同样适用：cor(table(df[df$score == 1, c('group', 'thing')]))

标签： r combinations permutation correlation pairwise

【解决方案1】：

你可以先dcast你的数据，然后使用cor()函数得到相关矩阵：

library(data.table)
dt <- data.table(
  group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
  thing = c(rep(c('a','b','c','d','e'),5)),
  score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0)
)
dt

m <- dcast(dt, group ~ thing, value.var = "score")

cor(m[, -1])

data.table 通常是高性能的，但如果它不适合你，请编写一个可重现的生成大量数据的示例，有人可能会在不同的解决方案上对速度和内存进行基准测试。

【讨论】：

像魅力一样工作！并轻松处理 1,000 组乘 10,000 个事物大小的矩阵。谢谢！