计算列表中元素之间的平均成对协方差答案

【问题标题】：Compute mean pairwise covariance between elements in a list计算列表中元素之间的平均成对协方差
【发布时间】：2025-11-30 15:55:02
【问题描述】：

我有以下数据框：

# df1
id   cg_v
1       a
2       b
3     a b
4     b c
5   b c d
6       d

# df2
id  cg
1    a
2    b
3    a
3    b
4    b
4    c
5    b
5    c
5    d
6    d

我需要向df1 添加一列，其中包含在cg_v 中每对元素计算的平均协方差。如果cg_v 仅包含一个元素，那么我希望新列包含其方差。

我可以通过cov(crossprod(table(df2)))得到一个协方差矩阵

#          a         b          c          d
a  0.9166667 0.0000000 -0.5833333 -0.6666667
b  0.0000000 2.0000000  1.0000000  0.0000000
c -0.5833333 1.0000000  0.9166667  0.3333333
d -0.6666667 0.0000000  0.3333333  0.6666667

我从这里做什么？

最终结果应该是这样的：

# df1
id   cg_v      cg_cov
1       a   0.9166667
2       b   2.0000000
3     a b   0.0000000
4     b c   1.0000000
5   b c d   0.4444444  # This is equal to (1.0000000 + 0.3333337 + 0.0000000)/3
6       d   0.6666667

生成df1和df2的代码：

df1 <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                      cg_v = c("a", "b", "a b", "b c", "b c d", "d")),
                 .Names = c("id", "cg_v"),
                 class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                      cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d")),
                 .Names = c("id", "cg"),
                 class = "data.frame", row.names = c(NA, -10L))

【问题讨论】：

标签： r covariance

【解决方案1】：

我想我使用 data.tables 和 reshape 找到了解决这个问题的方法。你想用三个字母 b c d 做什么？我假设你想要前两个字母的协方差：

        require(reshape)
        require(data.table)
        dt1 <- data.table(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                          cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
        dt2 <- data.table(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                              cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
        cov_dt <- data.table(melt(cov(crossprod(table(df2)))))
        dt1 <- cbind(dt1, t(sapply(strsplit(as.character(df1$cg_v), " "), function(x)x[1:2])))
        #replace the na with the first colomn
        dt1[is.na(V2), V2 := V1]

        # Merge them on two columns
        setkey(dt1, "V1", "V2")
        setkey(cov_dt, "X1", "X2")
        result <- cov_dt[dt1]
> result[,.(id, cg_v, value)]
   id  cg_v     value
1:  1     a 0.9166667
2:  3   a b 0.0000000
3:  2     b 2.0000000
4:  4   b c 1.0000000
5:  5 b c d 1.0000000
6:  6     d 0.6666667

如果有超过 2 个字母也可以使用的变体（不是最有效的代码）：

require(reshape)
require(combinat)
df1 <- data.frame(id = c(1L, 2L, 3L, 4L, 5L, 6L),
                  cg_v = c("a", "b", "a b", "b c", "b c d", "d"))
df2 <- data.frame(id = c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L),
                      cg = c("a", "b", "a", "b", "b", "c", "b", "c", "d", "d"))
cov_dt <- cov(crossprod(table(df2)))
mat <- sapply(strsplit(as.character(df1$cg_v), " "), function(x) if(length(x) == 1){c(x,x)} else(x))
# Should be all minimal 2 
sapply(mat, length) > 1
mat <- sapply(mat, function(x) matrix(combn(x,2), nrow = 2))
df1$cg_cov <- sapply(mat, function(x) mean(apply(x,2, function(x) cov_dt[x[1],x[2]])))
> df1
  id  cg_v    cg_cov
1  1     a 0.9166667
2  2     b 2.0000000
3  3   a b 0.0000000
4  4   b c 1.0000000
5  5 b c d 0.4444444
6  6     d 0.6666667

【讨论】：

不，我需要 cov(b,c)、cov(c,d) 和 cov(b,d) 的平均值。即 (1.0000000 + 0.3333337 + 0.0000000) / 3 = 0.4444444。
我编辑了我的解决方案，以便在使用更多字母时它可以工作
它适用于示例，但不适用于我的数据。运行mat <- sapply(mat, function(x) matrix(combn(x,2), nrow = 2)) 时出现错误。错误是：Error in combn(x, 2) : n < m
您应该检查 mat 的长度是否总是大于 1，这与与此示例不同的输入有关。我添加了一行来检查问题所在。
这确实是问题所在。 df1 包含 cg_v 为空的行，这导致 mat 在某些情况下的长度为 0。我相信我可以通过消除df1 中的空行来解决这个问题，但也许可以在代码中实现一个if-check，以便length(x)==0 中的行从mat 中删除？