与数据框中的参考组相关答案

【问题标题】：correlation with a reference group in a dataframe与数据框中的参考组相关
【发布时间】：2018-12-29 06:44:51
【问题描述】：

考虑noMissing 数据框。

library(lubridate)
set.seed(123)
value <- rnorm(300)
value[sample(1:300,10)]<- NA
b <- rep(c("a","b", "c", "d","e", "f"), each=50)
b[sample(1:300,12)] <- NA
c <- rep(rep(as.character(1:2), each = 25) , 6) 
c[sample(1:300,10)] <- NA
datee <- seq(lubridate::ymd("2012-01-01"),lubridate::ymd("2012-01-01") + 24 , by = "days")
datee <- rep(datee, 12)
datee[sample(1:300,20)] <- NA
dataframe <- cbind.data.frame( b, c, datee, value)
noMissing <- dataframe[complete.cases(dataframe),]
head(noMissing)
  b c      datee       value
1 a 1 2012-01-01 -0.56047565
2 a 1 2012-01-02 -0.23017749
3 a 1 2012-01-03  1.55870831
4 a 1 2012-01-04  0.07050839
5 a 1 2012-01-05  0.12928774
6 a 1 2012-01-06  1.71506499

现在我想按列b 和c 对数据进行分组，然后计算每个组与b 列中a 组的相关性，该列在datee 列中与其他组具有相同的日期.

例如b, 1与引用组a之间的关联如下图所示

我的初步解决方案：

b_unique <- unique(noMissing$b)
c_unique <- unique(noMissing$c)
out <- list()
v <- 0
for (i in 1:length(b_unique)) {
    v <- v + 1
    group <- noMissing[noMissing$b==b_unique[i] & noMissing$c == c_unique[k],]
    ref <- noMissing[noMissing$b=="a" & noMissing$c == c_unique[k] ,]
    inter <-ymd("1970-01-01") + intersect(ref$datee, group$datee )
    x <- cor(group$value[group$datee %in% inter],ref[ref$datee %in% inter , "value"])
    out[[v]] <- list(b = b_unique[i], c = c_unique[k], cor = x)
  }

}

 dplyr::bind_rows(out)
    b     c        cor
  <fct> <fct>  <dbl>
  1 a     1      1.000
  2 a     2      1    
  3 b     1      0.175
  4 b     2     -0.247
  5 c     1      0.216
  6 c     2      0.101
  7 d     1      0.159
  8 d     2     -0.253
  9 e     1      0.177
 10 e     2     -0.528
 11 f     1      0.179
 12 f     2     -0.178

我正在寻找高品味的编码解决方案

【问题讨论】：

如何得到 b,1 的 0.175 ？输出表看不懂。
当日期相同时，b == b and c == 1 和 b == a and c == 1 之间的相关性为 0.175。
我不明白你的问题，但为了避免循环，你可以试试这个：dcast(data.table(noMissing), datee + c ~ b, value.var='value')
我添加了一张图片

标签： r

【解决方案1】：

您可以执行以下操作：

library(data.table)

# convert the data shape to have datewise information across all groups
df <- dcast(data.table(noMissing), datee+c ~ b, value.var='value')

# rename c as c_1 column as there are multiple column with c name
setnames(df, old = 2, new = 'c_1')

# groupby 'c_1' and for each group calculate correlation between b-a, c-a, d-a, e-a and so on
df <- df[, 
         lapply(.SD[,-c('datee'), with=F], function(x) {
          cols <- c('a','b','c','d','e','f')
          vals <- vector(mode = 'numeric', length = 6)
          for(i in seq(cols)) {vals[i] <-  (cor(get(cols[i]), get(('a')), use = complete.obs'))}
          return (vals)}) 
         ,c_1]

# finally reshape the table as you posted in solution above.
df <- melt(df, id.vars = c('c_1'))
colnames(df) <- c('c','b','cor')

   c b         cor
1: 1 a  1.00000000
2: 1 a -0.12499728
3: 1 a -0.13133257
4: 1 a  0.02573947
5: 1 a  0.07239559
6: 1 a -0.07421281

【讨论】：

第一个值不应该是 1 吗？
我认为你是对的，我编辑了答案，但我仍然不确定这是否是正确的答案。希望这能给你一些提示。