【问题标题】:Find combination of n vectors across k dataframes with highest correlation在具有最高相关性的 k 个数据帧中查找 n 个向量的组合
【发布时间】:2019-08-30 01:21:59
【问题描述】:

假设有四个数据帧,每个数据帧有 3 个向量,例如

setA <- data.frame(
  a1 = c(6,5,2,4,5,3,4,4,5,3),
  a2 = c(4,3,1,4,5,1,1,6,3,2),
  a3 = c(5,4,5,6,4,6,5,5,3,3)
)

setB <- data.frame(
  b1 = c(5,3,4,3,3,6,4,4,3,5),
  b2 = c(4,3,1,3,5,2,5,2,5,6),
  b3 = c(6,5,4,3,2,6,4,3,4,6)
)

setC <- data.frame(
  c1 = c(4,4,5,5,6,4,2,2,4,6),
  c2 = c(3,3,4,4,2,1,2,3,5,4),
  c3 = c(4,5,4,3,5,5,3,5,5,6)
)

setD <- data.frame(
  d1 = c(5,5,4,4,3,5,3,5,5,4),
  d2 = c(4,4,3,3,4,3,4,3,4,5),
  d3 = c(6,5,5,3,3,4,2,5,5,4)
)

我正在尝试在每个数据帧中找到n 的向量数量,它们彼此之间具有最高的相关性。对于这个简单的示例,假设要在每个 k = 4 数据帧中找到 n = 1 向量,它们显示出整体最强的正相关 cor()

我对数据帧内向量的相关性不感兴趣,但对数据帧之间的相关性不感兴趣,因为我希望从每个集合中选择 1 个变量。

直观地说,我会将每个组合的所有相关系数相加,即:

sum(cor(cbind(setA$a1, setB$b1, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b2, setC$c1, setC$d1)))
sum(cor(cbind(setA$a1, setB$b1, setC$c2, setC$d1)))
... # and so on...

...但这似乎是强制使用某种聚类技术,可能更优雅地解决的解决方案?

无论如何,我希望找到一个动态解决方案,例如 function(n = 1, ...) where(... 用于数据帧),它将返回最高相关向量名称的列表。

【问题讨论】:

    标签: r cluster-analysis correlation


    【解决方案1】:

    根据您的示例,除非您的实际数据很大,否则我不会使用非常复杂的算法。这是一种简单的方法,我认为可以得到你想要的。 因此,基于您的 4 个数据框 a 创建 list_df 然后在函数中我只生成所有可能的变量组合并计算它们的相关性。最后我选择相关性最高的 n 个组合。

    list_df = list(setA,setB,setC,setD)
    
    CombMaxCor = function(n = 1,list_df){
    
      column_names = lapply(list_df,colnames)
      mat_comb     = expand.grid(column_names)
      mat_total    = do.call(cbind,list_df)
      vec_cor      = rep(NA,nrow(mat_comb))
    
      for(i in 1:nrow(mat_comb)){
        vec_cor[i] = sum(cor(mat_total[,as.character(unlist(mat_comb[i,]))]))
      }
      pos_max_temp = rev(sort(vec_cor))[1:n]
      pos_max      = vec_cor%in%pos_max_temp
      comb_max_cor = mat_comb[pos_max,]
      return(comb_max_cor)
    }
    

    【讨论】:

    • 这很适合我的例子,谢谢! :) 我的真实数据有 4 个数据帧,每个数据帧有 20 个向量,所以自然需要相当长的时间来处理。如果有任何关于如何提高性能的想法,我很高兴听到。非常感谢
    • @ComfortEagle 我认为将 for 替换为 apply 可能会使其更快,或者并行化 for 也会有所帮助
    【解决方案2】:

    你可以使用comb函数:

    fun = function(x){
      nm = paste0(names(x),collapse="")
      if(!grepl("(.)\\d.*\\1",nm,perl = T))
        setNames(sum(cor(x)),nm)
    }
    unlist(combn(a,4,fun,simplify = FALSE))[1:3]#Only printed the first 3
    
    a1b1c1d1 a1b1c1d2 a1b1c1d3 
    3.246442 4.097532 3.566949 
    
    sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d1)))
    [1] 3.246442
    sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d2)))
    [1] 4.097532
    sum(cor(cbind(setA$a1, setB$b1, setC$c1, setD$d3)))
    [1] 3.566949
    

    【讨论】:

      【解决方案3】:

      这是一个函数,我们可以使用它从每个数据帧中获取 n 个非重复列,以获得最大总相关性:

      func <- function(n, ...){
      
          list.df <- list(...)
          n.df <- length(list.df)
      
      
          # 1) First get the correlations
          get.two.df.cors <- function(df1, df2) apply(df1, 2, 
              function(x) apply(df2, 2, function(y) cor(x,y))
              )
          cor.combns <-  lapply(list.df, function(x) 
              lapply(list.df, function(y) get.two.df.cors(x,y))
              )
      
      
          # 2) Define function to help with aggregating the correlations.
          # We will call them for different combinations of selected columns from each df later
      
          # cmbns: given a df corresponding columns to be selected each data frame
          # (i-th row corresponds to i-th df),
          # return the "total correlation"
      
      
          get.cmbn.sum <- function(cmbns, cor.combns){
              # a helper matrix to help aggregation
              # each row represents which two data frames we want to get the correlation sums
              df.df <- t(combn(seq(n.df), 2, c))
      
              # convert to list of selections for each df
              cmbns <- split(cmbns, seq(nrow(cmbns)))
      
              sums <- apply(df.df, 1,
                function(dfs) sum(
                   cor.combns[[dfs[1]]][[dfs[2]]][cmbns[[dfs[2]]], cmbns[[dfs[1]]]] 
                )
              )
      
              # sum of the sums give the "total correlation"
              sum(sums)
          }
      
      
      
          # 3) Now perform the aggragation
      
          # get the methods of choosing n columns from each of the k data frames
          if (n==1) {
          cmbns.each.df <- lapply(list.df, function(df) matrix(seq(ncol(df)), ncol=1))
          } else {
          cmbns.each.df <- lapply(list.df, function(df) t(combn(seq(ncol(df)), n, c)))
          }
      
          # get all unique selection methods
          unique.selections <- Reduce(function(all.dfs, new.df){
              all.dfs.lst <- rep(list(all.dfs), nrow(new.df))
              all.new.rows <- lapply(seq(nrow(new.df)), function(x) new.df[x,,drop=F])
              for(i in seq(nrow(new.df))){
                  for(j in seq(length(all.dfs.lst[[i]]))){
                      all.dfs.lst[[i]][[j]] <- rbind(all.dfs.lst[[i]][[j]], all.new.rows[[i]])
                  }
              }
      
              do.call(c, all.dfs.lst)
      
          }, c(list(list(matrix(numeric(0), nrow=0, ncol=n))), cmbns.each.df))
      
          # for each unique selection method, calculate the total correlation
          result <- sapply(unique.selections, get.cmbn.sum, cor.combns=cor.combns)
          return( unique.selections[[which.max(result)]] )
      
      }
      

      现在我们有了:

      # n = 1
      func(1, setA, setB, setC, setD)
      #      [,1]
      # [1,]    1
      # [2,]    2
      # [3,]    3
      # [4,]    2
      
      # n = 2
      func(2, setA, setB, setC, setD)
      #      [,1] [,2]
      # [1,]    1    2
      # [2,]    2    3
      # [3,]    2    3
      # [4,]    2    3
      

      【讨论】:

        猜你喜欢
        • 2019-01-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-11-20
        • 2019-03-16
        • 1970-01-01
        • 2012-08-16
        • 1970-01-01
        相关资源
        最近更新 更多