【问题标题】:Efficiently collapse a matrix有效地折叠矩阵
【发布时间】:2016-06-24 21:48:15
【问题描述】:

我有一个这种格式的矩阵:

set.seed(1)
mat <- matrix(round(runif(25,0,1)),nrow=5,ncol=5)
colnames(mat) <- c("a1::C","a1::A","a1::B","b1::D","b1::A")

     a1::C a1::A a1::B b1::D b1::A
[1,]     0     1     0     0     1
[2,]     0     1     0     1     0
[3,]     1     1     1     1     1
[4,]     1     1     0     0     0
[5,]     0     0     1     1     0

换句话说,每一列都是一个主题和一个特征(由它们用 :: 分隔的列名表示)。在每一行中,值为 1 表示该主题具有该特征,如果没有,则为 0。某个主题的特定行的所有列中都有可能为 0。

我想构建一个新矩阵,其中列将是主题(即每个主题一列),并且在行中,该主题具有的特征将按字母顺序排序并以逗号分隔。如果某个主题没有任何特征(即该主题的某一行全为 0),则应使用“W”值(所有特征都没有“W”值)。

下面是基于mat 的新矩阵的样子:

cnames = unique(sapply(colnames(mat), function(x) strsplit(x,split="::")[[1]][1]))
new_mat <- matrix(c("A","A","A,B,C","A,C","B",
                    "A","D","A,D","W","D"),
                  nrow=nrow(mat),ncol=length(cnames))
colnames(new_mat) = cnames

     a1      b1   
[1,] "A"     "A"  
[2,] "A"     "D"  
[3,] "A,B,C" "A,D"
[4,] "A,C"   "W"  
[5,] "B"     "D"

知道什么是实现这一目标的有效而优雅的方法吗?

【问题讨论】:

    标签: r matrix apply


    【解决方案1】:

    第 1 步:矩阵列旋转

    mat <- mat[, order(colnames(mat))]
    
    #      a1::A a1::B a1::C b1::A b1::D
    # [1,]     1     0     0     1     0
    # [2,]     1     0     0     0     1
    # [3,]     1     1     1     1     1
    # [4,]     1     0     1     0     0
    # [5,]     0     1     0     0     1
    

    步骤 2.1:列名分解

    ## decompose levels, get main levels (before ::) and sub levels (post ::)
    decom <- strsplit(colnames(mat), "::")
    
    main_levels <- sapply(decom, "[", 1)
    # [1] "a1" "a1" "a1" "b1" "b1"
    
    sub_levels <- sapply(decom, "[", 2)
    # [1] "A" "B" "C" "A" "D"
    

    步骤 2.2:分组索引生成

    ## generating grouping index
    main_index <- paste(rep(main_levels, each = nrow(mat)), rep(1:nrow(mat), times = ncol(mat)), sep = "#")
    sub_index <- rep(sub_levels, each = nrow(mat))
    sub_index[!as.logical(mat)] <- ""  ## 0 values in mat implies ""
    
    ## in unclear of what "main_index" and "sub_index" are, check:
    
    ## matrix(main_index, nrow(mat))
    #      [,1]   [,2]   [,3]   [,4]   [,5]  
    # [1,] "a1#1" "a1#1" "a1#1" "b1#1" "b1#1"
    # [2,] "a1#2" "a1#2" "a1#2" "b1#2" "b1#2"
    # [3,] "a1#3" "a1#3" "a1#3" "b1#3" "b1#3"
    # [4,] "a1#4" "a1#4" "a1#4" "b1#4" "b1#4"
    # [5,] "a1#5" "a1#5" "a1#5" "b1#5" "b1#5"
    
    ## matrix(sub_index, nrow(mat))
    #      [,1] [,2] [,3] [,4] [,5]
    # [1,] "A"  ""   ""   "A"  ""  
    # [2,] "A"  ""   ""   ""   "D" 
    # [3,] "A"  "B"  "C"  "A"  "D" 
    # [4,] "A"  ""   "C"  ""   ""  
    # [5,] ""   "B"  ""   ""   "D" 
    

    第 2.3 步:有条件的折叠粘贴

    ## collapsed paste of "sub_index" conditional on "main_index"
    x <- unname(tapply(sub_index, main_index, paste0, collapse = ""))
    x[x == ""] <- "W"
    # [1] "A"   "A"   "ABC" "AC"  "B"   "A"   "D"   "AD"  "W"   "D" 
    

    第 3 步:后处理

    我对此不太满意,但没有找到替代方案。

    x <- sapply(strsplit(x, ""), paste0, collapse = ",")
    #  [1] "A"   "A"   "A,B,C"  "A,C"   "B"   "A"   "D"   "A,D"  "W"  "D"
    

    第 4 步:矩阵

    x <- matrix(x, nrow = nrow(mat))
    colnames(x) <- unique(main_levels)
    
    #      a1      b1   
    # [1,] "A"     "A"  
    # [2,] "A"     "D"  
    # [3,] "A,B,C" "A,D"
    # [4,] "A,C"   "W"  
    # [5,] "B"     "D" 
    

    效率考虑

    该方法本身使用矢量化是相当有效的,并且不需要手动输入分组信息。例如,当您有数百个主要组(在 :: 之前)和数百个子组(在 :: 之后)时,您可以使用相同的代码。

    唯一的考虑,是减少不必要的内存拷贝。在这方面,我们应该尽可能使用匿名函数,而不像上面演示的那样显式分配矩阵。这会很好(已经测试过):

     decom <- strsplit(sort(colnames(mat)), "::")
     main_levels <- sapply(decom, "[", 1)
    
     sub_index <- rep(sapply(decom, "[", 2), each = nrow(mat))
     sub_index[!as.logical(mat[, order(colnames(mat))])] <- ""
    
     x <- unname(tapply(sub_index,
                        paste(rep(main_levels, each = nrow(mat)),
                              rep(1:nrow(mat), times = ncol(mat)),
                              sep = "#"),
                        paste0, collapse = ""))
    
     x <- matrix(sapply(strsplit(x, ""), paste0, collapse = ","),
                 nrow = nrow(mat))
    
     colnames(x) <- unique(main_levels)
    

    【讨论】:

      【解决方案2】:

      这是一个起点。不过,根据您有多少变量,这可能会变得很麻烦。

      library(data.table)
      dt = data.table(id = seq_len(nrow(mat)), mat)
      longDt <- melt(dt, id.vars = "id", measure = patterns("^a1::", "^b1::"))
      
      longDt[, .(a1 = list(sort(c("C", "A", "B")[as.logical(value1)])), 
                 b1 = list(sort(c("D", "A")[as.logical(value2)]))), .(id)]
         id    a1  b1
      1:  1     A   A
      2:  2     A   D
      3:  3 A,B,C A,D
      4:  4   A,C    
      5:  5     B   D
      

      【讨论】:

        猜你喜欢
        • 2017-03-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-05-27
        • 1970-01-01
        相关资源
        最近更新 更多