【问题标题】:Subset a sparse-matrix with data.table in R在 R 中使用 data.table 对稀疏矩阵进行子集
【发布时间】:2016-09-24 00:57:22
【问题描述】:

我尝试使用 data.table 包解决以下问题: Is there a faster way to subset a sparse Matrix than '['?

但我得到了这个错误:

    Error in Z[, cols] : invalid or not-yet-implemented 'Matrix' subsetting 
10 stop("invalid or not-yet-implemented 'Matrix' subsetting") 
9 Z[, cols] 
8 Z[, cols] 
7 FUN(X[[i]], ...) 
6 lapply(X = ans[index], FUN = FUN, ...) 
5 tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z, simplify = FALSE) 
4 eval(expr, envir, enclos) 
3 eval(jsub, SDenv, parent.frame()) 
2 `[.data.table`(lkupdt, , tapply(.SD, INDEX = "gene_name", FUN = simple_fun, 
Z = Z, simplify = FALSE), .SDcols = c("snps")) 
1 lkupdt[, tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z, 
simplify = FALSE), .SDcols = c("snps")] 

这是我的解决方案:

library(data.table)
library(Matrix)

seed(1)

n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05


n <- floor(n_subjects*n_snps*sparcity) 

# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals

# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}

# split our matrix based look up table
system.time(
  res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
lkupdt <- data.table(lkup)
lkupdt[, tapply(.SD, INDEX = 'gene_name' , FUN = simple_fun, Z = Z, simplify = FALSE), .SDcols = c('snps')]

问题是关于试图复制上面保存到“res”的函数的最后一行代码。我对 data.table 做错了什么还是这根本不可能?感谢您的帮助!

【问题讨论】:

    标签: r matrix data.table


    【解决方案1】:

    不,我认为您不能使用 data.table 加快访问 Matrix 对象的速度。但是,如果您愿意使用 data.table 而不是 Matrix...

    ZDT = setDT(summary(Z))
    system.time(
      resDT <- ZDT[lkupdt, on = c(j = "snps")][, sum(x), by=gene_name]
    )
    
    # verify correctness
    all.equal(
      unname(unlist(res))[order(as.numeric(substring(names(res), 2, nchar(names(res)))))],
      resDT$V1
    )
    

    它给出了类似的结果

         gene_name         V1
      1:        g1   3.720619
      2:        g2  35.727923
      3:        g3  -3.949385
      4:        g4 -18.253456
      5:        g5   5.970879
     ---                     
    496:      g496 -20.979669
    497:      g497  63.880925
    498:      g498  16.498587
    499:      g499 -17.417110
    500:      g500  45.169608
    

    当然,出于其他原因,您可能需要将数据保存在稀疏矩阵中,但这在我的计算机上要快得多,并且输入和输出更简单。

    【讨论】:

      【解决方案2】:

      我认为sum() 太简单了,无法估计时间,当您显示更真实的function 时,您会得到更合适的答案。 (我没有data.table()就接近了)

      例如,这种function 看起来等于或快于data.table() 方法(当然,这种方法不能与复杂的function 一起使用);

      sum.func <- function(Z, lkup) {
        Zsum <- colSums(Z)[lkup$snps]
        Z2 <- cbind(Zsum, lkup$gene_name)
        res <- c(tapply(Z2[,1], Z2[,2], sum))
        names(res) <- levels(lkup$gene_name)
        return(c(res))
      }
      
      system.time(
        test.res <- sum.func(Z, lkup)
      )
      
      all.equal(unlist(res), test.res)
      

      这比data.table() 方法更通用但明显慢。

      general.fun <- function(Z, lkup) {
        Z2 <- Z[, lkup$snps]
        num.gn <- as.numeric(lkup$gene_name)
        res <- sapply(1:max(num.gn), function(x) sum(Z2[, which(num.gn == x)]))
        names(res) <- levels(lkup$gene_name)
        return(res)
      }
      
      system.time(
        test.res2 <- general.fun(Z, lkup)
      )
      
      all.equal(unlist(res), test.res2)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2016-06-25
        • 2023-03-13
        • 2012-01-10
        • 2017-05-23
        • 1970-01-01
        • 2015-07-28
        • 1970-01-01
        • 2015-06-18
        相关资源
        最近更新 更多