【发布时间】:2016-09-24 00:57:22
【问题描述】:
我尝试使用 data.table 包解决以下问题: Is there a faster way to subset a sparse Matrix than '['?
但我得到了这个错误:
Error in Z[, cols] : invalid or not-yet-implemented 'Matrix' subsetting
10 stop("invalid or not-yet-implemented 'Matrix' subsetting")
9 Z[, cols]
8 Z[, cols]
7 FUN(X[[i]], ...)
6 lapply(X = ans[index], FUN = FUN, ...)
5 tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z, simplify = FALSE)
4 eval(expr, envir, enclos)
3 eval(jsub, SDenv, parent.frame())
2 `[.data.table`(lkupdt, , tapply(.SD, INDEX = "gene_name", FUN = simple_fun,
Z = Z, simplify = FALSE), .SDcols = c("snps"))
1 lkupdt[, tapply(.SD, INDEX = "gene_name", FUN = simple_fun, Z = Z,
simplify = FALSE), .SDcols = c("snps")]
这是我的解决方案:
library(data.table)
library(Matrix)
seed(1)
n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05
n <- floor(n_subjects*n_snps*sparcity)
# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals
# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
snps = sample(n_snps, size = sum(sizes)))
# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}
# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)
lkupdt <- data.table(lkup)
lkupdt[, tapply(.SD, INDEX = 'gene_name' , FUN = simple_fun, Z = Z, simplify = FALSE), .SDcols = c('snps')]
问题是关于试图复制上面保存到“res”的函数的最后一行代码。我对 data.table 做错了什么还是这根本不可能?感谢您的帮助!
【问题讨论】:
标签: r matrix data.table