R：有效地对许多对象进行子集化答案

【问题标题】：R: subset many objects efficientlyR：有效地对许多对象进行子集化
【发布时间】：2015-08-12 10:12:59
【问题描述】：

我经常使用逻辑向量来对其他向量、矩阵和数据框进行子集化（在基因组学领域，这很常见）。在这样的向量上会这样：

condition <- myNucleotideVector == "G"

然后我处理匹配该条件的子集。所以我经常会得到如下笨拙的代码：

myNucleotideVector <- myNucleotideVector[condition]
object2 <- object2[condition]
dataframe1 <- dataframe1[conditon,]

或

result <- myNucleotideVector[condition] - object2[condition] + dataframe1[conditon,2]

重复[condition] 很多次。我可以放置所有相同大小的向量和一个数据框并将其子集化，但我并不总是想这样做（特别是考虑到向量的长度，可能是数百万）。

我正在寻找一种有效的解决方案。对于上面引用的第一种情况，我想到了一个函数，它将提供的任何对象作为参数并将其子集。那看起来像

subsetObjects <- function(..., condition, env = globalenv()) {          
    call <- as.character(match.call())
    names <- call[2:(length(call)-1)] #this isn't ideal as it doesn't consider the case where one would place 'condition' argument before the objects to subset
    for (name in names) {    
        value <- get(name, envir = env)
        assign(name, subset(value, subset = condition),envir = env)
    }
}

正如您在评论中看到的，它并不完美。也许有人可以提出更有效的建议。

对于第二种情况，我正在寻找类似于with() 的东西，其中每个向量、矩阵或数据框都将根据条件自动成为子集。那看起来像

result <- withCondition(condition, expression)

如果不存在这样的函数，我可以自己编写，但我不知道该怎么做。

谢谢

让

【问题讨论】：

为什么不直接使用lapply(list_of_matrices, function(x) x[condition, ]) 之类的东西？？甚至更简单：lapply(list_of_matrices, subset, vector == "C")，虽然我不确定我是否能 100% 理解你想要什么，但从头到尾一个清晰的例子会有所帮助
这样做首先需要创建一个对象列表，然后从列表中检索对象，而且我不太习惯使用列表。但我注意到了。n
我想把result <- myNucleotideVector[condition] - object2[condition] + dataframe1[conditon,2]变成result <- withCondition(condition, myNucleotideVector - object2 + dataframe1[,2])，但是我需要一个合适的withCondition函数()。

标签： r subset

【解决方案1】：

这里有一个想法，可能有点不寻常：您可以创建一个“getter”函数，它只接受一个参数：您要实例化的变量的名称，而不是直接使用代码中的底层对象代码中的那一点。你可以把它当作一个字符串，或者更好的是，使用substitute() 来允许使用不带引号的符号（实际上，我最终使用了as.character(substitute(var))，所以两者都可以）。在函数内部，您可以查找“全局条件”来决定如何对变量进行子集化，如果它应该被子集化。为了获得最大的灵活性，查找表还可以将每个变量映射到该变量的特定条件。这是我的设想：

## lookup table and getter
cond.to.vec <- list();
COND.NAME.GLOBAL <- '!global';
var.to.cond <- list();
cond.register <- function(name,vec=NULL) {
    prev.vec <- cond.to.vec[[name]];
    cond.to.vec[[name]] <<- vec;
    invisible(prev.vec);
};
cond.is.registered <- function(name) !is.null(cond.to.vec[[name]]);
cond.map <- function(var.name,cond.name=NULL) {
    ## remove previous mapping
    prev.mapping <- var.to.cond[[var.name]];
    var.to.cond[[var.name]] <<- NULL;
    ## omit cond.name arg to just remove
    if (is.null(cond.name)) return(invisible(prev.mapping));
    ## ensure cond.name has been registered
    if (!cond.is.registered(cond.name)) stop(paste0(cond.name,' not registered'));
    ## now add new cond.name mapping for var.name
    var.to.cond[[var.name]] <<- cond.name;
    invisible(prev.mapping);
};
cond.set <- function(var,cond.vec=NULL,sym=T) {
    var.name <- if (sym) as.character(substitute(var)) else var;
    cond.register(var.name,cond.vec);
    cond.map(var.name,if (is.null(cond.vec)) NULL else var.name);
};
cond.set.global <- function(vec=NULL) cond.register(COND.NAME.GLOBAL,vec);
cond.look.up <- function(var.name) {
    ## 1: specific condition
    cond.name <- var.to.cond[[var.name]];
    if (!is.null(cond.name)) return(cond.to.vec[[cond.name]]);
    ## 2: global condition
    vec <- cond.to.vec[[COND.NAME.GLOBAL]];
    if (!is.null(vec)) return(vec);
    ## 3: no condition
    T;
};

ss <- function(var,sym=T) {
    ## whitelist subsettables
    if (!typeof(var)%in%sapply(list(as.raw(0),T,0L,0,0i,'',list(),expression()),typeof))
        return(var);
    var.name <- if (sym) as.character(substitute(var)) else var;
    vec <- cond.look.up(var.name);
    if (length(dim(var)) == 2L) var[vec,] else var[vec];
};

## test data
set.seed(1);
N <- 10;
myNucleotideVector <- sample(c('A','C','T','G'),N,replace=T);
myNucleotideVectorNum <- sample(100:200,N,replace=T);
object2 <- seq_len(N);
dataframe1 <- data.frame(base=sample(c('A','C','T','G'),N,replace=T),x=sample(1:100,N));

## global condition
cond.set.global(myNucleotideVector == 'G');

## main code, uses global condition
result <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

## register separate condition for object2
cond.set(object2,object2%%3 == 0);
result2 <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

## unset/unregister all conditions to work with the entire data set
cond.set.global();
cond.set(object2);
result3 <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

result;
## [1] 153 208 240
result2;
## [1] 154 208 238
result3;
##  [1] 168 175 266 153 252 208 240 203 196 206

现在我们可以用更多的函数来增强上面的代码，以提供一种侵入性较小的方法来应用子集条件：

ss.all.sub <- function(pt) {
    if (typeof(pt) == 'symbol') ## wrap all symbols in ss()
        as.call(list(as.symbol('ss'),pt))
    else if (typeof(pt) == 'language' && length(pt) >= 2L) ## handle function args
        as.call(c(pt[[1]], ## pass function symbol untouched
            if (as.character(pt[[1]]) == '$') ## special case for $ operator
                list(ss.all.sub(pt[[2]]),pt[[3]]) ## pass RHS untouched
            else
                lapply(pt[-1],ss.all.sub) ## recurse on all args
        ))
    else pt; ## pass literals and nullary calls untouched
};

ss.all <- function(expr) eval(ss.all.sub(substitute(expr)));

ss.with <- function(cond.arg,expr) {
    if (is.list(cond.arg)) {
        prevs <- vector('list',length(cond.arg));
        for (i in seq_along(cond.arg)) {
            name <- names(cond.arg)[i];
            prevs[i] <- list(
                if (isTRUE(name != '' && name != COND.NAME.GLOBAL))
                    cond.set(name,cond.arg[[i]],sym=F)
                else
                    cond.set.global(cond.arg[[i]])
            );
        };
    } else prev <- cond.set.global(cond.arg);
    res <- eval(ss.all.sub(substitute(expr)));
    if (is.list(cond.arg)) {
        for (i in seq_along(cond.arg)) {
            name <- names(cond.arg)[i];
            if (isTRUE(name != '' && name != COND.NAME.GLOBAL))
                cond.set(name,prevs[[i]],sym=F)
            else
                cond.set.global(prevs[[i]]);
        };
    } else cond.set.global(prev);
    res;
};

## demo parse tree substitution
ss.all.sub(substitute(myNucleotideVectorNum - object2 + dataframe1$x));
## ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x

## demo using ss.with() to apply an inline condition
ss.with(myNucleotideVector == 'G',myNucleotideVectorNum - object2 + dataframe1$x);
## [1] 153 208 240
ss.with(
    list(myNucleotideVector == 'G',object2=object2%%3 == 0),
    myNucleotideVectorNum - object2 + dataframe1$x
);
## [1] 154 208 238
ss.with(T,myNucleotideVectorNum - object2 + dataframe1$x);
##  [1] 168 175 266 153 252 208 240 203 196 206

【讨论】：

感谢您的建议。但它看起来不会大大简化我的代码，因为我必须用 ss() 替换每个 [condition] 并设置全局条件。
@jeanlain 查看编辑；我改进了我的答案，现在我认为它完全符合您的要求。