根据 R 中另一个数据框的条件对 data.frame 中的列进行求和答案

【问题标题】：sum columns in a data.frame based on conditions from another dataframe in R根据 R 中另一个数据框的条件对 data.frame 中的列进行求和
【发布时间】：2014-04-01 07:31:28
【问题描述】：

我有两个数据框，a 和 b。

对于b 中的每一行，我想找到a 中的所有start,end，它们在b 的start,end 内，然后求和start,end 的这个特定子集的@987654330 的差异@，并将其作为新列存储在 b 中。我正在使用for 循环，但是在 R 中使用apply 是否有更有效的方法？

# data.frame a  
a <- data.frame(chrom=1L, start=as.integer(c(2,4,7,11)), end=as.integer(c(3,6,9,15)))
# chrom start end  
#     1     2   3  
#     1     4   6  
#     1     7   9        
#     1    11  15  

# data.frame b  
b <- data.frame(chr=1L, start=as.integer(c(2,11)), end=as.integer(c(10,20)))
# chrom start end  
#     1     2  10  
#     1    11  20  

# code
result=c()
for (i in 1:dim(b)[1]) { 
    # find start,end in A that are within    
    a_subset = a[which(a$chrom == b[i, ]$chrom & 
                 a$start >= b[i, ]$start & 
                 a$end <= b[i, ]$end), ]

    result = append(result, sum(a_subset$end - a_subset$start))  
}
c = cbind(b, result)

# data.frame c
# chrom start end result
#     1     2  10      5
#     1    11  20      4

【问题讨论】：

使用“GenomicRanges”（bioconductor 软件包）搜索帖子，该软件包旨在有效处理涉及重叠范围的生物信息学问题。

标签： r apply

【解决方案1】：

用 sqldf 容易，用 base R 很烦：

R>require(sqldf)
R>b$id <- 1:nrow(b)
R>sqldf("select id, b.chr, sum(a.end - a.start) as diff 
    from a, b where a.start >= b.start and b.end >= a.end group by id")
  id chr diff
1  1   1    5
2  2   1    4

【讨论】：