【问题标题】:R - output overlapping intervalsR - 输出重叠间隔
【发布时间】:2014-02-07 08:24:39
【问题描述】:

fileA 包含间隔(开始、结束)和分配给该间隔的值(值)。

start     end      value
0         123      1      #value 1 at positions 0 to 122 included.
123       78000    0      #value 0 at positions 123 to 77999 included.
78000     78004    56     #value 56 at positions 78000, 78001, 78002 and 78003.
78004     78005    12     #value 12 at position 78004.
78005     78006    1      #value 1 at position 78005.
78006     78008    21     #value 21 at positions 78006 and 78007.
78008     78056    8      #value 8 at positions 78008 to 78055 included.
78056     81000    0      #value 0 at positions 78056 to 80999 included.

fileB 包含我感兴趣的区间列表。我想从 fileA 检索重叠区间。开始和结束不一定匹配。以下是fileB 的示例:

start     end      label
77998     78005    romeo
78007     78012    juliet

目标是 (1) 从 fileA 检索与 fileB 重叠的区间,以及 (2) 从 fileB 附加相应的标签。预期结果是(# 表示被丢弃的行,这是为了帮助可视化,不会出现在最终输出中):

start     end      value    label
#
123       78000    0        romeo
78000     78004    56       romeo
78004     78005    12       romeo
#
78006     78008    21       juliet
78008     78056    8        juliet
#

这是我编写代码的尝试:

#read from tab-delimited text files which do not contain column names
A<-read.table("fileA.txt",sep="\t",colClasses=c("numeric","numeric","numeric"))
B<-read.table("fileB.txt",sep="\t",colClasses=c("numeric","numeric","character"))

#add column names
colnames(A)<-c("start","end","value")
colnames(B)<-c("start","end","label")

#output intervals in `fileA` that overlap with an interval in `fileB`
A_overlaps<-A[((A$start <= B$start & A$end >= B$start)
              |(A$start >= B$start & A$start <= B$end)
              |(A$end >= B$start & A$end <= B$end)),]

此时我已经得到了意想不到的结果:

> A_overlaps
  start   end value
  #missing
3 78000 78004    56
5 78005 78006     1   #this line should not be here
6 78006 78008    21
  #missing

我还没有编写输出标签的部分,因为我不妨先解决这个问题,但我不知道我做错了什么......

[编辑] 我也尝试了以下方法,但它只输出了整个fileA

A_overlaps <- A[(min(A$start,A$end) < max(B$start,B$end)
               & max(A$start,A$end) > min(B$start,B$end)),]

【问题讨论】:

  • 有间隔包

标签: r intervals overlap


【解决方案1】:

这会产生所需的输出,但可能有点难以阅读

# function to find, if value lies in interval
is.between <- function(x, a, b) {
  (x - a)  *  (b - x) > 0
}

# apply to all rows in A 
> matching <- apply(A, MARGIN=1, FUN=function(x){
# which row fulfill following condition:
+   which(apply(B, MARGIN=1, FUN=function(y){
# first value lies in interval from B or second value lies in interval from B
+     is.between(as.numeric(x[1]), as.numeric(y[1]), as.numeric(y[2])) | is.between(as.numeric(x[2]), as.numeric(y[1]), as.numeric(y[2]))
+     }))
+   })
> 
# print the results
> matching
[[1]]
integer(0)

[[2]]
[1] 1

[[3]]
[1] 1

[[4]]
[1] 1

[[5]]
integer(0)

[[6]]
[1] 2

[[7]]
[1] 2

[[8]]
integer(0)

> 
# filter those, which has 0 length = no matching
> A_overlaps <- A[unlist(lapply(matching, FUN=function(x)length(x)>0)),]
# add label
> A_overlaps$label <- B$label[unlist(matching)]
> 
> A_overlaps
  start   end value  label
2   123 78000     0  romeo
3 78000 78004    56  romeo
4 78004 78005    12  romeo
6 78006 78008    21 juliet
7 78008 78056     8 juliet

【讨论】:

  • 哇——我不确定我是否理解所有内容,但它确实有效。谢谢!
  • 我在apply函数中添加了一些解释
  • 非常感谢!这是我第一次遇到apply()函数,看起来很有用,会研究它。 :)
猜你喜欢
  • 1970-01-01
  • 2017-06-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多