【发布时间】:2018-09-25 10:32:37
【问题描述】:
我正在研究一个可以用以下示例表示的大数据框:
chromosome position position2 name Occup
Chr1 1 1 - 0.023
Chr1 2 2 - 0.023
Chr1 3 3 - 0.023
Chr1 4 4 - 0.023
Chr1 5 5 - 0.023
Chr1 6 6 - 0.069
Chr1 7 7 - 0.069
Chr1 8 8 - 0.069
Chr1 9 9 - 0.069
Chr1 10 10 - 0.116
Chr1 11 11 - 0.116
Chr1 12 12 - 0.116
Chr1 13 13 - 0.023
Chr1 14 14 - 0.023
Chr1 15 15 - 0.023
Chr1 16 16 - 0.023
Chr1 17 17 - 0.023
你可以这样读:
dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"),
position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),
name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"),
Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))
我想把它折叠成这样的数据框:
chromosome position position2 name Occup
Chr1 1 5 - 0.023
Chr1 6 9 - 0.069
Chr1 10 12 - 0.116
Chr1 13 17 - 0.023
基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中,直到下一行发生变化。
如果我这样做:
library(plyr)
test<-ddply(dtf, .(Occup), summarise,
position_start=min(position),
position_end= max(position2))
我明白了
Occup position_start position_end
0.023 1 17
0.069 6 9
0.116 10 12
所以它接近我想要的,但不是我想要的。
没有必要考虑第 1 列或第 3 列,因为在这种情况下这些列是任意的,并且包含所有行的相同信息。
【问题讨论】:
标签: r dataframe bioinformatics plyr collapse