根据相邻行折叠基本数据框答案

【问题标题】：Collapsing a basic dataframe based on adjecent rows根据相邻行折叠基本数据框
【发布时间】：2018-09-25 10:32:37
【问题描述】：

我正在研究一个可以用以下示例表示的大数据框：

chromosome  position    position2   name    Occup       
Chr1    1   1   -   0.023
Chr1    2   2   -   0.023
Chr1    3   3   -   0.023
Chr1    4   4   -   0.023
Chr1    5   5   -   0.023
Chr1    6   6   -   0.069
Chr1    7   7   -   0.069
Chr1    8   8   -   0.069
Chr1    9   9   -   0.069
Chr1    10  10  -   0.116
Chr1    11  11  -   0.116
Chr1    12  12  -   0.116
Chr1    13  13  -   0.023
Chr1    14  14  -   0.023
Chr1    15  15  -   0.023
Chr1    16  16  -   0.023
Chr1    17  17  -   0.023

你可以这样读：

dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"), 
                position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), 
                position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),        
                name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"), 
                Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))

我想把它折叠成这样的数据框：

chromosome  position    position2   name    Occup       
Chr1    1   5   -   0.023
Chr1    6   9   -   0.069
Chr1    10  12  -   0.116
Chr1    13  17  -   0.023

基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中，直到下一行发生变化。

如果我这样做：

library(plyr)
test<-ddply(dtf, .(Occup), summarise,
      position_start=min(position),
      position_end= max(position2))

我明白了

Occup   position_start  position_end    
0.023   1   17
0.069   6   9
0.116   10  12

所以它接近我想要的，但不是我想要的。

没有必要考虑第 1 列或第 3 列，因为在这种情况下这些列是任意的，并且包含所有行的相同信息。

【问题讨论】：

标签： r dataframe bioinformatics plyr collapse

【解决方案1】：

这应该可行：

library(dplyr)

dtf_grouped <- dtf %>%
    arrange(position) %>% # to ensure data is sequential
    mutate(
        occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
        occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
        group_id = cumsum(occup_shift)
        ) %>%
    group_by(group_id) %>%
    summarize(
        Occup = min(Occup),
        position_start = position[1],
        position_end = position2[n()]
    ) %>%
    select(-group_id)

head(dtf_grouped)

# A tibble: 4 x 3
   Occup position_start position_end
   <dbl>          <dbl>        <dbl>
1 0.0230              1            5
2 0.0690              6            9
3 0.116              10           12
4 0.0230             13           17

【讨论】：

太棒了，你是神！它是一种魅力；我需要一些时间来弄清楚所有细节并自己学习，但谢谢。
小建议：ifelse(is.na(occup_shift), FALSE, occup_shift)可以是coalesce(occup_shift, FALSE)。

【解决方案2】：

我们可以按连续的数字（Occup）分组，然后得到min，max：

library(dplyr)

res <- dtf %>% 
  group_by(chromosome,
           # create group for consecutive numbers
           myGroup = cumsum(c(1, diff(Occup) != 0))) %>% 
  summarise(position = min(position),
            position2 = max(position2),
            Occup = min(Occup)) %>% 
  ungroup() %>% 
  select(-myGroup)


res

# # A tibble: 4 x 4
#   chromosome position position2  Occup
#   <fct>         <dbl>     <dbl>  <dbl>
# 1 Chr1             1.        5. 0.0230
# 2 Chr1             6.        9. 0.0690
# 3 Chr1            10.       12. 0.116 
# 4 Chr1            13.       17. 0.0230

【讨论】：