【问题标题】:Collapsing a basic dataframe based on adjecent rows根据相邻行折叠基本数据框
【发布时间】:2018-09-25 10:32:37
【问题描述】:

我正在研究一个可以用以下示例表示的大数据框:

chromosome  position    position2   name    Occup       
Chr1    1   1   -   0.023
Chr1    2   2   -   0.023
Chr1    3   3   -   0.023
Chr1    4   4   -   0.023
Chr1    5   5   -   0.023
Chr1    6   6   -   0.069
Chr1    7   7   -   0.069
Chr1    8   8   -   0.069
Chr1    9   9   -   0.069
Chr1    10  10  -   0.116
Chr1    11  11  -   0.116
Chr1    12  12  -   0.116
Chr1    13  13  -   0.023
Chr1    14  14  -   0.023
Chr1    15  15  -   0.023
Chr1    16  16  -   0.023
Chr1    17  17  -   0.023

你可以这样读:

dtf = data.frame(chromosome=c("Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1","Chr1"), 
                position=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17), 
                position2=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17),        
                name=c("-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-","-"), 
                Occup=c(0.023,0.023,0.023,0.023,0.023,0.069,0.069,0.069,0.069,0.116,0.116,0.116,0.023,0.023,0.023,0.023,0.023))

我想把它折叠成这样的数据框:

chromosome  position    position2   name    Occup       
Chr1    1   5   -   0.023
Chr1    6   9   -   0.069
Chr1    10  12  -   0.116
Chr1    13  17  -   0.023

基本折叠的问题是占用值被放在一组中。这不是我想要的。我希望它们聚集在一个组中,直到下一行发生变化。

如果我这样做:

library(plyr)
test<-ddply(dtf, .(Occup), summarise,
      position_start=min(position),
      position_end= max(position2))

我明白了

Occup   position_start  position_end    
0.023   1   17
0.069   6   9
0.116   10  12

所以它接近我想要的,但不是我想要的。

没有必要考虑第 1 列或第 3 列,因为在这种情况下这些列是任意的,并且包含所有行的相同信息。

【问题讨论】:

    标签: r dataframe bioinformatics plyr collapse


    【解决方案1】:

    这应该可行:

    library(dplyr)
    
    dtf_grouped <- dtf %>%
        arrange(position) %>% # to ensure data is sequential
        mutate(
            occup_shift = Occup - lag(Occup, 1) != 0, # flag row change
            occup_shift = ifelse(is.na(occup_shift), FALSE, occup_shift), # replace NA's
            group_id = cumsum(occup_shift)
            ) %>%
        group_by(group_id) %>%
        summarize(
            Occup = min(Occup),
            position_start = position[1],
            position_end = position2[n()]
        ) %>%
        select(-group_id)
    
    head(dtf_grouped)
    
    # A tibble: 4 x 3
       Occup position_start position_end
       <dbl>          <dbl>        <dbl>
    1 0.0230              1            5
    2 0.0690              6            9
    3 0.116              10           12
    4 0.0230             13           17
    

    【讨论】:

    • 太棒了,你是神!它是一种魅力;我需要一些时间来弄清楚所有细节并自己学习,但谢谢。
    • 小建议:ifelse(is.na(occup_shift), FALSE, occup_shift)可以是coalesce(occup_shift, FALSE)
    【解决方案2】:

    我们可以按连续的数字(Occup)分组,然后得到minmax

    library(dplyr)
    
    res <- dtf %>% 
      group_by(chromosome,
               # create group for consecutive numbers
               myGroup = cumsum(c(1, diff(Occup) != 0))) %>% 
      summarise(position = min(position),
                position2 = max(position2),
                Occup = min(Occup)) %>% 
      ungroup() %>% 
      select(-myGroup)
    
    
    res
    
    # # A tibble: 4 x 4
    #   chromosome position position2  Occup
    #   <fct>         <dbl>     <dbl>  <dbl>
    # 1 Chr1             1.        5. 0.0230
    # 2 Chr1             6.        9. 0.0690
    # 3 Chr1            10.       12. 0.116 
    # 4 Chr1            13.       17. 0.0230
    

    【讨论】:

      猜你喜欢
      • 2015-03-12
      • 2013-12-13
      • 1970-01-01
      • 2015-09-21
      • 1970-01-01
      • 2015-04-11
      • 2017-09-16
      • 2020-07-08
      • 2016-02-03
      相关资源
      最近更新 更多