【问题标题】:Split columns in multiple columns and keep the next将列拆分为多列并保留下一个
【发布时间】:2018-06-25 11:03:50
【问题描述】:

我有一个格式如下的数据框:

i               j               score
chr12-100000000 chr12.100000000 0.333000
chr12-100000000 chr12.100050000 0.169200
chr12-100000000 chr12.100100000 0.054980

我想将其转换为分隔列:

chr_firstside   position_firstside  chr_secondside  position_secondside score
chr12           100000000           chr12           100000000           0.333000
chr12           100000000           chr12           100050000           0.169200
chr12           100000000           chr12           100100000           0.054980

我希望它用制表符分隔并在 R 中实现。我试过这个,但它不起作用:

library(data.table)
setDT(converted)[ , tstrsplit(i '[-]', type.convert=TRUE)]

【问题讨论】:

    标签: r regex dataframe multiple-columns


    【解决方案1】:

    使用 tidyr,

    library(tidyr)
    
    df <- data.frame(i = c("chr12-100000000", "chr12-100000000", "chr12-100000000"), 
                     j = c("chr12.100000000", "chr12.100050000", "chr12.100100000"), 
                     score = c(0.333, 0.1692, 0.05498),
                     stringsAsFactors = FALSE)
    
    df %>% 
        separate(i, c('chr_i', 'position_i'), convert = TRUE) %>% 
        separate(j, c('chr_j', 'position_j'), convert = TRUE)
    #>   chr_i position_i chr_j position_j   score
    #> 1 chr12  100000000 chr12  100000000 0.33300
    #> 2 chr12  100000000 chr12  100050000 0.16920
    #> 3 chr12  100000000 chr12  100100000 0.05498
    

    不过,长格式可能会更实用:

    df_long <- df %>% 
        gather(var, val, i:j) %>% 
        separate(val, c('chr', 'position'), convert = TRUE) 
    
    df_long
    #>     score var   chr  position
    #> 1 0.33300   i chr12 100000000
    #> 2 0.16920   i chr12 100000000
    #> 3 0.05498   i chr12 100000000
    #> 4 0.33300   j chr12 100000000
    #> 5 0.16920   j chr12 100050000
    #> 6 0.05498   j chr12 100100000
    

    ...如果您想返回宽格式,有可能:

    df_wide <- df_long %>% 
        gather(var2, val, chr:position) %>% 
        unite(var, var2, var) %>%
        spread(var, val, convert = TRUE)
    
    df_wide
    #> # A tibble: 3 x 5
    #>    score chr_i chr_j position_i position_j
    #>    <dbl> <chr> <chr>      <int>      <int>
    #> 1 0.0550 chr12 chr12  100000000  100100000
    #> 2 0.169  chr12 chr12  100000000  100050000
    #> 3 0.333  chr12 chr12  100000000  100000000
    

    【讨论】:

      【解决方案2】:

      带有read.tablebase R 选项将是前两列上的Map,为read.table 指定对应的sep 以分隔成多个列,cbind list 输出和然后 cbind 在使用所需列名 ('nm1') 重命名列后使用 'score' 列

      nm1 <- paste0(c('chr_', 'position_'), rep(c('firstside', 'secondside'), each = 2))
      cbind(setNames(do.call(cbind, Map(read.table, text=df[1:2],  
                     sep = list("-", "."))), nm1), df['score'])
      #  chr_firstside position_firstside chr_secondside position_secondside   score
      #1         chr12          100000000          chr12           100000000 0.33300
      #2         chr12          100000000          chr12           100050000 0.16920
      #3         chr12          100000000          chr12           100100000 0.05498
      

      【讨论】:

        【解决方案3】:

        使用sub

        df$chr_firstside <- sub("^([^-]+).*", "\\1", df$i)
        df$position_firstside <- sub(".*?([^-]+)$", "\\1", df$i)
        df$chr_secondside <- sub("^([^.]+).*", "\\1", df$j)
        df$position_secondside <- sub(".*?([^.]+)$", "\\1", df$j)
        

        如果您不再需要 ij 列,也可以从数据框中删除它们:

        df <- df[ , -which(names(df) %in% c("i","j"))]
        

        Demo

        【讨论】:

        • 这很好用!分数进入第一列有什么原因吗?
        【解决方案4】:

        base Rstrsplit

        split_temp <- sapply(lapply(converted[, 1:2], strsplit, "[\\.-]"), unlist)
        row_pos <- 1:nrow(split_temp) %% 2 == 0L
        converted2 <- data.frame(chr_firstside       = split_temp[!row_pos, "i"],
                                 position_firstside  = split_temp[row_pos, "i"],
                                 chr_secondside      = split_temp[!row_pos, "j"],
                                 position_secondside = split_temp[row_pos, "j"],
                                 score               = converted$score)
        print(converted2)
          chr_firstside position_firstside chr_secondside position_secondside   score
        1         chr12          100000000          chr12           100000000 0.33300
        2         chr12          100000000          chr12           100050000 0.16920
        3         chr12          100000000          chr12           100100000 0.05498
        

        【讨论】:

          【解决方案5】:

          我会从我的“splitstackshape”包中推荐cSplit,它允许您提供一个分割字符向量,一个用于分割的每一列。

          演示(使用sample data from @alistaire's answer):

          library(splitstackshape)
          cSplit(df, c("i", "j"), c("-", "."))
          #      score   i_1       i_2   j_1       j_2
          # 1: 0.33300 chr12 100000000 chr12 100000000
          # 2: 0.16920 chr12 100000000 chr12 100050000
          # 3: 0.05498 chr12 100000000 chr12 100100000
          

          使用setcolorder更改列顺序:

          setcolorder(cSplit(df, c("i", "j"), c("-", ".")), c(2:5, 1))[]
          #      i_1       i_2   j_1       j_2   score
          # 1: chr12 100000000 chr12 100000000 0.33300
          # 2: chr12 100000000 chr12 100050000 0.16920
          # 3: chr12 100000000 chr12 100100000 0.05498
          

          【讨论】:

            猜你喜欢
            • 2019-05-31
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2014-04-01
            • 1970-01-01
            • 2020-07-11
            • 1970-01-01
            相关资源
            最近更新 更多