将列拆分为多列并保留下一个答案

【问题标题】：Split columns in multiple columns and keep the next将列拆分为多列并保留下一个
【发布时间】：2018-06-25 11:03:50
【问题描述】：

我有一个格式如下的数据框：

i               j               score
chr12-100000000 chr12.100000000 0.333000
chr12-100000000 chr12.100050000 0.169200
chr12-100000000 chr12.100100000 0.054980

我想将其转换为分隔列：

chr_firstside   position_firstside  chr_secondside  position_secondside score
chr12           100000000           chr12           100000000           0.333000
chr12           100000000           chr12           100050000           0.169200
chr12           100000000           chr12           100100000           0.054980

我希望它用制表符分隔并在 R 中实现。我试过这个，但它不起作用：

library(data.table)
setDT(converted)[ , tstrsplit(i '[-]', type.convert=TRUE)]

【问题讨论】：

标签： r regex dataframe multiple-columns

【解决方案1】：

使用 tidyr，

library(tidyr)

df <- data.frame(i = c("chr12-100000000", "chr12-100000000", "chr12-100000000"), 
                 j = c("chr12.100000000", "chr12.100050000", "chr12.100100000"), 
                 score = c(0.333, 0.1692, 0.05498),
                 stringsAsFactors = FALSE)

df %>% 
    separate(i, c('chr_i', 'position_i'), convert = TRUE) %>% 
    separate(j, c('chr_j', 'position_j'), convert = TRUE)
#>   chr_i position_i chr_j position_j   score
#> 1 chr12  100000000 chr12  100000000 0.33300
#> 2 chr12  100000000 chr12  100050000 0.16920
#> 3 chr12  100000000 chr12  100100000 0.05498

不过，长格式可能会更实用：

df_long <- df %>% 
    gather(var, val, i:j) %>% 
    separate(val, c('chr', 'position'), convert = TRUE) 

df_long
#>     score var   chr  position
#> 1 0.33300   i chr12 100000000
#> 2 0.16920   i chr12 100000000
#> 3 0.05498   i chr12 100000000
#> 4 0.33300   j chr12 100000000
#> 5 0.16920   j chr12 100050000
#> 6 0.05498   j chr12 100100000

...如果您想返回宽格式，有可能：

df_wide <- df_long %>% 
    gather(var2, val, chr:position) %>% 
    unite(var, var2, var) %>%
    spread(var, val, convert = TRUE)

df_wide
#> # A tibble: 3 x 5
#>    score chr_i chr_j position_i position_j
#>    <dbl> <chr> <chr>      <int>      <int>
#> 1 0.0550 chr12 chr12  100000000  100100000
#> 2 0.169  chr12 chr12  100000000  100050000
#> 3 0.333  chr12 chr12  100000000  100000000

【讨论】：

【解决方案2】：

带有read.table 的base R 选项将是前两列上的Map，为read.table 指定对应的sep 以分隔成多个列，cbind list 输出和然后 cbind 在使用所需列名 ('nm1') 重命名列后使用 'score' 列

nm1 <- paste0(c('chr_', 'position_'), rep(c('firstside', 'secondside'), each = 2))
cbind(setNames(do.call(cbind, Map(read.table, text=df[1:2],  
               sep = list("-", "."))), nm1), df['score'])
#  chr_firstside position_firstside chr_secondside position_secondside   score
#1         chr12          100000000          chr12           100000000 0.33300
#2         chr12          100000000          chr12           100050000 0.16920
#3         chr12          100000000          chr12           100100000 0.05498

【讨论】：

【解决方案3】：

使用sub：

df$chr_firstside <- sub("^([^-]+).*", "\\1", df$i)
df$position_firstside <- sub(".*?([^-]+)$", "\\1", df$i)
df$chr_secondside <- sub("^([^.]+).*", "\\1", df$j)
df$position_secondside <- sub(".*?([^.]+)$", "\\1", df$j)

如果您不再需要 i 和 j 列，也可以从数据框中删除它们：

df <- df[ , -which(names(df) %in% c("i","j"))]

Demo

【讨论】：

这很好用！分数进入第一列有什么原因吗？

【解决方案4】：

玩base Rstrsplit：

split_temp <- sapply(lapply(converted[, 1:2], strsplit, "[\\.-]"), unlist)
row_pos <- 1:nrow(split_temp) %% 2 == 0L
converted2 <- data.frame(chr_firstside       = split_temp[!row_pos, "i"],
                         position_firstside  = split_temp[row_pos, "i"],
                         chr_secondside      = split_temp[!row_pos, "j"],
                         position_secondside = split_temp[row_pos, "j"],
                         score               = converted$score)
print(converted2)
  chr_firstside position_firstside chr_secondside position_secondside   score
1         chr12          100000000          chr12           100000000 0.33300
2         chr12          100000000          chr12           100050000 0.16920
3         chr12          100000000          chr12           100100000 0.05498

【讨论】：

【解决方案5】：

我会从我的“splitstackshape”包中推荐cSplit，它允许您提供一个分割字符向量，一个用于分割的每一列。

演示（使用sample data from @alistaire's answer）：

library(splitstackshape)
cSplit(df, c("i", "j"), c("-", "."))
#      score   i_1       i_2   j_1       j_2
# 1: 0.33300 chr12 100000000 chr12 100000000
# 2: 0.16920 chr12 100000000 chr12 100050000
# 3: 0.05498 chr12 100000000 chr12 100100000

使用setcolorder更改列顺序：

setcolorder(cSplit(df, c("i", "j"), c("-", ".")), c(2:5, 1))[]
#      i_1       i_2   j_1       j_2   score
# 1: chr12 100000000 chr12 100000000 0.33300
# 2: chr12 100000000 chr12 100050000 0.16920
# 3: chr12 100000000 chr12 100100000 0.05498

【讨论】：