【问题标题】:Split one column to two columns in R with looping使用循环将R中的一列拆分为两列
【发布时间】:2015-02-03 10:17:33
【问题描述】:

其实我也有同样的问题strsplit one column with exact information into two column

这个问题已经解决了,只是我的数据看起来像

      SNP Geno AlleleA AlleleB AlleleC AlleleD AlleleE
1 marker1   G1      AA      AA      AA      AA      AA
2 marker2   G1      TT      TT      TT      TT      TT
3 marker3   G1      TT      TT      TT      TT      TT
4 marker1   G2      CC      CC      CC      CC      CC
5 marker2   G2      AA      AA      AA      AA      AA
6 marker3   G2      TT      TT      TT      TT      TT
7 marker1   G3      GG      GG      GG      GG      GG
8 marker2   G3      AA      AA      AA      AA      AA
9 marker3   G3      TT      TT      TT      TT      TT

输入输出:

structure(list(SNP = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L), .Label = c("marker1", "marker2", "marker3"), class = "factor"), 
    Geno = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("G1", 
    "G2", "G3"), class = "factor"), AlleleA = structure(c(1L, 
    4L, 4L, 2L, 1L, 4L, 3L, 1L, 4L), .Label = c("AA", "CC", "GG", 
    "TT"), class = "factor"), AlleleB = structure(c(1L, 4L, 4L, 
    2L, 1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", 
    "CC", "GG", "TT")), AlleleC = structure(c(1L, 4L, 4L, 2L, 
    1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", 
    "GG", "TT")), AlleleD = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 
    3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", 
    "TT")), AlleleE = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 3L, 
    1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", "TT"
    ))), .Names = c("SNP", "Geno", "AlleleA", "AlleleB", "AlleleC", 
"AlleleD", "AlleleE"), row.names = c(NA, -9L), class = "data.frame")

在那个问题上,他只有一列想要拆分为两列。问题是我有 5000 列(AlleleA、AlleleB ......等)想要拆分(每一列到两列)

我尝试过像这样使用循环,但它不起作用,

for(i in colnames(dat)){
  dat1 <- data.frame(do.call(rbind, strsplit(as.vector(sprintf("dat$%s",i)), split = "")))
}

我会等待你的光, 谢谢你

【问题讨论】:

  • 你想如何拆分列? (每一列正好在两列中,拆分是如何定义的?)。在 tidyr 中有一个 separate 函数,它将一列拆分为多列,您可以使用例如 dplyr 的 mutate_each 函数将其应用于要拆分的每一列..
  • @beginneR 我已经修改了我的问题
  • @beginneR 它的作品使用 splitstackshape :) 感谢 Ananda Mahto

标签: r split


【解决方案1】:

您可以使用我的“splitstackshape”包中的cSplit 和参数stripWhite = FALSE

例如,如果我们想拆分所有“等位基因*”列,我们会这样做:

library(splitstackshape)
cSplit(mydf, grep("Allele", names(mydf)), "", stripWhite = FALSE)
#        SNP Geno AlleleA_1 AlleleA_2 AlleleB_1 AlleleB_2 AlleleC_1
# 1: marker1   G1         A         A         A         A         A
# 2: marker2   G1         T         T         T         T         T
# 3: marker3   G1         T         T         T         T         T
# 4: marker1   G2         C         C         C         C         C
# 5: marker2   G2         A         A         A         A         A
# 6: marker3   G2         T         T         T         T         T
# 7: marker1   G3         G         G         G         G         G
# 8: marker2   G3         A         A         A         A         A
# 9: marker3   G3         T         T         T         T         T
#    AlleleC_2 AlleleD_1 AlleleD_2 AlleleE_1 AlleleE_2
# 1:         A         A         A         A         A
# 2:         T         T         T         T         T
# 3:         T         T         T         T         T
# 4:         C         C         C         C         C
# 5:         A         A         A         A         A
# 6:         T         T         T         T         T
# 7:         G         G         G         G         G
# 8:         A         A         A         A         A
# 9:         T         T         T         T         T

【讨论】:

    【解决方案2】:

    正如@beginneR 所说,您可以使用tidyr::separate。下面是一个例子取自:http://blog.rstudio.org/2014/07/22/introducing-tidyr/

    head(tidier, 8)
    
    #>   id       trt     key    time
    #> 1  1 treatment work.T1 0.08514
    #> 2  2   control work.T1 0.22544
    #> 3  3 treatment work.T1 0.27453
    #> 4  4   control work.T1 0.27231
    #> 5  1 treatment home.T1 0.61583
    #> 6  2   control home.T1 0.42967
    #> 7  3 treatment home.T1 0.65166
    #> 8  4   control home.T1 0.56774
    
    tidy <- tidier %>%
      separate(key, into = c("location", "time"), sep = "\\.") 
    tidy %>% head(8)
    #>   id       trt location time    time
    #> 1  1 treatment     work   T1 0.08514
    #> 2  2   control     work   T1 0.22544
    #> 3  3 treatment     work   T1 0.27453
    #> 4  4   control     work   T1 0.27231
    #> 5  1 treatment     home   T1 0.61583
    #> 6  2   control     home   T1 0.42967
    #> 7  3 treatment     home   T1 0.65166
    #> 8  4   control     home   T1 0.56774
    

    【讨论】:

    • 认为这个问题与必须在多个列之间进行这种拆分有关。
    • 你是对的,我没有仔细阅读这个问题,也没有阅读@beginneR 的评论。
    • 实际上,我不太确定这是否可以使用 mutate_eachseparate 的组合来完成,至少不像 Ananda 的答案那样灵活,因为单独需要您指定您要拆分每列的哪些列。
    【解决方案3】:

    另一种选择是

    library(qdap)
    res <- colsplit2df(dat, splitcols=2:ncol(dat),sep='')
    colnames(res)[-1] <- make.names(rep(colnames(dat)[-1],each=2), unique=TRUE)
    res[1:3,1:5]
    #      SNP Geno Geno.1 AlleleA AlleleA.1
    #1 marker1    G      1       A         A
    #2 marker2    G      1       T         T
    #3 marker3    G      1       T         T
    

    或仅适用于Allele

    colsplit2df(dat, splitcols=grep('Allele', names(dat)),sep='')
    

    编辑(泰勒林克)

    我建议先使用setNames 编辑data.frame 的列名,如下所示:

    setNames(dat, gsub("([A-Z]{1}[a-z]+[A-Z])", "\\1.1&\\1.2", names(dat))) %>%
        colsplit2df(splitcols=3:ncol(dat), sep='')
    

    【讨论】:

      猜你喜欢
      • 2017-12-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-02-05
      • 1970-01-01
      • 2019-05-02
      • 1970-01-01
      相关资源
      最近更新 更多