【问题标题】:How to split text into multiple columns?如何将文本拆分为多列?
【发布时间】:2020-01-15 17:06:42
【问题描述】:

我有一列包含以下格式的文本:

ID-XXXXX Process for Description [1/5]

我希望将其分为三列,其中:

A = ID-XXXXX

B = Process for Description

C = 1/5

关于如何正确拆分的任何想法?

【问题讨论】:

    标签: r tidyr


    【解决方案1】:

    这是一个帮助你的尝试。请注意,第一部分有点棘手,我使用了一个正则表达式,其想法是 XXXXX 将始终是 5 个字符长。

    d = "ID-XXXXX Process for Description [1/5]"
    
    a =sub('[  ].+',"",d)
    
    c = sub('.+[  ][[]',"",d) ; c = sub('[]]',"",c)
    
    b = sub('[  ][[].*[]]',"",d) ;b = gsub('ID-.{5}[ ]',"",b)
    
    f = c(a,b,c) ; f
    # [1] "ID-XXXXX" "Process for Description" "1/5" 
    

    【讨论】:

      【解决方案2】:

      使用stringr,有几种选择:

      dat <- data.frame(my_string = "ID-XXXXX Process for Description [1/5]")
      
      dat %>% 
        mutate(A = str_extract(string = my_string, pattern = "ID-.{5}"),
               B = str_replace(string = my_string, pattern = "ID-.{5}\\s(.+)\\s\\[.*\\]", replacement = "\\1"),
               C = str_match(string = my_string, pattern = "\\[(.*)\\]")[2])
      

      A:提取以下模式:ID- 后跟正好 5 个字符
      B:捕获ID-XXXXX[X-X]之间的组,并将整个模式替换为捕获的模式
      C : 匹配方括号之间的捕获模式(.*)str_match 的第二列返回捕获的模式)

      结果:

                                     my_string        A                       B   C
      1 ID-XXXXX Process for Description [1/5] ID-XXXXX Process for Description 1/5
      

      编辑
      我只记得 tidyr 中的 extract() 函数正是这样做的。
      使用 regex 参数中括号之间的捕获组,您可以直接将它们放入新列中。

      dat <- data.frame(my_string = paste0("ID-0000", 1:5, " Process_", LETTERS[1:5], " [", 1:5, "/5]"))
      
      extract(data = dat,
              col = my_string, 
              into = c("A", "B", "C"), 
              regex = "(ID-.{5})\\s(.+)\\s\\[(.*)\\]", 
              remove = FALSE)
      
                       my_string        A         B   C
      1 ID-00001 Process_A [1/5] ID-00001 Process_A 1/5
      2 ID-00002 Process_B [2/5] ID-00002 Process_B 2/5
      3 ID-00003 Process_C [3/5] ID-00003 Process_C 3/5
      4 ID-00004 Process_D [4/5] ID-00004 Process_D 4/5
      5 ID-00005 Process_E [5/5] ID-00005 Process_E 5/5
      

      如果您不想保留原始字符串,请使用remove = TRUE

      【讨论】:

        【解决方案3】:

        您也可以使用tidyr::extract 系统地执行此操作。为演示而详细说明的示例-

        • 将直到第一个空格的所有内容提取到第一个捕获中
        • 将直到[ 的所有内容提取到第二个捕获组中
        • 将直到] 的所有内容提取到第三个捕获组中

        这样您就没有每个​​捕获组的字符数限制。

        vec <- c("ID-XXXXX Process for Description [1/5]", "ID-XXXXXYZ Process for Description something [1/5]", "ID-XXXXXFFF Process for Description something else [1/905]", "ID-XXXXXYYYYP Process for Description [900001/5]")
        df <- data.frame(col = vec)
        df
        #>                                                          col
        #> 1                     ID-XXXXX Process for Description [1/5]
        #> 2         ID-XXXXXYZ Process for Description something [1/5]
        #> 3 ID-XXXXXFFF Process for Description something else [1/905]
        #> 4           ID-XXXXXYYYYP Process for Description [900001/5]
        library(tidyverse)
        df %>%
          extract(col, into = c('A', 'B', 'C'), regex = '^([^\\s]*)\\s([^\\[]*)\\[([^\\]]*)\\]$')
        #>               A                                       B        C
        #> 1      ID-XXXXX                Process for Description       1/5
        #> 2    ID-XXXXXYZ      Process for Description something       1/5
        #> 3   ID-XXXXXFFF Process for Description something else     1/905
        #> 4 ID-XXXXXYYYYP                Process for Description  900001/5
        

        reprex package (v2.0.0) 于 2021 年 5 月 30 日创建

        【讨论】:

          【解决方案4】:

          我们可以使用str_extract

          df %>% 
            mutate(A = str_extract(col1, "ID-XXXX"),
                   B = str_extract(col1, "Process for Description"),
                   C = str_extract(col1, "\\[1\\/5\\]"))
          
          

          输出:

          # A tibble: 1 x 4
            col1                                   A       B                       C    
            <chr>                                  <chr>   <chr>                   <chr>
          1 ID-XXXXX Process for Description [1/5] ID-XXXX Process for Description [1/5]
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2013-06-11
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2012-12-08
            相关资源
            最近更新 更多