【发布时间】:2020-01-15 17:06:42
【问题描述】:
我有一列包含以下格式的文本:
ID-XXXXX Process for Description [1/5]
我希望将其分为三列,其中:
A = ID-XXXXX
B = Process for Description
C = 1/5
关于如何正确拆分的任何想法?
【问题讨论】:
我有一列包含以下格式的文本:
ID-XXXXX Process for Description [1/5]
我希望将其分为三列,其中:
A = ID-XXXXX
B = Process for Description
C = 1/5
关于如何正确拆分的任何想法?
【问题讨论】:
这是一个帮助你的尝试。请注意,第一部分有点棘手,我使用了一个正则表达式,其想法是 XXXXX 将始终是 5 个字符长。
d = "ID-XXXXX Process for Description [1/5]"
a =sub('[ ].+',"",d)
c = sub('.+[ ][[]',"",d) ; c = sub('[]]',"",c)
b = sub('[ ][[].*[]]',"",d) ;b = gsub('ID-.{5}[ ]',"",b)
f = c(a,b,c) ; f
# [1] "ID-XXXXX" "Process for Description" "1/5"
【讨论】:
使用stringr,有几种选择:
dat <- data.frame(my_string = "ID-XXXXX Process for Description [1/5]")
dat %>%
mutate(A = str_extract(string = my_string, pattern = "ID-.{5}"),
B = str_replace(string = my_string, pattern = "ID-.{5}\\s(.+)\\s\\[.*\\]", replacement = "\\1"),
C = str_match(string = my_string, pattern = "\\[(.*)\\]")[2])
A:提取以下模式:ID- 后跟正好 5 个字符
B:捕获ID-XXXXX和[X-X]之间的组,并将整个模式替换为捕获的模式
C : 匹配方括号之间的捕获模式(.*)(str_match 的第二列返回捕获的模式)
结果:
my_string A B C
1 ID-XXXXX Process for Description [1/5] ID-XXXXX Process for Description 1/5
编辑:
我只记得 tidyr 中的 extract() 函数正是这样做的。
使用 regex 参数中括号之间的捕获组,您可以直接将它们放入新列中。
dat <- data.frame(my_string = paste0("ID-0000", 1:5, " Process_", LETTERS[1:5], " [", 1:5, "/5]"))
extract(data = dat,
col = my_string,
into = c("A", "B", "C"),
regex = "(ID-.{5})\\s(.+)\\s\\[(.*)\\]",
remove = FALSE)
my_string A B C
1 ID-00001 Process_A [1/5] ID-00001 Process_A 1/5
2 ID-00002 Process_B [2/5] ID-00002 Process_B 2/5
3 ID-00003 Process_C [3/5] ID-00003 Process_C 3/5
4 ID-00004 Process_D [4/5] ID-00004 Process_D 4/5
5 ID-00005 Process_E [5/5] ID-00005 Process_E 5/5
如果您不想保留原始字符串,请使用remove = TRUE。
【讨论】:
您也可以使用tidyr::extract 系统地执行此操作。为演示而详细说明的示例-
[ 的所有内容提取到第二个捕获组中] 的所有内容提取到第三个捕获组中这样您就没有每个捕获组的字符数限制。
vec <- c("ID-XXXXX Process for Description [1/5]", "ID-XXXXXYZ Process for Description something [1/5]", "ID-XXXXXFFF Process for Description something else [1/905]", "ID-XXXXXYYYYP Process for Description [900001/5]")
df <- data.frame(col = vec)
df
#> col
#> 1 ID-XXXXX Process for Description [1/5]
#> 2 ID-XXXXXYZ Process for Description something [1/5]
#> 3 ID-XXXXXFFF Process for Description something else [1/905]
#> 4 ID-XXXXXYYYYP Process for Description [900001/5]
library(tidyverse)
df %>%
extract(col, into = c('A', 'B', 'C'), regex = '^([^\\s]*)\\s([^\\[]*)\\[([^\\]]*)\\]$')
#> A B C
#> 1 ID-XXXXX Process for Description 1/5
#> 2 ID-XXXXXYZ Process for Description something 1/5
#> 3 ID-XXXXXFFF Process for Description something else 1/905
#> 4 ID-XXXXXYYYYP Process for Description 900001/5
由reprex package (v2.0.0) 于 2021 年 5 月 30 日创建
【讨论】:
我们可以使用str_extract
df %>%
mutate(A = str_extract(col1, "ID-XXXX"),
B = str_extract(col1, "Process for Description"),
C = str_extract(col1, "\\[1\\/5\\]"))
输出:
# A tibble: 1 x 4
col1 A B C
<chr> <chr> <chr> <chr>
1 ID-XXXXX Process for Description [1/5] ID-XXXX Process for Description [1/5]
【讨论】: