【问题标题】：String split by value from another column字符串按另一列的值拆分
【发布时间】：2021-03-11 09:17:37
【问题描述】：

您好，我有这个数据框 (DF1)

structure(list(Value = list("Peter", "John", c("Patric", "Harry")),Text = c("Hello Peter How are you","Is it John? Yes It is John, Harry","Hello Patric, how are you. Well, Harry thank you."))  , class = "data.frame", row.names = c(NA, -3L)) 

             Value                                              Text
1            Peter                           Hello Peter How are you
2             John                 Is it John? Yes It is John, Harry
3 c(Patric, Harry) Hello Patric, how are you. Well, Harry thank you.

我想将 Text 中的句子按 Value 中的名称拆分以得到这个

             Value                                              Text   Split
1            Peter                           Hello Peter How are you  c("Hello", "Peter How are you")
2             John                 Is it John? Yes It is John, Harry  c("Is it", "John? Yes It is John, Harry")
3 c(Patric, Harry) Hello Patric, how are you. Well, Harry thank you   c("Hello", "Patric, how are you. Well,", "Harry thank you")

我试过了

DF1 %>% mutate(Split = strsplit(as.character(Text),as.character(Value)))

但它不起作用

【问题讨论】：

我认为你的数据有些奇怪。 c(Patric, Harry) 似乎是数据准备错误的结果。我希望价值在structure 中被识别如下：Value = list("Peter", "John", c("Patric", "Harry"))
谢谢我刚刚修复它
我留下了两种可能的解决方案。看看他们。
谢谢，我会测试它们并告诉你。
@Edo 谢谢，它运作良好。 In Value 不能是任何字符，只能是字母和数字。

标签： r string dplyr strsplit

【解决方案1】：

数据

假设这是真正的结构：

df <- structure(list(Value = list("Peter", "John", c("Patric", "Harry")),
                     Text = c("Hello Peter How are you","Is it John? Yes It is John, Harry","Hello Patric, how are you. Well, Harry thank you.")),
                class = "data.frame", row.names = c(NA, -3L))

第一个解决方案：双循环

您可以使用双 for 循环来解决您的问题。这可能是一个更易读且更易于调试的解决方案。

library(stringr)

Split <- list()

for(i in seq_len(nrow(df))){
 
 text  <- df$Text[i]
 value <- df$Value[[i]]
 
 for(j in seq_along(value)){
  
  text2 <- str_split(text[length(text)], paste0("(?<=.)(?=", value[[j]], ")"), n = 2)[[1]]
  text <- c(text[-length(text)], text2)
  
 }
 
 Split[[i]] <- text
 
}

df$Split <- Split

如果你打印df，它看起来就像你有一个唯一的字符串，但实际上不是。

df$Split
#> [[1]]
#> [1] "Hello "            "Peter How are you"
#> 
#> [[2]]
#> [1] "Is it "                      "John? Yes It is John, Harry"
#> 
#> [[3]]
#> [1] "Hello "                      "Patric, how are you. Well, " "Harry thank you."           
#>

第二种解决方案：tidyverse和递归fn

由于您最初尝试使用dplyr 函数，因此您也可以使用递归函数以这种方式编写它。此解决方案不使用 for 循环。

library(stringr)
library(purrr)
library(dplyr)

str_split_recursive <- function(string, pattern){
 
 string <- str_split(string[length(string)], paste0("(?<=.)(?=", pattern[1], ")"), n = 2)[[1]]
 pattern <- pattern[-1]
 if(length(pattern) > 0) string <- c(string[-length(string)], str_split_recursive(string, pattern))
 string
 
}

df <- df %>% 
 mutate(Split = map2(Text, Value, str_split_recursive))

【讨论】：