根据R中另一列中的值替换一列中第二次出现的字符串答案

【问题标题】：Replace second occurrence of a string in one column based on value in other column in R根据R中另一列中的值替换一列中第二次出现的字符串
【发布时间】：2018-05-14 12:42:36
【问题描述】：

这是一个示例数据框：

a <- c("cat", "dog", "mouse")
b <- c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse")
df <- data.frame(a,b)

我希望能够删除 col b 中 col a 中第二次出现的值。

这是我想要的输出：

a      b
cat    my cat is a tabby and is a friendly cat
dog    walk the dog
mouse  the mouse is scared of the other

我已经尝试过 gsub 和一些 stringr 函数的不同组合，但我什至还没有接近能够删除 col b 中 col a 中的第二个（并且只有第二个）字符串。我想我问的是类似于this 的问题，但我不熟悉 Perl，无法将其翻译成 R。

谢谢！

【问题讨论】：

标签： r regex string find-occurrences

【解决方案1】：

构建正确的正则表达式需要一些工作。

P1 = paste(a, collapse="|")
PAT = paste0("((", P1, ").*?)(\\2)")

sub(PAT, "\\1", b, perl=TRUE)
[1] "my cat is a tabby  and is a friendly cat"
[2] "walk the dog"                            
[3] "the mouse is scared of the other "

【讨论】：

【解决方案2】：

我实际上找到了另一种解决方案，虽然更长，但对于其他正则表达式初学者来说可能更清楚：

library(stringr)
# Replace first instance of col a in col b with "INTERIM" 
df$b <- str_replace(b, a, "INTERIM")

# Now that the original first instance of col a is re-labeled to "INTERIM", I can again replace the first instance of col a in col b, this time with an empty string
df$b <- str_replace(df$b, a, "")

# And I can re-replace the re-labeled "INTERIM" to the original string in col a
df$b <- str_replace(df$b, "INTERIM", a)

# Trim "double" whitespace
df$b <- str_replace(gsub("\\s+", " ", str_trim(df$b)), "B", "b")


df
a            b
cat          my cat is a tabby and is a friendly cat
dog          walk the dog
mouse        the mouse is scared of the other

【讨论】：

【解决方案3】：

你可以这样做...

library(stringr)
df$b <- str_replace(df$b, 
                    paste0("(.*?",df$a,".*?) ",df$a), 
                    "\\1")

df
      a                                       b
1   cat my cat is a tabby and is a friendly cat
2   dog                            walk the dog
3 mouse        the mouse is scared of the other

正则表达式查找其中某处带有df$a 的第一个字符串，然后是一个空格和另一个df$a。捕获组是直到第二次出现之前空格的文本（由(...)表示），整个文本（包括第二次出现）被捕获组\\1替换（具有删除第二个df$a 及其前面的空格）。第二个df$a 之后的任何内容都不会受到影响。

【讨论】：

@carozimm 请注意，我的解决方案和 G5W 的解决方案做不同的事情。我的将每个df$b 与同一行中的df$a 进行比较，而另一个答案将df$b 与df$a 列中的所有单词进行比较（因此它将删除“那只猫不是狗”中的“狗” “，例如）。我的解决方案还避免在删除的单词所在的位置留下额外的空间。希望这是您想要的行为！

【解决方案4】：

Base R，拆分-应用-组合解决方案：

# Split-apply-combine: 

data.frame(do.call("rbind", lapply(split(df, df$a), function(x){

        b <- paste(unique(unlist(strsplit(x$b, "\\s+"))), collapse = " ")

        return(data.frame(a = x$a, b = b))

      }

    )

  ), 

  stringsAsFactors = FALSE, row.names = NULL

)

数据：

df <- data.frame(a = c("cat", "dog", "mouse"),
                 b = c("my cat is a tabby cat and is a friendly cat", "walk the dog", "the mouse is scared of the other mouse"), 
                 stringsAsFactors = FALSE)

【讨论】：