按特定模式过滤字符串答案

【问题标题】：Filter out string by specific pattern按特定模式过滤字符串
【发布时间】：2018-02-16 21:51:56
【问题描述】：

我有一个包含 2 个和 3 个单词的数据框。我想过滤掉一些具有相同模式的特定字符串。

df <- data.frame(word = c("thin film", "film resistor", "thin film resistor", 
                          "protection material", "protection material removed",
                          "protection layer", "interconnect metal"))
>df                          
  words
1 thin film
2 film resistor
3 thin film resistor
4 protection material
5 protection material removed
6 protection layer
7 interconnect metal

我想过滤掉具有重复字符串模式的字符串。

所以这就是我想要的。

  words
1 thin film resistor
2 protection material removed
3 protection layer
4 interconnect metal

【问题讨论】：

您能详细说明删除内容的确切逻辑吗？看起来就像“如果有至少共享 2 个单词的字符串，则保留该集合中最长的字符串” - 对吗？或者你能更好地解释一下吗？
不清楚为什么protection layer 和interconnect metal 在所需的输出中。我认为它们是独一无二的。
Marius：是的，你的解释正是我想要描述的。 “如果有字符串共享至少 2 个单词，则保留该集合中最长的字符串”。因为我的数据框包含部分重复的字符串。重复的字符串对我没有用。我只想在我的数据框中保留最长的字符串。

标签： r regex filter

【解决方案1】：

假设字符类的words列：

必须有最好的方法来做到这一点：

  data.frame(words=names(which(colSums(sapply(df[,1],grepl,df[,1]))==1)))                       
              words
 1          thin film resistor
 2 protection material removed
 3            protection layer
 4          interconnect metal

希望对你有帮助

你也可以这样做：

 df$word[colSums(sapply(df[,1],grepl,df[,1]))==1]
 [1] "thin film resistor"          "protection material removed" "protection layer"           
 [4] "interconnect metal

或

 df$word[colSums(outer(df$word, df$word, stringr::str_detect)) == 1]

【讨论】：

好主意。我不确定这是否可以进一步优化。你可以使用outer，但它需要一个矢量化的grepl（比如stringr包中的str_detect），即df$word[colSums(outer(df$word, df$word, stringr::str_detect)) == 1]。

【解决方案2】：

创建data.frame时请设置stringsAsFactors=FALSE

试试这个：

lst = strsplit(df$word,split = " ")

output = sapply(1:length(lst),
   function(t,dict){
       superstring=c()
       temp = sapply(dict[-t],function(u,v){
                  matches = match(x=v,table=u); 
                  if(length(which(!is.na(matches)))==length(v)){
                      return(str_c(u,collapse = " "))
                  }else{
                      return(NULL)
                  }},dict[[t]],simplify = T)
        if(length(which(sapply(temp,is.null,simplify = T)))==(length(dict)-1)){
            superstring[t]=str_c(dict[[t]],collapse = " ")
        }else{
            superstring[t]=temp[[which.max(sapply(temp,nchar,simplify = T))]]
        }          
       },lst)

unique(output)

#[1] "thin film resistor"          "protection material removed" "protection layer"            "interconnect metal"

不是最优化的，但应该可以解决问题。

【讨论】：