【问题标题】:Keep the matched patterns only in a list of sentences in R仅将匹配的模式保留在 R 中的句子列表中
【发布时间】:2017-09-30 16:20:34
【问题描述】:

我有句子列表和单词列表,我想更新每个句子以仅保留单词列表中的单词。

比如我有以下几个字

"美国","英国","德国","澳大利亚","意大利","in","to"

以及以下句子:

“我在德国生活了 2 年”、“我从意大利搬到美国”、“美国、英国和澳大利亚的人说英语”

我想删除单词列表中未出现的句子中的所有单词 所以预期的输出是以下句子: “在德国”、“意大利到美国”、“在美国英国澳大利亚”

如何使用应用函数来做到这一点

mywords=data.frame(words=c("USA","UK","Germany","Australia","Italy","in","to"),
                   stringsAsFactors = F)
mysentences=data.frame(sentences=c("I lived in Germany 2 years",
                                   "I moved from Italy to USA",
                                   "people in USA, UK and Australia speak English"),
                   stringsAsFactors = F)

【问题讨论】:

  • 我第一次看错了;这里有一个非常相似的问题与接受的答案 - stackoverflow.com/questions/28891130/…
  • @neilfws - 可以很容易地适应 - 例如sapply(strsplit(sentence, "[[:space:]|[:punct:]]"), intersect, vect)

标签: r


【解决方案1】:

如果将此文本转换为整洁的数据格式,则可以使用连接来查找匹配的单词。然后你可以使用purrr::map_chr() 回到你需要的字符串。

library(tidyverse)
library(tidytext)

mywords <- data_frame(word = c("USA","UK","Germany","Australia","Italy","in","to"))

mysentences <- data_frame(sentences = c("I lived in Germany 2 years",
                                        "I moved from Italy to USA",
                                        "people in USA, UK and Australia speak English"))

mysentences %>% 
    mutate(id = row_number()) %>% 
    unnest_tokens(word, sentences, to_lower = FALSE) %>% 
    inner_join(mywords) %>% 
    nest(-id) %>%
    mutate(sentences = map(data, unlist),
           sentences = map_chr(sentences, paste, collapse = " ")) %>%
    select(-data)

#> Joining, by = "word"
#> # A tibble: 3 × 2
#>      id           sentences
#>   <int>               <chr>
#> 1     1          in Germany
#> 2     2        Italy to USA
#> 3     3 in USA UK Australia

【讨论】:

    【解决方案2】:

    您也可以使用 stringr。我很抱歉发了两次。搞错了。

    vect <- c("USA","UK","Germany","Australia","Italy","in","to")
    sentence <- c("I lived in Germany 2 years", "I moved from Italy to USA", "people in USA, UK and Australia speak English")
    
    library(stringr)
    li <- str_extract_all(sentence,paste0(vect,collapse="|"))
    d <- list()
    for(i in 1:length(li){
      d[i] <- paste(li[[i]],collapse=" ")
    }
    
    unlist(d)
    

    输出:

     > unlist(d)
    [1] "in Germany"         
    [2] "Italy to USA"       
    [3] "in USA UK Australia"
    

    【讨论】:

      【解决方案3】:

      这适用于较短的单词列表

      library(stringr)
      mywords_regex <- paste0(mywords$word, collapse = "|")
      sapply(str_extract_all(mysentences$sentences, mywords_regex), paste, collapse = " ")
      
      [1] "in Germany"          "Italy to USA"        "in USA UK Australia"
      

      【讨论】:

        【解决方案4】:

        这里有两种方法。第一个将单词列表折叠成正则表达式,然后使用str_detect 将单词与正则表达式匹配:


        library(tidyverse)
        library(glue)
        
        mywords=data_frame(words=c("USA","UK","Germany","Australia","Italy","in","to"))
        mysentences=data_frame(sentences=c("This is a sentence with no words of word list",
                                           "I lived in Germany 2 years",
                                           "I moved from Italy to USA",
                                           "people in USA, UK and Australia speak English"))
        mysentences %>% 
          filter(sentences %>% 
                   str_detect(mywords$words %>% collapse(sep = "|") %>% regex(ignore_case = T)))
        #> # A tibble: 3 × 1
        #>                                       sentences
        #>                                           <chr>
        #> 1                    I lived in Germany 2 years
        #> 2                     I moved from Italy to USA
        #> 3 people in USA, UK and Australia speak English
        

        第二种方法使用fuzzyjoinregex_semi_join(在幕后使用str_detect 并为您完成上述工作)

        library(fuzzyjoin)
        mysentences %>%
          regex_semi_join(mywords, by= c(sentences = "words"))
        #> # A tibble: 3 × 1
        #>                                       sentences
        #>                                           <chr>
        #> 1                    I lived in Germany 2 years
        #> 2                     I moved from Italy to USA
        #> 3 people in USA, UK and Australia speak English
        

        【讨论】:

          【解决方案5】:

          谢谢大家,

          我通过使用 intersect 函数的 answer 启发的以下代码解决了这个问题

          vect <- data.frame( c("USA","UK","Germany","Australia","Italy","in","to"),stringsAsFactors = F)
          sentence <- data.frame(c("I lived in Germany 2 years", "I moved from Italy to USA",
                                   "people in USA     UK and    Australia speak English"),stringsAsFactors = F)
          
          sentence[,1]=gsub("[^[:alnum:] ]", "", sentence[,1]) #remove special characters
          sentence[,1]=sapply(sentence[,1], FUN =  function(x){ paste(intersect(strsplit(x, "\\s")[[1]], vect[,1]), collapse=" ")})
          

          【讨论】:

            猜你喜欢
            • 2020-05-29
            • 1970-01-01
            • 2019-12-17
            • 1970-01-01
            • 1970-01-01
            • 2019-02-11
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多