【问题标题】:Count+Identify common words in two string vectors [R]计数+识别两个字符串向量中的常用词 [R]
【发布时间】:2025-12-02 01:40:01
【问题描述】:

我如何编写一个 R 函数,它可以采用两个字符串向量并返回常用词的数量以及比较从 stringvec1 的元素 1 到 stringvec2 的元素 1、stringvec1 的元素 2 到 stringvec2 的元素 2 等的常用词的数量。

假设我有这些数据:

#string vector 1
strvec1 <- c("Griffin Rahea Petersen Deana Franks Morgan","Story Keisha","Douglas Landon Lark","Kinsman Megan Thrall Michael Michels Breann","Gutierrez Mccoy Tyler West* Grayson Swank Shirley Didas Moriah")

#string vector 2
strvec2 <- c("Griffin Morgan Rose Manuel","Van De Grift Sarah Sell William","Mark Landon Lark","Beerman Carlee Megan Thrall Michels","Mcmillan Tyler Jonathan West* Grayson Didas Lloyd Connor")

理想情况下,我有一个函数可以返回常用词的数量以及常用词是什么:

#Non working sample of how functions would ideally work
desiredfunction_numwords(strvec1,strvec2)
[1] 2 0 2 3 4

desiredfunction_matchwords(strvec1,strvec2)
[1] "Griffin Morgan" "" "Landon Lark" "Megan Thrall Michels" "Tyler West* Grayson Didas"


【问题讨论】:

    标签: r string substring string-matching longest-substring


    【解决方案1】:

    您可以在每个单词处拆分字符串并执行操作。

    在基础 R 中:

    numwords <- function(str1, str2) {
      mapply(function(x, y) length(intersect(x, y)), 
             strsplit(str1, ' '), strsplit(str2, ' '))
    }
    
    matchwords <- function(str1, str2) {
      mapply(function(x, y) paste0(intersect(x, y),collapse = " "), 
             strsplit(str1, ' '), strsplit(str2, ' '))
    }
    
    numwords(strvec1, strvec2)
    #[1] 2 0 2 3 4
    
    matchwords(strvec1, strvec2)
    #[1] "Griffin Morgan"          ""                "Landon Lark"                  
    #[4] "Megan Thrall Michels"          "Tyler West* Grayson Didas"
    

    【讨论】:

      【解决方案2】:

      您可以将strvec1 用作正则表达式模式,通过strsplit将其分成单独的单词并paste将单词与交替标记|一起使用:

      pattern <- paste0(unlist(strsplit(strvec1, " ")), collapse = "|")
      

      您可以将此模式与str_countstr_extract_all 一起使用:

      library(stringr) 
      # counts:
      str_count(strvec2, pattern)
      [1] 2 0 2 3 4
      
      # matches:
      str_extract_all(strvec2, pattern)
      [[1]]
      [1] "Griffin" "Morgan" 
      
      [[2]]
      character(0)
      
      [[3]]
      [1] "Landon" "Lark"  
      
      [[4]]
      [1] "Megan"   "Thrall"  "Michels"
      
      [[5]]
      [1] "Tyler"     "West*" "Grayson"   "Didas"
      

      【讨论】: