【问题标题】:find words in a text divided in sentences R在句子中找到单词 R
【发布时间】:2021-05-08 13:38:44
【问题描述】:

您好,我有一个文本,我想只检索包含某些单词的句子。这是一个例子。

my_text<- tolower(c("Pro is a molecule that can be found in the air. This molecule spreads glitter and allows bees to fly over the rainbow. For flying, bees need another molecule that is Sub. Sub is activated and so Sub is a substrate. After eating that molecule bees become very speed and they can fly highly. Pro activate Sub. This means that Sub is catalyzed by Pro."))


my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
         "Sab", "Seb", "Sib", "Sob", "Sub"))

sent <- unlist(strsplit(my_text, "\\."))

sent <- sent[grep(pattern = my_words, sent, ignore.case = T)] 

使用此代码我收到此警告消息

Warning message:
In grep(pattern = my_words, sent, ignore.case = T) :
  argument 'pattern' has length > 1 and only the first element will be used

如何避免这种情况?我想分析我的向量的所有单词。我查看了 stringr 包,但找不到解决方案。

代码无论如何都可以改变,我只是展示了我所做的!

【问题讨论】:

    标签: r string word


    【解决方案1】:

    您可以从my_words 创建一个正则表达式模式并在grep 中使用它。

    my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
                          "Sab", "Seb", "Sib", "Sob", "Sub"))
    sent <- unlist(strsplit(my_text, "\\."))
    grep(paste0('\\b', my_words, '\\b', collapse = '|'), sent, ignore.case = TRUE, value = TRUE)
    
    #[1] "pro is a molecule that can be found in the air"     
    #[2] " for flying, bees need another molecule that is sub"
    #[3] " sub is activated and so sub is a substrate"        
    #[4] " pro activate sub"                                  
    #[5] " this means that sub is catalyzed by pro"    
    

    我已经包含了单词边界 (\\b),因此只有完整的单词匹配。例如,'pre' 不会与 'spread' 匹配。

    【讨论】:

    • 我在另一个练习中使用相同的代码,但字符向量很大(大约 10000 个单词),我有这个: grep 中的错误(paste0(“\\b”,my_words,“\ \b", collapse = "|"), sent, ignore.case = TRUE, : 无效的正则表达式 '\baac1\b|\baac3\b|\baad10\b|\baad14\b|\baad15\b|\ baad16\b|\baad3\b|\baad4\b|\baad6\b|\baah1\b|\baap1\b|\baar2\b|\baat1\b|\baat2\b|\basp5\b|\ babd1\b|\babf1\b|\bbaf1\b|\bobf1\b|\breb2\b|\bsbf1\b|\babf2\b|\bhim1\b|\babm1\b|\babp1\b|\ bbp140\b| 为什么?
    • 我是这样获取my_words的: my_words % mutate(names = strsplit(as.character(names), "; ")) %>% unnest(names) my_words
    • 我认为正则表达式查询的大小是有限制的。如果太大,则需要将其分解为 2/3 部分并单独涂抹。
    • 是否存在另一种在不拆分向量的情况下执行此操作的方法?
    • 可能有,但我不知道。
    【解决方案2】:

    您可以将要查找的单词定义为交替模式,并用\\b 包裹它们以确保它们仅在作为单词出现时匹配(而不是作为其他单词的一部分,例如 pro -- > 专业)并将该模式​​输入到您在帖子中使用的子集方法中。 我还建议您使用trimwsto,好吧,修剪空白:

    sent <- trimws(unlist(strsplit(my_text, "\\.")))
    pattern <- paste0("\\b", my_words, "\\b", collapse = "|")
    sent[grepl(pattern, sent)]
    

    您提到了stringr 包。基于str_detect 的解决方案是:

    sent[str_detect(sent, pattern)]
    

    【讨论】:

      猜你喜欢
      • 2016-06-21
      • 2023-03-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-01-03
      相关资源
      最近更新 更多