【问题标题】:How to flag missing left-hand collocates with NA如何用 NA 标记丢失的左手搭配
【发布时间】:2021-04-10 11:14:22
【问题描述】:

我想计算引理 GO 的搭配,包括其所有形式,例如 gogoesgone 等:

go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")

引理形式存储在这个向量中:

lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")

这个向量将它们变成一个交替模式:

pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")

但是,当使用带有str_extract_all 的模式来提取GO 的紧靠左侧的搭配时,提取会遗漏那些GO 是字符串中的第一个单词并且稍后在字符串中再次出现的字符串:

library(stringr)
str_extract_all(go, paste0("'?\\b[a-z']+\\b(?=\\s?", pattern_GO, ")"))
[[1]]
character(0)

[[2]]
[1] "we"

[[3]]
[1] "he"

[[4]]
[1] "it"

[[5]]
[1] "'m" "na"

[[6]]
[1] "'s"

预期的结果是这样的:

[[1]]
[1] NA

[[2]]
[1] "we"

[[3]]
[1] "he"

[[4]]
[1]  NA  "it"

[[5]]
[1] "'m" "na"

[[6]]
[1] "'s"

在没有左手搭配的情况下,如何修改提取也返回NA

【问题讨论】:

    标签: r regex


    【解决方案1】:

    您可以在字符串的开头或您的消费模式添加替代匹配:

    str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
    

    请参阅regex demo

    R demo

    go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
    lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
    pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
    library(stringr)
    str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
    

    输出:

    [[1]]
    [1] ""
    
    [[2]]
    [1] "we"
    
    [[3]]
    [1] "he"
    
    [[4]]
    [1] ""   "it"
    
    [[5]]
    [1] "'m" "na"
    
    [[6]]
    [1] "'s"
    
    
    Sukces #stdin #stdout 0.26s 42528KB
    [1] "\\b(go|goes|going|gone|went|gon na)\\b"
    [[1]]
    [1] ""
    
    [[2]]
    [1] "we"
    
    [[3]]
    [1] "he"
    
    [[4]]
    [1] ""   "it"
    
    [[5]]
    [1] "'m" "na"
    
    [[6]]
    [1] "'s"
    

    如果你愿意,你可以使用

    将所有空项变成NA
    res <- str_extract_all(go, paste0("('?\\b[a-z']+\\b|^)(?=\\s?", pattern_GO, ")"))
    res <- lapply(res, function(x) ifelse(x=="", NA, x))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-01-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多