从R中的字符串中提取模式而不区分大小写字母答案

【问题标题】：Extract pattern from string in R without distinguishing between upper and lower case letters从R中的字符串中提取模式而不区分大小写字母
【发布时间】：2016-06-14 03:16:38
【问题描述】：

这是一个玩具示例。我想在a 中搜索并提取b 中列出的那些颜色。即使颜色不是以大写字母开头，我也想提取它。但是，输出应该告诉我颜色在a 中是如何使用的。

所以我想得到的答案是#"Red" NA "blue。

a <- "She has Red hair and blue eyes"
b <- c("Red", "Yellow", "Blue")
str_extract(a, b)#"Red" NA    NA

我使用了来自“stringr”的str_extract，但很乐意使用其他函数/包（例如，grep）。

【问题讨论】：

最简单的方法是将所有字符串转换为相同的大小写，请参阅函数 ?tolower 或 ?toupper。

标签： r string extract

【解决方案1】：

我们可以做到这一点base R

unlist(sapply(tolower(b), function(x) {
        x1 <- regmatches(a, gregexpr(x, tolower(a)))
      replace(x1, x1 == "character(0)", NA)}), use.names=FALSE)
# "Red"     NA "blue"

或者从@leerssej 的回答中得到启发

library(stringr)
str_extract(a, fixed(b, ignore_case=TRUE))
#[1] "Red"  NA     "blue"

【讨论】：

除非我弄错了，否则您的解决方案与我的第一次尝试存在相同的问题... OP 希望结果保持a 的大写字母
@DominicComtois 也固定在这里！

【解决方案2】：

stringi 可以使用不区分大小写的选项

library(stringi)
stri_extract_all_fixed(a, b, opts_fixed = list(case_insensitive = TRUE))
#[[1]]
#[1] "Red"
#[[2]]
#[1] NA
#[[3]]
#[1] "blue"


# or using simplify = TRUE to get a non-list output
stri_extract_all_fixed(a, b, opts_fixed = list(case_insensitive = TRUE), 
    simplify = TRUE)
#     [,1]  
#[1,] "Red" 
#[2,] NA    
#[3,] "blue"

【讨论】：

【解决方案3】：

stringr 有一个 ignore.case() 函数

str_extract(a, ignore.case(b))#"Red"  NA     "blue"

【讨论】：

谢谢。这样做会导致错误消息：请使用 (fixed|coll|regexp)(x, ignore_case = TRUE) 而不是 ignore.case(x);所以;也许我应该这样做：str_extract(fixed(a, ignore_case = TRUE), fixed(b, ignore_case = TRUE))？
@milan 这不是错误。这只是一个消息，甚至不是警告。该代码提供了正确的结果。但是您可以使用，例如，str_extract(a, (fixed)(b, ignore_case=TRUE))。
如我所见，我上一条评论中的建议已包含在@akrun 的编辑中。
谢谢大家！这很好，但消息的目的是什么？ 'ignore.case()' 被实现并给出了正确的输出，那么为什么我们建议使用这个稍长的代码呢？
根据stringr library's code，不推荐使用ignore.case 函数。不幸的是，我在我力所能及的任何地方都找不到写下这个决定的原因。好消息是代码只是为您（在幕后）直接输入更新的格式，所以我只是让它为我们做额外的工作:-D

【解决方案4】：

作为对 akrun 答案的改进，您可以使用大小写更改进行匹配，但仍以 a 中最初编写的方式返回元素：

library(stringr)
a <- "She has Red hair and blue eyes"
b <- c("Red", "Yellow", "Blue")

positions <- str_locate(toupper(a), toupper(b))
apply(positions, 1, function(x) substr(a,x[1],x[2]))

## [1] "Red"  NA  "blue"

或者，要消除 NA...

positions <- str_locate(toupper(a), toupper(b))
words <- apply(positions, 1, function(x) substr(a,x[1],x[2]))
words[!is.na(words)]

## [1] "Red"  "blue"

【讨论】：