R中基于正则表达式的列表匹配答案

【问题标题】：Regular expression-based list matching in RR中基于正则表达式的列表匹配
【发布时间】：2013-02-28 13:14:23
【问题描述】：

我有两个列表（更准确地说是字符原子向量），我想使用正则表达式进行比较以生成其中一个列表的子集。我可以为此使用“for”循环，但是有一些更简单的代码吗？以下举例说明我的情况：

# list of unique cities
city <- c('Berlin', 'Perth', 'Oslo')

# list of city-months, like 'New York-Dec'
temp <- c('Berlin-Jan', 'Delhi-Jan', 'Lima-Feb', 'Perth-Feb', 'Oslo-Jan')

# need sub-set of 'temp' for only 'Jan' month for only the items in 'city' list:
#   'Berlin-Jan', 'Oslo-Jan'

补充说明：在我正在寻找代码的实际情况下，“月”等价物的值更复杂，而是随机的字母数字值，只有前两个字符具有我感兴趣的信息值（必须是'01')。

添加了实际案例：

# equivalent of 'city' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}
patient <- c('TCGA-43-4897', 'TCGA-65-4897', 'TCGA-78-8904', 'TCGA-90-8984')

# equivalent of 'temp' in the first example
# values match pattern TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-[\d]{2}[0-9A-Z]+
sample <- c('TCGA-21-5732-01A333', 'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76', 'TCGA-78-8904-11A70')

# sub-set wanted (must have '01' after the 'patient' ID part)
#   'TCGA-43-4897-01A159', 'TCGA-65-4897-01T76'

【问题讨论】：

请向我们展示您的“实际”案例。
我添加了一个实际案例。我现在意识到这很重要。
你不要：'TCGA-78-8904-11A70'??
它必须有一个'01'而不是'11'；即，“TCGA-78-8904-01A70”而不是“TCGA-78-8904-11A70”符合标准。
好的，很好。请检查编辑。

标签： regex r list subset

【解决方案1】：

这样的？

temp <- temp[grepl("Jan", temp)]
temp[sapply(strsplit(temp, "-"), "[[", 1) %in% city]
# [1] "Berlin-Jan" "Oslo-Jan"

更好的是，借用 @agstudy 的想法：

> temp[temp %in% paste0(city, "-Jan")]
# [1] "Berlin-Jan" "Oslo-Jan"

编辑：这个怎么样？

> sample[gsub("(.*-01).*$", "\\1", sample) %in% paste0(patient, "-01")]
# [1] "TCGA-43-4897-01A159" "TCGA-65-4897-01T76"

【讨论】：

感谢您的建议，但我的实际案例中的项目值更复杂。我现在已经在我的问题中澄清了这一点。
Arun，感谢编辑的代码。对于实际案例示例，如果患者项目值为“-01”，它将失败。但是对您的 sn-p 的这种轻微修改似乎效果很好： sample[gsub("(TCGA-[0-9A-Z]{2}-[0-9A-Z]{4}-01).*$" , "\\1", 样本) %in% paste0(患者, "-01")]
@user594694，我使用贪婪搜索.*-01，它会搜索到最后一个-01。所以，我认为除非你有一个病人：TCGA-78-8904-017190-01 或类似的东西，否则这不会是一个问题。也就是说，如果你的病人是：TCGA-01-0189-017190，这不会是一个问题。试试看。
你是对的。再次感谢。您的代码适用于我的完整案例（约 400 个患者项目和约 1200 个样本项目）。

【解决方案2】：

这里有两个部分字符串匹配的解决方案...

temp[agrep("Jan",temp)[which(agrep("Jan",temp) %in% sapply(city, agrep, x=temp))]]
# [1] "Berlin-Jan" "Oslo-Jan"

作为一个功能只是为了好玩...

fun <- function(x,y,pattern) y[agrep(pattern,y)[which(agrep(pattern,y) %in% sapply(x, agrep, x=y))]]
# x is a vector containing your data for filter
# y is a vector containing the data to filter on
# pattern is the quoted pattern you're filtering on

fun(temp, city, "Jan")
# [1] "Berlin-Jan" "Oslo-Jan"

【讨论】：

感谢您提供的好功能。它适用于我提供的示例，但在实际情况下失败。我现在在我的问题中提供了一个实际案例的样本。注意：错误地你在 fun() 中切换了参数顺序。

【解决方案3】：

你可以使用gsub

x <- gsub(paste(paste(city,collapse='-Jan|'),'-Jan',sep=''),1,temp)
> temp[x==1]
[1] "Berlin-Jan" "Oslo-Jan"

这里的模式是：

 "Berlin-Jan|Perth-Jan|Oslo-Jan"

【讨论】：

paste 的巧妙运用。但它可能会更短（我已经通过借用你的想法调整了解决方案）。（所有票数都没有了，如果允许，我会立即再次投票:)）

【解决方案4】：

这是继其他解决方案之后的一个解决方案，以及您的新要求：

sample[na.omit(pmatch(paste0(patient, '-01'), sample))]

【讨论】：

@user594694，blindJesse 注意：如果精确的部分匹配模式在样本中多次出现，这将不起作用，例如：sample <- c("TCGA-21-5732-01A333", "TCGA-21-5732-01B859")。 pmatch(paste0(patient, "-01"), sample) 在这种情况下返回 NA, NA。
好点...我想这在一定程度上取决于预期的数据集，但我想一般来说这个解决方案是行不通的