根据子字符串在字符串中出现的位置识别字符串答案

【问题标题】：Identifying strings based on where substrings appear in the string根据子字符串在字符串中出现的位置识别字符串
【发布时间】：2015-10-05 16:27:12
【问题描述】：

假设我有一组字符串，比如说：

#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"

现在我想检查某些模式是否出现在序列的第一个、中间或最后三分之一。因此，我希望能够制定这样的规则：

Match the string if, and only if, 
marker X occurs in the first/middle/last third of the string

例如，我可能想要匹配在前三分之一中具有A 的字符串。考虑到上面的序列，我会匹配#1 和#2。我还可以匹配在最后三分之一中具有A 的字符串。这将匹配 #2 和 #3。

如何编写一个通用代码/正则表达式模式，可以将各种此类规则作为输入，然后匹配适当的子序列？

【问题讨论】：

听起来不像是用正则表达式解决的问题。用函数定义规则，对输入字符串进行操作更灵活。
@nhahtdh：它可能同时需要函数和正则表达式（因为我要匹配的任何内容都必须用正则表达式定义，即使它很简单）。
字符串是否总是相同的固定长度？
@rloth：不，长度是动态的。
我认为正则表达式没有办法将字符串动态分成三部分，因为正则表达式不能count。但是，您可以基于动态变量动态构造正则表达式量词，其运行时已知字符串长度（除以 3）。然后找到你想要的东西是微不足道的。

标签： regex r string

【解决方案1】：

这是一个完全矢量化的尝试（您可以尝试设置并告诉我是否要添加/更改某些内容）

StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
  xsub <- gsub("-", "", x, fixed = TRUE)
  sizes <- nchar(xsub) / frac
  subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
  if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}

在您的向量上进行测试

x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3

或者如果你想要实际匹配的字符串

StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C"     "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"

请注意，当字符串大小不完全除以 3 时（例如，nchar(x) == 10），默认情况下，最后三分之一是最大的组（例如，如果 nchar(x) == 10，则大小为 4）

【讨论】：

【解决方案2】：

这是一个生成正则表达式以满足所需要求的解决方案。注意正则表达式可以计数，但不能相对于总字符串计数。因此，这会根据其长度为每个输入字符串生成一个自定义正则表达式。我使用了stringi::stri_detect_regex 而不是grep，因为后者没有在模式术语上进行矢量化。我还假设pattern 参数本身是一个有效的正则表达式，并且任何需要转义的字符（例如[、.）都会被转义。

library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
  before <- round(nchar(strings) / n_groups * (which_fraction - 1))
  after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
  sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"

#Test regexs:
stri_detect_regex(strings, patterns)
#[1]  TRUE  TRUE FALSE

【讨论】：

【解决方案3】：

这是一种选择：

f <- function(txts, needle, operator, threshold) {
  require(stringi)
  txts <- gsub("-", "", txts, fixed = TRUE)             # delete '-'s
  matches <- stri_locate_all_fixed(txts, needle)        # find matches 
  ends <- lapply(matches, function(x) x[, "end"])       # extract endposition of matches (= start)
  ends <- mapply("/", ends, sapply(txts, nchar) + 1)    # divide by string length+1
  which(sapply(mapply(operator, ends, threshold), any)) # return indices of matches that fulfill restriction of operator and its threshold
}
txts <- c("A-A-B-B-C-C", "A-A-A-A-A-A", "B-B-B-C-A-A")
idx <- f(txts, needle = "A", operator = "<=", threshold = .333)
txts[idx]
# [1] "A-A-B-B-C-C" "A-A-A-A-A-A"

【讨论】：

您能解释一下operator 参数的功能吗？我也可以在这里使用=> 和= 吗？
当我尝试它时它似乎不起作用 - 你能解释一下如何使用该功能吗？一些例子会很有帮助。
我的意思是>= 和==。见?Compare。例如。 txts <- c("A-A-B-D-B-C-C", "A-D-A", "B-B-D-B-C-A-A"); f(txts, needle = "D", operator = "==", threshold = .50); f(txts, needle = "C", operator = ">=", threshold = 6/7); f(txts, needle = "B", operator = "<=", threshold = 1/7).