正则表达式提取与 R 中某些单词匹配的部分字符串答案

【问题标题】：regex to extract partial string matching certain words in R正则表达式提取与 R 中某些单词匹配的部分字符串
【发布时间】：2014-06-15 10:51:24
【问题描述】：

我的数据包含如下所示的文本消息。我想从中提取区块年龄。

x:
my block is 8 years old and I am happy with it. I had been travelling since 2 years and that’s fun too…..
He invested in my 1 year block and is happy with the returns
He re-invested in my 1.5 year old block 
i had come to U.K for 4 years and when I reach Germany my block will be of 5 years

我提取了后跟单词“year”或“years”的数字，但我意识到我应该选择更接近单词“block”的数字。

library(stringr)

> str_extract_all(x, "[0-9.]{1,3}.year|[0-9.]{1,3}.years")
[[1]]
[1] "8 years" "2 years"

[[2]]
[1] "1 year"

[[3]]
[1] "1.5 year"

[[4]]
[1] "4 years" "5 years"

我希望输出是一个包含

的列表

8 years
1 year
1.5 year
5 years

我正在考虑提取包含“块”、“旧”等词的句子的一部分。但我不太清楚如何实现这一点。任何改进此过程的想法或建议都会有所帮助。

谢谢

【问题讨论】：

@David- 我只想提取区块的年龄。我编辑了我的帖子以包含图书馆的名称

标签： regex string r substring

【解决方案1】：

这是一个继续使用stringr的解决方案：

library(stringr)
m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
sapply(seq_along(x), function(i) {
   if (is.na(m1[i, 1])) m2[i, 2]
   else if (is.na(m2[i, 1])) m1[i, 2]
   else if (str_length(m1[i, 1]) < str_length(m2[i, 1])) m1[i, 2]
   else m2[i, 2]
})
## [1] "8 years"  "1 year"   "1.5 year" "5 years"

或等效：

m1 <- str_match(x, "block.*?([0-9.]{1,3}.year[s]?)")
m2 <- str_match(x, "([0-9.]{1,3}.year[s]?).*?block")
cbind(m1[,2], m2[,2])[cbind(1:nrow(m12), apply(str_length(cbind(m1[,1], m2[,1])), 1, which.min))]

两种解决方案都假定“块”在每个字符串中只出现一次。

【讨论】：

我喜欢 str_match 的想法。谢谢。 :)

【解决方案2】：

一个想法是获取“blocks”单词和“ages”的位置。然后为每个块计算最近的年龄。我正在使用gregexpr 来计算获取位置。

## position of blocks
d_block <- unlist(gregexpr('block',txt))
## position of ages
## Note here that I am using ? to simplify your regex
d_age <- unlist(gregexpr("[0-9.]{1,3}.years?",txt))
## for each block , get the nearest age position 
nearest <- sapply(d_block,function(x)d_age[which.min(abs(x-d_age))])
## get ages values
all_ages <- unlist(regmatches(txt,gregexpr("[0-9.]{1,3}.years?",txt)))
## filter to keep only ages near to block
all_ages[d_age %in% nearest]

"8 years"  "1 year"   "1.5 year" "5 years"

【讨论】：

我没有得到相同的输出。我的输出是 8 年、4 年和 5 年。我正在为相同的 4 条记录实现你的代码。 > all_ages[d_age %in% 最近] [1] “8 年” “4 年” “5 年”
@user1946217 真的吗？？对问题文本进行测试时会得到什么？
我得到以下输出> all_ages[d_age %in% nearest] [1] "8 years" "4 years" "5 years"
@你得到这个真的太棒了！

【解决方案3】：

这种方法从“块”中获取最短距离的“年”或“年”字，然后在执行str_extract_all 行之前删除每条消息中所有其余的“年”或“年”

goodyear <- lapply(x, function(x) if(length(grep("year",  unlist(strsplit(x, " ")))) > 1) grep("year",  unlist(strsplit(x, " ")))[which.min(abs(grep("block", unlist(strsplit(x, " "))) -  grep("year",  unlist(strsplit(x, " ")))))])
    for(i in seq_len(length(x))){
      if(!is.null(goodyear[[i]])){
        print(str_extract_all(paste(unlist(strsplit(x[[i]], " "))[-setdiff(grep("year",  unlist(strsplit(x[[i]], " "))), goodyear[[i]])], collapse = " "), "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
      } else print(str_extract_all(x[[i]], "[0-9.]{1,3}.year|[0-9.]{1,3}.years"))
    }

## [[1]]
## [1] "8 years"
##
## [[1]]
## [1] "1 year"
## 
## [[1]]
## [1] "1.5 year"
## 
## [[1]]
## [1] "5 years"

【讨论】：