从字符串中提取两个单词之间的子字符串答案

【问题标题】：Extract a substring between two words from a string从字符串中提取两个单词之间的子字符串
【发布时间】：2013-12-12 00:57:02
【问题描述】：

我有以下字符串：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

我想提取两个<body> 标签之间的字符串。我正在寻找的结果是：

substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"

请注意，两个<body> 标记之间的子字符串可以包含字母、数字、标点符号和特殊字符。

有没有简单的方法来做到这一点？谢谢！

【问题讨论】：

也许这个<body>[\S\s]*<body>

标签： regex string r substr

【解决方案1】：

这里是正则表达式的方式：

regmatches(string, regexpr('<body>.+<body>', string))

【讨论】：

这里为什么需要 perl = TRUE？
@Codoremifa 你不知道，谢谢。最初，我认为 OP 想要排除标签，我建议使用前瞻断言，需要 perl=TRUE 标志。
perl=TRUE 的一个优点是it's faster。
@Arun 不开玩笑。谢谢，我不知道这一点。

【解决方案2】：

regex = '<body>.+?<body>'

您想要非贪婪 (.+?)，这样它就不会将尽可能多的 <body> 标签分组。

如果您只使用没有辅助功能的正则表达式，您将需要一个捕获组来提取所需的内容，即：

regex = '(<body>.+?<body>)'

【讨论】：

【解决方案3】：

strsplit() 应该可以帮助你：

>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk"         "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"  
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"

当然，这为您提供了字符串的所有三个部分，并且不包括标记。

【讨论】：

非常感谢。但是您的解决方案中的正文标签被排除在外。我也希望它们被退回。

【解决方案4】：

我相信 Matthew 和 Steve 的回答都是可以接受的。这是另一个解决方案：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

regmatches(string, regexpr('<body>.+<body>', string))

output = sub(".*(<body>.+<body>).*", "\\1", string)

print (output)

【讨论】：