R字符串和子集答案

【问题标题】：R string and subsetR字符串和子集
【发布时间】：2015-07-09 07:39:34
【问题描述】：

我有一个很长的 html 字符串，带有

长度 - 1
类和模式 - 字符

......uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....

是否可以根据其中的文本提取该字符串的一部分。减去class="vip" title="Click this link to access到(DVD, 2011的所有内容，得到这个

The Big Bang Theory: The Complete Fourth Season

感谢您的帮助。

【问题讨论】：

我认为提问者的英语有困难，实际上意思是“extract”（==“retain”）而不是“subtract”（=="remove"）。
模式是否总是“点击访问...你想要的东西...（额外的东西）”？
@BondedDust 我想删除class="vip" title="Click this link to access 之前和class="vip" title="Click this link to access 之后的所有内容，只生活The Big Bang Theory: The Complete Fourth Season 对不起我的英语不好
@rawr 是的“点击访问...你想要的东西...（额外的东西）”是一种模式
不要 grep html... 使用 rvest 来解析它。

标签： r string character substr substring

【解决方案1】：

使用分组运算符()。这会丢弃“访问链接”和“DVD”之后的所有内容，并且只保留第二组的匹配项。表达式.+ 表示<anything, of any length>。有关“^”和“$”的解释以及在替换中使用\\N 的更多详细信息，请参阅?regex 帮助页面：

 htxt <- 'uygdasd class="vip" title="Click this link to access The Big Bang Theory: The Complete Fourth Season (DVD, 2011, 3-Disc Set).....'

gsub(pattern= "^(.+link to access )(.+)( \\(DVD,.+$)", "\\2", htxt)
[1] "The Big Bang Theory: The Complete Fourth Season"

当然，这个问题有一个著名的、投票率很高的回答：RegEx match open tags except XHTML self-contained tags

【讨论】：