R：strsplit 中的正则表达式（查找“，”，后跟大写字母）答案

【问题标题】：R: Regex in strsplit (finding ", " followed by capital letter)R：strsplit 中的正则表达式（查找“，”，后跟大写字母）
【发布时间】：2015-11-17 14:38:32
【问题描述】：

假设我有一个向量，其中包含一些我想根据正则表达式拆分的字符。

更准确地说，我想根据逗号、空格和大写字母来分割字符串（据我了解，regex 命令看起来像这样：/(, [A-Z])/g（有效）当我尝试时很好here))。

当我尝试在r 中实现这一点时，regex 似乎不起作用，例如：

x <- c("Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)",
  "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)")

strsplit(x, "/(, [A-Z])/g")
[[1]]
[1] "Non MMF investment funds, Insurance corporations, Assets (Net Acquisition of), Loans, Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations, Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds, Assets (Net Acquisition of), Loans, Short-term original maturity (up to 1 year)"

它没有发现分裂。我在这里做错了什么？

非常感谢任何帮助！

【问题讨论】：

你不应该使用/.../g。这不是 JS。我猜你不想省略这封信，是吗？试试this。
R 中未使用分隔符，strsplit 函数正在消耗字符。

标签： regex r strsplit

【解决方案1】：

这里有一个解决方案：

strsplit(x, ", (?=[A-Z])", perl=T)

见IDEONE demo

输出：

[[1]]
[1] "Non MMF investment funds"                                       
[2] "Insurance corporations"                                         
[3] "Assets (Net Acquisition of)"                                    
[4] "Loans"                                                          
[5] "Long-term original maturity (over 1 year or no stated maturity)"

[[2]]
[1] "Non financial corporations"                                                                                
[2] "Financial corporations other than MFIs, insurance corporations, pension funds and non-MMF investment funds"
[3] "Assets (Net Acquisition of)"                                                                               
[4] "Loans"                                                                                                     
[5] "Short-term original maturity (up to 1 year)"

正则表达式 - ", (?=[A-Z])" - 包含一个前瞻 (?=[A-Z])，它检查但不使用大写字母。在 R 中，您需要将 perl=T 与包含环视的正则表达式一起使用。

如果空格是可选的，或者逗号和大写字母之间可以有双空格，使用

strsplit(x, ",\\s*(?=[A-Z])", perl=T)

还有一种支持 Unicode 字母的变体（带有\\p{Lu}）：

strsplit(x, ", (?=\\p{Lu})", perl=T)

【讨论】：

@Thomas：我不认为这是一场比赛。至少我不这么认为。我们都在这里获胜。除非有人在没有解释的情况下开始投票。
我的意思是，你的打字速度比我好，但我懒得写了