无法摆脱非捕获正则表达式组答案

【问题标题】：Can't get rid of non-capturing regex groups无法摆脱非捕获正则表达式组
【发布时间】：2015-02-27 01:23:21
【问题描述】：

我有以下字符串：

In order to take this course, you must:<br>
<br>
&radic; &nbsp; &nbsp;Have access to a computer.<br>
<br>
&radic; &nbsp; &nbsp;Have continuous broadband Internet access.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability/permission to install plug-ins (e.g. Adobe Reader or Flash) and software.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability to download and save files and documents to a computer.<br>
<br>
&radic; &nbsp; &nbsp;Have the ability to open Microsoft file and documents (.doc, .ppt, .xls, etc.).<br>
<br>
&radic; &nbsp; &nbsp;Be competent in the English language.<br>
<br>
&radic; &nbsp; &nbsp;Have access to a relational database management system.&nbsp; A good open-source option is MySQL (<a href="http://dev.mysql.com" target="_blank">dev.mysql.com</a>).<br>
<br>
&radic; &nbsp; &nbsp;Have completed the Discrete Structures course.<br>
<br>
&radic;&nbsp;&nbsp;&nbsp; Have read the Student Handbook.

我正在尝试选择中间的文本（不包括标题、编码空格和<br>s），例如，第一个匹配项应该是：Have access to a computer.

我尝试了以下两个，但无法使其工作。

这个选择整行：^(?:&radic;([(&nbsp;)|\s]*))(.*)(?:(\<br\\?\>)*)$，我尝试调用Regex.Matches(requirements.InnerHtml, RequirementsExtractorRegex, RegexOptions.Multiline)[0].Captures[0].Value，这里是值：&radic; &nbsp; &nbsp;Have access to a computer.<br>。

而且这个没有选择任何东西：^(?<=&radic;([(&nbsp;)|\s]*))(.*)(?=(\<br\\?\>)*)$

我做错了什么？

【问题讨论】：

你的意思是，除了使用正则表达式解析HTML，你还做错了什么？你肯定见过“RegEx match open tags except XHTML self-contained tags”吗？

标签： c# html .net regex

【解决方案1】：

对正则表达式稍作修改会产生（几乎，见下文）所需的结果

^(?:&radic;(?:&nbsp;|\s)*)(.*)(?:<br/?>)

引用组#1中的目标匹配

Regex.Matches(requirements.InnerHtml, RequirementsExtractorRegex, RegexOptions.Multiline)[0].Groups[1].Value

在regexstorm 上测试，开启多行匹配选项。

警告

由于非可选的 br 元素，正则表达式匹配除最后一个之外的所有目标事件。量化该部分包括匹配中的最后一次出现，但使捕获组 #1 包含终止行的 br 元素 - 贪婪的通用匹配覆盖。添加线路终止锚可以防止匹配（尽管在我对规范的理解中它不应该 - 可能是测试环境的工件？）。

【讨论】：

最后一条语句不匹配。我认为(?:&nbsp;|\s)* 代表&nbsp 或空格，零次或多次，顺序无关紧要，不是吗？然后用什么来寻找重复零次或多次的可选单词？
您的两个观察结果都是正确的，并且您的语法是正确的。恕我直言，问题不是第二个非捕获组，而是第三个：就 ist 而言，它阻止了最后一行的匹配；当与* 排位赛时，前面的贪婪捕获组在每场比赛中获胜（即包括<br>）。我对此没有解决方案（除了人为地将<br>\n 附加到原始字符串）。
我试过换点，还是不行，见this一。
html 实体的 & 符号尚未进入正则表达式模式行，您必须在字符类中至少包含 .。如果您这样做并以 . 作为 & 符号的替代品，您将得到匹配 4 次的 ^(?:.radic;(?:.nbsp;|\s)*)([A-Za-z0-9.\s]*)(?:<br\/?>)。
我最终使用了我的原始查询，不包括最后一组，并从每个结果匹配中手动替换它。