正则表达式在 html 标记后查找缺失的空格答案

【问题标题】：Regex to find missing space after html tags正则表达式在 html 标记后查找缺失的空格
【发布时间】：2012-06-08 07:21:34
【问题描述】：

从一组超过 10000 行的文本中，我需要找到缺少一组 html 标记后空格的所有字符串实例。 HTML 标记集有限，如下所示。

 , , , , <ul> </ul>, <li> </li>, <ol> </ol>

运行 Regx 后，结果应该会出现以下字符串。

Hi allgood morning.

在这种情况下，我们在粗体标记后遗漏了 sapce。

【问题讨论】：

我看不出你怎么可能期望那个输入的输出，除非你对输出进行硬编码。 -1 表示尝试将正则表达式用于众所周知的需要堆栈的操作。
您已将 C# 和 JavaScript 都标记了 - 您使用的是哪种语言？
@bdares：这非常适合正则表达式。至少如果我正确理解了这个问题。
@bdares 输出不需要我，只需要找到所有存在此类字符串的实例。
这是一个很好的问题，只是措辞不好，所以人们会误解所问的内容。他说只有上面的一组标签，他需要正则表达式来确保它们之间有间隔，所以他最终不会得到像 Hi all> 这样的东西，而是得到 Hi all ，因为这是使用 html，我猜这是一个 Web 应用程序，因此同时使用 c#（他正在使用的语言）和 javascript，因为他很乐意使用 javascript 脚本来实现这一点。

标签： c# javascript regex

【解决方案1】：

假设 C#：

StringCollection resultList = new StringCollection();
Regex regexObj = new Regex("^.*<(?:/?b|/?em|/?su[pb]|/?[ou]l|/?li|span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"|/span)>(?! ).*$", RegexOptions.Multiline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
}

将返回文件中在列表中的标签之一之后至少有一个空格的所有行。

输入：

This </b> is <b> OK
This <b> is </b>not OK
Neither <b>is </b> this.

输出：

This <b> is </b>not OK
Neither <b>is </b> this.

说明：

^      # Start of line
.*     # Match any number of characters except newlines
<      # Match a <
(?:    # Either match a...
 /?b   #  b or /b
|      # or 
 /?em  #  em or /em
|...   # etc. etc.
)      # End of alternation
>      # Match a >
(?! )  # Assert that no space follows
.*     # Match any number of characters until...
$      # End of line

【讨论】：

+1 的描述也很棒，之前使用过正则表达式，但这个解释很重要！