将字符串拆分为单词并与其他数据重新连接答案

【问题标题】：Split string into words and rejoin with additional data将字符串拆分为单词并与其他数据重新连接
【发布时间】：2019-09-26 00:24:39
【问题描述】：

我有一个使用Regex 在文本string 中查找模式的方法。它有效，但还不够，因为它要求文本以确切的顺序出现，而不是将短语视为一组单词。

    public static string HighlightExceptV1(this string text, string wordsToExclude)
    {
        // Original version
        // wordsToExclude usually consists of a 1, 2 or 3 word term.
        // The text must be in a specific order to work.

        var pattern = $@"(\s*\b{wordsToExclude}\b\s*)";

        // Do something to string...
    }

这个版本在前一个版本的基础上进行了改进，它确实允许以任何顺序匹配单词，但是它在最终输出中导致一些间距问题，因为间距被删除并替换为管道。

    public static string HighlightExceptV2(this string text, string wordsToExclude)
    {
        // This version allows the words to be matched in any order, but it has
        // flaws, in that the natural spacing is removed in some cases.
        var words = wordsToExclude.Replace(' ', '|');

        var pattern = $@"(\s*\b{words}\b\s*)";

        // Example phase: big blue widget
        // Example output: $@"(\s*\bbig|blue|widget\b\s*)"

        // Do something to string...
    }

理想情况下，需要在每个单词周围保留间距。下面的伪示例显示了我正在尝试做的事情。

将原始短语拆分为单词
将每个单词包装在一个正则表达式模式中，以保留空格匹配时

重新加入单词模式以生成将用于匹配

public static string HighlightExceptV3(this string text, string wordsToExclude)
{
    // The outputted pattern must be dynamic due to unknown
    // words in phrase.

    // Example phrase: big blue widgets

    var words = wordsToExclude.Replace(' ', '|');
    // Example: big|blue|widget

    // The code below isn't complete - merely an example
    // of the required output.

    var wordPattern = $@"\s*\b{word}\b\s*";
    // Example: $@"\s*\bwidget\b\s*"

    var phrasePattern = "$({rejoinedArray})";
    // @"(\s*\bbig\b\s*|\s*\bblue\b\s*|\s*\bwidget\b\s*)";

    // Do something to string...
}

注意：处理单词边界间距可能有更好的方法，但我不是正则表达式专家。

我正在寻找一些帮助/建议来获取拆分数组，将其包装，然后以最简洁的方式重新加入。

【问题讨论】：

也许 - 如果您有 1 个空格分隔的单词，仅由单词字符组成 - 只需使用 var phrasePattern = $@"\s*\b(?:{wordsToExclude.Replace(" ", "|")})\b\s*";
请问原来的问题是什么？假设，给你一个短语，例如"The quick brown fox (not wolf or cat) runs, jumps over a lazy (!) dog." 和 要排除的单词，例如{"wolf", "over", "the"}。那么期望的结果是什么？
正如我所说，我不是正则表达式专家，但下面的代码似乎只是 V2 示例的变体 - 如果我错了，请纠正我。
@JohnOhara 我在顶部评论中建议的一段代码是您的 V2 示例的修复。
@DmitryBychenko - 这是大蓝色小部件的示例，“大小部件很棒，但如果蓝色更好” - 输出最终将使用正则表达式附加 html “蓝色小部件 很棒但更好大”

标签： c# asp.net regex string

【解决方案1】：

您需要将所有备选方案包含在一个非捕获组中，(?:...|...)。此外，为了进一步解决最终的问题，我建议用它们的环视明确等价物替换单词边界，(?<!\w)...(?!\w)。

这是working C# snippet：

var text = "there are big widgets in this phrase blue widgets too";
var words = "big blue widgets";
var pattern = $@"(\s*(?<!\w)(?:{string.Join("|", words.Split(' ').Select(Regex.Escape))})(?!\w)\s*)";
var result = string.Concat(Regex.Split(text, pattern, RegexOptions.IgnoreCase).Select((str, index) =>
            index % 2 == 0 && !string.IsNullOrWhiteSpace(str) ? $"<b>{str}</b>" : str));
 Console.WriteLine(result);

注意事项

words.Split(' ').Select(Regex.Escape) - 用空格分割 words 文本并正则表达式转义每个项目
string.Join("|",...) 重新构建在项目之间插入 | 的字符串
(?<!\w)negative lookbehind 匹配一个没有紧跟在单词 char 前面的位置，(?!\w)negative lookahead 匹配一个没有紧跟单词 char 的位置。

【讨论】：

【解决方案2】：

我建议使用2 状态（输入和输出选择）和Regex.Replace（我们可以保持原样 - word 或替换它）来实现 FSM（有限状态机）与<b>word、word<\b> 或<b>word<\b>)

private static string MyModify(string text, string wordsToExclude) {
  HashSet<string> exclude = new HashSet<string>(
    wordsToExclude.Split(' '), StringComparer.OrdinalIgnoreCase);

  bool inSelection = false;

  string result = Regex.Replace(text, @"[\w']+", match => {
      var next = match.NextMatch();

      if (inSelection) {
        if (next.Success && exclude.Contains(next.Value)) {
          inSelection = false;

          return match.Value + "</b>";
        }
        else
          return match.Value;
      }
      else {
        if (exclude.Contains(match.Value))
          return match.Value;
        else if (next.Success && exclude.Contains(next.Value))
          return "<b>" + match.Value + "</b>";
        else {
          inSelection = true;
          return "<b>" + match.Value;
        }
      }
    });

  if (inSelection)
    result += "</b>";

  return result;
}

演示：

string wordsToExclude = "big widgets blue if";

string[] tests = new string[] {
  "widgets for big blue",
  "big widgets are great but better if blue",
  "blue",
  "great but expensive",
  "big and small, blue and green",
};

string report = string.Join(Environment.NewLine, tests
  .Select(test => $"{test,-40} -> {MyModify(test, wordsToExclude)}"));

Console.Write(report);

结果：

widgets for big blue                     -> widgets <b>for</b> big blue
big widgets are great but better if blue -> big widgets <b>are great but better</b> if blue
blue                                     -> blue
great but expensive                      -> <b>great but expensive</b>
big and small, blue and green            -> big <b>and small</b>, blue <b>and green</b>

【讨论】：

感谢 Dmitry 提供替代解决方案。
Dmitry，这是一个很棒的解决方案（我今天学到了一些新东西），我花了一些时间来尝试理解它。我已经做了很多测试，但注意到它在一种情况下失败了——想看看吗？ ideone.com/RIQW2x
Dmitry，您已突出显示错误。 “widgets for big blue” - 部分包含我们的保留字之一 - “big”。它应该输出“widgets for big blue”
@John Ohara：不错的收获！ <b>SingleWord</b> 未被覆盖。我已经编辑了答案
Dmitry，代码已经通过了所有以前的失败 - 所以现在看起来很好。再次感谢。