使用正则表达式提取 BBCode答案

【问题标题】：Extract BBCode using regex使用正则表达式提取 BBCode
【发布时间】：2021-10-13 01:17:31
【问题描述】：

我正在尝试使用正则表达式从字符串中提取 BBCode ([U],[B],[I]) 的子集。我发现了很多问题，询问如何简单地解析/替换字符串中的 BBCode，但我想提取所有部分 - 普通文本部分和包含在标签中的部分。

我想出了以下正则表达式：(.*?)(\[[UBI]\](.*?)\[\/[UBI]\])(.*?)

它似乎几乎可以工作，除了它错过了字符串末尾的任何“普通文本”。例如

test1 [B]bold text[/B] test2 [U]underlined[/U] test3

这将导致两个匹配项

Match 1:
  group1: test1
  group2: [B]bold text[/B]
  group3: bold text

Match 2:
  group1: test2
  group2: [U]underlined[/U]
  group3: underlined

我怎样才能使它也匹配尾随 test3（作为新的 Match 或 group4（这是我的意图）？

【问题讨论】：

这似乎行得通。也许这取决于您运行它的环境的设置（您没有提到）......
@MBaas 嗯，你在哪里试过的？我在 regexr.com 和 Dart 代码中都尝试过，无论我尝试使用哪种修饰符，我都无法将最后一部分 (test3) 包含在匹配中。
我在 Regexbuddy 中试过（抱歉，需要下载，一个 Windows 应用）

标签： regex bbcode

【解决方案1】：

问题在于正则表达式模式末尾的.*? 模式。它从不消耗任何文本，因为总是先跳过惰性模式，首先尝试后续模式。这里，.*? 之后没有任何内容，这意味着返回一个有效的匹配项而不消耗最后一个 .*? 的任何内容是可以的。

一种可能的解决方案是使用正则表达式拆分字符串，将捕获的子字符串保留在输出中。不幸的是，Dart 不直接支持它，所以我增强了this solution 以解决您的情况：

extension RegExpExtension on RegExp {
  List<List<String?>> allMatchesWithSep(String input, int grpnum, bool includematch, [int start = 0]) {
    var result = List<List<String?>>.empty(growable: true);
    for (var match in allMatches(input, start)) {
      var res = List<String?>.empty(growable: true);
      res.add(input.substring(start, match.start));
      if (includematch) {
          res.add(match.group(0));
      }
      for (int i = 0; i < grpnum; i++) {
          res.add(match.group(i+1));
      }
      start = match.end;
      result.add(res);
    }
    result.add([input.substring(start)]);
    return result;
  }
}

extension StringExtension on String {
  List<List<String?>> splitWithDelim(RegExp pattern, int grpnum, bool includematch) =>
      pattern.allMatchesWithSep(this, grpnum, includematch);
}

void main() {
  String text = "test1 [B]bold text[/B] test2 [U]underlined[/U] test3";
  RegExp rx = RegExp(r"\[[UBI]\]([\w\W]*?)\[\/[UBI]\]");
  print(text.splitWithDelim(rx, 1, true));
}

输出：

[[test1 , [B]bold text[/B], bold text], [ test2 , [U]underlined[/U], underlined], [ test3]]

请注意，该模式现在只包含一个捕获组，这是grpnum 值（组号）。由于您需要结果中的整个匹配项，因此将 includematch 设置为 true。

[\w\W] 将匹配任何字符，包括换行字符，. 默认不匹配它们。

【讨论】：