删除 Regex.Match 字符串的一部分答案

【问题标题】：Remove parts of Regex.Match string删除 Regex.Match 字符串的一部分
【发布时间】：2025-11-25 13:10:02
【问题描述】：

所以我有一个字符串中的 HTML 表格。这些 HTML 大部分来自 FrontPage，因此大多格式不正确。这是它的外观的快速示例。

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

据我了解，FrontPage 会自动在每个新单元格中添加一个<p>。

我想删除那些 inside 表格的<p> 标记，但保留表格外部的标记。到目前为止，我尝试了 2 种方法：

第一种方法

第一种方法是使用单个 RegEx tp 捕获表中的每个 <p> 标记，然后到 Regex.Replace() 将它们删除。但是，我从未设法为此获得正确的 RegEx。（我知道用 RegEx 解析 HTML 很糟糕。我认为数据很简单，可以将 RegEx 应用于它）。

我可以使用这个正则表达式很容易地得到每个表中的所有内容：<table.*?>(.*?)</table>

然后我只想获取<p> 标签，所以我写了这个：(?<=<table.*?>)(<p>)(?=</table>)。这不匹配任何东西。（显然 .NET 允许在它们的后视中使用量词。至少这是我在使用 http://regexhero.net/tester/ 时的印象）

我可以通过什么方式修改此 RegEx 以仅捕获我需要的内容？

第二种方法

第二种方法是仅将表格内容捕获到一个字符串中，然后String.Replace() 删除<p> 标记。我正在使用以下代码来捕获匹配项：

MatchCollection tablematch = Regex.Matches(htmlSource, @"<table.*?>(.*?)</table>", RegexOptions.Singleline);

htmlSource 是一个包含整个 HTML 页面的字符串，这个变量是处理后将发送回客户端的内容。我只想从htmlSource 中删除我需要删除的内容。

如何使用 MatchCollection 删除 <p> 标签，然后将更新后的表格发送回 htmlSource？

谢谢

【问题讨论】：

一般认为是bad practice to try to parse HTML with regex，但是是Frontpage生成的HTML？这是一个全新的水平......
@JamesThorpe 我猜 HTML 解析器将无法读取这样的无效 HTML，所以也许没有其他选择。
@Alex 解析器比正则表达式更有可能处理它......另外，我没有看到任何与 OP 发布的内容特别无效的内容？
@JamesThorpe 我同意解析器在大多数情况下是最好的选择，但普通解析器在这种情况下只会抛出异常。
您可以使用MatchCollection 来查找所有内部<p> 标签，但这种方式可能无法替换它们。

标签： c# regex

【解决方案1】：

此答案基于第二种建议的方法。更改正则表达式以匹配表内的所有内容：

<table.*?table>

并使用 Regex.Replace 指定 MatchEvaluator 以进行所需的替换：

Regex myRegex = new Regex(@"<table.*?table>", RegexOptions.Singleline);
string replaced = myRegex.Replace(htmlSource, m=> m.Value.Replace("<p>",""));
Console.WriteLine(replaced);

使用问题输入输出：

<b>Table 1</b>
    <table class='class1'>
    <tr>
    <td>
        Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
    </table>
<p><b>Table 2</b></p>
    <table class='class2'>
    <tr>
        <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
    </table>
<p> Some text is here</p>

【讨论】：

【解决方案2】：

我猜可以通过使用委托（回调）来完成。

string html = @"
<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      <p>Procedure Name</td>
    <td>
        <p>Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        <p>Procedure Name</td>
        <td>
        <p>Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>
";

Regex RxTable = new Regex( @"(?s)(<table[^>]*>)(.+?)(</table\s*>)" );
Regex RxP = new Regex( @"<p>" );

string htmlNew = RxTable.Replace( 
    html,
    delegate(Match match)
    {
       return match.Groups[1].Value + RxP.Replace(match.Groups[2].Value, "") + match.Groups[3].Value;
    }
);
Console.WriteLine( htmlNew );

输出：

<b>Table 1</b>
  <table class='class1'>
  <tr>
    <td>
      Procedure Name</td>
    <td>
        Procedure</td>
    </tr>
  </table>
<p><b>Table 2</b></p>
  <table class='class2'>
    <tr>
      <td>
        Procedure Name</td>
        <td>
        Procedure</td>
    </tr>
  </table>
<p> Some text is here</p>

【讨论】：

【解决方案3】：

通常正则表达式允许您使用嵌套结构，它非常难看，您应该避免使用它，但如果您没有其他选择，您可以使用它。

static void Main()
{
    string s = 
@"A()
{
    for()
    {
    }
    do
    {
    }
}
B()
{
    for()
    {
    }   
}
C()
{
    for()
    {
        for()
        {
        }
    }   
}";

    var r = new Regex(@"  
                      {                       
                          (                 
                              [^{}]           # everything except braces { }   
                              |
                              (?<open>  { )   # if { then push
                              |
                              (?<-open> } )   # if } then pop
                          )+
                          (?(open)(?!))       # true if stack is empty
                      }                                                                  

                    ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

    int counter = 0;

    foreach (Match m in r.Matches(s))
        Console.WriteLine("Outer block #{0}\r\n{1}", ++counter, m.Value);

    Console.Read();
}

这里的正则表达式“知道”块的开始位置和结束位置，因此如果<p> 标记不适合关闭标记，您可以使用此信息删除它。

【讨论】：

我的主要问题不是在没有匹配结束标签的情况下处理<p> 标签，因为我只是想删除它们，即使它们有匹配的结束标签。我的问题我无法匹配或仅删除 inside 表格的标签。他们是否有匹配的结束标签