在 HTML 中查找内容并替换它答案

【问题标题】：Finding content within HTML, and replacing it在 HTML 中查找内容并替换它
【发布时间】：2018-07-05 03:09:38
【问题描述】：

我目前正在将内容从一个 CMS 导出/导入到另一个。

我有出口。我正在将旧 CMS 中的所有内容导出到 XML 文件，保留文档的结构等。导入也到位，映射到新的 PageTypes，映射文本字段等。我还将所有媒体从旧 CMS 导出和导入到新 CMS。

我唯一关心的是在每个页面的 RichText 字段中处理内部链接和媒体项链接。

因此，每个页面都包含一个 Header、一些通用信息和一个 RichTextField，其中包含 HTML 中的页面内容。此字段可以包含指向同一站点内其他页面的链接，例如内部链接以及指向媒体项目的链接。

我的问题是，我怎样才能找到这些，并将它们映射到我的新结构。

所有内部链接如下所示：<a href="/mycms/~/link.aspx?_id=D9423CEFED254610A5DC6B096A297E17&amp;_z=z">...</a>（也许某些链接上可能有更多属性，例如style=".."、class=".." 等。 ID，是对旧CMS的ID的引用，总是32个字符。

媒体项目（图像）可能如下所示：<img src="/mycms/~/media/B1FB91AC357347BD84913D56B8791D03.ashx" alt="" width="690" height="202" />。同样在这里，id 总是 32 个字符长。

在导入过程中，我生成了一个 json 文件，其中包含旧 CMS 中的所有 mediaId，将其映射到新 CMS 中的新 ID。所以它看起来像这样;

{
    "{0CFBBD0A-9156-4AD9-8A8A-7D30B2D7213B}":1095,
    "{BE9BEAAA-F04D-42DA-B52A-44B4B31A389E}":1096,
    etc.
}

请注意旧 CMS id 的 ID 格式与链接和媒体中使用的格式不同。去掉花括号和破折号，它就会匹配。

解决此问题的最佳方法是什么？我猜想 RegEx 将是可行的方法 - 但那会/会是什么样子？

谢谢:)

【问题讨论】：

Don't use regex to parse html.

标签： c# asp.net regex sitecore

【解决方案1】：

最好的办法是使用HtmlAgilityPack 之类的东西。纯正则表达式通常过于粗糙，无法成功解析 HTML……这不是不可能完成的任务，但比使用 HtmlAgilityPack 更难。

The post Eric 在他的评论中链接是 StackOverflow 历史上臭名昭著的一个，那里的多个回复详细说明了为什么不推荐使用 Regex 解析 HTML 的方法。根据我的个人经验提供 TLDR：HTML 页面通常充满小“错误”。例如，您经常会有未正确关闭的<img> 标签（如<img />）。确定性匹配和替换也相当困难。

因此，请尝试使用正确的工具来完成这项工作 - 在这种情况下，正确的工具是 HtmlAgilityPack。

谈到 HtmlAgilityPack 的使用 - they have good documentation。在您的情况下，您可能希望查看Replace Child 功能。为了从他们的文档中重现示例，这里是使用的测试 HTML：

<body>
    <h1>This is <b>bold</b> heading</h1>
    <p>This is <u>underlined</u> paragraph</p>
</body>

要对此进行操作，并替换 <h1> 节点，您可以这样做：

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html); // where html = @"content previously mentioned"

var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
HtmlNode oldChild = htmlBody.ChildNodes[1];     
HtmlNode newChild = HtmlNode.CreateNode("<h2> This is h2 new child heading</h2>");      

htmlBody.ReplaceChild(newChild, oldChild);
// now htmlBody has <h2> node instead of old <h1>

在您的情况下，您可能希望使用 SelectNodes 而不是 SelectSingleNode，其中使用 XPath 您将定位要替换的元素。在列表中包含这些元素后，您将迭代它们并根据条件替换内容。

要记住一件事 - 由于您的 ID 非常冗长，包含 32 个字符，因此您可能会使用纯字符串搜索来匹配它们。所以如果您不是针对某些 HTML 元素，而是针对 ID - 那么您甚至不需要使用 HtmlAgilityPack 或 Regex - 做简单的String.Replace("OLDUID", "NEWUID")。

【讨论】：

我非常同意 kape123 在这种情况下使用 HtmlAgilityPack 而不是蛮力正则表达式。我刚刚为具有类似情况的客户做了一个迁移项目——将旧 CMS 导入 Sitecore——在将内容导入 Sitecore 时，我必须进行大量的 HTML 清理和操作，包括正确设置内部链接和媒体链接。如果没有 HtmlAgilityPack，我认为我无法完成这项任务。
感谢@kape123 的建议。关于我如何使用 AgilityPack 找到与我的模式匹配的链接和图像以及 32 个字符 ID 的任何想法？
@brother 我在回答中添加了更多信息。

【解决方案2】：

如果您将非 html 与 html 混合使用，最好使用正则表达式。
这是一种进行替换的方法。

链接：

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(['"])/mycms/~/link\.aspx\?_id=)([a-f0-9]{32})(&amp;_z=z\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2 + key{$4} + $5
其中key{$4} 是字典中的新链接 ID 值。

https://regex101.com/r/xRf1xN/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s href \s* = \s*                 # href attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex link ID

      (                                 # (5 start), All past the ID num
           &amp;_z=z                         # Postfix link static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

媒体：

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\ssrc\s*=\s*(['"])/mycms/~/media/)([a-f0-9]{32})(\.ashx\3(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2 + key{$4} + $5
其中key{$4} 是字典中的新媒体 ID 值。

https://regex101.com/r/pwyjoK/1

 # https://regex101.com/r/ieEBj8/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the ID num
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?

           \s src \s* = \s*                  # src attribute
           ( ['"] )                          # (3), Quote
           /mycms/~/media/                   # Prefix media static text
      )                                 # (2 end)

      ( [a-f0-9]{32} )                  # (4), hex media ID

      (                                 # (5 start), All past the ID num
           \.ashx                            # Postfix media static text
           \3                                # End quote

                                             # The remainder of the tag parts
           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (5 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

如果我想 a) 在链接/src 标记中提取 ID 并 b) 替换整个 href=".." 或 src=".." 值（而不是隐藏 ID部分，这在 RegEx 中看起来如何？

为此，只需重新排列捕获组。

链接：

(?i)(<a)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(href\s*=\s*(['"])/mycms/~/link\.aspx\?_id=([a-f0-9]{32})&amp;_z=z\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2href='NEWID:key{$5}'$6
其中key{$5} 是字典中的新链接 ID 值。

https://regex101.com/r/FxpJVl/1

 (?i)                              # Case insensitive modifier
 ( < a )                           # (1), The a tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the href attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), href attribute
           href \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/link\.aspx\?_id=         # Prefix link static text


           ( [a-f0-9]{32} )                  # (5), hex link ID


           &amp;_z=z                         # Postfix link static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

媒体：

(?i)(<img)(?=((?:[^>"']|"[^"]*"|'[^']*')*?\s)(src\s*=\s*(['"])/mycms/~/media/([a-f0-9]{32})\.ashx\4)((?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

替换为$1$2src='NEWID:key{$5}'$6
其中key{$5} 是字典中的新媒体 ID 值。

https://regex101.com/r/EqKYjM/1

 (?i)                              # Case insensitive modifier
 ( < img )                         # (1), The img tag

 (?=                               # Asserttion (a pseudo atomic group)

      (                                 # (2 start), Up to the src attribute
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           \s 
      )                                 # (2 end)
      (                                 # (3 start), src attribute
           src \s* = \s* 
           ( ['"] )                          # (4), Quote
           /mycms/~/media/                   # Prefix media static text

           ( [a-f0-9]{32} )                  # (5), hex media ID

           \.ashx                            # Postfix media static text
           \4                                # End quote
      )                                 # (3 end)
      (                                 # (6 start), remainder of the tag parts

           (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
           > 
      )                                 # (6 end)

 )
                                   # All the parts have already been found via assertion
                                   # Just match a normal tag closure to advance the position
 \s+                               
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 >

【讨论】：

看起来不错 - 谢谢！如果我想 a) 在链接/src 标记中提取 ID 并 b) 替换整个 href=".." 或 src=".." 值（而不是隐藏 ID 部分，那么在 RegEx 中会如何？
@brother - 添加了方法（最后）。