正则表达式 URL 替换、忽略图像和现有链接答案

【问题标题】：Regex URL Replace, ignore Images and existing Links正则表达式 URL 替换、忽略图像和现有链接
【发布时间】：2012-02-21 08:19:19
【问题描述】：

我有一个非常好的正则表达式，它可以将字符串中的 url 替换为可点击一次。

string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";

现在，我如何让它忽略已经可点击的链接和图片？

所以它忽略了下面的字符串：

<a href="http://www.someaddress.com">Some Text</a>

<img src="http://www.someaddress.com/someimage.jpg" />

例子：

The website www.google.com, once again <a href="http://www.google.com">www.google.com</a>, the logo <img src="http://www.google.com/images/logo.gif" />

结果：

The website <a href="http://www.google.com">www.google.com</a>, once again <a href="http://www.google.com">www.google.com</a>, the logo <img src="http://www.google.com/images/logo.gif" />

完整的 HTML 解析器代码：

string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);

text = r.Replace(text, "<a href=\"$1\" title=\"Click to open in a new window or tab\" target=\"&#95;blank\" rel=\"nofollow\">$1</a>").Replace("href=\"www", "href=\"http://www");

return text;

【问题讨论】：

好，难以阅读，难以维护，使用 HtmlParser 很容易......
您是否尝试使用正则表达式解析 HTML？
我已经回答了这个问题here
Regex string issue in making plain text urls clickable的可能重复
是的，我正在尝试解析 HTML，我刚刚更新了问题并粘贴了所有代码。

标签： c# regex

【解决方案1】：

首先，如果没有其他人愿意，我会将其发布为必填链接。 RegEx match open tags except XHTML self-contained tags

像这样对" 使用负前瞻/后视如何：

string regex = @"(?<!"")((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])(?!"")";

【讨论】：

我们真的应该在发布到强制性参考后停止提供解决方法......
同意，但我也不想发布无用的 cmets 作为答案。
这对我有用：(?<!\w?="")(((http|https|ftp|news|file)+://)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])

【解决方案2】：

查看：Detect email in text using regex，只需替换链接的正则表达式，它永远不会替换标签内的链接，只会替换内容。

http://htmlagilitypack.codeplex.com/

类似：

string textToBeLinkified = "... your text here ...";
const string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);

var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    node.InnerHtml = urlExpression.Replace(node.InnerHtml, @"<a href=""$0"">$0</a>");
}
string linkifiedText = doc.DocumentNode.OuterHtml;

【讨论】：