使用正则表达式从 html 字符串中提取所有图像答案

【问题标题】：Extract all images from html string using Regex使用正则表达式从 html 字符串中提取所有图像
【发布时间】：2021-08-19 10:18:05
【问题描述】：

我正在尝试使用正则表达式从 html 字符串中提取所有图像源。由于几个原因，我不能使用 HTML Agitility Pack。

我需要从看起来像这样的字符串中提取“gfx/image.png”

<table cellpadding="0" cellspacing="0"  border="0" style="height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;">
<table cellpadding="0" cellspacing="0" border="0" background="gfx/image.jpg" style=" width:700px; height:250px; "><tr><td valign="middle">

【问题讨论】：

regex 不是解析 html 文件的正确工具。如果文件名中有<、&等特殊字符怎么办？还是文件名出现在评论中？
HTML Agility Pack 是不是最好的解决方案？有没有通用的解决方案来获取图片链接？
我不知道 HTML Agility Pack 是什么，但解析 HTML 文件的唯一可靠方法是使用解析器。 Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms, Using regular expressions to parse HTML: why not?, you can't parse [x]html with regex
@phuclv HTML Agility Pack 是一个 HTML 解析器。
Html Agility Pack 使用起来比较简单，所以通常推荐使用。如果您只需要图像的链接，您还可以使用WebBrowser class（不是控制）导航到 URL（远程或本地）。加载 Document 时，您已经在 [WebBrowser].Document.Images 集合中解析了所有图像。然后，您可以稍后下载图像或从浏览器缓存中获取已下载的文件。

标签： c# regex

【解决方案1】：

你可以使用这个正则表达式：(['"])([^'"]+\.jpg)\1 然后获取 Groups[2]，此代码运行良好：

var str = @"<table cellpadding=""0"" cellspacing=""0""  border=""0"" style=""height:350px; margin:0; background: url('gfx/image.jpg') no-repeat;"">
<table cellpadding=""0"" cellspacing=""0"" border=""0"" background=""gfx/image.jpg"" style="" width:700px; height:250px; ""><tr><td valign=""middle"">";
var regex = new Regex(@"(['""])([^'""]+\.jpg)\1");
var match = regex.Match(str);
while (match.Success)
{
    Console.WriteLine(match.Groups[2].Value);
    match = match.NextMatch();
}

【讨论】：

如果您需要所有图片，正则表达式可以更改为：(['"])([^'"]+\.(jpg|png|bmp|gif))\1
如果只是为了提取图像，正则表达式是轻量级的方式，制表符或换行符，你可以像(['"])([^'"\s]+\.(jpg|png|bmp|gif))\1这样改变正则表达式，这个正则表达式可以自动识别''和''，你没看到['"] 和 \1 ?