【问题标题】：matching images inside a link in regex匹配正则表达式中链接内的图像
【发布时间】：2013-06-27 14:13:32
【问题描述】：

我创建的正则表达式模式有什么问题：

$link_image_pattern = '/\<a\shref="([^"]*)"\>\<img\s.+\><\/a\>/';
preg_match_all($link_image_pattern, $str, $link_images);

我正在尝试做的是匹配其中包含图像的所有链接。但是当我尝试输出 $link_images 时，它包含第一个索引内的所有内容：

<pre>
  <?php print_r($link_images); ?>
</pre>

标记看起来像这样：

数组 ( [0] => 数组 ([0] => "

<p>&nbsp;</p>

<p><strong><a href="url">Title</a></strong></p>

<p>Desc</p>

<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>

但是当输出匹配的内容时，它只是返回匹配模式的第一个字符串加上页面中的所有其他标记，如下所示：

<a href="{$image_url}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url}" width="568" height="347"></a></p>

    <p>&nbsp;</p>

    <p><strong><a href="url">Title</a></strong></p>

    <p>Desc</p>

    <p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>")

【问题讨论】：

index0 将包含与表达式匹配的整个字符串
使用 DomDocument 库读取 HTML 并获取其数据。
Matching SRC attribute of IMG tag using preg_match 的可能重复项
参考上述问题并参考使用 html 解析器的答案 NOT regex
Regex 不是解析HTML的好方法，看下面的回答Parse anchor tags which have img tag as child element

标签： php regex

【解决方案1】：

Regex 可能不是解析 HTML 的最佳解决方案，但在某些情况下，它是唯一的选项，例如您的文本编辑器在搜索和替换表单中没有“在此处插入 html 解析脚本”选项。如果您实际上使用的是 PHP，那么您最好使用如下解析脚本：

$Document = new DOMXPath($doc);
foreach ($Document->query('//a//img')) {
# do something with it here
}

说明

这种格式通常会让讨厌正则表达式的人远离你。它将确保您的锚标签包含一个 img 标签。同时防止属性具有看起来像图像标签的东西的奇怪（并且非常不可能）边缘情况。

<a\b(?=\s|>)     # match the open anchor tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])*    # match the contents of the tag, skipping over the quoted values
>    # match the close of the anchor tag
<img\b(?=\s|>)    # match the open img tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])*     # match the contents of the img tag, skipping over the quoted value
>   # match the close of the img tag
<\/a>   # matcn the close anchor tag

PHP 代码示例：

示例文本

注意最后一行有一个丑陋的属性，它会破坏大多数其他正则表达式。

<p>&nbsp;</p>
<p><strong><a href="url">Title</a></strong></p>
<p>Desc</p>
<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>

<p><a href="{$image_url2}" Onmouseover="function(' ><img src=picture.png></a> ');" >I do not have an image</a></p>

代码

<?php
$sourcestring="your source string";
preg_match_all('/<a\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<img\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<\/a>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

匹配项

[0] => <a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a>

【讨论】：

【解决方案2】：

也许问题出在.+\> 部分，因为它匹配所有内容，直到最后一个>

尝试使用与停止 " 相同的方法： [^\>]+ 这适用于我的编辑器

<a.+><img[^>]+></a>

根据您的需要，您只需在 <、> 和 / 之前添加一些反斜杠 \

【讨论】：

Regex 不是解析 HTML 的方法，您是否注意到您在过去几分钟/秒内完成了多少次编辑，更不用说这个问题是重复的。
@Prix 1. 老实说，我的最后一次编辑是 21 分钟前，你的链接 - 17 分钟前，所以你在 4 分钟后完成了 2. 在减去之前尝试阅读问题，他试图“匹配”，而不是“解析” 3. 我可以在 5 分钟内进行尽可能多的编辑，你最好注意一下
仍然不是正则表达式，他可以使用 strpos，他仍然可以使用 DomDocument，不，我不是指您 21 分钟前的编辑，我指的是您在此期间的所有编辑之前我的评论超过 4 条，这证明正则表达式对于解析 HTML 并不容易处理，如果他正在比较链接，您可以使用 DomDocument 并匹配提取的字符串甚至使用 strpos 或类似选项更容易做到这一点。
我很高兴你花了这么多时间来关注我的回答，但一次又一次地编辑它的唯一原因是因为我不确定他能否正确使用它，例如如果很清楚，他需要转义一些字符（我的文本编辑器不需要它）
@Prix 你投反对票是因为它是错误的还是因为它“不是正确的方式”？

转发

说明

PHP 代码示例：