PHP正则表达式从html中提取<a>答案

【问题标题】：PHP Regex to extract <a> from htmlPHP正则表达式从html中提取<a>
【发布时间】：2015-08-10 08:59:15
【问题描述】：

我正在尝试从 HTML 中提取与特定模式匹配的标签。目前我使用：

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?[^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER)) {...}

正确获取所有元素，但是我希望第 3 组 (.*)（链接的文本）包含文本“查找门票”，但我所有合并文本的尝试都没有奏效。此外，html 内部的链接元素可以包含的内容不止“查找票证”——也就是说，它应该匹配 .*Find Tickets.*

任何人都可以在这里帮助我吗？我一直无能为力。

更新：我试图抓取的特定元素的示例：

<a href="https://www.facebook.com/l.php?u=https%3A%2F%2Fthelittleboxoffice.com%2Fheritagehotel%2Fevent%2Fview%2F22847&amp;h=RAQFYdp-K&amp;s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/thelittleboxoffice.com\\/heritagehotel\\/event\\/view\\/22847&quot;);" onclick="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/www.facebook.com\\/l.php?u=https\\u00253A\\u00252F\\u00252Fthelittleboxoffice.com\\u00252Fheritagehotel\\u00252Fevent\\u00252Fview\\u00252F22847&amp;h=RAQFYdp-K&amp;s=1&quot;);"><div id="u_0_p">Find Tickets</div></a><

谢谢乔什

【问题讨论】：

您只是想要链接文本还是href 值，both 还是整个a 标签？？
您能否更新您的帖子以显示您尝试匹配的字符串是什么以及您希望通过匹配实现什么？
我主要关注 href - 链接文本本身是可选的（即，我不会将它用于任何事情）。

标签： php regex python-2.7 html-parsing

【解决方案1】：

为作业使用正确的tool，而不是正则表达式。

$doc = new DOMDocument;
@$doc->loadHTML($html); // load the HTML data

$links = $xpath->query('//a[contains(., "Find Tickets")]');

foreach ($links as $link) {
   $results[] = $link->getAttribute('href');
}

print_r($results);

eval.in

【讨论】：

谢谢，但我实际上是从帽子方法开始的，但它不起作用，因为我所追求的元素实际上嵌入在 html 注释中，所以正则表达式是这项工作的更好工具。
我看不出正则表达式是一个更好的方法，有一些方法可以操纵 XPath 来实现这一点。
不过——我选择使用正则表达式，问题是关于正则表达式而不是其他方法。如果您可以编写一个 xpath 来查找包含的注释，那么该文本包含 Find Tickets - 然后将其打开 :-)

【解决方案2】：

我仍然有点难以理解您到底在追求什么。不过，这是我最好的尝试。

<?php

$string = '<a href="https://www.facebook.com/l.php?u=https%3A%2F%2Fthelittleboxoffice.com%2Fheritagehotel%2Fevent%2Fview%2F22847&amp;h=RAQFYdp-K&amp;s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/thelittleboxoffice.com\\/heritagehotel\\/event\\/view\\/22847&quot;);" onclick="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/www.facebook.com\\/l.php?u=https\\u00253A\\u00252F\\u00252Fthelittleboxoffice.com\\u00252Fheritagehotel\\u00252Fevent\\u00252Fview\\u00252F22847&amp;h=RAQFYdp-K&amp;s=1&quot;);"><div id="u_0_p">Find Tickets</div></a><';



if (preg_match('~(<a href(.*?)Find Tickets(.*?)</a>)~i', $string, $matches)) {
    print "<PRE><FONT COLOR=ORANGE>"; print_r($matches); print "</FONT></PRE>";
}

我在这里真正要做的就是寻找一个以<a href 开头的字符串，然后是一个字符串，直到它到达Find Tickets，可能还有更多的东西，直到它到达结尾的</a>。

这是一个非常通用的正则表达式，但如果您正在寻找更具体的东西，您可以以此为基础。

编辑：

好的，所以从您的评论来看，我想我对您正在寻找的东西有了更好的了解。这是一个更新的正则表达式，它将提取链接文本与 Find Tickets 匹配的链接的 URL。

<?php

$string = '
<a href="http://www.google.com" style="color: blue;">Google</a>

<a href="https://www.facebook.com/l.php?u=https%3A%2F%2Fthelittleboxoffice.com%2Fheritagehotel%2Fevent%2Fview%2F22847&amp;h=RAQFYdp-K&amp;s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/thelittleboxoffice.com\\/heritagehotel\\/event\\/view\\/22847&quot;);" onclick="LinkshimAsyncLink.swap(this, &quot;https:\\/\\/www.facebook.com\\/l.php?u=https\\u00253A\\u00252F\\u00252Fthelittleboxoffice.com\\u00252Fheritagehotel\\u00252Fevent\\u00252Fview\\u00252F22847&amp;h=RAQFYdp-K&amp;s=1&quot;);"><div id="u_0_p">Find Tickets</div></a>

<a href="http://www.yahoo.com">Yahoo</a>';

if (preg_match('~<a href="(.*?)"(?:.*?)(?:(?=Find Tickets))(?:.*?)</a>~i', $string, $matches)) {
    print "<PRE><FONT COLOR=ORANGE>"; print_r($matches); print "</FONT></PRE>";
}

这就是这个表达式的作用：

(.*?) - 这是实际捕获 URL 并将其存储到 $matches[1] 中的部分。
(?:.*?) - 这允许任何东西在 URL 之后直到它到达下一部分（前瞻）。由于我们实际上并不需要这些信息，?: 告诉 REGEX 不要记住它找到的内容。
(?:(?=Find Tickets)) - 这是一个积极的前瞻，意思是为了进行匹配，文本 Find Tickets 必须出现在下一个。与上一项一样，我们使用?: 告诉它实际上不需要记住匹配项。如果文本匹配诸如“不查找门票”之类的内容 - (?=>Find Tickets<)，您可能可以在文本周围包含 html 括号以进一步锁定它。
(?:.*?) - 最后一部分与前面的相同，只匹配到结束 </a> 标记之前的任何内容。

从上面的$string，这会给你这个：

Array
(
    [0] => <a href="https://www.facebook.com/l.php?u=https%3A%2F%2Fthelittleboxoffice.com%2Fheritagehotel%2Fevent%2Fview%2F22847&amp;h=RAQFYdp-K&amp;s=1" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;https:\/\/thelittleboxoffice.com\/heritagehotel\/event\/view\/22847&quot;);" onclick="LinkshimAsyncLink.swap(this, &quot;https:\/\/www.facebook.com\/l.php?u=https\u00253A\u00252F\u00252Fthelittleboxoffice.com\u00252Fheritagehotel\u00252Fevent\u00252Fview\u00252F22847&amp;h=RAQFYdp-K&amp;s=1&quot;);"><div id="u_0_p">Find Tickets</div></a>
    [1] => https://www.facebook.com/l.php?u=https%3A%2F%2Fthelittleboxoffice.com%2Fheritagehotel%2Fevent%2Fview%2F22847&amp;h=RAQFYdp-K&amp;s=1
)

$matches[1] 包含 URL。

希望它能为您实现目标！

【讨论】：

谢谢，我试过了，但文件似乎不止一次匹配。 Find Tickets 只会出现一次，但我确实需要组中的 href 值，因为这个值将在结束后包含文本 " 作为组的一部分。我原来的正则表达式非常擅长提取元素但我需要它只提取链接文本中也包含“查找票证”的元素。
我想我现在明白你在找什么了。我更新了我的帖子以反映这一点。希望这是您想要的。