用于从 HTML 中提取所有链接和锚文本的正则表达式答案

【问题标题】：Regexp for extracting all links and anchor texts from HTML用于从 HTML 中提取所有链接和锚文本的正则表达式
【发布时间】：2011-06-05 05:18:53
【问题描述】：

我想要一个或多个正则表达式，它可以：

1) 取大页面的html。

2) 查找所有链接中包含的url，例如：

<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>

等等，它应该提取包含在'href'attribute中的url，不管href之前或之后是什么

3) 提取所有链接的锚文本，例如在上面的例子中，它应该返回'http://example1.com'和锚文本'Test 1'，然后是'http://example2.com ' 和 'Test 2' 等等。

【问题讨论】：

有什么理由不想为此使用 DOM 解析器？以及您找不到副本的任何原因？
php regular expression to match specific url pattern的可能重复
Regular expression for grabbing the href attribute of an A element的可能重复
我喜欢这个问题每天被问一百万次
scrape the data from html page php的可能重复

标签： php regex string html-parsing

【解决方案1】：

试试这样的：

//not tested
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

【讨论】：

这不会匹配 OP 给定示例标记中的第二个和第三个链接。

【解决方案2】：

/<a[^>]+href\s*=\s*["']([^"']+)["'][^>]*>(.*?)<\/a>/mis

【讨论】：

当属性值用双引号括起来并包含单引号时，这将中断。当引号被省略时，它也会中断，这对于像 next_page.htm 这样的 href 值是允许的。见w3.org/TR/html401/intro/sgmltut.html#h-3.2.2
这个非常健壮（在这里测试martinwardener.com/regex）：\b(((src|href|action|url) *(=|:) *(?<mh>"|'|))(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mh>|url *$ *(?<mc>"|'|)(?<url>[\w ~$!*'/.?=#&@:%+,();\-\[\]]+)\k<mc>$)

【解决方案3】：

你需要看看look ahead and look behind。

<?php

$string = '<a href="http://example1.com">Test 1</a>
<a class="foo" id="bar" href="http://example2.com">Test 2</a>
<a onclick="foo();" id="bar" href="http://example3.com">Test 3</a>';

if(preg_match_all("|<a.*(?=href=\"([^\"]*)\")[^>]*>([^<]*)</a>|i", $string, $matches))
        {
        /*** if we find the word white, not followed by house ***/
        echo 'Found a match';
        print_r($matches);
    }
else
        {
        /*** if no match is found ***/
        echo 'No match found';
        }
?>

【讨论】：

当然，正确的方法是使用 DOM 解析器，但也可以使用正则表达式。
在 GameBit 的解决方案下方查看我的评论。它也适用于您的正则表达式。
不，如果属性内有单引号，它不会中断，只需尝试一下。事实上，如果你使用这个正则表达式 #]*>([^|]*>([^ |]*>([^#i 或类似的东西，然后你丢弃空的结果集，如果你使用单引号或不使用引号，它甚至不会中断全部。打破它的唯一方法是在锚文本中使用

【解决方案4】：

<?

$dom = new DomDocument();
$dom->loadHTML($html);
$urls = $dom->getElementsByTagName('a');

【讨论】：

很多人只是抛出“只需使用 DOM 解析器！”但从来没有一个简单的例子来说明它可以做什么。 php.net/manual/en/book.dom.php 它比我的例子做得更多。值得学习。
这个答案不完整，这是一个有效的stackoverflow.com/questions/4423272/…

【解决方案5】：

就使用 RegEx 从 HTML 中提取链接而言，这个非常强大：

这是一个从 HTML 文档中提取所有“纯”文本（即标签外的内容）的方法：

(<(?<tag>script|style)[\s\S]*?</\k<tag>>)||<[\s\S]*?>|(?<text>[^<>]*)

在这里测试它们：http://www.martinwardener.com/regex

【讨论】：

【解决方案6】：

<?php
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER))
{ foreach($matches as $match)
{// $match[2] = link address
// $match[3] = link text}
}
?>

这将提取链接和锚文本。

【讨论】：

我用这个，因为 4MB 文件只需要 54 毫秒，而不是真正的解析器需要 10-30 秒 :)
真的是一个伟大的工作，只有一个正则表达式，所有的工作都完成了。今天学到了新方法。