如何使用正则表达式从 html 标记中提取 url 和文本答案

【问题标题】：How to extract urls and text from html markup with regex如何使用正则表达式从 html 标记中提取 url 和文本
【发布时间】：2014-02-08 10:55:35
【问题描述】：

<!-- This Div repeated in HTML with different properties value -->

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">

<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">

    <!-- This Div also repeated multiple in HTML -->

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
    </FONT>
</a>

</DIV>

我们有非常脏的 html 标记，它是由某些程序或应用程序生成的。我们想从这段代码中提取“Urls”以及“Text”。

在href 中，我们使用两种类型的 url，Url 1 模式：'http://link.domain.com/id=123'，Url 2 模式：'http://domain.com/sons-title.mp3'

在第一场比赛中，我们是特定的模式，但在第二个 url 中，我们没有模式只是以 '.mp3' 扩展名结尾的 url。

是否有一些函数可以从这个模式和文本代码中提取url？

注意： 没有 DOM，有什么方法可以匹配一个 href 和 text 之间的正则表达式？预匹配？

【问题讨论】：

没有什么神奇的功能可以为您完成所有工作。你必须编写代码来做你想做的事。使用诸如 DOMDocument 之类的 DOM 解析器来完成此操作。

标签： php html regex curl

【解决方案1】：

利用DOMDocument 类并像这样继续。

$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {

        echo $tag->getAttribute('href');
        echo $tag->nodeValue; // to get the content in between of tags...

}

【讨论】：

刚试过这个，效果很好。尽管您可能希望将此行更改为： echo $tag->getAttribute('href');

【解决方案2】：

扩展@Shankar Damodaran 的回答：

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'?id=') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

然后对 MP3 做同样的事情：

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

【讨论】：

谢谢，但它的显示警告，如“警告：DOMDocument::loadHTML(): Unexpected end tag : td Notice: DOMDocument::loadHTML(): Namespace prefix fb Warning: DOMDocument::loadHTML( ): 标签 fb:comment"
您需要正确加载$html文件内容。
尝试将您正在从中读取 url 的页面保存为 .html 文件，然后使用 file_get_contents('source.htm') 打开它以首先对其进行调试。删除不必要的东西，让调试更简单。
没有DOM，有没有办法匹配href和正则表达式的文本之间？ preg_match ?