如何从 HTML 页面中提取链接？答案

【问题标题】：How can I extract the links from a page of HTML?如何从 HTML 页面中提取链接？
【发布时间】：2011-06-04 17:27:10
【问题描述】：

我正在尝试用 php 下载文件。

$file = file_get_contents($url);

我应该如何下载 $url 文件中的链接内容...

【问题讨论】：

通过调用 file_get_contents 下载链接，将链接作为参数传递。
Best Methods to parse HTML 的可能重复项

标签： php

【解决方案1】：

所以您想查找给定文件中的所有 URL？正则表达式来救援......下面的一些示例代码应该可以满足您的需求：

$file = file_get_contents($url);
if (!$file) return;
$file = addslashes($file);

//extract the hyperlinks from the file via regex
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $file, $urlmatches);

//if there are any URLs to be found
if (count($urlmatches)) {
    $urlmatches = $urlmatches[0];
    //count number of URLs
    $numberofmatches = count($matches);
    echo "Found $numberofmatches URLs in $url\n";

    //write all found URLs line by line
    foreach($urlmatches as $urlmatch) {
        echo "URL: $urlmatch...\n";
    }
}

编辑：当我正确理解您的问题时，您现在想要下载找到的 URL 的内容。您可以在 foreach 循环中为每个 URL 调用 file_get_contents 来执行此操作，但您可能希望事先进行一些过滤（例如不要下载图像等）。

【讨论】：

【解决方案2】：

这需要解析 HTML，这在 PHP 中是一个相当大的挑战。为了省去很多麻烦，请下载一个 HTML 解析库，例如 PHPQuery (http://code.google.com/p/phpquery/)。然后，您必须选择所有带有pq('a') 的链接，遍历它们以获取它们的href 属性值，并且对于每个链接，将其从相对转换为绝对并在生成的URL 上运行file_get_contents。希望这些提示可以帮助您入门。

【讨论】：

【解决方案3】：

您需要手动或通过第三方插件解析生成的 HTML 字符串。

HTML Scraping in Php

【讨论】：