如何从 html 链接抓取和下载所有 pdf 文件？答案

【问题标题】：how to crawl and download all pdf files from html link?如何从 html 链接抓取和下载所有 pdf 文件？
【发布时间】：2012-02-01 22:03:30
【问题描述】：

这是我抓取所有 pdf 链接的代码，但它不起作用。如何从这些链接下载并保存到我电脑上的文件夹中？

<?php
set_time_limit(0);
include 'simple_html_dom.php';

$url = 'http://example.com';
$html = file_get_html($url) or die ('invalid url');

//extrack pdf links
foreach($html->find('a[href=[^"]*\.pdf]') as $element)
echo $element->href.'<br>';
?>

【问题讨论】：

您好像有错字，在 foreach 循环中，$htnl 应该是 $html。如果这不在您的原始代码中，那么您遇到的错误究竟是什么？
@ggreiner 在我的 ori 代码中，没有错字，抱歉。我想念这里的错字。我的网页中的空白结果

标签： php dom pdf web-crawler

【解决方案1】：

foreach($htnl->find('a[href=[^"]*\.pdf]') as element)
           ^---typo. should be an 'm'        ^---typo. need a $ here

除了上述错字之外，您的代码如何“不起作用”？

【讨论】：

ups，对不起，在我的原始代码中，没有错字-.-。它不起作用，我的网页中出现空白结果

【解决方案2】：

你研究过 phpquery 吗？ http://code.google.com/p/phpquery/

【讨论】：

【解决方案3】：

这里更简单的解决方案是：

foreach ($html->find('a[href$=pdf]') as $element)

https://simplehtmldom.sourceforge.io/manual.htm

[attribute$=value] 匹配具有指定属性的元素并以某个值结束。

【讨论】：