如何在 PHP 中实现网络爬虫？ [关闭]答案

【问题标题】：How to implement a web scraper in PHP? [closed]如何在 PHP 中实现网络爬虫？ [关闭]
【发布时间】：2010-09-06 19:37:18
【问题描述】：

哪些内置 PHP 函数对网页抓取有用？有哪些好的资源（网络或印刷）可以加快使用 PHP 进行网络抓取？

【问题讨论】：

我想推荐我最近遇到的这门课。 Simple HTML DOM Parser
PHP 是一种特别糟糕的语言。它缺少一个事件驱动的框架，这几乎是该任务所必需的。你能用它爬取一个网站吗——是的。你会爬很多网站吗？没有。
@EvanCarroll cURL 和 DOMdocument 是否适合从多个网站抓取产品的价格和图片（输出到我的网站上）？例如this Stackoverflow link 如果没有，你有什么建议？
试试吧，如果它有效，它对你来说已经足够好了。 Node 是构建网络爬虫的更好选择。另外，Phantom.JS（如果你需要一些现代的东西，它实际上有一个 dom 并在其上运行 javascript）。

标签： php screen-scraping

【解决方案1】：

curl 库允许您下载网页。您应该查看用于进行抓取的正则表达式。

【讨论】：

-1 用于推荐正则表达式！使用 HTML 解析器。

【解决方案2】：

file_get_contents() 可以获取远程 URL 并为您提供源。然后，您可以使用正则表达式（与 Perl 兼容的函数）来获取您需要的内容。

出于好奇，你想刮什么？

【讨论】：

【解决方案3】：

我会使用 libcurl 或 Perl 的 LWP（libwww for perl）。 php有libwww吗？

【讨论】：

如果你打算使用 LWP，请使用 WWW::Mechanize，它用方便的辅助函数包装它。
Mechanize 也可用于 Ruby，如果您对 PHP 以外的东西持开放态度。

【解决方案4】：

抓取通常包含 3 个步骤：

首先获取或发布您的请求到指定的 URL
接下来你会收到作为返回的 html 回应
你终于解析出来了那个 html 你想要的文本刮。

为了完成第 1 步和第 2 步，下面是一个简单的 php 类，它使用 Curl 使用 GET 或 POST 来获取网页。取回 HTML 后，您只需使用正则表达式通过解析您要抓取的文本来完成第 3 步。

对于正则表达式，我最喜欢的教程网站如下： Regular Expressions Tutorial

我最喜欢使用 RegEx 的程序是 Regex Buddy。即使您不打算购买它，我也会建议您尝试该产品的演示。它是一个非常宝贵的工具，甚至可以为您使用您选择的语言（包括 php）生成的正则表达式生成代码。

用法：



$curl = new Curl();
$html = $curl->get("@987654323@");

// now, do your regex work against $html

PHP 类：



<?php

class Curl
{       

    public $cookieJar = "";

    public function __construct($cookieJarFile = 'cookies.txt') {
        $this->cookieJar = $cookieJarFile;
    }

    function setup()
    {


        $header = array();
        $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
        $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
        $header[] =  "Cache-Control: max-age=0";
        $header[] =  "Connection: keep-alive";
        $header[] = "Keep-Alive: 300";
        $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
        $header[] = "Accept-Language: en-us,en;q=0.5";
        $header[] = "Pragma: "; // browsers keep this blank.


        curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
        curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
        curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); 
        curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
        curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
        curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
    }


    function get($url)
    { 
        $this->curl = curl_init($url);
        $this->setup();

        return $this->request();
    }

    function getAll($reg,$str)
    {
        preg_match_all($reg,$str,$matches);
        return $matches[1];
    }

    function postForm($url, $fields, $referer='')
    {
        $this->curl = curl_init($url);
        $this->setup();
        curl_setopt($this->curl, CURLOPT_URL, $url);
        curl_setopt($this->curl, CURLOPT_POST, 1);
        curl_setopt($this->curl, CURLOPT_REFERER, $referer);
        curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
        return $this->request();
    }

    function getInfo($info)
    {
        $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
        return $info;
    }

    function request()
    {
        return curl_exec($this->curl);
    }
}

?>

【讨论】：

嗯，用正则表达式解析 html 是……好吧，我就让这家伙解释一下：stackoverflow.com/questions/1732348/…
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $this->cookieJar); curl_setopt($this->curl,CURLOPT_COOKIEFILE, $this->cookieJar);
此外，如果需要从同一网站获取或发布多个表单，$this->curl = curl_init($url); 将出现问题，每次都会打开一个新会话。这个init用于get函数和postForm函数
优秀的代码。不过 cookiejar 和 cookiefile 的错误，将 $cookieJar 替换为 $this->$cookieJar

【解决方案5】：

如果您需要易于维护而不是快速执行的东西，使用可编写脚本的浏览器可能会有所帮助，例如 SimpleTest's。

【讨论】：

【解决方案6】：

我的框架中的爬虫类：

<?php

/*
    Example:

    $site = $this->load->cls('scraper', 'http://www.anysite.com');
    $excss = $site->getExternalCSS();
    $incss = $site->getInternalCSS();
    $ids = $site->getIds();
    $classes = $site->getClasses();
    $spans = $site->getSpans(); 

    print '<pre>';
    print_r($excss);
    print_r($incss);
    print_r($ids);
    print_r($classes);
    print_r($spans);        

*/

class scraper
{
    private $url = '';

    public function __construct($url)
    {
        $this->url = file_get_contents("$url");
    }

    public function getInternalCSS()
    {
        $tmp = preg_match_all('/(style=")(.*?)(")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getExternalCSS()
    {
        $tmp = preg_match_all('/(href=")(\w.*\.css)"/i', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getIds()
    {
        $tmp = preg_match_all('/(id="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getClasses()
    {
        $tmp = preg_match_all('/(class="(\w*)")/is', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

    public function getSpans(){
        $tmp = preg_match_all('/(<span>)(.*)(<\/span>)/', $this->url, $patterns);
        $result = array();
        array_push($result, $patterns[2]);
        array_push($result, count($patterns[2]));
        return $result;
    }

}
?>

【讨论】：

【解决方案7】：

这是另一个：简单的PHP Scraper without Regex。

【讨论】：

【解决方案8】：

ScraperWiki 是一个非常有趣的项目。帮助您使用 Python、Ruby 或 PHP 在线构建爬虫 - 我能够在几分钟内完成一个简单的尝试。

【讨论】：

【解决方案9】：

我推荐Goutte, a simple PHP Web Scraper。

示例用法：-

创建一个 Goutte 客户端实例（扩展 Symfony\Component\BrowserKit\Client):

use Goutte\Client;

$client = new Client();

使用request() 方法发出请求：

$crawler = $client->request('GET', 'http://www.symfony-project.org/');

request 方法返回一个Crawler 对象 (Symfony\Component\DomCrawler\Crawler)。

点击链接：

$link = $crawler->selectLink('Plugins')->link();
$crawler = $client->click($link);

提交表格：

$form = $crawler->selectButton('sign in')->form();
$crawler = $client->submit($form, array('signin[username]' => 'fabien', 'signin[password]' => 'xxxxxx'));

提取数据：

$nodes = $crawler->filter('.error_list');

if ($nodes->count())
{
  die(sprintf("Authentification error: %s\n", $nodes->text()));
}

printf("Nb tasks: %d\n", $crawler->filter('#nb_tasks')->text());

【讨论】：

【解决方案10】：

抓取可能非常复杂，具体取决于您要执行的操作。在The Basics Of Writing A Scraper In PHP 上阅读本教程系列，看看您是否能掌握它。

您可以使用类似的方法来自动化表单注册、登录，甚至虚假点击广告！使用 CURL 的主要限制是它不支持使用 javascript，所以如果你试图抓取一个使用 AJAX 进行分页的网站，例如它可能会变得有点棘手......但同样有办法解决这个问题！

【讨论】：