这几天使用PHP的爬虫框架爬取了一些数据,发现还是挺方便的,先上爬虫框架的文档 phpspider框架文档
使用方法其实在文档中写的很清楚而且在demo中也有使用示例,这里放下我自己的代码做个笔记
<?php include "./autoloader.php"; use phpspider\core\phpspider; /* Do NOT delete this comment */ /* 不要删除这段注释 */ $configs = array( \'name\' => \'中国保温网\', \'domains\' => array( \'www.cnbaowen.net\', \'cnbaowen.net\' ), \'scan_urls\' => array( \'http://www.cnbaowen.net/news/list-3720-1.html\' ), \'export\' => array( \'type\' => \'db\', \'table\' => \'articles_mc\', ), \'db_config\' => array( \'host\' => \'127.0.0.1\', \'port\' => 3306, \'user\' => \'root\', \'pass\' => \'123456\', \'name\' => \'spider\', ), \'content_url_regexes\' => array( "http://www.cnbaowen.net/news/show-\d+.html" ), \'list_url_regexes\' => array( "http://www.cnbaowen.net/news/list-3720-\d+.html" ), \'fields\' => array( array( // 抽取内容页的文章内容 \'name\' => "title", \'selector\' => "//h1[@id=\'title\']", \'required\' => true ), array( // 抽取内容页的文章作者 \'name\' => "content", \'selector\' => "//div[@id=\'content\']", \'required\' => true ), array( // 抽取内容页的文章作者 \'name\' => "type" ), array( // 抽取内容页的文章作者 \'name\' => "site_id" ), ), ); $spider = new phpspider($configs); $spider->on_list_page = function($page, $content, $spider){ for ($i = 2; $i < 24; $i++) { $url = "http://www.cnbaowen.net/news/list-3720-{$i}.html"; $spider->add_url($url); } }; $spider->on_extract_field = function($fieldname, $data, $page){ if($fieldname == "type"){ return 2; }elseif($fieldname == "content"){ $s = preg_replace("/<div style=\"float:right[\s\S]*?div>/","",$data); $s = preg_replace(\'/<a .*?href="(.*?)".*?>/is\',"<a href=\'#\'>",$s); $data = preg_replace(\'/<img.*?>/is\',"",$s); return $data; }elseif($fieldname == "site_id"){ return 1; }else{ return $data; } }; $spider->start();
注释:这里需要说明一点,抓取页面数据时我只需要标题和内容的部分,但是存入数据库时我需要使用到另外两个字段,所以定义字段的时候多定义了`type`和`site_id`两个字段,但是这两个字段的实际赋值是在 `on_extract_field` 回调函数中完成的
附带sql语句
CREATE TABLE `articles_mc` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(200) DEFAULT NULL, `content` text, `type` int(5) DEFAULT \'0\' COMMENT \'文章类型 1行业资讯 2技术资料\', `site_id` int(5) DEFAULT NULL COMMENT \'站点id\', PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=4887 DEFAULT CHARSET=utf8mb4;