目标:爬取全国报刊名称及地址
链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm
目的:练习scrapy爬取数据
学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。
目标截图:
1、创建爬虫工程
|
1
2
|
$ cd ~/code/crawler/scrapyProject$ scrapy startproject newSpapers |
2、创建爬虫程序
|
1
2
|
$ cd newSpapers/$ scrapy genspider nationalNewspaper news.xinhuanet.com |
3、配置数据爬取项
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
$ cat items.py# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewspapersItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
addr = scrapy.Field()
|
4、 配置爬虫程序
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
$ cat spiders/nationalNewspaper.py# -*- coding: utf-8 -*-import scrapyfrom newSpapers.items import NewspapersItem
class NationalnewspaperSpider(scrapy.Spider):
name = "nationalNewspaper"
allowed_domains = ["news.xinhuanet.com"]
start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']
def parse(self, response):
sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')
sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')
tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')
items = []
for each in tags_a_country:
item = NewspapersItem()
item['name'] = each.xpath('./strong/text()').extract()
item['addr'] = each.xpath('./@href').extract()
items.append(item)
return items
|
5、配置谁去处理爬取结果
|
1
2
3
4
|
$ cat settings.py……#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}
|
6、配置数据处理程序
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
$ cat pipelines.py# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport timeclass NewspapersPipeline(object):
def process_item(self, item, spider):
now = time.strftime('%Y-%m-%d',time.localtime())
filename = 'newspaper.txt'
print '================='
print item
print '================'
with open(filename,'a') as fp:
fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n')
return item
|
7、查看结果
|
1
2
3
4
5
6
7
|
$ cat spiders/newspaper.txt人民日报 http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm
海外版 http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm
光明日报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm
经济日报 http://www.economicdaily.com.cn/no1/
解放军报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm
中国日报 http://pub1.chinadaily.com.cn/cdpdf/cndy/
|