爬取报刊名称及地址

目标：爬取全国报刊名称及地址

链接：http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm

目的：练习scrapy爬取数据

学习过scrapy的基本使用方法后，我们开始写一个最简单的爬虫吧。

目标截图：

爬取报刊名称及地址

　　1、创建爬虫工程

1
2

$ cd ~/code/crawler/scrapyProject
$ scrapy startproject newSpapers

　　2、创建爬虫程序

1
2

$ cd newSpapers/
$ scrapy genspider nationalNewspaper news.xinhuanet.com　

　　3、配置数据爬取项　

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

$ cat items.py
# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 

class NewspapersItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    name = scrapy.Field()

    addr = scrapy.Field()

　4、　配置爬虫程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

$ cat spiders/nationalNewspaper.py
# -*- coding: utf-8 -*-
import scrapy

from newSpapers.items import NewspapersItem
 

class NationalnewspaperSpider(scrapy.Spider):

    name = "nationalNewspaper"

    allowed_domains = ["news.xinhuanet.com"]

    start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm']
 

    def parse(self, response):

        sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]')

        sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]')

        tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a')

        items = []

        for each in tags_a_country:

            item = NewspapersItem()

            item['name'] = each.xpath('./strong/text()').extract()

            item['addr'] = each.xpath('./@href').extract()

            items.append(item)

        return items

　　5、配置谁去处理爬取结果

1
2
3
4

$ cat settings.py
……
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100}

　　6、配置数据处理程序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

$ cat pipelines.py
# -*- coding: utf-8 -*-
 
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
import time

class NewspapersPipeline(object):

    def process_item(self, item, spider):

        now = time.strftime('%Y-%m-%d',time.localtime())

        filename = 'newspaper.txt'

        print '================='

        print item

        print '================'

        with open(filename,'a') as fp:

            fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n')

        return item

　　7、查看结果

1
2
3
4
5
6
7

$ cat spiders/newspaper.txt

人民日报    http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm

海外版 http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm

光明日报    http://www.gmw.cn/01gmrb/2007-09/20/default.htm

经济日报    http://www.economicdaily.com.cn/no1/

解放军报    http://www.gmw.cn/01gmrb/2007-09/20/default.htm

中国日报    http://pub1.chinadaily.com.cn/cdpdf/cndy/