【问题标题】:Crawlspider rule not working爬虫规则不起作用
【发布时间】:2017-03-17 14:46:50
【问题描述】:

我正在尝试使用 python 的 scrapy 框架构建一个蜘蛛来抓取纽约理工学院课程的数据......以下是我的蜘蛛 (nyitspider.py)。有人可以告诉我哪里出错了。

from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

from nyit_sample.items import NyitSampleItem


class nyitspider(CrawlSpider):
name = 'nyitspider'
allowed_domains = ['nyit.edu']
start_urls = ['http://www.nyit.edu/academics/courses/']

rules = (
    Rule(LxmlLinkExtractor(
         allow=('.*/academics/courses', ),
    )),

Rule(LxmlLinkExtractor(
         allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9]    [0-9]/', ),
    ), callback='parse_item'),

)

def parse_item(self, response):
    item = Course()
    item["institute"] = 'New York Institute of Technology'
    item['site'] = 'www.nyit.edu'
    item['title'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a').extract()[0]
item['id'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a').extract()[0]
    item['credits'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[1]/td[3]').extract()[0]
    item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0]



    yield item

【问题讨论】:

  • 我们可以从中得到什么? 2017-03-17 07:20:59 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:6026 2017-03-17 07:20:59 [scrapy.core.engine] 调试:已爬网(200 ) nyit.edu/academics/courses> (referer: None) ['cached']
  • 首先,您可以从所有 xpath 表达式中删除 tbody 标记。它由浏览器添加,页面响应没有它。并尝试将您的第二条规则中的正则表达式更改为r'\/academics\/courses\/(.*)'(您也可以删除第一条规则)。

标签: python xpath web-scraping web-crawler scrapy-spider


【解决方案1】:

您必须在 parse_item 方法中正确声明项目,并且该方法应该返回一些内容。这是一个建议,但您必须对其进行改进:

# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule, BaseSpider, Spider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

from nyit_sample.items import NyitSampleItem


class nyitspider(CrawlSpider):
    name = 'nyitspider'
    allowed_domains = ['nyit.edu']
    start_urls = ['http://www.nyit.edu/academics/courses/']

    rules = (
        Rule(LxmlLinkExtractor(
             allow=('.*/academics/courses', ),
        ), callback='parse_item'),   
        Rule(LxmlLinkExtractor(
             allow=('.*/academics/courses/[a-z][a-z][a-z]-[a-z][a-z]-[0-9][0-9]    [0-9]/', ),
        ), callback='parse_item'),

    )

    def parse_item(self, response):
        item = NyitSampleItem()
        item['institute'] = 'New York Institute of Technology'
        item['site'] = 'www.nyit.edu'
        item['title'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[2]/a)').extract()[0]
        item['id'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[1]/a)').extract()[0]
        item['credits'] = response.xpath('string(//*[@id="course_catalog_table"]/tbody/tr[1]/td[3])').extract()[0]
        item['description'] = response.xpath('//*[@id="course_catalog_table"]/tbody/tr[2]/td/text()[1]').extract()[0]
        return item

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-01-07
    • 2018-01-30
    • 2012-04-22
    • 2018-11-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-03
    相关资源
    最近更新 更多