【问题标题】:Scrapy crawler is scraping only one item, not allScrapy 爬虫只抓取一项,而不是全部
【发布时间】:2020-06-27 02:14:20
【问题描述】:

我正在尝试使用以下代码从 this page 中抓取以 A 开头的项目。

import scrapy
from scrapy.selector import Selector
from ..items import RozeepkItem
class JobcatsSpider(scrapy.Spider):
    name = 'jobcats'
    allowed_domains = ['www.rozee.pk']
    start_urls = ['https://www.rozee.pk/jobs-by-industry']

    

    def parse(self, response):
        items = RozeepkItem()
        for job_cat in Selector(response).xpath("//div[@class = 'boxb job-dtl sitemap']"):
            category_title =  job_cat.xpath(".//div[@id = 'A-block']/div[@class = 'row']/ul/li/a/@title").get()
            url = job_cat.xpath(".//div[@id = 'A-block']/div[@class = 'row']/ul/li/a/@href").get()

            items['job_category'] = category_title
            items['url_str'] = url

            yield items

以下是items.py

import scrapy
class RozeepkItem(scrapy.Item):
    job_category = scrapy.Field()
    url_str = scrapy.Field()

它给出输出

2020-06-27 06:55:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.rozee.pk/jobs-by-industry>
{'job_category': 'Accounting Jobs in Pakistan',
 'url_str': '//www.rozee.pk/search/accounting-jobs-in-pakistan'}
2020-06-27 06:55:00 [scrapy.core.engine] INFO: Closing spider (finished)
2020-06-27 06:55:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 232,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 16977,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 2.227556,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 6, 27, 1, 55, 0, 379725),
 'item_scraped_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2020, 6, 27, 1, 54, 58, 152169)}
2020-06-27 06:55:00 [scrapy.core.engine] INFO: Spider closed (finished)

可以看出我只得到一个项目及其对应的链接,另一方面,如果我在浏览器中尝试这个 xpath,我会得到所有信息,如下面的截图所示。

有人可以帮我解决我犯错的地方吗?谢谢

【问题讨论】:

    标签: python xpath scrapy


    【解决方案1】:

    来自文档:

    .get() 总是返回一个结果;如果有多个匹配项, 返回第一个匹配的内容;如果没有匹配项,则 None 是 回来。 .getall() 返回一个包含所有结果的列表。

    所以,在你的代码中使用.getall() 而不是get()

    【讨论】: