【问题标题】:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)爬取 0 页(以 0 页/分钟),抓取 0 项(以 0 项/分钟)
【发布时间】:2020-05-21 18:57:48
【问题描述】:

我想从https://wolt.com/ru/kaz/almaty抓取餐厅数据 通过https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2等网址访问每个餐厅页面 这是我的代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from satu.items import QuoteItem
from scrapy.linkextractors import LinkExtractor


class QuotesSpiderSpider(CrawlSpider):
    name = 'wolt'
    allowed_domains = ['wolt.com']
    start_urls = ['https://wolt.com/ru/kaz/almaty/restaurant/']

    handle_httpstatus_list = [404, 302]

    rules = (
        Rule(LinkExtractor(allow=('/restaurant/')), callback='parse_item'))

    def parse_item(self, response):

        try:
            title = response.xpath(
                ".//div[@class='VenueHeroBanner__container___1_lK2']/h1[@class='VenueHeroBanner__title___2EzpN']//text()").get()
        except:
            title = ['']

        try:
            time = response.xpath(
                ".//div[@class='VenueSide__infoLine___jrSHX']/div[@class='VenueSide__hours___122Zm']//text()").get()
        except:
            time = ['']

        item = QuoteItem()
        item["title"] = title
        item["time"] = time

        yield item

但是,它不会抓取任何数据。而且我不知道问题出在哪里。 输出是这样的:

2020-05-22 00:13:40 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: satu)
2020-05-22 00:13:40 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.6.0 (v3.6.0:41df79263a11, Dec 2
3 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 2.9.1, Platform Windows-10-10.0.18362-SP0
2020-05-22 00:13:40 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-22 00:13:40 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'satu',
 'CONCURRENT_REQUESTS': 32,
 'COOKIES_ENABLED': False,
 'DOWNLOAD_DELAY': 3,
 'HTTPCACHE_IGNORE_HTTP_CODES': [301, 302],
 'NEWSPIDER_MODULE': 'satu.spiders',
 'REDIRECT_ENABLED': False,
 'RETRY_HTTP_CODES': [500, 503, 504, 400, 403, 404, 408, 429],
 'RETRY_TIMES': 1,
 'SPIDER_MODULES': ['satu.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/55.0.2919.83 Safari/537.36'}
2020-05-22 00:13:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-22 00:13:40 [scrapy.core.engine] INFO: Spider opened
2020-05-22 00:13:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-22 00:13:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-22 00:13:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 1 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 2 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://wolt.com/ru/kaz/almaty/restaurant/> (failed 2 times): 404 Not Found
2020-05-22 00:13:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://wolt.com/ru/kaz/almaty/restaurant/> (referer: None)
2020-05-22 00:13:44 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-22 00:13:44 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 622,
 ...
 'start_time': datetime.datetime(2020, 5, 21, 18, 13, 40, 862817)}
2020-05-22 00:13:44 [scrapy.core.engine] INFO: Spider closed (finished)

【问题讨论】:

    标签: python web-scraping scrapy scrape


    【解决方案1】:

    问题是您的蜘蛛从https://wolt.com/ru/kaz/almaty/restaurant/ 开始,这是一个 404 页面(又名未找到)。您应该将start_urls 更改为带有https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2 之类的数据的内容。此外,如果您不覆盖 start_requests,您没有定义默认挂钩的 parse 方法。然后您的 xpath 中出现时间错误,您缺少 /div
    试试这个

    class QuotesSpiderSpider(CrawlSpider):
        name = 'wolt'
        allowed_domains = ['wolt.com']
        start_urls = ['https://wolt.com/ru/kaz/almaty/restaurant/la-pizza-2']
    
        def parse(self, response):
            title = response.xpath(".//div[@class='VenueHeroBanner__container___1_lK2']/h1[@class='VenueHeroBanner__title___2EzpN']/text()").get()
            time = response.xpath(".//div[@class='VenueSide__infoLine___jrSHX']/div[@class='VenueSide__hours___122Zm']/div/text()").get()
            yield QuoteItem(title=title, time=time)
    

    【讨论】:

    • 感谢您的贡献。然而,将start_urls 更改为wolt.com/ru/kaz/almaty/restaurant/la-pizza-2 并没有解决问题。我仍然收到相同的输出,没有数据被抓取
    • @SultanAkhmetbek 是的,经过仔细检查,我还发现了更多问题。编辑了我的答案,请试一试!
    猜你喜欢
    • 1970-01-01
    • 2018-12-18
    • 1970-01-01
    • 2018-07-06
    • 2022-11-02
    • 1970-01-01
    • 2023-02-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多