【发布时间】:2018-01-23 16:45:25
【问题描述】:
我正在尝试从网站的多个页面中抓取不同的信息。 直到第 16 页,一切正常:页面被抓取、抓取,并且信息存储在我的数据库中,但是在第 16 页之后,它停止抓取但继续抓取。 我检查了网站,有更多的 470 页信息。 HTML标签是一样的,所以我不明白为什么它停止了抓取。
Python:
def url_lister():
url_list = []
page_count = 1
while page_count < 480:
url = 'https://www.active.com/running?page=%s' %page_count
url_list.append(url)
page_count += 1
return url_list
class ListeCourse_level1(scrapy.Spider):
name = 'ListeCAP_ACTIVE'
allowed_domains = ['www.active.com']
start_urls = url_lister()
def parse(self, response):
selector = Selector(response)
for uneCourse in response.xpath('//*[@id="lpf-tabs2-a"]/article/div/div/div/a[@itemprop="url"]'):
loader = ItemLoader(ActiveItem(), selector=uneCourse)
loader.add_xpath('nom_evenement', './/div[2]/div/h5[@itemprop="name"]/text()')
loader.default_input_processor = MapCompose(string)
loader.default_output_processor = Join()
yield loader.load_item()
pass
外壳:
> 2018-01-23 17:22:29 [scrapy.core.scraper] DEBUG: Scraped from <200
> https://www.active.com/running?page=15>
> {
> 'nom_evenement': 'Enniscrone 10k run & 5k run/walk',
> }
> 2018-01-23 17:22:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=16> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> --------------------------------------------------
> 2018-01-23 17:22:34 [scrapy.extensions.logstats] INFO: Crawled 17 pages (at 17 pages/min), scraped 155 items (at 155 items/min)
> 2018-01-23 17:22:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.active.com/running?page=17> (referer: None)
>
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=18> (referer: None)
> --------------------------------------------------
> SCRAPING DES ELEMENTS EVENTS
> -------------------------------------------------- 2018-01-23 17:22:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> https://www.active.com/running?page=19> (referer: None)
【问题讨论】:
-
我遇到了类似的问题。您是否等待 Scrapy 完成爬网?
-
是的,我每次都等
-
我有一个蜘蛛当前正在运行,它已经抓取了 15783 个页面,但只抓取了 625 个页面。它抓取了我在
start_urls中提供的前几个 URL 的链接(在一个长列表中)值),但随后停止抓取。我找不到可以解释这种行为的文档。 -
好像是同样的问题。
标签: python scrapy web-crawler