【发布时间】:2020-04-24 12:16:31
【问题描述】:
我有多个页面的网址。我尝试分页以从这些 url 中提取数据,但它只能工作一次(只有一个 next_page)。怎么了?
import json
import scrapy
import re
import pkgutil
from scrapy.loader import ItemLoader
from rzc_spider.items import AnnonceItem
class AnnonceSpider(scrapy.Spider):
name = 'rzc_results'
def __init__(self, *args, **kwargs):
data_file = pkgutil.get_data("rzc_spider", "json/input/test_tt.json")
self.data = json.loads(data_file)
def start_requests(self):
for item in self.data:
request = scrapy.Request(item['rzc_url'], callback=self.parse)
request.meta['item'] = item
yield request
def parse(self, response):
item = response.meta['item']
item['results'] = []
item["car_number"] = response.css(
"h2.sub::text").extract_first()
for caritem in response.css("div.ad > div[itemtype='https://schema.org/Vehicle']"):
data = AnnonceItem()
#model
data["model"] = caritem.css(
"em.title::text").extract_first()
item['results'].append(data)
yield item
next_page = response.css(
'a.link::attr(href)').extract_first()
if next_page is not None:
url_pagination = 'https://www.websiteexample.com' + next_page
meta = {'item': response.meta['item']}
yield scrapy.Request(url=url_pagination, callback=self.parse, meta=meta)
#ban proxies reaction
def response_is_ban(self, request, response):
return b'banned' in response.body
def exception_is_ban(self, request, exception):
return None
带有 url 的 json 文件(本例中的示例):
[{
"rzc_url": "https://www.websiteexample.com/model"
}]
【问题讨论】:
-
蜘蛛是否在第二页没有找到下一个 URL,或者找到了 URL 但它给出了错误的响应?
-
它会找到第二页并将其废弃,但它会停在这里,即使有 5 页。
-
那么你对这个问题有什么想法吗?