【发布时间】:2020-07-02 20:08:55
【问题描述】:
我使用 scrapy 来抓取亚马逊网站只是为了学习。当我们按类别购物时,我们会得到一个产品列表,当我们点击一个产品时,我们会得到该产品的详细信息。我已经完成了从产品列表中抓取详细信息的基本部分,例如产品名称、价格及其链接。但我希望这些抓取的链接可以在当时和那里使用,并且每个产品的详细信息页面都应该在该程序本身中抓取。
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = [
'https://www.amazon.co.uk/s?me=A1NZU6VUR85CVU&marketplaceID=A1F83G8C2ARO7P'
]
def parse(self, response):
items = AmazonscrapyItem()
all_div_quotes = response.css('body')
for quotes in all_div_quotes:
product = quotes.css('.a-color-base.a-text-normal').css('::text').extract()
price = quotes.css('.a-offscreen').css('::text').extract()
brand = quotes.css('.s-image::attr(src)').extract()
asin = quotes.css(
'.sg-col-20-of-24.s-result-item.sg-col-0-of-12.sg-col-28-of-32.sg-col-16-of-20.sg-col.sg-col-32-of-36.sg-col-12-of-16.sg-col-24-of-28::attr(data-asin)').extract()
productlink = quotes.css('.a-link-normal.a-text-normal').css('::attr(href)').extract()
items['product'] = product
items['price'] = price
items['brand'] = brand
items['asin'] = asin
items['productlink'] = productlink
yield items
next_page_link = response.css('.a-last a::attr(href)').extract_first()
next_page_link = response.urljoin(next_page_link)
yield scrapy.Request(url=next_page_link, callback=self.parse)
【问题讨论】:
标签: python web-scraping beautifulsoup scrapy html-parsing