如何遍历 URL 列表以在 Scrapy 中抓取数据？答案

【问题标题】：How can I iterate over list of URL's to scrape the data in Scrapy?如何遍历 URL 列表以在 Scrapy 中抓取数据？
【发布时间】：2020-09-22 15:45:08
【问题描述】：

import scrapy
class oneplus_spider(scrapy.Spider):
    name='one_plus'
    page_number=0
    start_urls=[
        'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
    ]
     
    def parse(self,response):
        all_links=[]
        total_links=[]
        domain='https://www.amazon.com'
        href=[]
        link_set=set()
        
        href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
        for x in href:
            link_set.add(domain+x)
        for x in link_set:
            next_page=x
            yield response.follow(next_page, callback=self.parse_page1)


    def parse_page1(self, response):
        title=response.css('span.a-size-large product-title-word-break::text').extract()
        print(title)

运行代码后出错 - （失败 2 次）：503 服务不可用。我尝试了很多方法，但都失败了。请帮我。提前致谢！

【问题讨论】：

标签： url scrapy scrape

【解决方案1】：

首先通过“curl”检查网址。喜欢，

curl -I "https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3"

然后，您可以看到 503 响应。

HTTP/2 503

也就是说，你的请求是错误的。

你必须找到合适的请求。

Chrome DevTools 将帮助您。喜欢

我认为必须需要用户代理（如浏览器）。

curl 'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' \
   --compressed

所以...它可能会起作用，

import scrapy
class oneplus_spider(scrapy.Spider):
    name='one_plus'
    page_number=0
    user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    start_urls=[
        'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
    ]
     
    def parse(self,response):
        all_links=[]
        total_links=[]
        domain='https://www.amazon.com'
        href=[]
        link_set=set()
        
        href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
        for x in href:
            link_set.add(domain+x)
        for x in link_set:
            next_page=x
            yield response.follow(next_page, callback=self.parse_page1)


    def parse_page1(self, response):
        title=response.css('span.a-size-large product-title-word-break::text').extract()
        print(title)

【讨论】：

我放置了 user_agent，现在它正在访问下一页链接，但没有返回该页面上的文本。这是输出。 2020-09-23 15:25:34 [scrapy.core.engine] 调试：已爬网 (200) amazon.com/Samsung-Unlocked-Smartphone-Technology-Long-Lasting/…> (referer: amazon.com/…) [] [] []