【问题标题】:How can I iterate over list of URL's to scrape the data in Scrapy?如何遍历 URL 列表以在 Scrapy 中抓取数据?
【发布时间】:2020-09-22 15:45:08
【问题描述】:
import scrapy
class oneplus_spider(scrapy.Spider):
    name='one_plus'
    page_number=0
    start_urls=[
        'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
    ]
     
    def parse(self,response):
        all_links=[]
        total_links=[]
        domain='https://www.amazon.com'
        href=[]
        link_set=set()
        
        href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
        for x in href:
            link_set.add(domain+x)
        for x in link_set:
            next_page=x
            yield response.follow(next_page, callback=self.parse_page1)


    def parse_page1(self, response):
        title=response.css('span.a-size-large product-title-word-break::text').extract()
        print(title)

运行代码后出错 - (失败 2 次):503 服务不可用。 我尝试了很多方法,但都失败了。请帮我。提前致谢!

【问题讨论】:

    标签: url scrapy scrape


    【解决方案1】:

    首先通过“curl”检查网址。喜欢,

    curl -I "https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3"
    

    然后,您可以看到 503 响应。

    HTTP/2 503
    

    也就是说,你的请求是错误的。

    你必须找到合适的请求。

    Chrome DevTools 将帮助您。喜欢

    我认为必须需要用户代理(如浏览器)。

    curl 'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3' \
      -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' \
       --compressed
    

    所以...它可能会起作用,

    import scrapy
    class oneplus_spider(scrapy.Spider):
        name='one_plus'
        page_number=0
        user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
        start_urls=[
            'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
        ]
         
        def parse(self,response):
            all_links=[]
            total_links=[]
            domain='https://www.amazon.com'
            href=[]
            link_set=set()
            
            href=response.css('a.a-link-normal.a-text-normal').xpath('@href').extract()
            for x in href:
                link_set.add(domain+x)
            for x in link_set:
                next_page=x
                yield response.follow(next_page, callback=self.parse_page1)
    
    
        def parse_page1(self, response):
            title=response.css('span.a-size-large product-title-word-break::text').extract()
            print(title)
    

    【讨论】:

    • 我放置了 user_agent,现在它正在访问下一页链接,但没有返回该页面上的文本。这是输出。 2020-09-23 15:25:34 [scrapy.core.engine] 调试:已爬网 (200) amazon.com/Samsung-Unlocked-Smartphone-Technology-Long-Lasting/…> (referer: amazon.com/…) [] [] []
    猜你喜欢
    • 2019-01-25
    • 2021-09-05
    • 1970-01-01
    • 2017-04-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多