【问题标题】:Scrapy follow all the links and get statusScrapy 跟踪所有链接并获取状态
【发布时间】:2018-05-06 14:24:11
【问题描述】:

我想关注网站的所有链接并获取每个链接的状态,例如 404,200。我试过这个:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class someSpider(CrawlSpider):
  name = 'linkscrawl'
  item = []
  allowed_domains = ['mysite.com']
  start_urls = ['//mysite.com/']

  rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
  )

  def parse_obj(self,response):
    item = response.url
    print(item)

我可以在控制台上看到没有状态码的链接,例如:

mysite.com/navbar.html
mysite.com/home
mysite.com/aboutus.html
mysite.com/services1.html
mysite.com/services3.html
mysite.com/services5.html

但是如何将所有链接的状态保存在文本文件中?

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    我解决了这个问题,如下所示。希望这对需要的人有所帮助。

    import scrapy
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors import LinkExtractor
    
    class LinkscrawlItem(scrapy.Item):
        # define the fields for your item here like:
        link = scrapy.Field()
        attr = scrapy.Field()
    
    class someSpider(CrawlSpider):
      name = 'linkscrawl'
      item = []
    
      allowed_domains = ['mysite.com']
      start_urls = ['//www.mysite.com/']
    
      rules = (Rule (LinkExtractor(), callback="parse_obj", follow=True),
      )
    
      def parse_obj(self,response):
        #print(response.status)
        item = LinkscrawlItem()
        item["link"] = str(response.url)+":"+str(response.status)
        # item["link_res"] = response.status
        # status = response.url
        # item = response.url
        # print(item)
        filename = 'links.txt'
        with open(filename, 'a') as f:
          f.write('\n'+str(response.url)+":"+str(response.status)+'\n')
        self.log('Saved file %s' % filename)
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-11-08
    • 1970-01-01
    • 1970-01-01
    • 2014-05-23
    • 1970-01-01
    • 2020-03-17
    • 2012-01-22
    • 2015-12-31
    相关资源
    最近更新 更多