【问题标题】:Scrapy not crawling all links recursivelyScrapy没有递归地抓取所有链接
【发布时间】:2018-12-19 00:20:48
【问题描述】:

我需要网站所有页面的所有内部链接进行分析。我搜索了很多类似的问题。 我通过Mithu 找到了这段代码,它给出了可能的答案。但是,这并没有提供来自第二级页面深度的所有可能链接。 生成的csv 文件只有 676 条记录,但网站有 1000 条记录。

工作代码

import csv // Done to avoid line gaps in the generated csv file
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from eylinks.items import LinkscrawlItem
outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):

    name = "toscrapesp"
    start_urls = ["http://books.toscrape.com/"]

    rules = ([Rule(LinkExtractor(allow=r".*"), callback='parse', follow=True)])


    def parse(self, response):
        extractor = LinkExtractor(allow_domains='toscrape.com')
        links = extractor.extract_links(response)
        for link in links:
            yield scrapy.Request(link.url, callback=self.collect_data)

    def collect_data(self, response):
        global writer                                  
        for item in response.css('.product_pod'):
            product = item.css('h3 a::text').extract_first()
            value = item.css('.price_color::text').extract_first()
            lnk = response.url
            stats = response.status
            print(lnk)
            yield {'Name': product, 'Price': value,"URL":lnk,"Status":stats}  
            writer.writerow([product,value,lnk,stats]) 

【问题讨论】:

    标签: python-3.x scrapy scrapy-spider


    【解决方案1】:

    要提取链接试试这个:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.http import Request
    import csv 
    
    outfile = open("data.csv", "w", newline='')
    writer = csv.writer(outfile)
    class BooksScrapySpider(scrapy.Spider):
        name = 'books'
        allowed_domains = ['books.toscrape.com']
        start_urls = ['http://books.toscrape.com/']
    
        def parse(self, response):
            books = response.xpath('//h3/a/@href').extract()
            for book in books:
                url = response.urljoin(book)
                yield Request(url, callback=self.parse_book)
    
            next_page_url = response.xpath(
                "//a[text()='next']/@href").extract_first()
            absolute_next_page = response.urljoin(next_page_url)
            yield Request(absolute_next_page)
    
        def parse_book(self, response):
    
            title = response.css("h1::text").extract_first()
            price = response.xpath(
                "//*[@class='price_color']/text()").extract_first()
            url = response.request.url
    
            yield {'title': title,
                   'price': price,
                   'url': url,
                   'status': response.status}
            writer.writerow([title,price,url,response.status])
    

    【讨论】:

    • 您的代码运行异常。感谢您的指导。但是我的最终目标是仅获取网站的 URL、标题和状态以跟踪所有无效链接,因此最终下一页链接将不适用于我:(。将尝试编辑此代码以满足我的需要。据我所知为此,我过去两天的全天研究表明,我必须使用 from scrapy.linkextractors import LinkExtractor 来实现这一点。如果你能在这方面提供帮助,我将感激不尽。
    • 也知道为什么链接提取器会排除一些链接吗?
    猜你喜欢
    • 2014-02-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-01-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多