【问题标题】:IMDB Movie Scraping gives blank csv using scrapyIMDB 电影抓取使用 scrapy 提供空白 csv
【发布时间】:2019-01-17 14:03:14
【问题描述】:

我得到的是空白 csv,尽管它没有在代码中显示任何错误。 无法爬取网页。

这是我写的参考 youtube 的代码:-

import scrapy

from Example.items import MovieItem

class ThirdSpider(scrapy.Spider):
name = "imdbtestspider"
allowed_domains = ["imdb.com"]
start_url = ('http://www.imdb.com/chart/top',)


  def parse(self,response):
    links = response.xpath('//tbody[@class="lister-list"]/tr/td[@class="titleColumn"]/a/@href').extract()
    i = 1
    for link in links:
        abs_url = response.urljoin(link)
        #
        url_next = '//*[@id="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()'
        rating = response.xpath(url_next).extact()
        if (i <= len(link)):
            i=i+1
            yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'rating': rating})

  def parse_indetail(self,response):
    item = MovieItem()
    #
    item['title'] = response.xpath('//div[@class="title_wrapper"])/h1/text()').extract[0][:-1]
    item['directors'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="director"]/a/span/text()').extract()[0]
    item['writers'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="creator"]/a/span/text()').extract()
    item['stars'] = response.xpath('//div[@class="credit_summary_items"]/span[@itemprop="actors"]/a/span/text()').extract()
    item['popularity'] = response.xpath('//div[@class="titleReviewBarSubItem"]/div/span/text()').extract()[2][21:-8]

    return item

这是我在运行执行代码时得到的输出

scrapy crawl imdbtestspider -o example.csv -t csv

2019-01-17 18:44:34 [scrapy.core.engine] 信息:蜘蛛打开 2019-01-17 18:44:34 [scrapy.extensions.logstats] 信息:已爬取 0 页 (以 0 页/分钟),抓取 0 项(以 0 项/分钟)

【问题讨论】:

  • This question 有一个你正在做的工作代码。如果您需要更具体的帮助,请提供所有代码(相信from Example.items 是您的自定义代码?)。
  • 另外,start_url 是一个列表,所以需要放在方括号中,即 start_url = [www.abc.com, ]

标签: python web-scraping scrapy export-to-csv


【解决方案1】:

这是您可以尝试的另一种方法。我使用 css 选择器而不是 xpath 来使脚本不那么冗长。

import scrapy

class ImbdsdpyderSpider(scrapy.Spider):
    name = 'imbdspider'
    start_urls = ['http://www.imdb.com/chart/top']

    def parse(self, response):
        for link in response.css(".titleColumn a[href^='/title/']::attr(href)").extract():
            yield scrapy.Request(response.urljoin(link),callback=self.get_info)

    def get_info(self, response):
        item = {}
        title = response.css(".title_wrapper h1::text").extract_first()
        item['title'] = ' '.join(title.split()) if title else None
        item['directors'] = response.css(".credit_summary_item h4:contains('Director') ~ a::text").extract()
        item['writers'] = response.css(".credit_summary_item h4:contains('Writer') ~ a::text").extract()
        item['stars'] = response.css(".credit_summary_item h4:contains('Stars') ~ a::text").extract()
        popularity = response.css(".titleReviewBarSubItem:contains('Popularity') .subText::text").extract_first()
        item['popularity'] = ' '.join(popularity.split()).strip("(") if popularity else None
        item['rating'] = response.css(".ratingValue span::text").extract_first()
        yield item

【讨论】:

    【解决方案2】:

    我已经给你测试了xpaths 我不知道他们是错错了还是真的错了。

    例如;

    xpath = //*="main"]/div/span/div/div/div[2]/table/tbody/tr['+str(i)+']/td[3]/strong/text()
    
    #There is not table when you reach at div[2]
    
    //div[@class="title_wrapper"])/h1/text()    #here there is and error after `]` ) is bad syntax
    

    另外,您的 xpath 没有产生任何结果。

    【讨论】:

      【解决方案3】:

      至于为什么您会收到错误提示 0/pages crawled,尽管没有重新创建您的案例,但我不得不假设您的页面迭代方法没有正确构建页面 URL。

      我无法理解创建所有“关注链接”的变量数组的用途,然后使用 len 将它们发送到 parse_indetail() 但有几点需要注意。

      1. 当您使用“元”将项目从一个函数传递到下一个函数时,尽管您的想法是正确的,但您会遗漏一些将其传递给的函数的实例化(您还应该使用标准命名约定简单)

      应该是这样的……

      def parse(self,response):
          # If you are going to capture an item at the first request, you must instantiate
          # your items class
          item = MovieItem()
          ....
          # You seem to want to pass ratings to the next function for itimization, so
          # you make sure that you have it listed in your items.py file and you set it
          item[rating] = response.xpath(PATH).extact() # Why did you ad the url_next? huh?
          ....
          # Standard convention for passing meta using call back is like this, this way
          # allows you to pass multiple itemized item gets passed
          yield scrapy.Request(abs_url, callback = self.parse_indetail, meta = {'item': item})
      
        def parse_indetail(self,response):
          # Then you must initialize the meta again in the function your passing it to
          item = response.meta['item']
          # Then you can continue your scraping
      
      1. 您不应使页面迭代逻辑复杂化。您似乎了解它的工作原理,但需要帮助微调这方面。我重新创建了您的用例并对其进行了优化。
      #items.py file
      import scrapy
      
      
      class TestimbdItem(scrapy.Item):
          title = scrapy.Field()
          directors = scrapy.Field()
          writers = scrapy.Field()
          stars = scrapy.Field()
          popularity = scrapy.Field()
          rating = scrapy.Field()
      
      # The spider file
      import scrapy
      from testimbd.items import TestimbdItem
      
      class ImbdsdpyderSpider(scrapy.Spider):
          name = 'imbdsdpyder'
          allowed_domains = ['imdb.com']
          start_urls = ['http://www.imdb.com/chart/top']
      
          def parse(self, response):
              for href in response.css("td.titleColumn a::attr(href)").extract():
                  yield scrapy.Request(response.urljoin(href),
                                       callback=self.parse_movie)
      
          def parse_movie(self, response):
              item = TestimbdItem()
              item['title'] = [ x.replace('\xa0', '')  for x in response.css(".title_wrapper h1::text").extract()][0]
              item['directors'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Director")]/following-sibling::a/text()').extract()
              item['writers'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Writers")]/following-sibling::a/text()').extract()
              item['stars'] = response.xpath('//div[@class="credit_summary_item"]/h4[contains(., "Stars")]/following-sibling::a/text()').extract()
              item['popularity'] = response.css(".titleReviewBarSubItem span.subText::text")[2].re('([0-9]+)')
              item['rating'] = response.css(".ratingValue span::text").extract_first()
      
              yield item
      

      注意两点: ID parse() 函数。我在这里所做的只是通过链接使用for循环,循环中的每个实例都引用href,并将urljoined href传递给解析器函数。给出你的用例,这就足够了。在你有下一页的情况下,它只是以某种方式为“下一页”创建一个变量并回调解析,它会一直这样做直到它无法找到“下一页”。

      其次,只有在 HTML 中的项目具有相同的标签但内容不同时才使用 xpath。这更多是个人观点,但我告诉人们,xpath 选择器就像手术刀,而 css 选择器就像一把屠刀。您可以使用手术刀获得该死的准确,但需要更多时间,而且在许多情况下,使用 CSS 选择器可能更容易获得相同的结果。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2022-10-21
        • 2016-06-19
        • 1970-01-01
        • 2022-01-18
        • 1970-01-01
        • 2022-08-14
        • 1970-01-01
        相关资源
        最近更新 更多