【问题标题】:Scrapy, Parse items data from page then follow link to get additional items dataScrapy,从页面解析项目数据,然后点击链接获取其他项目数据
【发布时间】:2015-01-19 19:21:06
【问题描述】:

在从第一页抓取数据后,我无法抓取其他页面上的其他字段,例如:

这是我的代码:

from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from IMDB_Frompage.items import ImdbFrompageItem
from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

URL = "http://www.imdb.com/search/title?count=100&ref_=nv_ch_mm_1&start=1&title_type=feature,tv_series,tv_movie"

class MySpider(CrawlSpider):
    name = "imdb"
    allowed_domains = ["imdb.com"]
    start_urls = [URL]
    DOWNLOAD_DELAY = 0.5

    rules = (Rule(SgmlLinkExtractor(allow=('100&ref'), restrict_xpaths=('//span[@class="pagination"]/a[contains(text(),"Next")]')), callback='parse_page', follow=True),)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        item = ImdbFrompageItem()
        links = hxs.select("//td[@class='title']")
        items=[]
        for link in links:
            item = ImdbFrompageItem()
            item['link'] = link.select("a/@href").extract()[0]
            item['new_link'] ='http://www.imdb.com'+item['link']
            new_links = ''.join(item['new_link'])
            request = Request(new_links, callback=self.parsepage2)
            request.meta['item'] = item
            yield request
            yield item

    def parsepage2(self, response):
        item = response.meta['item']
        hxs = HtmlXPathSelector(response)
        blocks = hxs.select("//td[@id='overview-top']")
        for block in blocks:
            item = ImdbFrompageItem()
            item["title"] = block.select("h1[@class='header']/span[@itemprop='name']/text()").extract()
            item["year"] = block.select("h1[@class='header']/span[@class='nobr']").extract()
            item["description"] = block.select("p[@itemprop='description']/text()").extract()
            yield item

部分结果是:

{"link": , "new_link": }
{"link": , "new_link": }
{"link": , "new_link": }
{"link": , "new_link": }
....
{"link": , "new_link": }
{"title": , "description":}
{"title": , "description":}
next page
{"link": , "new_link": }
{"link": , "new_link": }
{"link": , "new_link": }
{"title": , "description":}

我的结果不包含每个链接的所有数据({"title": , "description":})

但我想要这样的东西:

{"link": , "new_link": }
{"title": , "description":}
{"link": , "new_link": }
{"title": , "description":}
{"link": , "new_link": }
{"title": , "description":}
{"link": , "new_link": }
....
{"link": , "new_link": }
{"title": , "description":}
next page
{"link": , "new_link": }
{"title": , "description":}
{"link": , "new_link": }
{"title": , "description":}
{"link": , "new_link": }
{"title": , "description":}

关于我做错了什么有什么建议吗?

【问题讨论】:

  • 一种猜测是准确检查产量的行为。第一次 for 调用从你的函数创建的生成器对象时,它会从头开始运行你的函数中的代码,直到它达到 yield,然后它会返回循环的第一个值。然后,每个其他的调用都会再次运行你在函数中编写的循环,并返回下一个值,直到没有值可以返回。所以函数首先分析循环中的所有链接,然后从新开始链接。更多细节在这里 stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python
  • 问题是:1.从链接({"title": , "description":})抓取的数据不与这个链接({"link": , "new_link": })在一起(我在@JimmyZhang回复后解决了这个问题)2.为什么解析器不遵循第一个页面上的所有链接,并随机选择要抓取的链接。例如在第 100 页链接结果中:link1 data; link12 data link36 data ... then follows to next page

标签: python callback web-scraping scrapy


【解决方案1】:

Scrapy不能保证所有的请求都按顺序解析,是无序的

执行顺序可能是这样的:

  1. 调用 parse1();
  2. 调用 parse1();
  3. 调用 parse1();
  4. 调用 parse2();
  5. ....

也许你可以改变你的代码来得到你想要的:

def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    links = hxs.select("//td[@class='title']")
    for link in links:
        new_links = ''.join('http://www.imdb.com'+item['link'])
        request = Request(new_links, callback=self.parsepage2)
        request.meta['item'] = item
        request.meta['link'] = link.select("a/@href").extract()[0]
        request.meta['new_link'] = new_links
        yield request


def parsepage2(self, response):
    item = response.meta['item']
    hxs = HtmlXPathSelector(response)
    blocks = hxs.select("//td[@id='overview-top']")
    for block in blocks:
        item = ImdbFrompageItem()
        item["link"] = response["link"]
        item["new_link" = response["new_link"]
        item["title"] = block.select("h1[@class='header']/span[@itemprop='name']/text()").extract()
        item["year"] = block.select("h1[@class='header']/span[@class='nobr']").extract()
        item["description"] = block.select("p[@itemprop='description']/text()").extract()

        yield item

所以你会得到这样的结果:

{"link": , "new_link": ,"title": , "description":}

我不确定我的代码是否可以直接运行,我只是给你一个启发,让你意识到你想要什么。

【讨论】:

  • 感谢@JimmyZhang 的回复,这真的很有帮助,我编辑了我的代码,现在它工作正常我的结果是:{"link": , "new_link": ,"title": , "description":} 但我仍然不明白为什么解析器不遵循第一个链接的所有链接页面并随机选择要抓取的链接。例如在第 100 页链接结果:link1 data; link12 data link36data ... 然后转到下一页也许你知道我做错了什么?
  • 我觉得一页的所有链接,都是一一解析的,能不能给点详细的信息;如果对您有帮助,请接受答案。
猜你喜欢
  • 1970-01-01
  • 2012-03-09
  • 2012-02-07
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多