【问题标题】:How to pass parameters to scrapy spiders in program?如何在程序中将参数传递给scrapy spider?
【发布时间】:2016-04-18 08:43:11
【问题描述】:

我是python和scrapy的新手。我使用此博客Running multiple scrapy spiders programmatically 中的方法在烧瓶应用程序中运行我的蜘蛛。代码如下:

# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]

# crawlers that are running 
RUNNING_CRAWLERS = []

def spider_closing(spider):
    """
    Activates on spider closed signal
    """
    log.msg("Spider closed: %s" % spider, level=log.INFO)
    RUNNING_CRAWLERS.remove(spider)
    if not RUNNING_CRAWLERS:
        reactor.stop()

# start logger
log.start(loglevel=log.DEBUG)

# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
    settings = Settings()

    # crawl responsibly
    settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
    crawler = Crawler(settings)
    crawler_obj = spider()
    RUNNING_CRAWLERS.append(crawler_obj)

    # stop reactor when spider closes
    crawler.signals.connect(spider_closing, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(crawler_obj)
    crawler.start()

# blocks process; so always keep as the last statement
reactor.run()

这是我的蜘蛛代码:

class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

如您所见,我的代码中有一个参数term = 'man',它是我的start urls 的一部分。我不想固定这个参数,所以我想知道如何在我的程序中动态地给出start url 或参数term?就像在命令行中运行蜘蛛一样,有一种方法可以传递参数,如下所示:

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      self.start_urls = [kwargs.get('start_url')] 
And start it like: scrapy crawl my_spider -a start_url="http://some_url"

谁能告诉我如何处理这个问题?

【问题讨论】:

  • 是的 scrapy crawl my_spider -a start_url="http://google.com" 工作正常
  • 但是我不想在命令行中调用我的蜘蛛,我想在我的程序中调用蜘蛛。

标签: python scrapy


【解决方案1】:

首先,要在一个脚本中运行多个蜘蛛,推荐的方式是使用scrapy.crawler.CrawlerProcesswhere you pass spider classes,而不是蜘蛛实例。

要使用CrawlerProcess 将参数传递给您的蜘蛛,您只需将参数添加到.crawl() 调用中,在蜘蛛子类之后, 例如

    process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')

以这种方式传递的参数随后可用作蜘蛛属性(与命令行上的 -a term=someterm 相同)

最后,不用在__init__ 中构建start_urls,您可以使用start_requests 实现相同的效果,并且您可以使用self.term 构建这样的初始请求:

def start_requests(self):
    yield Request("http://epgd.biosino.org/"
                  "EPGD/search/textsearch.jsp?"
                  "textquery={}"
                  "&submit=Feeling+Lucky".format(self.term))

【讨论】:

  • 首先,谢谢您的详细解答!!我试过CrawlerProcess,但是有一个问题是我不能在Flask App中使用它,当我使用它时有一个错误说信号只在主线程中有效,我已经问过这个问题link ,但是没有有效的解决办法。那你还有其他方法吗?
  • 如果你想使用scrapy.crawler.Crawlerit needs to be instantiated with (spidercls, settings),不仅仅是设置。例如crawler = Crawler(DmozSpider, settings) 然后crawler.crawl(term="someterm")
  • 问题是我在Flask App中运行这些蜘蛛,所以我应该尝试scrapy.crawler.Crawler而不是CrawlerProcess
  • 我不知道如何在 Flask 应用程序中运行 scrapy spider。我去问问
  • 我发现我用scrapy - 0.24.0而不是scrapy -1.0,而在scrapy - 0.24.0,爬虫只有一个参数settings,和最新的有点不同。
猜你喜欢
  • 2015-04-26
  • 1970-01-01
  • 1970-01-01
  • 2016-11-20
  • 2013-03-14
  • 1970-01-01
  • 2013-07-19
  • 2015-09-13
  • 2016-11-17
相关资源
最近更新 更多