【问题标题】:Is there any way to change scrapy spider's name by script有什么办法可以通过脚本改变scrapy spider的名字吗
【发布时间】:2017-05-26 12:50:53
【问题描述】:

我做了一个scrapy-redis爬虫,决定做一个分布式爬虫。更重要的是,我想让它成为一个基于任务的,一个任务一个名字。所以,我打算把蜘蛛的名字改成任务的名字,并用这个名字区分每个任务。因此,我在运行网页管理的过程中遇到了一个如何更改蜘蛛名称的问题。

这是我的代码,还不成熟:

​​>
#-*- encoding: utf-8 -*-
import redis
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy_redis.spiders import RedisSpider
import pymongo
client = pymongo.MongoClient('mongodb://localhost:27017')
db_name = 'news'
db = client[db_name]

class NewsSpider(RedisSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'news'
    redis_key = 'news:start_urls'
    start_urls = ["http://www.bbc.com/news"]

    def parse(self, response):
        pass
    # I add those  ,setname and getname
    def setname(self, name):
        self.name = name

    def getname(self):
        return self.name

def start():
    news_spider = NewsSpider()
    news_spider.setname('test_spider_name')
    print news_spider.getname()
    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    r.lpush('news:start_urls', 'http://news.sohu.com/')
    process = CrawlerProcess(get_project_settings())
    process.crawl('test_spider_name')
    process.start()  # the script will block here until the crawling is finished

if __name__ == '__main__':
    start()

还有错误:

test_spider_name
2017-05-26 20:14:05 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2017-05-26 20:14:05 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'geospider.spiders', 'SPIDER_MODULES': ['geospider.spiders'], 'COOKIES_ENABLED': False, 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter'}
Traceback (most recent call last):
  File "/home/kui/work/python/project/bigcrawler/geospider/control/command.py", line 29, in <module>
    start()
  File "/home/kui/work/python/project/bigcrawler/geospider/control/command.py", line 23, in start
    process.crawl('test_spider_name')
  File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 162, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 190, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/crawler.py", line 194, in _create_crawler
    spidercls = self.spider_loader.load(spidercls)
  File "/home/kui/work/python/env/lib/python2.7/site-packages/scrapy/spiderloader.py", line 55, in load
    raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: test_spider_name'

我知道这是一种愚蠢的方式,我在网上搜索了很长时间。但没有用。请帮助我或提供一些想法如何实现这一目标。

提前致谢。

【问题讨论】:

  • 谢谢你,但没用。

标签: python python-2.7 scrapy web-crawler


【解决方案1】:

这可能会有所帮助:

class NewsSpider(RedisSpider):
    """Spider that reads urls from redis queue (myspider:start_urls)."""
    name = 'news_redis'
    redis_key = 'news:start_urls'
    start_urls = ["http://www.bbc.com/news"]

    def parse(self, response):
        pass

def start():
    news_spider = NewsSpider()

    # Set name & redis_key for NewsSpider
    NewsSpider.name = 'test_spider_name_redis'
    NewsSpider.redis_key = NewsSpider.name + ':start_urls'

    r = redis.Redis(host='127.0.0.1', port=6379, db=0)
    r.lpush(NewsSpider.name + ':start_urls', 'http://news.sohu.com/')
    process = CrawlerProcess(get_project_settings())
    process.crawl(NewsSpider)
    process.start()  # the script will block here until the crawling is finished

if __name__ == '__main__':
    start()

【讨论】:

    猜你喜欢
    • 2010-11-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-09-04
    • 1970-01-01
    • 2022-11-17
    • 2014-03-18
    相关资源
    最近更新 更多