【问题标题】:scrapy : pass arguments to crawler programaticallyscrapy:以编程方式将参数传递给爬虫
【发布时间】:2017-12-28 21:12:13
【问题描述】:

我正在做一个爬虫。我有一个 python 模块,它从数据库中获取 url,并且应该配置 scrapy 为每个 url 启动一个蜘蛛。因为我是从我的脚本开始的,所以我不知道如何像在命令行开关 -a 中那样传递它的参数,这样每个调用都会收到不同的 url。

这里是scrapy调用者的代码

def scrape_next_url() :

conn = _mysql.connect(host, username, password, database_name)
conn.query("select min(sortorder) from url_queue where processed = false for update")
query_result = conn.store_result()
url_index = query_result.fetch_row()[0][0]

conn.query("select url from url_queue where sortorder = " + str(url_index))
query_result = conn.store_result()
url_at_index = query_result.fetch_row()[0][0]

conn.query("update url_queue set processed = true where sortorder = " + str(url_index))
conn.commit()
conn.close()

settings = Settings()
os.environ['SCRAPY_SETTINGS_MODULE'] = 'webscraper.settings'
settings_module_path = os.environ['SCRAPY_SETTINGS_MODULE']
settings.setmodule(settings_module_path, priority='project')

process = CrawlerProcess(settings)
ImageSpider.start_urls.append(url_at_index)
process.crawl(ImageSpider)
process.start()

帮助!

注意:我遇到了一个问题 (Scrapy: Pass arguments to cmdline.execute()),但如果可能,我希望以编程方式进行。

编辑:

我已听从您的建议,并拥有以下蜘蛛代码:

    def __init__(self, url=None, *pargs, **kwargs) :
       super(ImageSpider, self).__init__(*pargs, **kwargs)
       self.start_urls.append(url.strip())

我的来电者:

    process = CrawlerProcess(settings)
    process.crawl(ImageSpider, url=url_at_index)

我知道参数正在传递给 init,因为如果不存在,则 url.strip() 调用将失败。但结果是蜘蛛运行但不爬任何东西:

(webcrawler) faisca:webscraper dlsa$ python scraper_launcher.py 
2017-07-25 00:42:16 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: webscraper)
2017-07-25 00:42:16 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'webscraper', 'NEWSPIDER_MODULE': 'webscraper.spiders', 'SPIDER_MODULES': ['webscraper.spiders']}
2017-07-25 00:42:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.memusage.MemoryUsage']
2017-07-25 00:42:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-07-25 00:42:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-07-25 00:42:16 [scrapy.middleware] INFO: Enabled item pipelines:
['webscraper.pipelines.WebscraperPipeline']
2017-07-25 00:42:16 [scrapy.core.engine] INFO: Spider opened
2017-07-25 00:42:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-25 00:42:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

【问题讨论】:

    标签: python scrapy web-crawler


    【解决方案1】:

    像这样传递参数

    process.crawl(MySpider(), limit=query_to_run, cursor=cursor, conn=conn)
    

    然后在你的蜘蛛中

    import from scrapy.spiders import CrawlSpider
    
    class MySpider(CrawlSpider):
        # some code here
        def __init__(self, limit=None, cursor=None, conn=None, *args, **kwargs):
                super(MySpider, self).__init__(*args, **kwargs)
    

    【讨论】:

    • 应该是process.crawl(MySpider, limit=query_to_run, cursor=cursor, conn=conn)。您传递的是蜘蛛类,而不是蜘蛛实例。
    • @paultrmbrth 我有这段代码已经在生产中工作了,可能是旧的scrapy版本或其他什么......
    猜你喜欢
    • 2020-08-02
    • 1970-01-01
    • 2019-12-27
    • 1970-01-01
    • 1970-01-01
    • 2019-06-24
    • 2013-07-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多