在脚本文件函数中获取 Scrapy 爬虫输出/结果答案

【问题标题】：Get Scrapy crawler output/results in script file function在脚本文件函数中获取 Scrapy 爬虫输出/结果
【发布时间】：2017-03-07 09:12:43
【问题描述】：

我正在使用脚本文件在 scrapy 项目中运行蜘蛛，并且蜘蛛正在记录爬虫输出/结果。但是我想在某些函数中使用该脚本文件中的蜘蛛输出/结果。我不想将输出/结果保存在任何文件或数据库中。这是从https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script获取的脚本代码

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())


d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()

def spider_output(output):
#     do something to that output

如何在“spider_output”方法中获取蜘蛛输出。可以获得输出/结果。

【问题讨论】：

标签： python scrapy web-crawler twisted scrapy-spider

【解决方案1】：

这是在列表中获取所有输出/结果的解决方案

from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from scrapy.signalmanager import dispatcher


def spider_results():
    results = []

    def crawler_results(signal, sender, item, response, spider):
        results.append(item)

    dispatcher.connect(crawler_results, signal=signals.item_scraped)

    process = CrawlerProcess(get_project_settings())
    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished
    return results


if __name__ == '__main__':
    print(spider_results())

【讨论】：

这似乎对我不起作用，你有管道工作吗？
它对我有用，注意MySpider是你的蜘蛛类..它对初学者很有帮助..
仅供参考，截至 Scrapy 0.14 item_passed 已重命名为 item_scraped。来源：docs.scrapy.org/en/latest/news.html旧item_passed文档：docs.scrapy.org/en/0.9/topics/signals.html#item-passed新item_scaped文档：docs.scrapy.org/en/latest/topics/signals.html#item-scraped

【解决方案2】：

它将返回列表中蜘蛛的所有结果。

from scrapyscript import Job, Processor
from scrapy.utils.project import get_project_settings


def get_spider_output(spider, **kwargs):
    job = Job(spider, **kwargs)
    processor = Processor(settings=get_project_settings())
    return processor.run([job])

【讨论】：

【解决方案3】：

这是一个老问题，但供将来参考。如果您使用的是 python 3.6+，我建议您使用scrapyscript，它可以让您以超级简单的方式运行您的 Spider 并获得结果：

from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json

# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
    name = 'myspider'

    def start_requests(self):
        yield Request(self.url)

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        return {'url': response.request.url, 'title': title}

# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')

# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)

# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])

# Print the consolidated results
print(json.dumps(data, indent=4))

[
    {
        "title": [
            "Welcome to Python.org"
        ],
        "url": "https://www.python.org/"
    },
    {
        "title": [
            "The world's leading software development platform \u00b7 GitHub",
            "1clr-code-hosting"
        ],
        "url": "https://github.com/"
    }
]

【讨论】：

【解决方案4】：

我的建议是使用 Python subprocess 模块从脚本运行爬虫，而不是使用 scrapy 文档中提供的方法从 python 脚本运行爬虫。这样做的原因是使用subprocess 模块，您可以从蜘蛛内部捕获您print 的输出/日志甚至语句。

在 Python 3 中，使用 run 方法执行蜘蛛。例如。

import subprocess
process = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if process.returncode == 0:
    result = process.stdout.decode('utf-8')
else:
    # code to check error using 'process.stderr'

将 stdout/stderr 设置为 subprocess.PIPE 将允许捕获输出，因此设置此标志非常重要。这里command 应该是一个序列或一个字符串（它是一个字符串，然后用另外1 个参数调用run 方法：shell=True）。例如：

command = ['scrapy', 'crawl', 'website', '-a', 'customArg=blahblah']
# or
command = 'scrapy crawl website -a customArg=blahblah' # with shell=True
#or
import shlex
command = shlex.split('scrapy crawl website -a customArg=blahblah') # without shell=True

此外，process.stdout 将包含脚本的输出，但其类型为 bytes。您需要使用decode('utf-8') 将其转换为str

【讨论】：

【解决方案5】：

AFAIK 没有办法做到这一点，因为 crawl():

返回一个在抓取完成时触发的延迟。

除了将结果输出到记录器之外，爬虫不会将结果存储在任何地方。

但是返回输出会与 scrapy 的整个异步性质和结构相冲突，因此保存到文件然后读取它是这里的首选方法。
您可以简单地设计将您的项目保存到文件的管道，并简单地读取spider_output 中的文件。您将收到结果，因为reactor.run() 会阻止您的脚本，直到输出文件完成为止。

【讨论】：

是的，你是对的，爬虫不存储结果，但使用信号我们可以获得结果
@SheikhJames 哦，对了，完全忘记了信号。太聪明了！