【发布时间】:2019-11-10 23:39:15
【问题描述】:
我想在 python 脚本中使用蜘蛛的输出。为此,我基于另一个thread 编写了以下代码。
我面临的问题是函数 spider_results() 只会一遍又一遍地返回最后一个项目的列表,而不是包含所有找到的项目的列表。当我使用 scrapy crawl 命令手动运行同一个蜘蛛时,我得到了所需的输出。下面是脚本的输出、手动json输出和蜘蛛本身。
我的代码有什么问题?
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from circus.spiders.circus import MySpider
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
脚本输出:
{'away_odds': 1.44,
'away_team': 'Los Angeles Dodgers',
'event_time': datetime.datetime(2019, 6, 8, 2, 15),
'home_odds': 2.85,
'home_team': 'San Francisco Giants',
'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
'league': 'MLB'}, {'away_odds': 1.44,
'away_team': 'Los Angeles Dodgers',
'event_time': datetime.datetime(2019, 6, 8, 2, 15),
'home_odds': 2.85,
'home_team': 'San Francisco Giants',
'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
'league': 'MLB'}, {'away_odds': 1.44,
'away_team': 'Los Angeles Dodgers',
'event_time': datetime.datetime(2019, 6, 8, 2, 15),
'home_odds': 2.85,
'home_team': 'San Francisco Giants',
'last_update': datetime.datetime(2019, 6, 6, 20, 58, 41, 655497),
'league': 'MLB'}]
Scrapy 抓取的 Json 输出:
[
{"home_team": "Los Angeles Angels", "away_team": "Seattle Mariners", "event_time": "2019-06-08 02:07:00", "home_odds": 1.58, "away_odds": 2.4, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Diego Padres", "away_team": "Washington Nationals", "event_time": "2019-06-08 02:10:00", "home_odds": 1.87, "away_odds": 1.97, "last_update": "2019-06-06 20:48:16", "league": "MLB"},
{"home_team": "San Francisco Giants", "away_team": "Los Angeles Dodgers", "event_time": "2019-06-08 02:15:00", "home_odds": 2.85, "away_odds": 1.44, "last_update": "2019-06-06 20:48:16", "league": "MLB"}
]
我的蜘蛛:
from scrapy.spiders import Spider
from ..items import MatchItem
import json
import datetime
import dateutil.parser
class MySpider(Spider):
name = 'first_spider'
start_urls = ["https://websiteXYZ.com"]
def parse(self, response):
item = MatchItem()
timestamp = datetime.datetime.utcnow()
response_json = json.loads(response.body)
for event in response_json["el"]:
for team in event["epl"]:
if team["so"] == 1: item["home_team"] = team["pn"]
if team["so"] == 2: item["away_team"] = team["pn"]
for market in event["ml"]:
if market["mn"] == "Match result":
item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
for outcome in market["msl"]:
if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]
if market["mn"] == 'Moneyline':
item["event_time"] = dateutil.parser.parse(market["dd"]).replace(tzinfo=None)
for outcome in market["msl"]:
if outcome["mst"] == "1": item["home_odds"] = outcome["msp"]
#if outcome["mst"] == "X": item["draw_odds"] = outcome["msp"]
if outcome["mst"] == "2": item["away_odds"] = outcome["msp"]
item["last_update"] = timestamp
item["league"] = event["scn"]
yield item
编辑:
根据下面的答案,我尝试了以下两个脚本:
controller.py
import json
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor, defer
from betsson_controlled.spiders.betsson import Betsson_Spider
from scrapy.utils.project import get_project_settings
class MyCrawlerRunner(CrawlerRunner):
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
# create crawler (Same as in base CrawlerProcess)
crawler = self.create_crawler(crawler_or_spidercls)
# handle each item scraped
crawler.signals.connect(self.item_scraped, signals.item_scraped)
# create Twisted.Deferred launching crawl
dfd = self._crawl(crawler, *args, **kwargs)
# add callback - when crawl is done cal return_items
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
return json.dumps([dict(item) for item in output])
settings = get_project_settings()
runner = MyCrawlerRunner(settings)
spider = Betsson_Spider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
reactor.run()
print(deferred)
当我执行 controller.py 时,我得到:
<Deferred at 0x7fb046e652b0 current result: '[{"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}, {"home_team": "St. Louis Cardinals", "away_team": "Pittsburgh Pirates", "home_odds": 1.71, "away_odds": 2.19, "league": "MLB"}]'>
【问题讨论】:
-
这是在黑暗中拍摄的,但他们已经重构了爬虫在新发布的 Scrapy 中的工作方式。在文档中查看此处所做的更改,并确定它是否有助于您的事业。你的结果表明你的 deferred 正在工作,但不知何故蜘蛛要么没有完成,要么没有关闭。 docs.scrapy.org/en/1.7/news.html
-
感谢您为我着想。我会调查的。不确定我是否会继续在这个项目中使用 Scrapy,如果实现如此简单的功能有那么复杂。
-
我知道我的答案是正确的答案,我们只是遗漏了一些东西。我有这段代码在 API 端点上运行生产。但我知道试图弄清楚这样的事情时的感觉。使用scrapy的所有项目和功能来实现请求同时运行可能与解决这个问题一样困难。我们至少知道 deferred 是作为回调工作的,因此您应该能够从这里解决问题。
-
尝试在爬行函数中运行你的代码,就像我在最后一段代码中使用延迟回调装饰器所做的那样,看看是否有任何作用。我认为您可能必须停止反应器才能完成代码执行。 reactor.run() 应该阻塞,直到脚本完成但它永远不会完成。完成后,您的所有项目都应该在 deferred 变量中......
-
更新了答案并再次尝试...尝试 crawlerprocess 而不是 runner 它似乎更多你需要的地方,因为我需要 runner。