【发布时间】:2018-02-09 09:02:57
【问题描述】:
我尝试在我的蜘蛛中实现this pipeline。 安装必要的依赖项后,我可以在没有任何错误的情况下运行蜘蛛,但由于某种原因它没有写入我的数据库。
我很确定连接到数据库时出了点问题。当我输入错误的密码时,我仍然没有收到任何错误。
当蜘蛛抓取所有数据时,它需要几分钟才能开始转储统计数据。
2017-08-31 13:17:12 [scrapy] INFO: Closing spider (finished)
2017-08-31 13:17:12 [scrapy] INFO: Stored csv feed (27 items) in: test.csv
2017-08-31 13:24:46 [scrapy] INFO: Dumping Scrapy stats:
管道:
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy import log
SETTINGS = {}
SETTINGS['DB_HOST'] = 'mysql.domain.com'
SETTINGS['DB_USER'] = 'username'
SETTINGS['DB_PASSWD'] = 'password'
SETTINGS['DB_PORT'] = 3306
SETTINGS['DB_DB'] = 'database_name'
class MySQLPipeline(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def __init__(self, stats):
print "init"
#Instantiate DB
self.dbpool = adbapi.ConnectionPool ('MySQLdb',
host=SETTINGS['DB_HOST'],
user=SETTINGS['DB_USER'],
passwd=SETTINGS['DB_PASSWD'],
port=SETTINGS['DB_PORT'],
db=SETTINGS['DB_DB'],
charset='utf8',
use_unicode = True,
cursorclass=MySQLdb.cursors.DictCursor
)
self.stats = stats
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
print "close"
""" Cleanup function, called after crawing has finished to close open
objects.
Close ConnectionPool. """
self.dbpool.close()
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
return item
def _insert_record(self, tx, item):
print "insert"
result = tx.execute(
" INSERT INTO matches(type,home,away,home_score,away_score) VALUES (soccer,"+item["home"]+","+item["away"]+","+item["score"].explode("-")[0]+","+item["score"].explode("-")[1]+")"
)
if result > 0:
self.stats.inc_value('database/items_added')
def _handle_error(self, e):
print "error"
log.err(e)
蜘蛛:
import scrapy
import dateparser
from crawling.items import KNVBItem
class KNVBspider(scrapy.Spider):
name = "knvb"
start_urls = [
'http://www.knvb.nl/competities/eredivisie/uitslagen',
]
custom_settings = {
'ITEM_PIPELINES': {
'crawling.pipelines.MySQLPipeline': 301,
}
}
def parse(self, response):
# www.knvb.nl/competities/eredivisie/uitslagen
for row in response.xpath('//div[@class="table"]'):
for div in row.xpath('./div[@class="row"]'):
match = KNVBItem()
match['home'] = div.xpath('./div[@class="value home"]/div[@class="team"]/text()').extract_first()
match['away'] = div.xpath('./div[@class="value away"]/div[@class="team"]/text()').extract_first()
match['score'] = div.xpath('./div[@class="value center"]/text()').extract_first()
match['date'] = dateparser.parse(div.xpath('./preceding-sibling::div[@class="header"]/span/span/text()').extract_first(), languages=['nl']).strftime("%d-%m-%Y")
yield match
如果有更好的管道可用于我正在努力实现的目标,那也将受到欢迎。谢谢!
更新: 通过接受的答案中提供的链接,我最终得到了这个正在工作的功能(从而解决了我的问题):
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
query.addBoth(lambda _: item)
return query
【问题讨论】:
-
您的缩进似乎有误。
_insert_error和_handle_error不是管道的一部分。您是否检查过管道中的process_item是否被调用(打印出一些东西)? -
用一些打印和固定缩进更新了代码。调用
process_item,以及除_insert_record和_handle_error之外的所有其他函数。我真的很惊讶它会走到这一步,而init函数中的连接应该因密码错误而失败,但由于某种原因,我没有收到错误。 -
@Casper 你有权访问 MySQL 服务器日志吗?如果是这样,请尝试检查一下,您可能会发现一些东西。
标签: mysql python-2.7 scrapy