【发布时间】:2023-04-02 17:37:01
【问题描述】:
我如何获得 scrapy 管道来用我的项目填充我的 mongodb?这是我的代码目前的样子,它反映了我从 scrapy 文档中获得的信息。
我还想提一下,我尝试过返回物品而不是屈服,也尝试过使用物品加载器。所有方法似乎都有相同的结果。
在那张纸条上,我想提一下,如果我运行命令
mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json
我的数据库被填充(只要我屈服并且不返回项目)......我真的很想让这个管道工作......
好的,这是我的代码:
这是我的蜘蛛
import scrapy
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import HtmlResponse
from capstone.items import CapstoneItem
class CongressSpider(CrawlSpider):
name = "congress"
allowed_domains = ["www.congress.gov"]
start_urls = [
'https://www.congress.gov/members',
]
#creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)
def parse_page(self, response):
for search in response.selector.xpath(".//li[@class='compact']"):
yield {
'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
}
设置:
BOT_NAME = 'capstone'
SPIDER_MODULES = ['capstone.spiders']
NEWSPIDER_MODULE = 'capstone.spiders'
ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'congress'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 10
这是我的 pipeline.py 导入pymongo
from pymongo import MongoClient
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class MongoDBPipeline(object):
collection_name= 'members'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI')
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
这里是 items.py 导入scrapy
class CapstoneItem(scrapy.Item):
member = scrapy.Field()
state = scrapy.Field()
District = scrapy.Field()
party = scrapy.Field()
served = scrapy.Field()
最后但并非最不重要的是,我的输出如下所示:
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8007,
'downloader/request_count': 24,
'downloader/request_method_count/GET': 24,
'downloader/response_bytes': 757157,
'downloader/response_count': 24,
'downloader/response_status_count/200': 24,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
'item_scraped_count': 2139,
'log_count/DEBUG': 2164,
'log_count/INFO': 11,
'request_depth_max': 22,
'response_received_count': 24,
'scheduler/dequeued': 23,
'scheduler/dequeued/memory': 23,
'scheduler/enqueued': 23,
'scheduler/enqueued/memory': 23,
'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)
所以在我看来,我没有收到任何错误,我的物品被刮掉了。如果我使用 -o myfile.json 运行它,我可以将 myfile 导入到我的 mongodb,但管道没有做任何事情!
mongo
MongoDB shell version: 3.2.12
connecting to: test
Server has startup warnings:
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] 2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten] ** We suggest setting it to 'never'
2017-02-24T18:51:24.276-0800 I CONTROL [initandlisten]
> show dbs
congress 0.078GB
local 0.078GB
> use congress
switched to db congress
> show collections
members
system.indexes
> db.members.count()
0
>
我怀疑我的问题与我的设置文件有关。我是scrapy和mongodb的新手,我感觉我没有告诉scrapy我的mongodb在哪里正确。 这是我找到的其他一些来源,我尝试将它们用作示例,但我尝试的所有内容都导致相同的结果(已完成抓取,mongo 为空) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb 不幸的是,我有更多来源,但没有足够的声誉发布更多信息。 无论如何~任何想法将不胜感激谢谢。
【问题讨论】:
-
你的
MongoDBPipeline有错字:def open_sipder(self,spider):应该是open_spider -
opps... 没有解决问题,但谢谢!
-
同样在
MongoDBPipeline(object): from_crawler(cls, crawler):return cls()语句的两个参数应该用逗号分隔。不管这是否是最后一步,我建议stackoverflow.com/questions/299704/… 和stackoverflow.com/questions/1623039/python-debugging-tips 在编写python 脚本时提供一些基本测试/调试的技巧。 -
谢谢!当脚本实际运行时,我最终发现了这个错误。我适当的调试材料。
标签: mongodb python-3.x scrapy pymongo