【问题标题】:How do I get scrapy pipeline to fill my mongodb with my items?如何获得scrapy管道来用我的项目填充我的mongodb?
【发布时间】:2023-04-02 17:37:01
【问题描述】:

我如何获得 scrapy 管道来用我的项目填充我的 mongodb?这是我的代码目前的样子,它反映了我从 scrapy 文档中获得的信息。 我还想提一下,我尝试过返回物品而不是屈服,也尝试过使用物品加载器。所有方法似乎都有相同的结果。 在那张纸条上,我想提一下,如果我运行命令 mongoimport --db mydb --collection mycoll --drop --jsonArray --file ~/path/to/scrapyoutput.json 我的数据库被填充(只要我屈服并且不返回项目)......我真的很想让这个管道工作......

好的,这是我的代码:

这是我的蜘蛛

    import scrapy

    from scrapy.selector import Selector
    from scrapy.loader import ItemLoader
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    from scrapy.http import HtmlResponse
    from capstone.items import CapstoneItem

    class CongressSpider(CrawlSpider):
        name = "congress"
        allowed_domains = ["www.congress.gov"]
        start_urls = [
            'https://www.congress.gov/members',
        ]
    #creating a rule for my crawler. I only want it to continue to the next page, don't follow any other links.
    rules = (Rule(LinkExtractor(allow=(),restrict_xpaths=("//a[@class='next']",)), callback="parse_page", follow=True),)

    def parse_page(self, response):
        for search in response.selector.xpath(".//li[@class='compact']"):
            yield {
                'member' : ' '.join(search.xpath("normalize-space(span/a/text())").extract()).strip(),
                'state' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item']/span/text())").extract()).strip(),
                'District' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][2]/span/text())").extract()).strip(),
                'party' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][3]/span/text())").extract()).strip(),
                'Served' : ' '.join(search.xpath("normalize-space(div[@class='quick-search-member']//span[@class='result-item'][4]/span//li/text())").extract()).strip(),
            }

设置:

    BOT_NAME = 'capstone'

    SPIDER_MODULES = ['capstone.spiders']
    NEWSPIDER_MODULE = 'capstone.spiders'

    ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}
    MONGO_URI = 'mongodb://localhost:27017'
    MONGO_DATABASE = 'congress'
    ROBOTSTXT_OBEY = True
    DOWNLOAD_DELAY = 10

这是我的 pipeline.py 导入pymongo

    from pymongo import MongoClient
    from scrapy.conf import settings
    from scrapy.exceptions import DropItem
    from scrapy import log

    class MongoDBPipeline(object):
        collection_name= 'members'
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI')
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
            )
        def open_spider(self,spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
        def close_spider(self, spider):
            self.client.close()
        def process_item(self, item, spider):
            self.db[self.collection_name].insert(dict(item))
            return item

这里是 items.py 导入scrapy

    class CapstoneItem(scrapy.Item):
        member = scrapy.Field()
        state = scrapy.Field()
        District = scrapy.Field()
        party = scrapy.Field()
        served = scrapy.Field()

最后但并非最不重要的是,我的输出如下所示:

    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Closing spider (finished)
    2017-02-26 20:44:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 8007,
    'downloader/request_count': 24,
    'downloader/request_method_count/GET': 24,
    'downloader/response_bytes': 757157,
    'downloader/response_count': 24,
    'downloader/response_status_count/200': 24,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 2, 27, 4, 44, 41, 767181),
    'item_scraped_count': 2139,
    'log_count/DEBUG': 2164,
    'log_count/INFO': 11,
    'request_depth_max': 22,
    'response_received_count': 24,
    'scheduler/dequeued': 23,
    'scheduler/dequeued/memory': 23,
    'scheduler/enqueued': 23,
    'scheduler/enqueued/memory': 23,
    'start_time': datetime.datetime(2017, 2, 27, 4, 39, 58, 834315)}
    2017-02-26 20:44:41 [scrapy.core.engine] INFO: Spider closed (finished)

所以在我看来,我没有收到任何错误,我的物品被刮掉了。如果我使用 -o myfile.json 运行它,我可以将 myfile 导入到我的 mongodb,但管道没有做任何事情!

     mongo
     MongoDB shell version: 3.2.12
     connecting to: test
     Server has startup warnings: 
      2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten]                              2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **    WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] 
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
     2017-02-24T18:51:24.276-0800 I CONTROL  [initandlisten] 
     > show dbs
     congress  0.078GB
     local     0.078GB
     > use congress
     switched to db congress
     > show collections
     members
     system.indexes
     > db.members.count()
     0
     > 

我怀疑我的问题与我的设置文件有关。我是scrapy和mongodb的新手,我感觉我没有告诉scrapy我的mongodb在哪里正确。 这是我找到的其他一些来源,我尝试将它们用作示例,但我尝试的所有内容都导致相同的结果(已完成抓取,mongo 为空) https://realpython.com/blog/python/web-scraping-and-crawling-with-scrapy-and-mongodb/ https://github.com/sebdah/scrapy-mongodb 不幸的是,我有更多来源,但没有足够的声誉发布更多信息。 无论如何~任何想法将不胜感激谢谢。

【问题讨论】:

  • 你的MongoDBPipeline有错字:def open_sipder(self,spider):应该是open_spider
  • opps... 没有解决问题,但谢谢!
  • 同样在MongoDBPipeline(object): from_crawler(cls, crawler): return cls() 语句的两个参数应该用逗号分隔。不管这是否是最后一步,我建议stackoverflow.com/questions/299704/…stackoverflow.com/questions/1623039/python-debugging-tips 在编写python 脚本时提供一些基本测试/调试的技巧。
  • 谢谢!当脚本实际运行时,我最终发现了这个错误。我适当的调试材料。

标签: mongodb python-3.x scrapy pymongo


【解决方案1】:

我注释掉了我的那行代码

ITEM_PIPLINES = {'capstone.pipelines.MongoDBPipeline': 300,}

我取消了设置文件中已经存在的代码行的注释

ITEM_PIPLINES = { 'capstone.pipelines.MongoDBPipeline': 300, }

我能看到的唯一区别是换行符,这个设置远低于我的所有其他设置。 在让它工作后,我开始收到有关管道文件中拼写错误的 python 错误。我发现我的管道没有连接,因为我的项目被刮掉之前的输出:

[scrapy.middleware] INFO: Enabled item pipelines:[]

更改我的设置后,我得到了这个:

[scrapy.middleware] INFO: Enabled item piplines:['capstone.pipelines.MongoDBPipeline']

【讨论】:

    【解决方案2】:

    您设置数据库名称的错字:

    mongo_db=crawer.settings.get('MONGO_DATABASE', 'items')
    

    应该是

    mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
    

    希望这行得通!

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-10
      • 1970-01-01
      • 2010-11-06
      相关资源
      最近更新 更多