MongoDB InvalidDocument：无法编码对象答案

【问题标题】：MongoDB InvalidDocument: Cannot encode objectMongoDB InvalidDocument：无法编码对象
【发布时间】：2015-11-04 14:34:09
【问题描述】：

我正在使用scrapy来废弃博客，然后将数据存储在mongodb中。起初我得到了 InvalidDocument 异常。对我来说很明显的是数据的编码不正确。因此，在持久化对象之前，在我的 MongoPipeline 中，我检查文档是否为“utf-8 strict”，然后我才尝试将对象持久化到 mongodb。但是我仍然得到 InvalidDocument Exceptions，现在这很烦人。

这是我的代码，我的 MongoPipeline 对象将对象持久化到 mongodb

# -*- coding: utf-8 -*-

# Define your item pipelines here
#

import pymongo
import sys, traceback
from scrapy.exceptions import DropItem
from crawler.items import BlogItem, CommentItem


class MongoPipeline(object):
    collection_name = 'master'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'posts')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):

        if type(item) is BlogItem:
            try:
                if 'url' in item:
                    item['url'] = item['url'].encode('utf-8', 'strict')
                if 'domain' in item:
                    item['domain'] = item['domain'].encode('utf-8', 'strict')
                if 'title' in item:
                    item['title'] = item['title'].encode('utf-8', 'strict')
                if 'date' in item:
                    item['date'] = item['date'].encode('utf-8', 'strict')
                if 'content' in item:
                    item['content'] = item['content'].encode('utf-8', 'strict')
                if 'author' in item:
                    item['author'] = item['author'].encode('utf-8', 'strict')

            except:  # catch *all* exceptions
                e = sys.exc_info()[0]
                spider.logger.critical("ERROR ENCODING %s", e)
                traceback.print_exc(file=sys.stdout)
                raise DropItem("Error encoding BLOG %s" % item['url'])

            if 'comments' in item:
                comments = item['comments']
                item['comments'] = []

                try:
                    for comment in comments:
                        if 'date' in comment:
                            comment['date'] = comment['date'].encode('utf-8', 'strict')
                        if 'author' in comment:
                            comment['author'] = comment['author'].encode('utf-8', 'strict')
                        if 'content' in comment:
                            comment['content'] = comment['content'].encode('utf-8', 'strict')

                        item['comments'].append(comment)

                except:  # catch *all* exceptions
                    e = sys.exc_info()[0]
                    spider.logger.critical("ERROR ENCODING COMMENT %s", e)
                    traceback.print_exc(file=sys.stdout)

        self.db[self.collection_name].insert(dict(item))

        return item

我仍然得到以下异常：

au coeur de l\u2019explosion de la bulle Internet n\u2019est probablement pas \xe9tranger au succ\xe8s qui a suivi. Mais franchement, c\u2019est un peu court comme argument !Ce que je sais dire, compte tenu de ce qui pr\xe9c\xe8de, c\u2019est quelles sont les conditions pour r\xe9ussir si l\u2019on est vraiment contraint de rester en France. Ce sont des sujets que je d\xe9velopperai dans un autre article.',
     'date': u'2012-06-27T23:21:25+00:00',
     'domain': 'reussir-sa-boite.fr',
     'title': u'Peut-on encore entreprendre en France ?\t\t\t ',
     'url': 'http://www.reussir-sa-boite.fr/peut-on-encore-entreprendre-en-france/'}
    Traceback (most recent call last):
      File "h:\program files\anaconda\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "H:\PDS\BNP\crawler\crawler\pipelines.py", line 76, in process_item
        self.db[self.collection_name].insert(dict(item))
      File "h:\program files\anaconda\lib\site-packages\pymongo\collection.py", line 409, in insert
        gen(), check_keys, self.uuid_subtype, client)
    InvalidDocument: Cannot encode object: {'author': 'Arnaud Lemasson',
     'content': 'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me co\xc3\xbbterait bien trop cher. Bref, 100% d\xe2\x80\x99accord avec vous. Le probl\xc3\xa8me, je ne vois pas comment cela pourrait changer avec le gouvernement actuel\xe2\x80\xa6 A moins que si, j\xe2\x80\x99ai pu lire il me semble qu\xe2\x80\x99ils avaient en t\xc3\xaate de r\xc3\xa9duire l\xe2\x80\x99IS pour les petites entreprises et de l\xe2\x80\x99augmenter pour les grandes\xe2\x80\xa6 A voir',
     'date': '2012-06-27T23:21:25+00:00'}
    2015-11-04 15:29:15 [scrapy] INFO: Closing spider (finished)
    2015-11-04 15:29:15 [scrapy] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 252396,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 11, 4, 14, 29, 15, 701000),
     'log_count/DEBUG': 2,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start)
    time': datetime.datetime(2015, 11, 4, 14, 29, 13, 191000)}

另一件有趣的事来自@eLRuLL 的评论，我做了以下事情：

>>> s = "Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me"
>>> s
'Tellement vrai\xe2\x80\xa6 Il faut vraiment \xc3\xaatre motiv\xc3\xa9 aujourd\xe2\x80\x99hui pour monter sa bo\xc3\xaete. On est pr\xc3\xa9lev\xc3\xa9 de partout, je ne pense m\xc3\xaame pas \xc3\xa0 embaucher, cela me'
>>> se = s.encode("utf8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> se = s.encode("utf-8", "strict")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)
>>> s.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 14: ordinal not in range(128)

那么我的问题是。如果此文本无法编码。那为什么，我的 MongoPipeline try catch 不捕捉这个异常？因为只有不引发任何异常的对象才应该附加到 item['cmets'] ？

【问题讨论】：

您是否尝试过先将项目转换为 dict，然后更新每个字段？
@eLRuLL 正如你所建议的，我尝试将项目转换为 dict，然后通过编码的 utf8 严格值更新所有字段，但这也会引发相同的 InvalidDocumentException

标签： python mongodb encoding scrapy

【解决方案1】：

我终于想通了。问题不在于编码。它与文件的结构有关。

因为我使用了标准的 MongoPipeline 示例，该示例不处理嵌套的 scrapy 项目。

我正在做的是：博客项目： “网址” ... cmets = [评论项]

所以我的 BlogItem 有一个 CommentItem 列表。现在问题来了，为了将对象持久化到数据库中：

self.db[self.collection_name].insert(dict(item))

所以我在这里将 BlogItem 解析为一个字典。但我没有解析 CommentItems 列表。而且因为回溯显示的 CommentItem 有点像字典，所以我没有想到有问题的对象不是字典！

所以最后解决这个问题的方法是在将评论附加到评论列表时更改行：

item['comments'].append(dict(comment))

现在 MongoDB 将其视为有效文档。

最后，对于最后一部分，我问为什么我在 python 控制台上而不是在脚本中出现异常。

原因是因为我在 python 控制台上工作，它只支持 ascii。因此错误。

【讨论】：

我的收益列表把我带到了这里：b

【解决方案2】：

运行查询时出现此错误

db.collection.find({'attr': {'$gte': 20}})

collection 中的一些记录的attr 具有非数字值。

【讨论】：

【解决方案3】：

首先，当您执行"somestring".encode(...) 时，不会更改"somestring"，但它会返回一个新的编码字符串，因此您应该使用类似：

 item['author'] = item['author'].encode('utf-8', 'strict')

其他字段也一样。

【讨论】：

我们的目标是验证编码是否可行。如果变量可以编码为 utf8。如果它抛出异常，那么我不包括这个对象。另外，由于 mongodb 默认在持久化之前对其对象进行编码，我认为存储这些编码对象是没有用的。我按照你的建议做的越少越好。但仍然得到同样的错误。我正在更新问题。
顺便说一句，当我尝试：s = 'Tellement vrai\xe2\x80\xa6 Il...'; s2=s.encode('utf-8', 'strict') 我得到UnicodeDecodeError
这意味着评论['content'] 没有被编码。或者本应引发的明显编码错误没有引发。
spider.logger.critical("ERROR ENCODING %s", e) 应该是spider.logger.critical("ERROR ENCODING %s" % e)，最好使用import logging; logging.critical("error")
我刚刚验证了编码的代码行是否被执行。他们是。那么。也许 MongoDB 对 utf8 strict 不满意？我觉得这不太可能..？

【解决方案4】：

我在 Mongo 查询中使用 numpy 数组时遇到了同样的错误：

'myField' : { '$in': myList },

解决方法是将nd.array() 转换为列表：

'myField' : { '$in': list(myList) },

【讨论】：