为什么在 PyMongo 中处理大型 MongoDB 集合时会丢失数据？我能做些什么呢？答案

【问题标题】：Why is data missing when processing large MongoDB collections in PyMongo? What can I do about it?为什么在 PyMongo 中处理大型 MongoDB 集合时会丢失数据？我能做些什么呢？
【发布时间】：2016-06-03 18:49:26
【问题描述】：

我在处理一个非常大的 MongoDB 集合（1900 万个文档）时遇到了一些问题。

当我简单地遍历集合时，如下所示，PyMongo 似乎在 10,593,454 个文档后放弃了。即使我使用skip()，这似乎也是一样的，集合的后半部分似乎无法以编程方式访问。

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
  print "%s: created at %s" % (ii,classification["created_at"])

print "Done."

脚本最初报告：

Collection contains 19036976 documents.

最终，脚本完成，我没有收到任何错误，而且我确实得到了“完成”。信息。但打印的最后一行是

10593454: created at 2013-12-12 02:17:35

我在过去 2 年内登录的所有记录（最近的记录）似乎都无法访问。有谁知道这里发生了什么？我该怎么办？

【问题讨论】：

在它“结束”之前调整了多长时间。会不会超时？
@Takarii 如您所见，我使用no_cursor_timeout=True 禁用了超时。话虽如此，它确实在第 10,593,454 条记录上挂了很久。所以它确实感觉像是某种超时。即使该计划最终仍在继续。但是为什么每次在相同的特定记录之后都会超时？
不太确定。你能直接查询数据库，看看你是否能真正检索到记录吗？
嗯 MongoHub 似乎不允许 .skip() 值大于 9999 所以我不确定我该怎么做。
我指的是直接访问数据库，这样你就可以查询其中 id > xxx

标签： python mongodb collections pymongo large-data

【解决方案1】：

好的，感谢this helpful article 我找到了另一种翻阅文档的方法，这似乎不受“丢失数据”/“超时”问题的影响。本质上，您必须使用find() 和limit() 并依靠集合的自然_id 排序来检索页面中的文档。这是我修改后的代码：

#!/usr/bin/env python
import pymongo

client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]

print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]

# get first ID
pageSize = 100000
first_classification = classification_collection.find_one()
completed_page_rows=1
last_id = first_classification["_id"]

# get the next page of documents (read-ahead programming style)
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

# keep getting pages until there are no more
while next_results.count()>0:
  for ii, classification in enumerate(next_results):
    completed_page_rows+=1
    if completed_page_rows % pageSize == 0:
      print "%s (id = %s): created at %s" % (completed_page_rows,classification["_id"],classification["created_at"])
    last_id = classification["_id"]
  next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)

print "\nDone.\n"

我希望通过编写此解决方案可以帮助其他遇到此问题的人。

注意：这个更新后的列表还接受了 cmets 中 @Takarii 和 @adam-comerford 的建议，我现在只检索我需要的字段（_id 默认提供），我还打印出参考。

【讨论】：