在 mongodb 中，为什么查询索引子文档数组比索引第一级文档更快？答案

【问题标题】：In mongodb, why is it faster to query indexed subdocument array than indexed first level documents?在 mongodb 中，为什么查询索引子文档数组比索引第一级文档更快？
【发布时间】：2013-07-27 12:59:13
【问题描述】：

这就是我的数据库的样子：

> show dbs
admin   0.203125GB
local   0.078125GB
profiler    63.9228515625GB
> use profiler
switched to db profiler
> show collections
documents
mentions

mentions中的一个文档是这样的：

> db.mentions.findOne()
{
    "_id" : ObjectId("51ec29ef1b63042f6a9c6fd2"),
    "corpusID" : "GIGAWORD",
    "docID" : "WPB_ENG_20100226.0044",
    "url" : "http://en.wikipedia.org/wiki/Taboo",
    "mention" : "taboos",
    "offset" : 4526
}

documents中的一个文档如下所示：

> db.documents.findOne()
{
    "_id" : ObjectId("51ec2d981b63042f6ae4ca0b"),
    "sentence_offsets" : [
        ..................
    ],
    "docID" : "WPB_ENG_20101020.0002",
    "text" : ".........",
    "verb_offsets" : [
    .............
    ],
    "mentions" : [
        {
            "url" : "http://en.wikipedia.org/wiki/Washington,_D.C.",
            "mention" : "Washington",
            "ner" : "ORG",
            "offset" : 122
        },
        ...................
    ],
    "corpusID" : "GIGAWORD",
    "chunk_offsets" : [
        .................
    ]
}

有 1 亿个文档被提及，130 万个文档被提及。在mentions 中出现的每个提及也应该在某些document 的mentions 数组中出现一次。我在文档中存储提及信息的原因是为了避免进入提及来检索上下文。然而，当我只查询提及时，我认为拥有一个独立的集合应该更快，mentions。

但是，在我对 mentions.url/mentions.mention 和 documents.mentions.url/documents.mentions.mention 都进行了索引实验后，并在两个集合中查询相同的 url/mention，我发现从文档中获得响应的速度是原来的两倍集合而不是来自提及集合。

我不确定索引在内部是如何工作的，但我假设两个索引的大小相同，因为两个集合中的提及次数相同。因此它们应该具有相同的响应时间？

我正在尝试类似的东西

> db.mentions.find({url: "http://en.wikipedia.org/wiki/Washington,_D.C."}).explain()

所以网络开销不应该有差异。

这是输出

> db.mentions.find({mention: "Illinois"}).explain()

{
"cursor" : "BtreeCursor mention_1",
"isMultiKey" : false,
"n" : 4342,
"nscannedObjects" : 4342,
"nscanned" : 4342,
"nscannedObjectsAllPlans" : 4342,
"nscannedAllPlans" : 4342,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 14,
"nChunkSkips" : 0,
"millis" : 18627,
"indexBounds" : {
    "mention" : [
        [
            "Illinois",
            "Illinois"
        ]
    ]
},
"server" : "----:----"
}

和那个

> db.documents.find({"mentions.mention": "Illinois"}).explain()

{
"cursor" : "BtreeCursor mentions.mention_1",
"isMultiKey" : true,
"n" : 3102,
"nscannedObjects" : 3102,
"nscanned" : 3102,
"nscannedObjectsAllPlans" : 3102,
"nscannedAllPlans" : 3102,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 8,
"nChunkSkips" : 0,
"millis" : 7862,
"indexBounds" : {
    "mentions.mention" : [
        [
            "Illinois",
            "Illinois"
        ]
    ]
},
"server" : "----:----"
}

还有统计数据（是的，我恢复了集合，还没有索引documents.url）：

> db.documents.stats()
{
    "ns" : "profiler.documents",
    "count" : 1302957,
    "size" : 23063622656,
    "avgObjSize" : 17700.985263519826,
    "storageSize" : 25188048768,
    "numExtents" : 31,
    "nindexes" : 2,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 0,
    "totalIndexSize" : 3432652720,
    "indexSizes" : {
        "_id_" : 42286272,
        "mentions.mention_1" : 3390366448
    },
    "ok" : 1
}
> db.mentions.stats()
{
    "ns" : "profiler.mentions",
    "count" : 97458884,
    "size" : 15299979084,
    "avgObjSize" : 156.98906509128506,
    "storageSize" : 17891127216,
    "numExtents" : 29,
    "nindexes" : 3,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "systemFlags" : 0,
    "userFlags" : 0,
    "totalIndexSize" : 15578411408,
    "indexSizes" : {
        "_id_" : 3162125232,
        "mention_1" : 4742881248,
        "url_1" : 7673404928
    },
    "ok" : 1
}

如果有人能告诉我为什么会这样，我将不胜感激。 :]

【问题讨论】：

您能否分享以下输出：1）您的两个查询的解释输出 2）您的两个集合的 db.collection.stats()？
迪伦，抱歉回复晚了。我已经用集合上的 explain() 和 stats() 的输出更新了这个问题。此外，我在 Asya 的回答下添加了一些进一步的调查。有什么想法吗？
执行计划向我展示了一个比另一个快的两个原因。注意到它们可能不是唯一的原因。首先，您的子文档索引的选择性提高了 33%。您正在扫描和检索另外 33% 的文档。由于运行时间较长的查询，性能会进一步下降。请注意，您的第一个查询产生 14 次，而另一个产生 8 次。这意味着您的第一个查询产生了更多的写入操作，或者它甚至可能表明您的索引的工作集不在内存中并且它出错导致更多的产生。

标签： mongodb indexing bigdata database

【解决方案1】：

有 1 亿篇文档被提及，130 万篇被提及文件。

两个索引中有相同数量的索引条目，因为您告诉我们您将提及存储在文档和提及中。

所以索引访问时间是相同的。您可以通过从两者运行覆盖索引查询来衡量这一点 - 这意味着您只想取回存储在索引中的值：db.x.find({url:"xxx"}, {_id:0, "url":1}) 表示查找匹配的文档并仅从中返回 url 的值。如果这两个在两个连接中不相等，则可能是您的设置存在异常，或者其中一个索引无法放入 RAM 或其他与测量相关的问题。

如果这两个相同，但在文档收集中获取文档始终更快，我会检查并了解原因 - 完整的解释输出可以显示花费的时间 - 以及是否更多例如，一个集合而不是另一个集合恰好存在于 RAM 中。

【讨论】：

谢谢阿莎。在提及集合和文档集合中应该有相同数量的提及。但是，根据我刚刚附加到问题中的collection.stats() 的输出，mention_1 的索引大小比mentions.mention_1 的索引大小大1/3 以上。当我做watch -n 1 free -m 时，似乎查询提及coll 消耗的RAM 空间是查询文档coll 的两倍。
虽然documents coll中的一个document肯定大于mentions coll中的一个，但是doc coll中匹配的文档数量远小于mentions coll中的。那么，也许 mongo 在查询时将结果加载到 RAM 中，而来自 doc coll 的结果的总大小恰好是来自提及 coll 的一半？
原来db.documents.find({"mentions.mention":"China"}, {_id:1}).explain() 提供mills: 98502 和db.mentions.find({"mention":"China"}, {_id:1}).explain() 提供mills: 134932。困惑。