优化对大型索引对象的 MongoDB 聚合查询答案

【问题标题】：Optimizing MongoDB aggregate query on large Index objects优化对大型索引对象的 MongoDB 聚合查询
【发布时间】：2020-08-12 02:05:45
【问题描述】：

我的 MongoDb 集合中有 2000 万个对象。目前在具有 7.5Gb 内存和 40Gb 磁盘的 M30 MongoDb 实例上运行。

数据像这样存储在集合中 -

{
 _id:xxxxx,
 id : 1 (int),
 from : xxxxxxxx (int),
 to : xxxxxx (int),
 status : xx (int)
 .
 .
 .
 .
},
{

 _id:xxxxx,
 id : 2 (int),
 from : xxxxxxxx (int),
 to : xxxxxx (int),
 status : xx (int)
 .
 .
 .
 .
}
.
.
.
. and so on..

id 是唯一索引，from 是此集合中的索引。

我正在运行一个查询以将“to”分组并返回给我最大 id 并在给定条件下按最大 id 排序，即“来自”

$collection->aggregate([
            ['$project' => ['id'=>1,'to'=>1,'from'=>1],
            [ '$match'=> [
                        '$and'=> 
                                [ 
                                    [ 'from'=> xxxxxxxxxx],
                                    [ 'status'=> xx ],
                                ] 
                        ] 
            ],
            ['$group' => [
                        '_id' => 
                                '$to',
                                'max_revision'=>['$max' => '$id'],
                        ]
            ],
            ['$sort' => ['max_revision' => -1]],
            ['$limit' => 20],

]);

上面的查询在索引 from 上的小型数据集上运行得很好（约 2 秒），例如在集合中具有 50-100k 的相同 'from' 值。但是对于像这样的情况，例如，如果 2M 个对象具有相同的“来自”值，那么执行并给出结果需要超过 10 秒。

一个简单的例子，案例 1 - 如果使用 from 作为 12345 执行相同的查询，则在 2 秒内运行，因为 12345 在集合中出现 50k 次。

案例 2- 如果查询以 from 作为 98765 执行，则查询需要 10 秒以上，因为 98765 在集合中出现了 2M 次。

编辑：下面的解释查询-

{
  "command": {
    "aggregate": "mycollection",
    "pipeline": [
      {
        "$project": {
          "id": 1,
          "to": 1,
          "from": 1
        }
      },
      {
        "$match": {
          "$and": [
            {
              "from": {
                "$numberLong": "12345"
              }
            },
            {
              "status": 22
            }
          ]
        }
      },
      {
        "$group": {
          "_id": "$to",
          "max_revision": {
            "$max": "$id"
          }
        }
      },
      {
        "$sort": {
          "max_revision": -1
        }
      },
      {
        "$limit": 20
      }
    ],
    "allowDiskUse": false,
    "cursor": {},
    "$db": "mongo_jc",
    "lsid": {
      "id": {
        "$binary": "8LktsSkpTjOzF3GIC+m1DA==",
        "$type": "03"
      }
    },
    "$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 1597230985,
          "i": 1
        }
      },
      "signature": {
        "hash": {
          "$binary": "PHh4eHh4eD4=",
          "$type": "00"
        },
        "keyId": {
          "$numberLong": "6859724943999893507"
        }
      }
    }
  },
  "planSummary": [
    {
      "IXSCAN": {
        "from": 1
      }
    }
  ],
  "keysExamined": 1246529,
  "docsExamined": 1246529,
  "hasSortStage": 1,
  "cursorExhausted": 1,
  "numYields": 9747,
  "nreturned": 0,
  "queryHash": "29DAFB9E",
  "planCacheKey": "F5EBA6AE",
  "reslen": 231,
  "locks": {
    "ReplicationStateTransition": {
      "acquireCount": {
        "w": 9847
      }
    },
    "Global": {
      "acquireCount": {
        "r": 9847
      }
    },
    "Database": {
      "acquireCount": {
        "r": 9847
      }
    },
    "Collection": {
      "acquireCount": {
        "r": 9847
      }
    },
    "Mutex": {
      "acquireCount": {
        "r": 100
      }
    }
  },
  "storage": {
    "data": {
      "bytesRead": {
        "$numberLong": "6011370213"
      },
      "timeReadingMicros": 4350129
    },
    "timeWaitingMicros": {
      "cache": 2203
    }
  },
  "protocol": "op_msg",
  "millis": 8548
}

【问题讨论】：

一些相关信息：Query Optimization - 请参阅主题选择性和 Pipeline Optimization。
向问题添加解释的查询计划。
@D.SM 添加了解释查询的要点

标签： mongodb aggregation-framework query-optimization

【解决方案1】：

对于这种特定情况，mongod 查询执行器可以使用索引进行初始匹配，但不能用于排序。

如果您要重新排序和稍微修改阶段，它可以使用{from:1, status:1, id:1} 上的索引进行匹配和排序：

$collection->aggregate([
            [ '$match'=> [
                        '$and'=> 
                                [ 
                                    [ 'from'=> xxxxxxxxxx],
                                    [ 'status'=> xx ],
                                ] 
                        ] 
            ],
            ['$sort' => ['id' => -1]],
            ['$project' => ['id'=>1,'to'=>1,'from'=>1],
            ['$group' => [
                        '_id' => '$to',
                        'max_revision'=>['$first' => '$id'],
                      ]
            ],
            ['$limit' => 20],

]);

这样它应该能够将$match 和$sort 阶段组合成一个索引扫描。

【讨论】：

有道理，但这会使管道超过 100 Mb 的 ram 使用量，并最终在磁盘上运行以进行排序。仍然需要> 10秒才能执行。分析器要点 - gist.github.com/TomarAditya/30444b158613b2b967fafb11860b8664
呸，我打错了 - 在$group 之前排序时，需要在分组之前使用该字段的名称。现已编辑。此外，您需要在 {from:1, status:1, id:1} 上创建复合索引