MongoDB聚合组性能答案

【问题标题】：Mongo DB aggregation group performanceMongoDB聚合组性能
【发布时间】：2018-04-06 04:38:00
【问题描述】：

我对 mongo DB 很陌生，并且正在为我们的一个应用程序进行试验。我们正在尝试实现 CQRS 和查询部分，我们正在尝试使用 node.js 和命令部分，我们正在通过 c# 实现。

我的一个收藏中可能包含数百万个文档。我们将有一个 scenarioId 字段，每个场景可以有大约 200 万条记录。

我们的用例是比较这两个场景数据，并对场景的每个字段做一些数学运算。例如，每个场景都可以有一个属性avgMiles，我想计算这个属性的差异，用户应该能够过滤这个差异值。由于我的设计是将两个场景数据保存在单个集合中，因此我尝试按场景 ID 进行分组并进一步投影。

我的文档示例结构如下所示。

{ 
    "_id" : ObjectId("5ac05dc58ff6cd3054d5654c"), 
    "origin" : {
        "code" : "0000", 
    }, 
    "destination" : {
        "code" : "0001", 
    }, 
    "currentOutput" : {
        "avgMiles" : 0.15093020854848138, 
    },
    "scenarioId" : NumberInt(0), 
    "serviceType" : "ECON"
}

当我分组时，我会根据 origin.code 和 destination.code 和 serviceType 属性对其进行分组。

我的聚合管道查询如下所示：

  db.servicestats.aggregate([{$match:{$or:[{scenarioId:0}, {scenarioId:1}]}},
    {$sort:{'origin.code':1,'destination.code':1,serviceType:1}},
    {$group:{
      _id:{originCode:'$origin.code',destinationCode:'$destination.code',serviceType:'$serviceType'},
          baseScenarioId:{$sum:{$switch: {
                branches: [
                  {
                    case: { $eq: [ '$scenarioId', 1] },
                    then: '$scenarioId'
                  }],
                default: 0
                  }
        }},
        compareScenarioId:{$sum:{$switch: {
                branches: [
                  {
                    case: { $eq: [ '$scenarioId', 0] },
                    then: '$scenarioId'
                  }],
                default: 0
                  }
        }},
            baseavgMiles:{$max:{$switch: {
                branches: [
                  {
                    case: { $eq: [ '$scenarioId', 1] },
                    then: '$currentOutput.avgMiles'
                  }],
                default: null
                  }
        }},
        compareavgMiles:{$sum:{$switch: {
                branches: [
                  {
                    case: { $eq: [ '$scenarioId', 0] },
                    then: '$currentOutput.avgMiles'
                  }],
                default: null
                  }
        }}
    }
    },
    {$project:{scenarioId:
      { base:'$baseScenarioId',
        compare:'$compareScenarioId'
      },
    avgMiles:{base:'$baseavgMiles', comapre:'$compareavgMiles',diff:{$subtract :['$baseavgMiles','$compareavgMiles']}}
      } 
    },
    {$match:{'avgMiles.diff':{$eq:0.5}}},
    {$limit:100}
    ],{allowDiskUse: true} )

我的小组管道阶段将包含 400 万份文档。您能否建议我如何提高此查询的性能？

我在分组条件中使用的字段上有一个索引，并且我添加了一个排序管道阶段以帮助分组条件更好地执行。

欢迎提出任何建议。

由于 group by 在我的情况下不起作用，我已经使用 $lookup 实现了左外连接，查询如下所示。

    db.servicestats.aggregate([
{$match:{$and :[ {'scenarioId':0}
  //,{'origin.code':'0000'},{'destination.code':'0001'}
  ]}},
//{$limit:1000000},
{$lookup: { from:'servicestats',
  let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
  pipeline:[
  {$match: {
                  $expr: { $and:
                       [
                         { $eq: [ "$scenarioId", 1 ] },
                         { $eq: [ "$origin.code",  "$$ocode" ] },
                         { $eq: [ "$destination.code",  "$$dcode" ] },
                         { $eq: [ "$serviceType",  "$$stype" ] },
                       ]
                    }

              }
  },
  {$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
  { $replaceRoot: { newRoot: "$comp" } }
  ],
  as : "compoutputs"
}},
{
          $replaceRoot: {
             newRoot: {
                $mergeObjects:[
                   {
                      $arrayElemAt: [
                         "$$ROOT.compoutputs",
                         0
                      ]
                   },
                   {
                      origin: "$$ROOT.origin",
                      destination: "$$ROOT.destination",
                      serviceType: "$$ROOT.serviceType",
                      baseavgmiles: "$$ROOT.currentOutput.avgMiles",
                      output: '$$ROOT'
                   }
                ]
             }
          }
       },
       {$limit:100}
])

以上查询性能不错，70毫秒内返回。

但在我的场景中，我需要实现一个完整的外部连接，我理解 mongo 目前不支持它并使用如下所示的 $facet 管道实现

    db.servicestats.aggregate([
{$limit:1000},
{$facet: {output1:[
  {$match:{$and :[ {'scenarioId':0}
  ]}},
{$lookup: { from:'servicestats',
  let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
  pipeline:[
  {$match: {
                  $expr: { $and:
                       [
                         { $eq: [ "$scenarioId", 1 ] },
                         { $eq: [ "$origin.code",  "$$ocode" ] },
                         { $eq: [ "$destination.code",  "$$dcode" ] },
                         { $eq: [ "$serviceType",  "$$stype" ] },
                       ]
                    }

            }
  },
  {$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
  { $replaceRoot: { newRoot: "$comp" } }
  ],
  as : "compoutputs"
}},
//{
//          $replaceRoot: {
//             newRoot: {
//                $mergeObjects:[
//                   {
//                      $arrayElemAt: [
//                         "$$ROOT.compoutputs",
//                         0
//                      ]
//                   },
//                   {
//                      origin: "$$ROOT.origin",
//                      destination: "$$ROOT.destination",
//                      serviceType: "$$ROOT.serviceType",
//                      baseavgmiles: "$$ROOT.currentOutput.avgMiles",
//                      output: '$$ROOT'
//                   }
//                ]
//             }
//          }
//       }
  ],
  output2:[
    {$match:{$and :[ {'scenarioId':1}
  ]}},
{$lookup: { from:'servicestats',
  let: {ocode:'$origin.code',dcode:'$destination.code',stype:'$serviceType'},
  pipeline:[
  {$match: {
                  $expr: { $and:
                       [
                         { $eq: [ "$scenarioId", 0 ] },
                         { $eq: [ "$origin.code",  "$$ocode" ] },
                         { $eq: [ "$destination.code",  "$$dcode" ] },
                         { $eq: [ "$serviceType",  "$$stype" ] },
                       ]
                    }

            }
  },
  {$project: {_id:0, comp :{compavgmiles :'$currentOutput.avgMiles'}}},
  { $replaceRoot: { newRoot: "$comp" } }
  ],
  as : "compoutputs"
}},
//{
//          $replaceRoot: {
//             newRoot: {
//                $mergeObjects:[
//                   {
//                      $arrayElemAt: [
//                         "$$ROOT.compoutputs",
//                         0
//                      ]
//                   },
//                   {
//                      origin: "$$ROOT.origin",
//                      destination: "$$ROOT.destination",
//                      serviceType: "$$ROOT.serviceType",
//                      baseavgmiles: "$$ROOT.currentOutput.avgMiles",
//                      output: '$$ROOT'
//                   }
//                ]
//             }
//          }
//       },
       {$match :{'compoutputs':{$eq:[]}}}

  ]
  }
}




       ///{$limit:100}
])

但是方面的表现非常糟糕。非常欢迎任何进一步的想法来改进这一点。

【问题讨论】：

不设置allowDiskUse会怎样？聚合是否仍然有效？你有多少内存？
Mongo 抛出错误，提示 $group stage 超过 100 MB 限制。我们使用 64GB RAM 运行此测试
你找到解决这个问题的方法了吗？

标签： c# node.js mongodb mongoose mongodb-query

【解决方案1】：

一般来说，有三种情况会导致查询缓慢：

查询没有索引，不能有效地使用索引，或者架构设计不是最佳的（例如高度嵌套的数组或子文档），这意味着 MongoDB 必须做一些额外的工作才能获得相关数据。
查询正在等待一些缓慢的事情（例如，从磁盘获取数据、将数据写入磁盘）。
硬件配置不足。

就您的查询而言，可能有一些关于查询性能的一般性建议：

在聚合管道中使用allowDiskUse 意味着查询可能会在其某些阶段使用磁盘。磁盘通常是机器中最慢的部分，因此如果您可以避免这种情况，它将加快查询速度。
请注意，聚合查询仅限使用 100MB 内存。这与您拥有的内存量无关。
$group 阶段不能使用索引，因为索引与文档在磁盘上的位置相关联。一旦聚合管道进入与文档物理位置无关的阶段（例如$group 阶段），就不能再使用索引了。
默认情况下，WiredTiger 缓存约为 RAM 的 50%，因此 64GB 机器将具有 ~32GB WiredTiger 缓存。如果你发现查询很慢，可能是 MongoDB 需要去磁盘取相关文档。在查询期间监控iostats 并检查磁盘利用率% 将提供有关是否已配置足够RAM 的提示。

一些可能的解决方案是：

提供更多 RAM，以便 MongoDB 不必经常访问磁盘。
重新设计架构以避免严重嵌套的字段或文档中的多个数组。
定制文档架构，让您更轻松地查询其中的数据，而不是根据您认为数据的存储方式定制架构（例如，避免关系数据库设计模型中固有的大量规范化）。
如果您发现自己达到了单台机器的性能限制，请考虑分片以水平扩展查询。但是，请注意，分片是一种需要仔细设计和考虑的解决方案。

【讨论】：

【解决方案2】：

您在上面说您想按scenarioId 分组，但是您没有。但这可能是您应该做的以避免所有 switch 语句。这样的事情可能会让你继续前进：

db.servicestats.aggregate([{
    $match: {
        scenarioId: { $in: [ 0, 1 ] }
    }
}, {
    $sort: { // not sure if that stage even helps - try to run with and without
        'origin.code': 1,
        'destination.code': 1,
        serviceType: 1
    }
}, {
    $group: { // first group by scenarioId AND the other fields
        _id: {
            scenarioId: '$scenarioId',
            originCode: '$origin.code',
            destinationCode: '$destination.code',
            serviceType: '$serviceType'
        },
        avgMiles: { $max: '$currentOutput.avgMiles' } // no switches needed
    },
}, {
    $group: { // group by the other fields only so without scenarioId
        _id: {
            originCode: '$_id.originCode',
            destinationCode: '$_id.destinationCode',
            serviceType: '$_id.serviceType'
        },
        baseScenarioAvgMiles: {
            $max: {
                $cond: {
                    if: { $eq: [ '$_id.scenarioId', 1 ] },
                    then: '$avgMiles',
                    else: 0
                }
            }
        },
        compareScenarioAvgMiles: {
            $max: {
                $cond: {
                    if: { $eq: [ '$_id.scenarioId', 0 ] },
                    then: '$avgMiles',
                    else: 0
                }
            }
        }
    },
}, {
    $addFields: { // compute the difference
        diff: {
            $subtract :[ '$baseScenarioAvgMiles', '$compareScenarioAvgMiles']
        }
    }
}, {
    $match: {
        'avgMiles.diff': { $eq: 0.5 }
    }
}, {
    $limit:100
}], { allowDiskUse: true })

除此之外，我建议您使用 db.collection.explain().aggregate(...) 的强大功能来找到正确的索引并调整您的查询。

【讨论】：

感谢您的建议。但是我的查询返回仍然需要 100 秒。正如预期的那样，性能低于 1 秒。不知道我是否能够达到那样的表现