【问题标题】：What is the correct way to do a HAVING in a MongoDB GROUP BY?在 MongoDB GROUP BY 中进行 HAVING 的正确方法是什么？
【发布时间】：2011-07-29 20:18:18
【问题描述】：

这个查询在 SQL 中会是什么（查找重复项）：

SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1

我在 MongoDB 中执行了这个简单的查询：

res = db.col.group({key:{userId:true,name:true}, 
                     reduce: function(obj,prev) {prev.count++;}, 
                     initial: {count:0}})

我添加了一个简单的 Javascript 循环来检查结果集，并执行了一个过滤器来查找所有计数 > 1 的字段，如下所示：

for (i in res) {if (res[i].count>1) printjson(res[i])};

除了在客户端使用 javascript 代码之外，还有更好的方法吗？如果这是最好/最简单的方法，请说它是，这个问题会对某人有所帮助:)

【问题讨论】：

查看类似问题stackoverflow.com/questions/4224773/…
类似，但又不一样。我使用的是组函数，而不是 MongoDB 的 map-reduce 功能。
但是从那个答案链接的这个网站有帮助，举个简单的例子：csanz.posterous.com/look-for-duplicates-using-mongodb-mapreduce

标签： group-by mongodb having having-clause

【解决方案1】：

使用 Mongo 聚合框架的新答案

这个问题被问及回答后，10gen 发布了带有聚合框架的 Mongodb 2.2 版本。执行此查询的最佳新方法是：

db.col.aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

10gen 有一个方便的SQL to Mongo Aggregation conversion chart 值得收藏。

【讨论】：

【解决方案2】：

已经给出的答案很可能是诚实的，并且由于隐式优化在幕后工作，使用投影使它变得更好。我做了一个小改动，我正在解释它背后的积极因素。

原来的命令

db.getCollection('so').explain(1).aggregate( [
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

解释计划的部分内容

{
    "stages" : [ 
        {
            "$cursor" : {
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "5fa42c8b8778717d277f67c4_test.so",
                    "indexFilterSet" : false,
                    "parsedQuery" : {},
                    "queryHash" : "F301762B",
                    "planCacheKey" : "F301762B",
                    "winningPlan" : {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                            "name" : 1,
                            "userId" : 1,
                            "_id" : 0
                        },
                        "inputStage" : {
                            "stage" : "COLLSCAN",
                            "direction" : "forward"
                        }
                    },
                    "rejectedPlans" : []
                },
                "executionStats" : {
                    "executionSuccess" : true,
                    "nReturned" : 6000,
                    "executionTimeMillis" : 8,
                    "totalKeysExamined" : 0,
                    "totalDocsExamined" : 6000,

样本集很小，只有 6000 个文档
此查询将处理 WiredTiger 内部缓存中的数据，因此如果集合的大小很大，则所有内容都将保存在内部缓存中以确保执行。 WT Cache 非常重要，如果这个命令在缓存中占用了如此大的空间，那么缓存大小将不得不更大以容纳其他操作

现在是一个小的、hack 和添加的索引。

 db.getCollection('so').createIndex({userId : 1, name : 1})

新命令

db.getCollection('so').explain(1).aggregate( [
    {$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
   { $group: { _id: { userId: "$userId", name: "$name" },
               count: { $sum: 1 } } },
   { $match: { count: { $gt: 1 } } },
   { $project: { _id: 0, 
                 userId: "$_id.userId", 
                 name: "$_id.name", 
                 count: 1}}
] )

解释计划

{
  "stages": [{
        "$cursor": {
          "queryPlanner": {
            "plannerVersion": 1,
            "namespace": "5fa42c8b8778717d277f67c4_test.so",
            "indexFilterSet": false,
            "parsedQuery": {
              "$and": [{
                  "name": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                }
              ]
            },
            "queryHash": "4EF9C4D5",
            "planCacheKey": "3898FC0A",
            "winningPlan": {
              "stage": "PROJECTION_COVERED",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "IXSCAN",
                "keyPattern": {
                  "userId": 1.0,
                  "name": 1.0
                },
                "indexName": "userId_1_name_1",
                "isMultiKey": false,
                "multiKeyPaths": {
                  "userId": [],
                  "name": []
                },
                "isUnique": false,
                "isSparse": false,
                "isPartial": false,
                "indexVersion": 2,
                "direction": "forward",
                "indexBounds": {
                  "userId": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ],
                  "name": [
                    "[MinKey, undefined)",
                    "(null, MaxKey]"
                  ]
                }
              }
            },
            "rejectedPlans": [{
              "stage": "PROJECTION_SIMPLE",
              "transformBy": {
                "name": 1,
                "userId": 1,
                "_id": 0
              },
              "inputStage": {
                "stage": "FETCH",
                "filter": {
                  "userId": {
                    "$not": {
                      "$eq": null
                    }
                  }
                },
                "inputStage": {
                  "stage": "IXSCAN",
                  "keyPattern": {
                    "name": 1.0
                  },
                  "indexName": "name_1",
                  "isMultiKey": false,
                  "multiKeyPaths": {
                    "name": []
                  },
                  "isUnique": false,
                  "isSparse": false,
                  "isPartial": false,
                  "indexVersion": 2,
                  "direction": "forward",
                  "indexBounds": {
                    "name": [
                      "[MinKey, undefined)",
                      "(null, MaxKey]"
                    ]
                  }
                }
              }
            }]
          },
          "executionStats": {
            "executionSuccess": true,
            "nReturned": 6000,
            "executionTimeMillis": 9,
            "totalKeysExamined": 6000,
            "totalDocsExamined": 0,
            "executionStages": {
              "stage": "PROJECTION_COVERED",
              "nReturned": 6000,

检查 Projection_Covered 部分，该命令是一个覆盖查询，基本上只是依赖于索引中的数据
此命令不需要将数据保留在 WT 内部缓存中，因为它根本不会去那里，检查检查的文档，它是 0，假设数据在索引中，它正在使用它来执行，这个对于 WT Cache 已经受到其他操作压力的系统来说，这是一个很大的利好
如果有任何机会需要搜索特定名称而不是整个集合，那么这将变得有用：D
这里的缺点是添加了索引，如果此索引也用于其他操作，那么老实说没有缺点，但如果这是额外添加，那么缓存中的索引将占用更多空间 + 写入会受到影响略微添加索引

*在 6000 条记录的性能方面，显示的时间多 1 毫秒，但对于更大的数据集，这可能会有所不同。需要注意的是，我插入的示例文档只有3个字段，除了这里使用的两个，默认的_id，如果这个集合有更大的文档大小，那么原始命令的执行会增加它在缓存也会增加。

【讨论】：