通过键字段查找 MongoDB 集合中的所有重复文档答案

【问题标题】：Find all duplicate documents in a MongoDB collection by a key field通过键字段查找 MongoDB 集合中的所有重复文档
【发布时间】：2012-03-18 12:20:00
【问题描述】：

假设我有一个包含一组文档的集合。像这样的。

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}

我想通过“名称”字段查找此集合中的所有重复条目。例如。 "foo" 出现两次，"bar" 出现 3 次。

【问题讨论】：

要删除重复项，您可以使用this solution

标签： mongodb mapreduce duplicates aggregation-framework

【解决方案1】：

注意：这个解决方案是最容易理解的，但不是最好的。

您可以使用mapReduce 找出文档包含某个字段的次数：

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

【讨论】：

【解决方案2】：

有关通用 Mongo 解决方案，请参阅MongoDB cookbook recipe for finding duplicates using group。请注意，聚合更快更强大，因为它可以返回重复记录的_ids。

对于pymongo，接受的答案（使用mapReduce）效率不高。相反，我们可以使用group 方法：

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

输出将是这样的：

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

等效的 SQL 查询是：SELECT name, COUNT(name) FROM prb GROUP BY name。请注意，我们仍然需要从数组中过滤掉计数为 0 的元素。同样，请参阅 MongoDB cookbook recipe for finding duplicates using group 以了解使用 group 的规范解决方案。

【讨论】：

MongoDB 食谱的链接已过时并返回 404。

【解决方案3】：

接受的答案在大型集合上非常慢，并且不返回重复记录的_ids。

聚合速度更快，可以返回_ids：

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

在聚合管道的第一阶段，$group 运算符按name 字段聚合文档，并将分组记录的每个_id 值存储在uniqueIds 中。 $sum 运算符将传递给它的字段的值相加，在本例中为常量 1 - 从而将分组记录的数量计算到 count 字段中。

在流水线的第二阶段，我们使用$match 过滤 count 至少为 2 的文档，即重复。

然后，我们首先对最频繁的重复进行排序，并将结果限制在前 10 位。

此查询将输出最多 $limit 具有重复名称的记录，以及它们的 _ids。例如：

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

【讨论】：

要删除重复项，您可以使用this solution
现在如何使用 C# 调用它？
此解决方案是否使用密钥上的现有索引？我担心的是针对非常大的集合运行此操作，其中分组的文档可能不适合内存。
@Iravanchi 确实如此。已经有一段时间了，但我记得我的数据库大小是 5TB。
通过使用 db.getCollection().aggregate 而不是 db.collection.aggregate 让它工作

【解决方案4】：

aggregation pipeline framework可用于轻松识别具有重复键值的文档：

// Desired unique index: 
// db.collection.ensureIndex({ firstField: 1, secondField: 1 }, { unique: true})

db.collection.aggregate([
  { $group: { 
    _id: { firstField: "$firstField", secondField: "$secondField" }, 
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  }}, 
  { $match: { 
    count: { $gt: 1 } 
  }}
])

~ 参考：官方 mongo lab 博客上的有用信息：

https://blog.mlab.com/2014/03/finding-duplicate-keys-with-the-mongodb-aggregation-framework

【讨论】：

【解决方案5】：

这里接受的最高答案是：

uniqueIds: { $addToSet: "$_id" },

这还会返回一个名为 uniqueIds 的新字段，其中包含一个 ID 列表。但是，如果您只想要该字段及其计数怎么办？那么它会是这样的：

db.collection.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

为了解释这一点，如果您来自 MySQL 和 PostgreSQL 等 SQL 数据库，您习惯于与 GROUP BY 语句一起使用的聚合函数（例如 COUNT()、SUM()、MIN()、MAX()）例如，您要查找列值出现在表中的总计数。

SELECT COUNT(*), my_type FROM table GROUP BY my_type;
+----------+-----------------+
| COUNT(*) | my_type         |
+----------+-----------------+
|        3 | Contact         |
|        1 | Practice        |
|        1 | Prospect        |
|        1 | Task            |
+----------+-----------------+

如您所见，我们的输出显示了每个 my_type 值出现的计数。要在 MongoDB 中查找重复项，我们将以类似的方式解决该问题。 MongoDB 拥有聚合操作，将来自多个文档的值组合在一起，并且可以对分组的数据执行各种操作以返回单个结果。这与 SQL 中的聚合函数的概念类似。

假设有一个名为 contacts 的集合，初始设置如下所示：

db.contacts.aggregate([ ... ]);

这个聚合函数接受一个聚合运算符数组，在我们的例子中，我们需要 $group 运算符，因为我们的目标是按字段的计数（即字段值的出现次数）对数据进行分组。

db.contacts.aggregate([  
    {$group: { 
        _id: {name: "$name"} 
        } 
    }
]);

这种方法有点奇怪。 _id 字段是使用 group by 运算符所必需的。在这种情况下，我们对 $name 字段进行分组。 _id 中的键名可以是任何名称。但我们使用名称是因为它在这里很直观。

通过仅使用 $group 运算符运行聚合，我们将获得所有名称字段的列表（无论它们在集合中出现一次还是多次）：

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"} 
    } 
  }
]);

{ "_id" : { "name" : "John" } }
{ "_id" : { "name" : "Joan" } }
{ "_id" : { "name" : "Stephen" } }
{ "_id" : { "name" : "Rod" } }
{ "_id" : { "name" : "Albert" } }
{ "_id" : { "name" : "Amanda" } }

请注意上面的聚合是如何工作的。它获取带有名称字段的文档并返回提取的名称字段的新集合。

但是我们想知道的是字段值会出现多少次。 $group 运算符采用一个计数字段，该字段使用 $sum 运算符将表达式 1 添加到组中每个文档的总数中。因此，$group 和 $sum 一起返回给定字段（例如名称）产生的所有数值的总和。

db.contacts.aggregate([  
  {$group: { 
    _id: {name: "$name"},
    count: {$sum: 1}
    } 
  }
]);

{ "_id" : { "name" : "John" },  "count" : 1  }
{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }
{ "_id" : { "name" : "Amanda" },  "count" : 1 }

由于目标是消除重复，因此需要一个额外的步骤。要仅获取计数大于 1 的组，我们可以使用 $match 运算符来过滤我们的结果。在 $match 运算符中，我们将告诉它查看计数字段并告诉它使用表示“大于”和数字 1 的 $gt 运算符查找大于 1 的计数。

db.contacts.aggregate([ 
  {$group: { _id: {name: "$name"}, 
             count: {$sum: 1} } }, 
  {$match: { count: {"$gt": 1} } } 
]);

{ "_id" : { "name" : "Joan" },  "count" : 3  }
{ "_id" : { "name" : "Stephen" },  "count" : 2 }
{ "_id" : { "name" : "Rod" },  "count" : 3 }
{ "_id" : { "name" : "Albert" },  "count" : 2 }

附带说明，如果您通过像 Mongoid for Ruby 这样的 ORM 使用 MongoDB，您可能会收到以下错误：

The 'cursor' option is required, except for aggregate with the explain argument

这很可能意味着您的 ORM 已过期并且正在执行 MongoDB 不再支持的操作。因此，要么更新您的 ORM，要么找到修复程序。对于 Mongoid，这是对我的修复：

module Moped
  class Collection
    # Mongo 3.6 requires a `cursor` option be passed as part of aggregate queries.  This overrides
    # `Moped::Collection#aggregate` to include a cursor, which is not provided by Moped otherwise.
    #
    # Per the [MongoDB documentation](https://docs.mongodb.com/manual/reference/command/aggregate/):
    #
    #   Changed in version 3.6: MongoDB 3.6 removes the use of `aggregate` command *without* the `cursor` option unless
    #   the command includes the `explain` option. Unless you include the `explain` option, you must specify the
    #   `cursor` option.
    #
    #   To indicate a cursor with the default batch size, specify `cursor: {}`.
    #
    #   To indicate a cursor with a non-default batch size, use `cursor: { batchSize: <num> }`.
    #
    def aggregate(*pipeline)
      # Ordering of keys apparently matters to Mongo -- `aggregate` has to come before `cursor` here.
      extract_result(session.command(aggregate: name, pipeline: pipeline.flatten, cursor: {}))
    end

    private

    def extract_result(response)
      response.key?("cursor") ? response["cursor"]["firstBatch"] : response["result"]
    end
  end
end

【讨论】：