在 MongoDB 中查找重复记录答案

【问题标题】：Find duplicate records in MongoDB在 MongoDB 中查找重复记录
【发布时间】：2015-01-15 01:54:43
【问题描述】：

如何在 mongo 集合中查找重复字段。

我想检查是否有任何“名称”字段重复。

{
    "name" : "ksqn291",
    "__v" : 0,
    "_id" : ObjectId("540f346c3e7fc1054ffa7086"),
    "channel" : "Sales"
}

非常感谢！

【问题讨论】：

这个问题的重复标志是不值得的。这个问题询问如何查找重复记录，而不是阻止它们。

标签： mongodb aggregation-framework database

【解决方案1】：

在name 上使用聚合并通过count > 1 获得name：

db.collection.aggregate([
    {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
    {"$project": {"name" : "$_id", "_id" : 0} }
]);

按重复次数从多到少对结果进行排序：

db.collection.aggregate([
    {"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
    {"$sort": {"count" : -1} },
    {"$project": {"name" : "$_id", "_id" : 0} }     
]);

要与“name”以外的其他列名一起使用，请将“$name”更改为“$column_name”

【讨论】：

"$match": {"_id" :{ "$ne" : null } - 在这里是不必要的，因为语句的第二部分足以过滤结果。所以只检查具有count > 1 的组就可以了。
谢谢@BatScream。 { "$ne" : null } 以防万一 'name' 为 null 或不存在。聚合也将计为 null。
欢迎。但是为什么要检查_id 字段。在group 操作后始终保证不为空。
来自$group 阶段的文档的_id 可以为空。
这个输出会是什么？如果我运行，我会得到所有我需要的文件，我只想要重复的 id/names。

【解决方案2】：

您可以使用以下aggregate 管道查找duplicate 名称中的list：

Group 具有相似 name 的所有记录。
Match 那些groups 的记录大于1。
然后group 再次将project 的所有重复名称作为array。

代码：

db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])

o/p:

{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }

【讨论】：

您解释每一行的作用这一事实使这个答案最佳。
如何根据两个字段获取重复数据。基本示例：假设我在其中存储了社交详细信息，例如：``` [{username: 'abc', type: 'facebook'}, {username: 'abc', type: 'instagram'} ] ``` 所以在这种情况下，我不希望仅基于用户名，而是基于“用户名和类型”。谢谢:)

【解决方案3】：

如果您有一个大型数据库并且属性名称仅存在于某些文档中，那么 anhic 给出的答案可能非常低效。

为了提高效率，您可以在聚合中添加 $match。

db.collection.aggregate(
    {"$match": {"name" :{ "$ne" : null } } }, 
    {"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
    {"$match": {"count" : {"$gt": 1} } }, 
    {"$project": {"name" : "$_id", "_id" : 0} }
)

【讨论】：

【解决方案4】：

db.getCollection('orders').aggregate([  
    {$group: { 
            _id: {name: "$name"},
            uniqueIds: {$addToSet: "$_id"},
            count: {$sum: 1}
        } 
    },
    {$match: { 
        count: {"$gt": 1}
        }
    }
])

第一组根据字段查询组。

然后我们检查唯一ID并对其进行计数，如果计数大于1，则该字段在整个集合中是重复的，以便由$match查询处理。

【讨论】：

也无法让这一项为我工作。投反对票！
这篇文章很旧，但可能对某些人有所帮助。检查一下，我会检查我的本地它是否正常工作。甚至我也遇到过一个关于此的博客。请看一看。 compose.com/articles/finding-duplicate-documents-in-mongodb
我能够让它工作 - 编辑更新到确认的工作版本。

【解决方案5】：

如果有人正在使用额外的“$and” where 子句（例如“and where someOtherField is true”）查找重复项的查询

诀窍是从另一个 $match 开始，因为分组后您不再拥有所有可用数据

// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
    _id: { name: "$name" },
    count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},

我找了很长时间才找到这个符号，希望我能帮助遇到同样问题的人

【讨论】：

【解决方案6】：

如果您需要查看所有重复的行：

db.collection.aggregate([
     {"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
     {"$unwind": "$data"}
     {"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }, 
]);

【讨论】：

错误：第 4 行：意外的令牌 {

【解决方案7】：

这就是我们如何在 mongoDB compass 中实现这一点

【讨论】：

【解决方案8】：

另一种选择是使用$sortByCount 阶段。

db.collection.aggregate([
  { $sortByCount: '$name' }
]

是$group & $sort的组合

【讨论】：