检查数组中的每个元素是否匹配条件答案

【问题标题】：Check if every element in array matches condition检查数组中的每个元素是否匹配条件
【发布时间】：2014-06-28 23:43:45
【问题描述】：

我有一组文档：

date: Date
users: [
  { user: 1, group: 1 }
  { user: 5, group: 2 }
]

date: Date
users: [
  { user: 1, group: 1 }
  { user: 3, group: 2 }
]

我想查询此集合以查找我的用户数组中的每个用户 ID 都在另一个数组 [1, 5, 7] 中的所有文档。在此示例中，只有第一个文档匹配。

我能找到的最佳解决方案是：

$where: function() { 
  var ids = [1, 5, 7];
  return this.users.every(function(u) { 
    return ids.indexOf(u.user) !== -1;
  });
}

不幸的是，$where 文档中说明了这似乎会损害性能：

$where 评估 JavaScript 并且不能利用索引。

如何改进这个查询？

【问题讨论】：

您是否尝试使用$in 操作符来完成此操作？
@Artem 如果我只想测试一个元素是否匹配，我可以看到它是如何工作的，但在这里我希望每个元素都匹配。
您必须反转条件 - 实际上两次。看我的回答。

标签： mongodb mapreduce mongodb-query aggregation-framework

【解决方案1】：

你想要的查询是这样的：

db.collection.find({"users":{"$not":{"$elemMatch":{"user":{$nin:[1,5,7]}}}}})

这表示找到所有不包含列表 1、5、7 之外元素的文档。

【讨论】：

附注这个答案在从另一个“答案”生成的样本数据集上需要 10 毫秒
太棒了，这似乎给了我与我的问题中的查询相同的结果，并且它的返回速度快了大约 10 倍。
关键是 $elemMatch ，它表明您希望特定元素满足特定条件，而不是整个文档来满足条件。因为数组允许“users.user”在单个文档中有多个值，所以无论您是指任何元素还是特定元素都可能会产生歧义。正如你所拥有的，任何元素都可以满足 $not 其中之一，它就等同于 $in。 $elemMatch 表示单个元素必须不是其中之一，这意味着现在必须有另一个不是 1,5 或 7 的元素。 $not 现在排除那些 documents
好答案。但值得注意的是，这还将包括 users 缺失或为空的文档。
好点，@JohnnyHK 我假设 users 数组始终存在并包含一些用户。要排除这些查询，可以使用 {"users.user":{$exists:true}} 进行“$and”编辑

【解决方案2】：

我不知道更好的方法，但有几种不同的方法可以解决这个问题，具体取决于您可用的 MongoDB 版本。

不太确定这是否是您的意图，但显示的查询将匹配第一个文档示例，因为在实现您的逻辑时，您正在匹配该文档数组中必须包含在示例数组中的元素。

因此，如果您确实希望文档包含这些元素的所有，那么$all 运算符将是显而易见的选择：

db.collection.find({ "users.user": { "$all": [ 1, 5, 7 ] } })

但是假设您的逻辑实际上是预期的，至少根据建议，您可以通过结合 $in 运算符“过滤”这些结果，以便您的 约束的文档更少$where** 评估 JavaScript 中的条件：

db.collection.find({
    "users.user": { "$in": [ 1, 5, 7 ] },
    "$where": function() { 
        var ids = [1, 5, 7];
        return this.users.every(function(u) { 
            return ids.indexOf(u.user) !== -1;
        });
    }
})

虽然实际扫描的结果将乘以匹配文档中数组中的元素数量，但您会得到一个索引，但仍然比没有附加过滤器要好。

或者甚至可以考虑将$and 运算符与$or 以及$size 运算符结合使用的逻辑抽象，具体取决于您的实际数组条件：

db.collection.find({
    "$or": [
        { "users.user": { "$all": [ 1, 5, 7 ] } },
        { "users.user": { "$all": [ 1, 5 ] } },
        { "users.user": { "$all": [ 1, 7 ] } },
        { "users": { "$size": 1 }, "users.user": 1 },
        { "users": { "$size": 1 }, "users.user": 5 },
        { "users": { "$size": 1 }, "users.user": 7 }
    ]
})

因此，这是您匹配条件的所有可能排列的代数，但性能可能会根据您可用的安装版本而有所不同。

注意：在这种情况下实际上完全失败，因为这会做一些完全不同的事情，实际上会导致逻辑上的 $in

替代方案是聚合框架，由于集合中文档的数量，您的里程可能会有所不同，因为您的集合中的文档数量是 MongoDB 2.6 及更高版本的一种方法：

db.problem.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Just keeping the "user" element value
    { "$group": {
        "_id": "$_id",
        "users": { "$push": "$users.user" }
    }},

    // Compare to see if all elements are a member of the desired match
    { "$project": {
        "match": { "$setEquals": [
            { "$setIntersection": [ "$users", [ 1, 5, 7 ] ] },
            "$users"
        ]}
    }},

    // Filter out any documents that did not match
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

因此该方法使用一些新引入的set operators 来比较内容，当然您需要重组数组才能进行比较。

正如所指出的，$setIsSubset 中有一个直接操作符可以执行此操作，它与上面的单个操作符中的组合操作符等效：

db.collection.aggregate([
    { "$match": { 
        "users.user": { "$in": [ 1,5,7 ] } 
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},
    { "$unwind": "$users" },
    { "$group": {
        "_id": "$_id",
        "users": { "$push": "$users.user" }
    }},
    { "$project": {
        "match": { "$setIsSubset": [ "$users", [ 1, 5, 7 ] ] }
    }},
    { "$match": { "match": true } },
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

或者采用不同的方法，同时仍然利用 MongoDB 2.6 中的 $size 运算符：

db.collection.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    // and a note of it's current size
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
        "size": { "$size": "$users" }
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Filter array contents that do not match
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Count the array elements that did match
    { "$group": {
        "_id": "$_id",
        "size": { "$first": "$size" },
        "count": { "$sum": 1 }
    }},

    // Compare the original size to the matched count
    { "$project": { 
        "match": { "$eq": [ "$size", "$count" ] } 
    }},

    // Filter out documents that were not the same
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

当然仍然可以这样做，尽管在 2.6 之前的版本中有点冗长：

db.collection.aggregate([
    // Match documents that "could" meet the conditions
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Keep your original document and a copy of the array
    { "$project": {
        "_id": {
            "_id": "$_id",
            "date": "$date",
            "users": "$users"
        },
        "users": 1,
    }},

    // Unwind the array copy
    { "$unwind": "$users" },

    // Group it back to get it's original size
    { "$group": { 
        "_id": "$_id",
        "users": { "$push": "$users" },
        "size": { "$sum": 1 }
    }},

    // Unwind the array copy again
    { "$unwind": "$users" },

    // Filter array contents that do not match
    { "$match": { 
        "users.user": { "$in": [ 1, 5, 7 ] } 
    }},

    // Count the array elements that did match
    { "$group": {
        "_id": "$_id",
        "size": { "$first": "$size" },
        "count": { "$sum": 1 }
    }},

    // Compare the original size to the matched count
    { "$project": { 
        "match": { "$eq": [ "$size", "$count" ] } 
    }},

    // Filter out documents that were not the same
    { "$match": { "match": true } },

    // Return the original document form
    { "$project": {
        "_id": "$_id._id",
        "date": "$_id.date",
        "users": "$_id.users"
    }}
])

这通常会完善不同的方法，尝试一下，看看哪种方法最适合您。 $in 与您现有表单的简单组合很可能是最好的组合。但在所有情况下，请确保您有一个可以选择的索引：

db.collection.ensureIndex({ "users.user": 1 })

只要您以某种方式访问它，这将为您提供最佳性能，就像这里的所有示例一样。

判决

我对此很感兴趣，因此最终设计了一个测试用例，以查看性能最佳的情况。所以首先生成一些测试数据：

var batch = [];
for ( var n = 1; n <= 10000; n++ ) {
    var elements = Math.floor(Math.random(10)*10)+1;

    var obj = { date: new Date(), users: [] };
    for ( var x = 0; x < elements; x++ ) {
        var user = Math.floor(Math.random(10)*10)+1,
            group = Math.floor(Math.random(10)*10)+1;

        obj.users.push({ user: user, group: group });
    }

    batch.push( obj );

    if ( n % 500 == 0 ) {
        db.problem.insert( batch );
        batch = [];
    }

}

集合中有 10000 个文档，其中随机数组长度为 1..10，随机值为 1..0，匹配计数为 430 个文档（从 $in match ) 与以下结果 (avg):

带有 $in 子句的 JavaScript：420 毫秒
与 $size 聚合：395 毫秒
与组数组计数聚合：650 毫秒
使用两个集合运算符聚合：275 毫秒
与 $setIsSubset 聚合：250ms

请注意，除了最后两个样本之外，所有样本的峰值方差都快了大约 100 毫秒，而最后两个样本都表现出 220 毫秒的响应。最大的变化出现在 JavaScript 查询中，它的结果也慢了 100 毫秒。

但这里的重点是相对于硬件而言，在我的笔记本电脑下的虚拟机上不是特别好，但给出了一个想法。

因此，集合运算符，特别是带有集合运算符的 MongoDB 2.6.1 版本显然在性能上胜出，而 $setIsSubset 作为单个运算符则带来了额外的小幅提升。

这特别有趣，因为（如 2.4 兼容方法所示）此过程中最大的成本将是 $unwind 语句（平均超过 100 毫秒），因此使用 @ 987654357@ 选择的平均时间约为 32 毫秒，其余流水线阶段的平均执行时间不到 100 毫秒。因此，这给出了聚合与 JavaScript 性能的相对概念。

【讨论】：

感谢您为我指明聚合的方向。查看文档，似乎setIsSubset 也是合适的。我将看看这些与我已有的相比表现如何。
@Wex 没错，因为这相当于示例中使用的两个集合操作。老实说，由于过于关注 2.6 之前的示例而错过了这一点，但也值得添加它自己的示例。没有针对大量数据运行这样的东西，我不太确定性能如何变化。但我仍然怀疑没有聚合方法的前两种形式中的任何一种都是性能最高的选项。
@Wex 实际上对你的结果可能与现实世界的数据非常感兴趣。我回到这个测试用例，结果非常有趣。
@AsyaKamsky 好吧，尽管否定了索引，但您是对的，这将是更好的解决方案。但是没有必要像你回应的那样粗鲁。

【解决方案3】：

我一天中的大部分时间都在尝试通过对象比较而不是严格相等来实现 Asya 的上述解决方案。所以我想我会在这里分享。

假设您将问题从 userIds 扩展到完整用户。您想查找其users 数组中的每个项目都存在于另一个用户数组中的所有文档：[{user: 1, group: 3}, {user: 2, group: 5},...]

这不起作用：db.collection.find({"users":{"$not":{"$elemMatch":{"$nin":[{user: 1, group: 3},{user: 2, group: 5},...]}}}}}) 因为 $nin 仅适用于严格相等。所以我们需要为对象数组找到一种不同的方式来表达“不在数组中”。并且使用$where 会大大降低查询速度。

解决方案：

db.collection.find({
 "users": {
   "$not": {
     "$elemMatch": {
       // if all of the OR-blocks are true, element is not in array
       "$and": [{
         // each OR-block == true if element != that user
         "$or": [
           "user": { "ne": 1 },
           "group": { "ne": 3 }
         ]
       }, {
         "$or": [
           "user": { "ne": 2 },
           "group": { "ne": 5 }
         ]
       }, {
         // more users...
       }]
     }
   }
 }
})

为了完善逻辑：$elemMatch 匹配所有用户不在数组中的文档。所以 $not 将匹配所有用户在数组中的所有文档。

【讨论】：