在 MongoDB 上执行聚合/设置交集答案

【问题标题】：Perform Aggregation/Set intersection on MongoDB在 MongoDB 上执行聚合/设置交集
【发布时间】：2017-10-08 17:57:17
【问题描述】：

我有一个查询，在对样本数据集执行一些聚合后，将以下示例视为中间数据；

fileid 字段包含文件的 id，用户数组包含对相应文件进行了一些更改的用户数组

{
   “_id” : {  “fileid” : 12  },
   “_user” : [ “a”,”b”,”c”,”d” ]
}
{
   “_id” : {  “fileid” : 13  },
   “_user” : [ “f”,”e”,”a”,”b” ]
}
{
   “_id” : {  “fileid” : 14  },
   “_user” : [ “g”,”h”,”m”,”n” ]
}
{
   “_id” : {  “fileid” : 15  },
   “_user” : [ “o”,”r”,”s”,”v” ]
}
{
   “_id” : {  “fileid” : 16  },
   “_user” : [ “x”,”y”,”z”,”a” ]
}
{
   “_id” : {  “fileid” : 17  },
   “_user” : [ “g”,”r”,”s”,”n” ]
}

我需要为此找到解决方案 -> 任何两个用户对至少两个相同的文件进行了一些更改。所以输出结果应该是

{
   “_id” : {  “fileid” : [12,13]  },
   “_user” : [ “a”,”b”]
}
{
   “_id” : {  "fileid” : [14,17]  },
   “_user” : [ “g”,”n” ]
}
{
   “_id” : {  "fileid” : [15,17]  },
   “_user” : [ “r”,”s” ]
}

我们非常感谢您的意见。

【问题讨论】：

repid:[15,17], _user: ["r","s"] 不也是匹配的吗？不挑剔；只是想确保示例输出和描述匹配。
是的，我忘了说，它也应该在那里。 { “_id” : { “repoid” : [15,17] }, “_user” : [ “r”,”s” ] }
我相信下面的答案。
我猜查询看起来很完美，但是在第一阶段之后，我收到了这个聚合错误。我猜在中间阶段创建了太多的数组对。有关如何克服此错误的任何想法。断言：命令失败：{“ok”：0，“errmsg”：“BSONObj 大小：45845276（0x2BB8B1C）无效。大小必须在 0 和 16793600（16MB）之间）第一个元素：id：0”，“code”：10334 , "codeName" : "Location10334" } :
尝试将 { allowDiskUse: true } 设置为 arg #2 到 aggregate() 调用。如果这不起作用，那么您可能只需要在客户端通过获取重复数据删除的用户列表并在客户端代码中运行 for() 循环来创建对。

标签： arrays json mongodb set-intersection nosql-aggregation

【解决方案1】：

这是一个有点复杂的解决方案。这个想法是首先使用 DB 来获取可能配对的数量，然后转身让 DB 在 _user 字段中找到配对。请注意，成千上万的用户将创建一个非常大的配对列表。我们使用$addFields 以防输入记录比我们在示例中看到的更多，但如果没有，为了提高效率，请用$project 替换以减少流过管道的材料量。

//
// Stage 1:  Get unique set of username pairs.
//
c=db.foo.aggregate([
{$unwind: "$_user"}

// Create single deduped list of users:
,{$group: {_id:null, u: {$addToSet: "$_user"} }}

// Nice little double map here creates the pairs, effectively doing this:
//    for index in range(0, len(list)):
//      first = list[index]
//      for p2 in range(index+1, len(list)):
//        pairs.append([first,list[p2]])
// 
,{$addFields: {u: 
  {$map: {
    input: {$range:[0,{$size:"$u"}]},
    as: "z",
    in: {
        $map: {
            input: {$range:[{$add:[1,"$$z"]},{$size:"$u"}]},
            as: "z2",
            in: [
            {$arrayElemAt:["$u","$$z"]},
            {$arrayElemAt:["$u","$$z2"]}
            ]
        }
    }
    }}
}}

// Turn the array of array of pairs in to a nice single array of pairs:
,{$addFields: {u: {$reduce:{
        input: "$u",
        initialValue:[],
        in:{$concatArrays: [ "$$value", "$$this"]}
        }}
    }}
          ]);


// Stage 2:  Find pairs and tally up the fileids

doc = c.next(); // Get single output from Stage 1 above.                       

u = doc['u'];

c2=db.foo.aggregate([
{$addFields: {_x: {$map: {
                input: u,
                as: "z",
                in: {
                    n: "$$z",
                    q: {$setIsSubset: [ "$$z", "$_user" ]}
                }
            }
        }
    }}
,{$unwind: "$_x"}
,{$match: {"_x.q": true}}
//  Nice use of grouping by an ARRAY here:
,{$group: {_id: "$_x.n", v: {$push: "$_id.fileid"}, n: {$sum:1} }}
,{$match: {"n": {"$gt":1}}}
                     ]);

show(c2);

【讨论】：