mongodb：返回子文档并跟踪父答案

【问题标题】：mongodb: Return subdocument and keep track of parentmongodb：返回子文档并跟踪父
【发布时间】：2017-08-08 20:09:17
【问题描述】：

我有一个推文集合，我正在尝试将根级别的转推（与引用的推文类似）输出到一个新集合，以便稍后使用转储和恢复将它们与原始集合合并）。转推状态是推文文档中的一个子文档，可能有多条推文转发同一条推文。如何在根级别进行转发并添加一个名为“retweeted_by”的数组，其中包含转发它的所有推文的 ID？

请记住，我使用推文 ID 作为主索引 (_id) 以避免在组合 (mongorestore) 集合时创建重复项。

我的收藏有以下形式：

{
    "_id" : "123456",
    "other_fields1" : "values1",
    "retweeted_status" : {
                          "retweet_id": "159753",
                          "other_fields2" : "values2",
                          }
}

理想的输出应该是这样的：

{
    "_id" : "159753",
    "other_fields2" : "values2",    
    "retweeted_by" : [ "123456", "974631", "121212"]
}

编辑澄清：

子文档中的字段 (other_fields2) 是多个字段 (~28)，并非所有其他推文中都存在

【问题讨论】：

db.collection.aggregate([{$group: {_id: "$retweeted_status.retweet_id", retweeted_by: {$push: "$_id"}}}])
@felix 谢谢，但这仅输出 retweeted_status 的 id，而不是 retweeted_status 的整个子文档，在我的示例“other_fields2”中调用...我认为分组后我需要使用 $replaceRoot将子文档作为 newRoot 并以某种方式添加数组 retweeted_by
添加other_fields2: {$first: "$retweeted_status.other_fields2"}。请看mongodb documentation $group
@felix 我试过了，但问题是 other_fields2 实际上是多个字段（在 24-28 个字段之间），从一个转发到另一个不同，即一个可以有 24 个字段，另一个可以有一个额外 4 个字段
@felix 我想我找到了解决问题的方法。我刚开始在这里提问，那么最好的方法是什么：我应该发布解决方案还是删除问题？

标签： mongodb aggregation-framework

【解决方案1】：

好的..所以我终于找到了我的问题的解决方案..我不确定这是否是最好的方法：

db.tweets.aggregate([
{
    $match: { retweeted_status: {$exists: true}} 
},
{ 
    $addFields: { 'retweeted_status.retweeted_by' : '$_id', 'retweeted_status._id' : '$retweeted_status.id_str'} 
},
{
   $replaceRoot: { newRoot: '$retweeted_status'} 
},
{ 
    $group: { _id: '$_id',  doc: { '$first': '$$ROOT' }, retweeted_by: {$addToSet: '$retweeted_by'}}
},
{
    $addFields: { 'doc.retweeted_by' : '$retweeted_by'}
},
{
    $replaceRoot: { newRoot: '$doc'}
},
{
    $project: { id: 0 , id_str: 0 }
},
{
    $out: 'retweets'
}
], {allowDiskUse: true})

最初每个文档（推文）都具有以下形式：

{父级，{子文档}}

先匹配一个 retweeted_status（子文档）的存在，然后在按 retweeted_status id 分组之前，我添加了一个带有父推文 id 的字段：

{parent, {subdocument , parent_id}}

然后用修改后的子文档替换根：

{subdocument, parent_id}

然后，我按照新根的_id进行分组，取出组的第一个文档，并添加了一个新的累加器集（retweeted_by）。（不是 $push，因为 twitter API 有时会发送重复）

到目前为止，根文档包含 _id、嵌入在字段“doc”中的转发文档以及包含父级的数组：

{doc{subdocument, parent_id}, [parent_ids]}

接下来，我在 doc 中添加了 parents 数组作为字段，（覆盖之前添加的 retweeted_by 字段）：

{doc{subdocument, [parent_ids]}, [parent_ids]}

然后用新文档替换父（根）。然后排除包含与_id相同数字的字段：

{子文档，[parent_ids]}

【讨论】：