MongoDB聚合/展开/分组/项目查询组合答案

【问题标题】：MongoDB aggregation/unwind/group/project query combinationMongoDB聚合/展开/分组/项目查询组合
【发布时间】：2013-05-23 13:42:25
【问题描述】：

我有以下格式的记录：

"_id" : "2013-05-23",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 1,
            "customers" : [
                {
                   "time_bought" : 1368627290,
                   "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 2,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                }
            ]
        }
    ]
}

每个日期都有一条记录，其中许多将包含同一作者。我正在查询返回以下内容的查询：

{
    "_id" : "Charles Dickens",
    "num_sold" : 235,
    "customers" : [
        {
            "date_bought" : 1368627290,
            "customer_id" : 9715923
        },
        {
            "date_bought" : 1368622358,
            "customer_id" : 9876234
        },
        etc...
    ]
}

我尝试了聚合、组、展开和项目的各种组合，但仍然无法达到目标，非常感谢任何建议。

对于额外的点，我实际上是使用 Ruby gem 来做这件事的，所以特定于此的代码会很棒。不过，我可以转换普通的 MongoDB 查询语言。

【问题讨论】：

你尝试过 MapReduce 吗？
我看到的最大问题是文档本身的存储方式。改变数据的结构是一种选择吗？我问的原因是，如果您将_ID字段设置为日期，并且每个日期文档都有一个客户数组，为什么您需要再次将日期存储在客户数组中？此外，文档的大小限制为 16mb，因此如果一天内有数百万的销售量，您可能会超过该文档大小限制。我认为如果每笔销售都是它自己的记录会更容易（再次推测），那么您可以使用聚合框架来创建您正在寻找的东西。
不，我还没有尝试过 MapReduce...
@Jesta 感谢您的反馈。数据来自每日数据转储。在我的真实情况下（不看书和作者！），id 实际上将类似于“2013-05-17_emails”、“2013-05-17_banner-ads”等，因此包含更多信息。此外，“date_bought”字段实际上是一个时间戳，所以我会改变我的问题以更好地反映这一点。我也可以保证，记录的数量永远不会在百万的范围内！谢谢。

标签： ruby mongodb aggregation-framework

【解决方案1】：

我获取了您的示例数据，对第二个文档进行了稍微修改，然后将它们添加到测试集合中。我使用的文件如下：

{
    "_id" : "2013-05-23",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 1,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 2,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                }
            ]
        }
    ]
}
{
    "_id" : "2013-05-21",
    "authors_who_sold_books" : [
        {
            "id" : "Charles Dickens",
            "num_sold" : 3,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        },
        {
            "id" : "JRR Tolkien",
            "num_sold" : 1,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                }
            ]
        }
    ]
}

现在，为了获得您预期的结果，我使用了聚合框架并运行了以下查询：

db.collection.aggregate([
    {
        // First we unwind all the authors that sold books
        $unwind: '$authors_who_sold_books',
    },
    {
        // Next, we unwind each of the customers that purchased a book
        $unwind: '$authors_who_sold_books.customers'
    },
    {
        // Now we group them by "Author Name" (hoping they are unique!)
        $group: {
            _id: '$authors_who_sold_books.id',
            // Increment the number sold by each author
            num_sold: {
                $sum: 1
            },
            // Add the customer data to the array
            customers: {
                $push: '$authors_who_sold_books.customers'
            }
        }
    }
]);

我试图记录上面的代码，这样它就更有意义了。基本上，它将数据展开两次，以便为作者的每次销售创建一个文档。首先通过authors_who_sold_books 展开，然后通过authors_who_sold_books.customers 展开。

下一步只是将它们分组并将所有客户推送到客户数组中，并为我们拥有的每个展开文档将 num_sold 递增 1。

结果如下：

{
    "result" : [
        {
            "_id" : "JRR Tolkien",
            "num_sold" : 3,
            "customers" : [
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                },
                {
                    "date_bought" : 1368537290,
                    "customer_id" : 9163893
                },
                {
                    "date_bought" : 1368540890,
                    "customer_id" : 9872345
                }
            ]
        },
        {
            "_id" : "Charles Dickens",
            "num_sold" : 4,
            "customers" : [
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                },
                {
                    "time_bought" : 1368627290,
                    "customer_id" : 9715923
                }
            ]
        }
    ],
    "ok" : 1
}

希望这可以帮助您找出真正的解决方案:)

【讨论】：

非常感谢您花时间制定解决方案并如此清楚地解释它！
我刚刚在我的实际解决方案中实现了这一点（被搁置了一段时间），它立即完美地工作。我很感激，谢谢。当我再获得 1 个代表点时，我就可以投票了！