用于查询过滤器的 PyMongo 聚合？答案

【问题标题】：PyMongo Aggregation for Query Filters?用于查询过滤器的 PyMongo 聚合？
【发布时间】：2018-08-17 15:16:24
【问题描述】：

所以我发现了类似的问题，但并没有完全回答我正在寻找的内容。如果这是重复的；请随时为我指出合适的地方。

我有一个集合，它是一些非常大的文档的“真实来源”。在进行主要分析之前，我想使用查询引擎进行一些预过滤。

查询 1：

仅检索 document.financials.entrycount $gte 4 的文档。所以基本上在一个文档中我有一个财务子文档。我想用这个作为过滤器。我只想返回条目数大于 4 的文档。

查询 2：

能够进行数学运算并将其与数字进行比较以进行检索。

例如：

(totalAssets + totalCash) / (totalDebt + totalLiabilities) < .5

这些数字在子文档中的位置。

终于可以把这些结合起来了。

以下是预计仅包含季度财务数据的示例文档。

{
  "symbol": "AAWW",
  "quarterly_financials": {
    "2017-09-30": {
      "cashChange": -106467000,
      "cashFlow": 82299000,
      "costOfRevenue": 439135000,
      "currentAssets": 449776000,
      "currentCash": 176280000,
      "currentDebt": 196509000,
      "grossProfit": 96613000,
      "netIncome": -24162000,
      "operatingExpense": 43690000,
      "operatingGainsLosses": 378000,
      "operatingIncome": 52923000,
      "operatingRevenue": 535748000,
      "researchAndDevelopment": None,
      "shareholderEquity": 1575169000,
      "totalAssets": 4687302000,
      "totalCash": 175926000,
      "totalDebt": 2105344000,
      "totalLiabilities": None,
      "totalRevenue": 535748000
    },
    "2017-12-31": {
      "cashChange": 115584000,
      "cashFlow": 136613000,
      "costOfRevenue": 474565000,
      "currentAssets": 587586000,
      "currentCash": 291864000,
      "currentDebt": 218013000,
      "grossProfit": 153387000,
      "netIncome": 209448000,
      "operatingExpense": 46628000,
      "operatingGainsLosses": -95000,
      "operatingIncome": 106759000,
      "operatingRevenue": 627952000,
      "researchAndDevelopment": None,
      "shareholderEquity": 1789856000,
      "totalAssets": 4955462000,
      "totalCash": 294413000,
      "totalDebt": 2226999000,
      "totalLiabilities": None,
      "totalRevenue": 627952000
    },
    "2018-03-31": {
      "cashChange": -161460000,
      "cashFlow": 69125000,
      "costOfRevenue": 498924000,
      "currentAssets": 433193000,
      "currentCash": 130404000,
      "currentDebt": 223308000,
      "grossProfit": 91090000,
      "netIncome": 9612000,
      "operatingExpense": 50521000,
      "operatingGainsLosses": None,
      "operatingIncome": 40569000,
      "operatingRevenue": 590014000,
      "researchAndDevelopment": None,
      "shareholderEquity": 1792299000,
      "totalAssets": 5016832000,
      "totalCash": 136421000,
      "totalDebt": 2270870000,
      "totalLiabilities": None,
      "totalRevenue": 590014000
    },
    "2018-06-30": {
      "cashChange": 97525000,
      "cashFlow": 106786000,
      "costOfRevenue": 548491000,
      "currentAssets": 565191000,
      "currentCash": 227929000,
      "currentDebt": 245322000,
      "grossProfit": 117654000,
      "netIncome": -21150000,
      "operatingExpense": 47334000,
      "operatingGainsLosses": None,
      "operatingIncome": 70320000,
      "operatingRevenue": 664531000,
      "researchAndDevelopment": None,
      "shareholderEquity": 1776073000,
      "totalAssets": 5348343000,
      "totalCash": 234280000,
      "totalDebt": 2501488000,
      "totalLiabilities": None,
      "totalRevenue": 666145000
    }
  }
}

【问题讨论】：

标签： python mongodb mongodb-query aggregation-framework pymongo

【解决方案1】：

坦率地说，您在这里遇到的主要问题是文档结构。底线是，以“key”命名的“子文档”对于任何形式的数据库通常都不是一件好事，包括 MongoDB。

虽然在客户端代码中处理单个文档时“按键查找”可能更有效，但对于像这样的自然“列表”，MongoDB 更适合使用类似“数组”或“集合”的结构。

聚合表达式

可行的替代方法是使用聚合运算符，例如$objectToArray，以便将此表单“强制”为“自然列表”进行处理，以便您可以按该列表中的条目数进行过滤：

collection.aggregate([
  { "$match": {
    "$expr": {
      "$gte": [
        { "$size": { "$objectToArray": "$quarterly_financials" } },
        4
      ]
    }
  }}
])

请注意，这是使用 MongoDB 3.6 及更高版本中的 $expr。如果您没有该支持版本，但仍然有来自更高版本 MongoDB 3.4 版本的 $objectToArray（即使文档说 3.6 实际上在那些更高版本中），那么您可以使用类似 $redact 的东西来代替 @ 987654325@ 或普通的find() 。

附加计算表达式也是如此。底线是您仍然需要“数组转换”才能实际处理和遍历那些“列表元素”。因此，如果只有那些满足条件的子条目加起来就是所需的四个，那么您将在数组元素上使用 $filter 条件进行更改：

collection.aggregate([
  { "$match": {
      "$expr": {
        "$gte": [
          { "$size": { 
            "$filter": {
              "input": { "$objectToArray": "$quarterly_financials" },
              "cond": {
                "$lt": [
                  { "$divide": [
                    { "$add": [ "$$this.v.totalAssets", "$$this.v.totalCash" ] },
                    { "$add": [ 
                      "$$this.v.totalDebt",
                      { "$ifNull": [ "$$this.v.totalLiabilities", 0 ] }
                    ]}
                  ]},
                  .5
                ]
              }
            }
          }},
          4
        ]
      }
  }}
])

因此，在“强制转换为数组” 之后，每个列表项都使用$filter 进行检查，以确定使用$divide 和$add 等运算符的数学表达式在哪里满足逻辑条件$lt 之前使用 $size 运算符考虑剩余的过滤数组的 "length"。

另请注意$objectToArray 实质上将每个子文档转换为具有这种形式的列表：

   {
        "k" : "2018-06-30",
        "v" : {
            "cashChange" : 97525000,
            "cashFlow" : 106786000,
            "costOfRevenue" : 548491000,
            "currentAssets" : 565191000,
            "currentCash" : 227929000,
            "currentDebt" : 245322000,
            "grossProfit" : 117654000,
            "netIncome" : -21150000,
            "operatingExpense" : 47334000,
            "operatingGainsLosses" : null,
            "operatingIncome" : 70320000,
            "operatingRevenue" : 664531000,
            "researchAndDevelopment" : null,
            "shareholderEquity" : 1776073000,
            "totalAssets" : 5348343000,
            "totalCash" : 234280000,
            "totalDebt" : 2501488000,
            "totalLiabilities" : null,
            "totalRevenue" : 666145000
        }
    }

这意味着您要查找的所有内容都在新转换的“列表”中每个文档的 "v" 属性（或 "value" ）下。 "k" 属性当然是您以“子文档”形式命名的“键”。

另外，$ifNull 是处理属性的 null（或 Python 的 None）值的要求，或者在适当的情况下实际上是 “缺失” 属性。

JavaScript 评估

不是很推荐，但是您的 MongoDB 不支持较新的运算符（例如 $objectToArray）的另一个替代方法是在服务器上使用 JavaScript 评估，例如处理此类逻辑的 $where 或 mapReduce。

同样的原则也适用于你必须首先“强制”成一个数组形式。（这里原谅“shell 形式”的简写示例。只是所有其他语言的字符串）：

collection.find(function() {
  var quarter = this.quarterly_financials;
  return Object.keys(quarter).filter( k => 
    ( 
      ( quarter[k].totalAssets + quarter[k].totalCash ) /
      ( quarter[k].totalDebt + ( quarter[k].totalLiabilites || 0 ) )
    ) < .5
  ).length >= 4
})

虽然“噪音较小”，但它并不是真正优化的，因为在服务器上评估 JavaScript 表达式的 “成本” 远高于自然语言聚合表达式。也有可能某些环境和服务器配置实际上根本不允许您使用此类 JavaScript 表达式。

另请注意，如果您真的想要在分析的后期阶段进行“聚合”，那么您需要将该逻辑分解为 mapReduce，因为在聚合管道中无法使用 $where 查询表达式。

替代结构

最后，由于一切都依赖于从“命名键”中“制作列表”，那么更好的方法通常是首先以这种方式构造数据（请原谅extended JSON format）：

{
  "symbol": "AAWW",
  "quarterly_financials": [
    { 
      "tranDate": { "$date": "2017-09-30T00:00:00Z"},
      "cashChange": -106467000,
      "cashFlow": 82299000,
      "costOfRevenue": 439135000,
      "currentAssets": 449776000,
      "currentCash": 176280000,
      "currentDebt": 196509000,
      "grossProfit": 96613000,
      "netIncome": -24162000,
      "operatingExpense": 43690000,
      "operatingGainsLosses": 378000,
      "operatingIncome": 52923000,
      "operatingRevenue": 535748000,
      "researchAndDevelopment": null,
      "shareholderEquity": 1575169000,
      "totalAssets": 4687302000,
      "totalCash": 175926000,
      "totalDebt": 2105344000,
      "totalLiabilities": null,
      "totalRevenue": 535748000
    },
    {
      "tranDate": { "$date": "2017-12-31T00:00:00Z" },
      "cashChange": 115584000,
      "cashFlow": 136613000,
      "costOfRevenue": 474565000,
      "currentAssets": 587586000,
      "currentCash": 291864000,
      "currentDebt": 218013000,
      "grossProfit": 153387000,
      "netIncome": 209448000,
      "operatingExpense": 46628000,
      "operatingGainsLosses": -95000,
      "operatingIncome": 106759000,
      "operatingRevenue": 627952000,
      "researchAndDevelopment": null,
      "shareholderEquity": 1789856000,
      "totalAssets": 4955462000,
      "totalCash": 294413000,
      "totalDebt": 2226999000,
      "totalLiabilities": null,
      "totalRevenue": 627952000
    },
    { 
      "tranDate": { "$date": "2018-03-31T00:00:00Z" },
      "cashChange": -161460000,
      "cashFlow": 69125000,
      "costOfRevenue": 498924000,
      "currentAssets": 433193000,
      "currentCash": 130404000,
      "currentDebt": 223308000,
      "grossProfit": 91090000,
      "netIncome": 9612000,
      "operatingExpense": 50521000,
      "operatingGainsLosses": null,
      "operatingIncome": 40569000,
      "operatingRevenue": 590014000,
      "researchAndDevelopment": null,
      "shareholderEquity": 1792299000,
      "totalAssets": 5016832000,
      "totalCash": 136421000,
      "totalDebt": 2270870000,
      "totalLiabilities": null,
      "totalRevenue": 590014000
    },
    { 
      "tranDate": { "$date": "2018-06-30T00:00:00Z" },
      "cashChange": 97525000,
      "cashFlow": 106786000,
      "costOfRevenue": 548491000,
      "currentAssets": 565191000,
      "currentCash": 227929000,
      "currentDebt": 245322000,
      "grossProfit": 117654000,
      "netIncome": -21150000,
      "operatingExpense": 47334000,
      "operatingGainsLosses": null,
      "operatingIncome": 70320000,
      "operatingRevenue": 664531000,
      "researchAndDevelopment": null,
      "shareholderEquity": 1776073000,
      "totalAssets": 5348343000,
      "totalCash": 234280000,
      "totalDebt": 2501488000,
      "totalLiabilities": null,
      "totalRevenue": 666145000
    }
  ]
}

由于它已经是一个“列表”，您只需跳过 $objectToArray 部分（或与 JavaScript 类似的部分）：

collection.aggregate([
  { "$match": {
    "$expr": {
      "$gte": [
        { "$size": { 
          "$filter": {
            "input": "$quarterly_financials",
            "cond": {
              "$lt": [
                { "$divide": [
                  { "$add": [ "$$this.totalAssets", "$$this.totalCash" ] },
                  { "$add": [ 
                    "$$this.totalDebt",
                    { "$ifNull": [ "$$this.totalLiabilities", 0 ] }
                  ]}
                ]},
                .5
              ]
            }
          }
        }},
        4
      ]
    }
  }}
])

改用这种结构还有很多优点，其中大多数通常甚至会完全避免评估表达式并能够使用自然查询表达式。事实上，如果您在这里的条件实际上不是四个需要依赖于这样一个 “过滤” 条件，那么如果您已经 “预先计算”在存储时预先计算每个列表条目中的数学表达式。

因此，“结构” 始终是您真正应该考虑的最佳查询性能，因为任何形式的 “评估” 都会导致集合扫描并且成本非常高。

因此，如果事物是列表，而计算是“静态”的，则“使用列表”，然后预先执行并存储它们，而不是在运行时计算。

【讨论】：