了解 fold() 及其对 Azure Cosmos DB 中 gremlin 查询成本的影响答案

【问题标题】：Understanding fold() and its impact on gremlin query cost in Azure Cosmos DB了解 fold() 及其对 Azure Cosmos DB 中 gremlin 查询成本的影响
【发布时间】：2019-09-27 03:17:06
【问题描述】：

我正在尝试了解 Azure Cosmos DB 中的查询成本

我无法弄清楚以下示例有什么区别以及为什么使用 fold() 会降低成本：

g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id')

产生以下输出：

[
  {
    "itemId": 14,
    "id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
  },
  {
    "itemId": 5,
    "id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
  },
  {
    "itemId": 6,
    "id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
  },           
....    
  {
    "itemId": 5088,
    "id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
  }
]

成本为 15642 RU x 0.00008 $/RU = 1.25$

g.V().hasLabel('item').project('itemId', 'id').by('itemId').by('id').fold()

产生以下输出：

[
  [
    {
      "itemId": 14,
      "id": "186de1fb-eaaf-4cc2-b32b-de8d7be289bb"
    },
    {
      "itemId": 5,
      "id": "361753f5-7d18-4a43-bb1d-cea21c489f2e"
    },
    {
      "itemId": 6,
      "id": "1c0840ee-07eb-4a1e-86f3-abba28998cd1"
    },
...
    {
      "itemId": 5088,
      "id": "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc"
    }
  ]
]

成本为 787 RU x 0.00008$/RU = 0.06$

g.V().hasLabel('item').values('id', 'itemId')

输出如下：

[
  "186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
  14,
  "361753f5-7d18-4a43-bb1d-cea21c489f2e",
  5,
  "1c0840ee-07eb-4a1e-86f3-abba28998cd1",
  6,
...
  "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
  5088
]

成本：10639 RU x 0.00008 $/RU = 0.85$

g.V().hasLabel('item').values('id', 'itemId').fold()

输出如下：

[
  [
    "186de1fb-eaaf-4cc2-b32b-de8d7be289bb",
    14,
    "361753f5-7d18-4a43-bb1d-cea21c489f2e",
    5,
    "1c0840ee-07eb-4a1e-86f3-abba28998cd1",
    6,
...
    "2ed1871d-c0e1-4b38-b5e0-78087a5a75fc",
    5088
  ]
]

成本为 724.27 RU x 0.00008 $/RU = 0.057$

如您所见，对成本的影响是巨大的。这只是大约。 3200 个节点，属性很少。

我想了解为什么添加折叠变化如此之大。

【问题讨论】：

标签： azure-cosmosdb gremlin azure-cosmosdb-gremlinapi

【解决方案1】：

我试图重现您的示例，但不幸的是结果相反（Cosmos 中有 500 个顶点）：

g.V().hasLabel('test').values('id')

或

g.V().hasLabel('test').project('id').by('id')

分别给 86.08 和 91.44 RU，而 fold() 步骤之后的相同查询导致 585.06 和 590.43 卢布。

根据 TinkerPop documentation 的说法，我得到的这个结果看起来不错：

在某些情况下，遍历流需要一个“障碍”来聚合所有对象并发出一个计算，该计算是聚合体。 fold()-step (map) 是一个特定的实例这个。

知道 Cosmos 对访问对象的数量和对这些获得的对象进行的计算（在此特定情况下为 fold）收取 RU 费用，折叠的更高成本是预期的。

您可以尝试运行 executionProfile() 步骤进行遍历，这可以帮助您调查您的案例。当我尝试时：

g.V().hasLabel('test').values('id').executionProfile()

我为 fold() 增加了 2 个步骤（为简洁起见，省略了相同的输出部分），而这个 ProjectAggregation 是结果集从 500 映射到 1 的地方：

 ...
      {
        "name": "ProjectAggregation",
        "time": 165,
        "annotations": {
          "percentTime": 8.2
        },
        "counts": {
          "resultCount": 1
        }
      },
      {
        "name": "QueryDerivedTableOperator",
        "time": 1,
        "annotations": {
          "percentTime": 0.05
        },
        "counts": {
          "resultCount": 1
        }
      }
...

【讨论】：