【问题标题】:arangodb Facet calculation / aggregation slow?arangodb Facet 计算/聚合慢?
【发布时间】:2014-11-26 20:10:15
【问题描述】:

我想知道为什么下面的构面计算这么慢:

FOR q IN LRQ  
    COLLECT profile = q.LongRunningQuery.Profile INTO profiles 
RETURN { "Profile" : profile, "Count" : LENGTH(profiles)} 

这需要大约 30 秒,尽管数据库中只有 5.000 个文档,结果中只有 30 个不同的方面。

字段 LongRunningQuery.Profile 使用哈希索引和跳过列表索引进行索引。 (我也尝试了它们的不同组合)。

有没有人提示我可能出了什么问题?查询是否可能从索引中受益? (这 5000 条记录大约有 1 GB 大小,所以我假设不会使用哈希索引,也许是全表扫描?)

有趣的是,以下替代方案仅持续 2 秒:

FOR q IN SKIPLIST(LRQ, { "LongRunningQuery.Profile": [ [ '>',  ''  ] ] })[*].LongRunningQuery.Profile
    COLLECT profile = q INTO profiles
RETURN { "Profile" : profile, "Count" : LENGTH(profiles) } 

但它仍然需要 2 秒 - 对于如此少量的记录。这里看起来使用了skiplist索引,但它可能不是完美的索引变体。


2014 年 11 月 27 日更新:

arangosh [_system]> stmt._query
    FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN {
     "Profile" : profile, "Count" : LENGTH(profiles)}

arangosh [_system]> db.LRQ.ensureHashIndex("LongRunningQuery.Profile");
{
  "id" : "LRQ/296017913256",
  "type" : "hash",
  "unique" : false,
  "fields" : [
    "LongRunningQuery.Profile"
  ],
  "isNewlyCreated" : false,
  "error" : false,
  "code" : 200
}

查询耗时约 32 秒,返回 31 个简短结果。

执行计划:

    {
        "plan": {
            "nodes": [
                {
                    "type": "SingletonNode",
                    "dependencies": [],
                    "id": 1,
                    "estimatedCost": 1,
                    "estimatedNrItems": 1
                },
                {
                    "type": "EnumerateCollectionNode",
                    "dependencies": [
                        1
                    ],
                    "id": 2,
                    "estimatedCost": 5311,
                    "estimatedNrItems": 5310,
                    "database": "_system",
                    "collection": "LRQ",
                    "outVariable": {
                        "id": 0,
                        "name": "q"
                    }
                },
                {
                    "type": "CalculationNode",
                    "dependencies": [
                        2
                    ],
                    "id": 3,
                    "estimatedCost": 10621,
                    "estimatedNrItems": 5310,
                    "expression": {
                        "type": "attribute access",
                        "name": "Profile",
                        "subNodes": [
                            {
                                "type": "attribute access",
                                "name": "LongRunningQuery",
                                "subNodes": [
                                    {
                                        "type": "reference",
                                        "name": "q",
                                        "id": 0
                                    }
                                ]
                            }
                        ]
                    },
                    "outVariable": {
                        "id": 3,
                        "name": "3"
                    },
                    "canThrow": false
                },
                {
                    "type": "SortNode",
                    "dependencies": [
                        3
                    ],
                    "id": 4,
                    "estimatedCost": 56166.713176593075,
                    "estimatedNrItems": 5310,
                    "elements": [
                        {
                            "inVariable": {
                                "id": 3,
                                "name": "3"
                            },
                            "ascending": true
                        }
                    ],
                    "stable": true
                },
                {
                    "type": "AggregateNode",
                    "dependencies": [
                        4
                    ],
                    "id": 5,
                    "estimatedCost": 61476.713176593075,
                    "estimatedNrItems": 5310,
                    "aggregates": [
                        {
                            "outVariable": {
                                "id": 1,
                                "name": "profile"
                            },
                            "inVariable": {
                                "id": 3,
                                "name": "3"
                            }
                        }
                    ],
                    "outVariable": {
                        "id": 2,
                        "name": "profiles"
                    }
                },
                {
                    "type": "CalculationNode",
                    "dependencies": [
                        5
                    ],
                    "id": 6,
                    "estimatedCost": 66786.71317659307,
                    "estimatedNrItems": 5310,
                    "expression": {
                        "type": "array",
                        "subNodes": [
                            {
                                "type": "array element",
                                "name": "Profile",
                                "subNodes": [
                                    {
                                        "type": "reference",
                                        "name": "profile",
                                        "id": 1
                                    }
                                ]
                            },
                            {
                                "type": "array element",
                                "name": "Count",
                                "subNodes": [
                                    {
                                        "type": "function call",
                                        "name": "LENGTH",
                                        "subNodes": [
                                            {
                                                "type": "list",
                                                "subNodes": [
                                                    {
                                                        "type": "reference",
                                                        "name": "profiles",
                                                        "id": 2
                                                    }
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    },
                    "outVariable": {
                        "id": 4,
                        "name": "4"
                    },
                    "canThrow": false
                },
                {
                    "type": "ReturnNode",
                    "dependencies": [
                        6
                    ],
                    "id": 7,
                    "estimatedCost": 72096.71317659307,
                    "estimatedNrItems": 5310,
                    "inVariable": {
                        "id": 4,
                        "name": "4"
                    }
                }
            ],
            "rules": [],
            "collections": [
                {
                    "name": "LRQ",
                    "type": "read"
                }
            ],
            "variables": [
                {
                    "id": 0,
                    "name": "q"
                },
                {
                    "id": 1,
                    "name": "profile"
                },
                {
                    "id": 4,
                    "name": "4"
                },
                {
                    "id": 2,
                    "name": "profiles"
                },
                {
                    "id": 3,
                    "name": "3"
                }
            ],
            "estimatedCost": 72096.71317659307,
            "estimatedNrItems": 5310
        },
        "warnings": []
    }

2014 年 12 月 5 日更新:

以下是其他措施: 明白了,谢谢。这是输出:

执行 AQL_EXECUTE('FOR q IN LRQ FILTER q.LongRunningQuery.Profile == "Admin" LIMIT 1 RETURN q.LongRunningQuery.Profile', {}, { profile : true }).profile --> { “初始化”:0, “解析”:0, “优化 ast”:15.364980936050415, “实例化计划”:0, “优化计划”:0, “执行”:0 }

执行 AQL_EXECUTE('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)}', {}, { profile : true }).profile --> { “初始化”:0, “解析”:0, “优化 ast”:0, “实例化计划”:0, “优化计划”:0, “执行”:77.88313102722168 }

2014 年 12 月 19 日更新:

从 2.3.2 开始查询的执行计划 arangosh [_system]> stmt2 = db._createStatement('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)} ')

看起来像这样:

arangosh [_system]> stmt2.explain()
{
  "plan" : {
    "nodes" : [
      {
        "type" : "SingletonNode",
        "dependencies" : [ ],
        "id" : 1,
        "estimatedCost" : 1,
        "estimatedNrItems" : 1
      },
      {
        "type" : "IndexRangeNode",
        "dependencies" : [
          1
        ],
        "id" : 8,
        "estimatedCost" : 5311,
        "estimatedNrItems" : 5310,
        "database" : "_system",
        "collection" : "LRQ",
        "outVariable" : {
          "id" : 0,
          "name" : "q"
        },
        "ranges" : [
          [ ]
        ],
        "index" : {
          "type" : "skiplist",
          "id" : "530975525379",
          "unique" : false,
          "fields" : [
            "LongRunningQuery.Profile"
          ]
        },
        "reverse" : false
      },
      {
        "type" : "CalculationNode",
        "dependencies" : [
          8
        ],
        "id" : 3,
        "estimatedCost" : 10621,
        "estimatedNrItems" : 5310,
        "expression" : {
          "type" : "attribute access",
          "name" : "Profile",
          "subNodes" : [
            {
              "type" : "attribute access",
              "name" : "LongRunningQuery",
              "subNodes" : [
                {
                  "type" : "reference",
                  "name" : "q",
                  "id" : 0
                }
              ]
            }
          ]
        },
        "outVariable" : {
          "id" : 3,
          "name" : "3"
        },
        "canThrow" : false
      },
      {
        "type" : "AggregateNode",
        "dependencies" : [
          3
        ],
        "id" : 5,
        "estimatedCost" : 15931,
        "estimatedNrItems" : 5310,
        "aggregates" : [
          {
            "outVariable" : {
              "id" : 1,
              "name" : "profile"
            },
            "inVariable" : {
              "id" : 3,
              "name" : "3"
            }
          }
        ],
        "outVariable" : {
          "id" : 2,
          "name" : "profiles"
        }
      },
      {
        "type" : "CalculationNode",
        "dependencies" : [
          5
        ],
        "id" : 6,
        "estimatedCost" : 21241,
        "estimatedNrItems" : 5310,
        "expression" : {
          "type" : "array",
          "subNodes" : [
            {
              "type" : "array element",
              "name" : "Profile",
              "subNodes" : [
                {
                  "type" : "reference",
                  "name" : "profile",
                  "id" : 1
                }
              ]
            },
            {
              "type" : "array element",
              "name" : "Count",
              "subNodes" : [
                {
                  "type" : "function call",
                  "name" : "LENGTH",
                  "subNodes" : [
                    {
                      "type" : "list",
                      "subNodes" : [
                        {
                          "type" : "reference",
                          "name" : "profiles",
                          "id" : 2
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        "outVariable" : {
          "id" : 4,
          "name" : "4"
        },
        "canThrow" : false
      },
      {
        "type" : "ReturnNode",
        "dependencies" : [
          6
        ],
        "id" : 7,
        "estimatedCost" : 26551,
        "estimatedNrItems" : 5310,
        "inVariable" : {
          "id" : 4,
          "name" : "4"
        }
      }
    ],
    "rules" : [
      "use-index-for-sort"
    ],
    "collections" : [
      {
        "name" : "LRQ",
        "type" : "read"
      }
    ],
    "variables" : [
      {
        "id" : 0,
        "name" : "q"
      },
      {
        "id" : 1,
        "name" : "profile"
      },
      {
        "id" : 4,
        "name" : "4"
      },
      {
        "id" : 2,
        "name" : "profiles"
      },
      {
        "id" : 3,
        "name" : "3"
      }
    ],
    "estimatedCost" : 26551,
    "estimatedNrItems" : 5310
  },
  "warnings" : [ ],
  "stats" : {
    "rulesExecuted" : 25,
    "rulesSkipped" : 0,
    "plansCreated" : 1
  }
}

【问题讨论】:

  • 能否告诉我您使用的是哪个版本的 ArangoDB? 2.2 还是 2.3?
  • 我怀疑这与COLLECT 操作的一些广泛复制有关。此处报告了一个错误:github.com/triAGENS/ArangoDB/issues/1107
  • 我已经在使用 2.3。
  • 但我记得这在 2.2 中也很慢
  • "profile" 中的字符串很少且很短:总共有大约 40 个不同的字符串,每个字符串的最大长度为 30 个字符。这适合 COLLECT 中的内存泄漏吗?

标签: arangodb


【解决方案1】:

hm,看看解释有一个排序节点,而您的查询不提供排序? collect 可能会阻止优化器使用您的索引(然后您将拥有 IndexRangeNode 而不是 EnumerateCollectionNode)

如果传递查询的options参数(db._query()的第4个参数) {个人资料:真实} 它将输出阶段使用的时间;你能用它重新运行你的查询,并向我们展示回复吗?

【讨论】:

  • 我很乐意这样做。好像我在这里做错了什么,但是什么?我的电话是: c = db._query('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)}', null, null, { profile : 真的}); c.elements() ...但在这种情况下,我没有得到分析信息。在其他查询中,我得到了分析信息。有什么提示吗?
  • 嗯,好的,需要采取不同的方法。请使用 --console 运行 arangod 并执行如下查询: AQL_EXECUTE("FOR u IN _users RETURN u", {}, { profile : true }).profile
  • 明白,谢谢。这是输出:执行 AQL_EXECUTE('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)}', {}, { profile : true } ).profile --> { "initializing" : 0, "parsing" : 0, "optimizing ast" : 0, "instanciating plan" : 0, "optimizing plan" : 0, "executing" : 77.88313102722168 }
  • 另一个度量看起来不同:执行 AQL_EXECUTE('FOR q IN LRQ FILTER q.LongRunningQuery.Profile == "Admin" LIMIT 1 RETURN q.LongRunningQuery.Profile', {}, { profile : true }).profile --> {“初始化”:0,“解析”:0,“优化ast”:15.364980936050415,“实例化计划”:0,“优化计划”:0,“执行”:0 }
  • 同时我使用的是 2.3.1 版本!
【解决方案2】:

COLLECT 语句需要排序输入。因此,SORT 语句将自动添加到执行计划中,即使原始查询字符串不包含显式的SORT 语句。

这就是计划中出现 SortNode 的原因。如果 sort 属性上有一个 skiplist 索引(在这种情况下为 LongRunningQuery.Profile),SortNode 将被优化掉。因此,在属性上添加一个 skiplist 索引会加快它的速度,因为可以省去(昂贵的)排序步骤。

如果你设置了这样的索引并运行查询,它应该比只有哈希索引时更快。事实上,原来的查询应该已经忽略了哈希索引。

如果你已经设置了skiplist索引并解释了查询,你也应该看到没有SortNode了。

从 ArangoDB 2.4 开始(目前处于开发阶段),有一个更有效的语法添加用于仅计算方面:

FOR q IN LRQ  
  COLLECT profile = q.LongRunningQuery.Profile WITH COUNT INTO numProfiles
  RETURN { "Profile" : profile, "Count" : numProfiles } 

这应该会进一步加快查询速度。

【讨论】:

  • 谢谢,很高兴听到 2.4 的进一步改进。也感谢您的详细解释。我的问题是我已经激活了这样一个skiplist索引,但是要么它不会被使用,要么性能仍然很差(5000条记录一分钟)。这是索引: arangosh [_system]> db.LRQ.index("530975525379") { "id" : "LRQ/530975525379", "type" : "skiplist", "unique" : false, "fields" : [ " LongRunningQuery.Profile" ], "error" : false, "code" : 200 }
  • 以下语句有效,但它仍然值得 2 秒(5.000 条记录,30 个结果)吗? FOR q IN SKIPLIST(LRQ, { "LongRunningQuery.Profile": [ [ '>', '' ] ] })[*].LongRunningQuery.Profile COLLECT profile = q RETURN { "Profile" : profile }
  • 但我发现同时执行计划看起来与以前的版本不同。因此,我在上面的解释中更新了详细信息。
  • 这 5000 条记录(总共 > 1 GB)是否会导致运行时变慢,即使有索引?
  • 您发布的最后一个执行计划显示使用了索引并且没有额外的排序步骤。所以现在查询的问题一定是组值变量的构建(即profiles)。在查询的原始版本中,整个文档将被复制到这个变量中,这将是昂贵的,因为文档> 1 GB,正如你提到的。 2.4 中的 WITH COUNT INTO ... 扩展应该可以解决这个问题。在那之前,SKIPLIST()-y 版本的查询可能不会复制那么多,因此现在会更快。
猜你喜欢
  • 2018-10-21
  • 1970-01-01
  • 1970-01-01
  • 2019-01-14
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-08-11
  • 1970-01-01
相关资源
最近更新 更多