MarkLogic 中的 cts:search 选项中哪个是更好的 collection() 或根元素答案

【问题标题】：Which is better collection() or root element in cts:search option in MarkLogicMarkLogic 中的 cts:search 选项中哪个是更好的 collection() 或根元素
【发布时间】：2018-11-08 16:57:31
【问题描述】：

在我的一个项目中，MarkLogic 顾问建议我在cts:search 中使用collection()，而在另一个项目中，ML 顾问建议我在cts:search 中使用根元素。在这两个项目中，我们拥有相同数量的文档。哪一个在性能方面更好？

假设我们有一个文档（我拿了一个小文档来解释这个场景）。它有一个名为“demo”的集合：

<root>
<child1>ABC</child1>
<child2>DEF</child2>
<child3>GHI</child3>
<child4>JKL</child4>
</root>

哪种情况更好/更有效：

cts:search(/root, cts:and-query((....some cts:queries..)))

cts:search(collection("demo"), cts:and-query((....some cts:queries..)))

请帮我解释一下哪个更好。

【问题讨论】：

标签： xquery marklogic

【解决方案1】：

就搜索执行而言，它们都是单项查找，因此性能应该相同。

真正的区别在于您希望如何管理您的内容。您可以在同一个文档上拥有多个集合，因此您可以以多种方式对相同的内容进行切片，但您只能拥有一个根元素。集合还可以让您从文档结构的细节中抽象出来：您可以在同一个集合中拥有多个不同的根元素。

【讨论】：

此外，从 xml“文件”的角度考虑，集合是“不可见的”元数据——即，如果您将文档保存为 xml 文件（或从 xml 文件中读取），则集合它所属的不是该文件的一部分，它必须单独关联，类似于 uri，因为它在数据库表示中，但不是 XML 内容本身的一部分。这有利有弊。

【解决方案2】：

根据 MarkLogic 文档，“MarkLogic 的集合实现旨在优化针对大量文档的查询性能。”。因此，这意味着您只能在庞大的数据库上识别差异。

我试图通过实际来识别这一点，因此我创建了两个 XQuery，一个带有集合，一个带有元素，正如您所建议的那样。但是，我将xdmp:query-trace(fn:true()) 放在两个XQuery 的顶部。我一一运行查询并分析了我的 MarkLogic 日志文件。

如果是元素 XQuery：

2018-11-12 15:16:58.448 Info: App-Services: at 5:12: xdmp:eval("declare namespace sem = &quot;http://marklogic.com/semantics&quo...", (), <options xmlns="xdmp:eval"><database>5310618057872024096</database>...</options>)
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Analyzing path for search: fn:collection()/sem:triples
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Step 1 is searchable: fn:collection()
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Step 2 is searchable: sem:triples
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Path is fully searchable.
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Gathering constraints.
2018-11-12 15:16:58.448 Info: App-Services: at 5:12: Step 2 contributed 1 constraint: sem:triples
2018-11-12 15:16:58.449 Info: App-Services: at 5:12: Search query contributed 1 constraint: cts:element-value-query(xs:QName("sem:object"), "taxonomy", ("lang=en"), 1)
2018-11-12 15:16:58.449 Info: App-Services: at 5:12: Executing search.
2018-11-12 15:16:58.464 Info: App-Services: at 5:12: Selected 65964 fragments to filter

如果是 XQuery 集合：

2018-11-12 15:20:07.871 Info: App-Services: at 5:11: xdmp:eval("declare namespace sem = &quot;http://marklogic.com/semantics&quo...", (), <options xmlns="xdmp:eval"><database>5310618057872024096</database>...</options>)
2018-11-12 15:20:07.871 Info: App-Services: at 5:11: Analyzing path for search: fn:collection("/triples")
2018-11-12 15:20:07.871 Info: App-Services: at 5:11: Step 1 is searchable: fn:collection("/triples")
2018-11-12 15:20:07.871 Info: App-Services: at 5:11: Path is fully searchable.
2018-11-12 15:20:07.871 Info: App-Services: at 5:11: Gathering constraints.
2018-11-12 15:20:07.871 Info: App-Services: at 5:11: Step 1 contributed 1 constraint: fn:collection("/triples")
2018-11-12 15:20:07.875 Info: App-Services: at 5:11: Search query contributed 1 constraint: cts:element-value-query(xs:QName("sem:object"), "taxonomy", ("lang=en"), 1)
2018-11-12 15:20:07.875 Info: App-Services: at 5:11: Executing search.
2018-11-12 15:20:07.891 Info: App-Services: at 5:11: Selected 65964 fragments to filter

差异很明显。如果我们使用集合查询，MarkLogic 几乎在单步“1”中完成所有事情，但如果是元素查询，MarkLogic 正在执行两步过程。

【讨论】：

那些痕迹并没有向您展示两个不同的步骤；他们向您展示了添加到单个查询中的一组约束。如果您查看计划（将xdmp:plan 包裹在cts:search 周围），您会看到索引正在解析的确切查询。当约束相交（和）时，每个支架中命中数最少的那个将是工作量的决定因素，但分辨率如此之快，除非在非常大的支架和叶子结果集中，否则您不会注意到。