通配符 cts:element-value-query 返回错误匹配答案

【问题标题】：wildcard cts:element-value-query returning wrong matches通配符 cts:element-value-query 返回错误匹配
【发布时间】：2018-09-24 09:19:28
【问题描述】：

wildcard cts:element-value-query 的行为不符合预期。

插入文档查询：

xdmp:document-insert('/sample/2.xml', <data>the living Theater</data>)

cts 查询：

cts:search(
    doc(),
    cts:element-value-query(xs:QName('data'), 'theater* *', ('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')),
    'unfiltered'
)

以上 cts 查询返回给我/sample/2.xml 文档。据我了解，此查询不应返回上述文档，而应仅返回以 theater 文本开头的文档。

似乎问题出在下面的文本模式上。

在文档中显示文本：@@@ word @@@text

搜索词：@@@t* *

@ - 可以是任何字符。

我也可以使用以下数据重现问题。

在文档中显示文本：mark the marklogic

搜索文字：markl* *

通配符相关索引设置为true。

我已经粘贴了数据库配置，它可能有助于找到问题。

数据库配置：

<package-database xmlns="http://marklogic.com/manage/package/databases">
    <config>
        <name>publishers</name>
        <package-database-properties>
            <enabled>true</enabled>
            <retired-forest-count>0</retired-forest-count>
            <language>en</language>
            <stemmed-searches>advanced</stemmed-searches>
            <word-searches>true</word-searches>
            <word-positions>true</word-positions>
            <fast-phrase-searches>true</fast-phrase-searches>
            <fast-reverse-searches>false</fast-reverse-searches>
            <triple-index>true</triple-index>
            <triple-positions>true</triple-positions>
            <fast-case-sensitive-searches>true</fast-case-sensitive-searches>
            <fast-diacritic-sensitive-searches>true</fast-diacritic-sensitive-searches>
            <fast-element-word-searches>true</fast-element-word-searches>
            <element-word-positions>true</element-word-positions>
            <fast-element-phrase-searches>true</fast-element-phrase-searches>
            <element-value-positions>true</element-value-positions>
            <attribute-value-positions>true</attribute-value-positions>
            <field-value-searches>true</field-value-searches>
            <field-value-positions>true</field-value-positions>
            <three-character-searches>true</three-character-searches>
            <three-character-word-positions>true</three-character-word-positions>
            <fast-element-character-searches>true</fast-element-character-searches>
            <trailing-wildcard-searches>true</trailing-wildcard-searches>
            <trailing-wildcard-word-positions>true</trailing-wildcard-word-positions>
            <fast-element-trailing-wildcard-searches>true</fast-element-trailing-wildcard-searches>
            <word-lexicons>
                <word-lexicon>http://marklogic.com/collation/codepoint</word-lexicon>
            </word-lexicons>
            <two-character-searches>false</two-character-searches>
            <one-character-searches>false</one-character-searches>
            <uri-lexicon>true</uri-lexicon>
            <collection-lexicon>true</collection-lexicon>
            <reindexer-enable>true</reindexer-enable>
            <reindexer-throttle>5</reindexer-throttle>
            <reindexer-timestamp>0</reindexer-timestamp>
            <directory-creation>manual</directory-creation>
            <maintain-last-modified>false</maintain-last-modified>
            <maintain-directory-last-modified>false</maintain-directory-last-modified>
            <inherit-permissions>false</inherit-permissions>
            <inherit-collections>false</inherit-collections>
            <inherit-quality>false</inherit-quality>
            <in-memory-limit>174080</in-memory-limit>
            <in-memory-list-size>341</in-memory-list-size>
            <in-memory-tree-size>85</in-memory-tree-size>
            <in-memory-range-index-size>11</in-memory-range-index-size>
            <in-memory-reverse-index-size>11</in-memory-reverse-index-size>
            <in-memory-triple-index-size>44</in-memory-triple-index-size>
            <large-size-threshold>1024</large-size-threshold>
            <locking>fast</locking>
            <journaling>fast</journaling>
            <journal-size>682</journal-size>
            <journal-count>2</journal-count>
            <preallocate-journals>false</preallocate-journals>
            <preload-mapped-data>false</preload-mapped-data>
            <preload-replica-mapped-data>false</preload-replica-mapped-data>
            <range-index-optimize>facet-time</range-index-optimize>
            <positions-list-max-size>256</positions-list-max-size>
            <format-compatibility>automatic</format-compatibility>
            <index-detection>automatic</index-detection>
            <expunge-locks>none</expunge-locks>
            <tf-normalization>scaled-log</tf-normalization>
            <merge-priority>lower</merge-priority>
            <merge-max-size>32768</merge-max-size>
            <merge-min-size>1024</merge-min-size>
            <merge-min-ratio>2</merge-min-ratio>
            <merge-timestamp>0</merge-timestamp>
            <retain-until-backup>false</retain-until-backup>
            <assignment-policy-name>bucket</assignment-policy-name>
        </package-database-properties>
    </config>
</package-database>

【问题讨论】：

使用'filtered' 选项执行搜索时是否得到正确的结果？
@MadsHansen 是的.. 过滤后我得到了正确的结果，但我不能使用过滤选项，因为它很慢。
尝试启用元素词位置。您需要它来准确解析多令牌值而不进行过滤..
@grtjn element word position 设置为真。还是同样的问题。尝试使用http://marklogic.com/collation/codepoint 排序规则添加单词词典，但没有收获。
请帮忙，因为我经常遇到这个问题，无法确定我做错了什么。看起来 value 查询的行为类似于 word 查询

标签： marklogic

【解决方案1】：

尝试在数据元素上创建元素范围索引，然后运行以下搜索：

let $terms :=  cts:element-value-match(xs:QName("data"),"theater* *")
return
  cts:search(
    doc(),
    cts:element-value-query(
      xs:QName('data'), 
      $terms, 
      ('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')
    ),
    'unfiltered'
  )

这不会获取您的“/sample/2.xml”文档

【讨论】：

这是一个很好的解决方法，但我很想知道为什么cts:element-value-query 没有按预期工作，为什么在上述情况下它失败了？两次命中数据库也会影响性能，尤其是当cts:element-value-match 将匹配大量值时。

【解决方案2】：

未经过滤的搜索带有一些caveats：

它们直接根据索引确定结果，无需过滤验证。这使得未经过滤的结果最与传统的搜索引擎风格的结果相媲美。

它们包括假阳性结果。假阳性结果可能源于多种情况，包括短语搜索包含 3 个或更多字词，某些通配符搜索，标点符号敏感、变音符号敏感和/或大小写敏感搜索。

MarkLogic 提供了一种方法来确定结果是否为误报。您可以为此使用cts:contains。此 xquery 表明您的结果确实是误报：

xquery version "1.0-ml";

declare boundary-space preserve;
declare namespace qm="http://marklogic.com/xdmp/query-meters";

let $trueCounter := 0
let $falseCounter := 0
let $query := cts:element-value-query(xs:QName('data'), 'theater* *')
let $x := 
  for $result in cts:search(fn:doc(), $query, "unfiltered")
  return
  (
  if ( cts:contains($result, $query) )
  then ( xdmp:set($trueCounter, $trueCounter + 1) )
  else ( xdmp:set($falseCounter, $falseCounter + 1) )
  )
return
<results>
  <resultTotal>{$trueCounter}</resultTotal>
  <false-positiveTotal>{$falseCounter}</false-positiveTotal>
  <elapsed-time>{xdmp:query-meters()/qm:elapsed-time/text()}
  </elapsed-time>
</results>

MarkLogic 搜索分为两个步骤：

候选人 ID 解析。 ML 从索引中搜索匹配的文档。这些只是候选者，这意味着它们可能是误报。这对于缩小文档范围很有用，因此不必加载太多片段。
候选 ID 用于从磁盘加载片段。然后将针对初始查询再次测试每个片段。此步骤过滤误报。

通过使用未过滤的查询，您没有第二步，因此不会出现误报。你可以阅读更多关于here的信息。

编辑： This 部分进一步描述了可以使用未过滤搜索的应用程序：

您的内容和搜索字词使您知道未经过滤的搜索也是准确的（例如，搜索都是在文档或片段根处执行，它们是单项查询，并且不是通配符、不区分标点符号、不区分变音符号 和/或区分大小写的搜索）。

您不介意是否有一些假阳性结果，因为结果是一个估计值（也就是说，它们需要快速，但不要求准确）。

您的搜索返回大量结果，并且您希望通过有效的方式跳转到这些结果的特定部分。

如第一项所述，如果您不希望出现误报，则不能使用通配符查询。我想你应该坚持过滤搜索。

希望这会有所帮助！

【讨论】：

感谢您的详细解释。