【问题标题】:Solr Spell Check returning false positives correctlySpelled()Solr 拼写检查正确返回误报Spelled()
【发布时间】:2016-07-17 03:48:59
【问题描述】:

我目前在本地服务器上使用 solr 5x 并使用 Drupal 实例生成所有索引。经过大量配置后,我对 solr 的实现感到相当满意。

但是,我刚刚注意到的一个问题是,正确的拼写仍然被视为拼写错误,并且仍然被提供建议。

"correctlySpelled":false

正如您在 JSON 输出 中所见,两个词:licensevehicle 拼写正确,仍被归类为 不正确

"spellcheck":{
   "suggestions":[
      "license",
      {
         "numFound":3,
         "startOffset":0,
         "endOffset":7,
         "suggestion":[
            "licensed",
            "licensee",
            "licenser"
         ]
      },
      "vehicle",
      {
         "numFound":3,
         "startOffset":8,
         "endOffset":15,
         "suggestion":[
            "chicle",
            "pedicle",
            "vehiculate"
         ]
      }
   ],
   "correctlySpelled":false,
   "collations":[
      "collation",
      "licensed chicle",
      "collation",
      "licensed pedicle",
      "collation",
      "licensed vehiculate",
      "collation",
      "licenser chicle",
      "collation",
      "licenser pedicle"
   ]
}

有人知道为什么会产生误报吗?

网址编码查询

http://192.168.33.10:8983/solr/drupal/spell?q=license+vehicle&spellcheck=true&spellcheck.accuracy=0.7&spellcheck.collate=true&defType=edismax&json.nl=flat&omitHeader=true&qf=ts_title^1&fl=*,score&start=0&fq=index_id:"new_index"&fq=hash:"96z3wm"&rows=10&wt=json&stopwords=true&lowercaseOperators=true

查询:

q = license+vehicle
spellcheck = true
spellcheck.accuracy = 0.7
spellcheck.collate = true
defType = edismax
json.nl = flat
omitHeader = true
qf = ts_title^1
fl = *,score
start = 0
fq = index_id:"new_index"
fq = hash:"96z3wm"
rows = 10
wt = json
stopwords = true
lowercaseOperators = true

schema.xml 的相关部分:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal. -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="1"
            catenateNumbers="1"
            catenateAll="0"
            splitOnCaseChange="0"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->

  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="0"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="multiterm">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->

    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            protected="protwords.txt"
            generateWordParts="1"
            generateNumberParts="1"
            catenateWords="0"
            catenateNumbers="0"
            catenateAll="0"
            splitOnCaseChange="1"
            preserveOriginal="1"/>
    <filter class="solr.LengthFilterFactory" min="2" max="100" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

solrconfig.xml 的相关部分

  <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="df">spell</str> <!--The default field for spell checking. -->
      <str name="spellcheck.dictionary">file</str> <!--default or file or jarowinkler as mentioned above. -->
      <str name="spellcheck">on</str>
      <str name="spellcheck.extendedResults">true</str>
      <str name="spellcheck.count">3</str>
      <str name="spellcheck.maxResultsForSuggest">5</str>
      <str name="spellcheck.collate">false</str>
      <str name="spellcheck.collateExtendedResults">false</str>
      <str name="spellcheck.maxCollationTries">10</str>
      <str name="spellcheck.maxCollations">5</str>
    </lst>
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

    <str name="queryAnalyzerFieldType">textSpell</str>

    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spell</str>
      <str name="spellcheckIndexDir">spellchecker</str>
      <str name="buildOnOptimize">true</str>
    </lst>

    <lst name="spellchecker">
      <str name="classname">solr.FileBasedSpellChecker</str>
      <str name="name">file</str>
      <str name="sourceLocation">spellings.txt</str>
      <str name="characterEncoding">UTF-8</str>
      <str name="spellcheckIndexDir">spellcheckerFile</str>
    </lst>

  </searchComponent>

【问题讨论】:

  • 您是否尝试过使用非词干字段作为拼写更正的来源?您可能实际上没有您认为索引中包含的术语。此外,来自旧 wiki:如果未指定“spellcheck.maxResultsForSuggest”,则默认行为是生成建议,如果至少 1 个术语不在索引中,则默认行为将“正确拼写”报告为“假”(文档频率 == 0 ) 与返回的结果数无关。
  • @MatsLindh - 在solr.SearchHandler 中有spellcheck.maxResultsForSuggest 的引用。如上所述,结果返回正常,因此正在对这些术语进行索引。只是拼写检查器的值忽略了正确的拼写。
  • 那么token的命中数是否大于maxResultsForSuggest?

标签: json xml solr solr5


【解决方案1】:

这是我在 Solr 上也经历过的事情。它以不可预知的方式发生。我用来避免这种情况的方法是使用 edismax 参数 "mm" 进行拼写预检查,将其设置为 100 。尝试在您的 edismax 查询中设置 mm=100,看看是否可行。然后您创建一个流程,您首先严格地只对单词进行拼写检查,然后将其传递给搜索查询处理程序。当您指定 mm=100 时,不要在任何类型的双引号中传递您的短语,只需按原样传递即可。让我知道这是否有帮助:)

【讨论】:

  • 我已将 mm 值设置为 100,但我没有得到任何结果,只是相同的建议。能否请您详细说明,将不胜感激。
  • 告诉我你和mm一起发送的其他参数是什么?
  • 它只是 mm=100,相同的查询,我想了解是否需要重新格式化我的请求以支持 edismax。另外,我还需要寄什么东西?
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-06-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多