Solr 8.8 - 使用 eDisMax 和 EdgeNGramFilter 匹配部分单词时遇到问题答案

【问题标题】：Solr 8.8 - trouble matching partial words with eDisMax and EdgeNGramFilterSolr 8.8 - 使用 eDisMax 和 EdgeNGramFilter 匹配部分单词时遇到问题
【发布时间】：2021-04-06 11:02:26
【问题描述】：

我是 Solr 的新手，并试图提供与 Solr 8.8.1 的部分单词匹配，但部分没有给出任何结果。我已经梳理了博客没有运气来解决这个问题。

例如，文档的文本包含单词longer。索引分析给出lon、long、longe、longer。如果我使用alltext_en:longer 查询longer，我会得到匹配。但是，如果我使用alltext_en:longe 查询（例如）longe，我会找不到匹配项。 explainOther 返回0.0 = No matching clauses。

我似乎遗漏了一些明显的东西，因为这不是一个复杂的短语查询。

如果我错过了任何需要的细节，请提前道歉 - 如果您告诉我还有什么需要知道的，我会更新问题。

以下是我的托管模式中的相关字段规范：

  <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>

  <dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>

  <field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
  <copyField source="*_txt_en" dest="alltext_en"/>

这里是solrconfig.xml的相关部分：

  <requestHandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>

       <!-- Query settings -->
       <str name="defType">edismax</str>
       <str name="q">*:*</str>
       <str name="q.alt">*:*</str>
       <str name="rows">50</str>
       <str name="fl">*,score,[explain]</str>
       <str name="ps">10</str>

       <!-- Highlighting defaults -->
       <str name="hl">on</str>
       <str name="hl.fl">_text_</str>
       <str name="hl.preserveMulti">true</str>
       <str name="hl.encoder">html</str>
       <str name="hl.simple.pre">&lt;span class="artica-snippet"&gt;</str>
       <str name="hl.simple.post">&lt;/span&gt;</str>

       <!-- Spell checking defaults -->
       <str name="spellcheck">on</str>
       <str name="spellcheck.extendedResults">false</str>
       <str name="spellcheck.count">5</str>
       <str name="spellcheck.alternativeTermCount">2</str>
       <str name="spellcheck.maxResultsForSuggest">5</str>
       <str name="spellcheck.collate">true</str>
       <str name="spellcheck.collateExtendedResults">true</str>
       <str name="spellcheck.maxCollationTries">5</str>
       <str name="spellcheck.maxCollations">3</str>
     </lst>

     <arr name="last-components">
       <str>spellcheck</str>
     </arr>
  </requestHandler>

【问题讨论】：

该词干过滤器将以您无法预测的方式修改标记 - 由于它们仅发生在您尝试在查询时与 ngrammed 标记再次匹配的标记上，因此该标记可能不是您所期望的）。如果您正在生成 ngram，通常应该删除词干过滤器。我还将删除所有格过滤器（另外，小注意 - 在格式化文本时尽量避免使用 *，因为很难知道您在查询时是否使用过它并且格式化是错误的 - 而是使用反引号表示该文本是代码关键字/查询。）

标签： solr

【解决方案1】：

该词干过滤器将以您无法预测的方式修改标记 - 并且由于它们仅发生在您在查询时尝试与 ngrammed 标记匹配的标记上，因此该标记可能不是您所期望的）。如果您正在生成 ngram，通常应该删除词干过滤器。我还将删除所有格过滤器（另外，小注意 - 在格式化文本时尽量避免使用 *，因为很难知道您在查询时是否使用过它并且格式是错误的 - 而是使用反引号表示文本是代码关键字/查询。） – MatsLindh

这回答了它 - 我从索引步骤中删除了词干分析器，一切都很好。太棒了，谢谢@MatsLindh！

【讨论】：