过滤对 solr 中搜索结果的影响答案

【问题标题】：filters effect on search results in solr过滤对 solr 中搜索结果的影响
【发布时间】：2011-09-25 00:27:51
【问题描述】：

当我在 solr 中查询“优雅”时，我也会得到“优雅”的结果。

我使用这些过滤器进行索引分析

WhitespaceTokenizerFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
SynonymFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory
ReversedWildcardFilterFactory

对于查询分析：

WhitespaceTokenizerFactory
SynonymFilterFactory
StopFilterFactory
WordDelimiterFilterFactory
LowerCaseFilterFactory
EnglishPorterFilterFactory
RemoveDuplicatesTokenFilterFactory

我想知道哪个过滤器影响了我的搜索结果。

【问题讨论】：

标签： indexing solr query-analyzer

【解决方案1】：

EnglishPorterFilterFactory

这是简短的回答；）

更多信息：

English Porter 是指英语 porter stemmer 词干算法。根据词干分析器（启发式词根生成器），优雅和优雅都有相同的词干。

您可以在线验证这一点，例如Here。基本上你会看到 "eleg ant " 和 "eleg ance" 来自同一个词干 > eleg。

来自 Solr 来源：

       public void inform(ResourceLoader loader) {
            String wordFiles = args.get(PROTECTED_TOKENS);
            if (wordFiles != null) {
                try {

这正是 protwords 文件的作用：

                    File protectedWordFiles = new File(wordFiles);
                    if (protectedWordFiles.exists()) {
                        List<String> wlist = loader.getLines(wordFiles);
                        //This cast is safe in Lucene
                        protectedWords = new CharArraySet(wlist, false);//No need to go through StopFilter as before, since it just uses a List internally
                    } else {
                        List<String> files = StrUtils
                                .splitFileNames(wordFiles);
                        for (String file : files) {
                            List<String> wlist = loader.getLines(file
                                    .trim());
                            if (protectedWords == null)
                                protectedWords = new CharArraySet(wlist,
                                        false);
                            else
                                protectedWords.addAll(wlist);
                        }
                    }
                } catch (IOException e) {
                    throw new RuntimeException(e);
                }
            }
        }

那是影响词干的部分。在那里你看到了雪球库的调用

        public EnglishPorterFilter create(TokenStream input) {
            return new EnglishPorterFilter(input, protectedWords);
        }

    }

    /**
     * English Porter2 filter that doesn't use reflection to
     * adapt lucene to the snowball stemmer code.
     */
    @Deprecated
    class EnglishPorterFilter extends SnowballPorterFilter {
        public EnglishPorterFilter(TokenStream source,
                CharArraySet protWords) {
            super (source, new org.tartarus.snowball.ext.EnglishStemmer(),
                    protWords);
        }
    }

【讨论】：

@fyr：是的，我使用 solr adimn 页面查看效果 :)，但是使用 portwords.txt 的englishPorterFilter，其中我没有包含任何内容。那么它是如何做到的呢？
portwords.txt有什么用
不，它仅对您修复的词干使用端口词。它是启发式的，所以它会出错。英文 Porter 算法使用的是雪球库。
我把它用作：那么这里的portwords.txt是什么
看看我的编辑。 prot 词是没有词干的词。 “保护词”