Lucene 6.1 自定义标记器和分析器答案

【问题标题】：Lucene 6.1 Custom Tokenizer and AnalyzerLucene 6.1 自定义标记器和分析器
【发布时间】：2016-07-30 21:20:07
【问题描述】：

我正在寻求有关 Lucene 6.1 API 的帮助。

我尝试扩展 Lucene 的 Tokenizer 和 Analyzer，但我不了解所有指南。在所有教程中，用户的Tokenizer 会覆盖增量。在构造函数中他们有Reader 类，在用户的Analyzer 类中他们覆盖createComponents 方法。但是在 Lucene 中它只有 1 个 String 参数，那么如何将 Reader 添加到我的Analyzer？

我的代码：

public class ChemTokenizer extends Tokenizer{
    protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
    protected String stringToTokenize;
    protected int position = 0;
    protected List<int[]> chemicals = new ArrayList<>();

    @Override
    public boolean incrementToken() throws IOException {
        // Clear anything that is already saved in this.charTermAttribute
        this.charTermAttribute.setEmpty();

        // Get the position of the next symbol
        int nextIndex = -1;
        Pattern p = Pattern.compile("[^A-zА-я]");
        Matcher m = p.matcher(stringToTokenize.substring(position));
        nextIndex = m.start();
        // Did we lose chemicals?
        for (int[] pair: chemicals) {
            if (pair[0] < nextIndex && pair[1] > nextIndex) {
                //We are in the chemical name
                if (position == pair[0]) {
                    nextIndex = pair[1];
                }
                else {
                    nextIndex = pair[0];
                }
            }
        }
        // Next separator was found
        if (nextIndex != -1) {
            String nextToken = stringToTokenize.substring(position, nextIndex);
            charTermAttribute.append(nextToken);
            position = nextIndex + 1;
            return true;
        }
        // Last part of text
        else if (position < stringToTokenize.length()) {
            String nextToken = stringToTokenize.substring(position);
            charTermAttribute.append(nextToken);
            position = stringToTokenize.length();
            return true;
        }
        else {
            return false;
        }
    }
    public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
        int numChars;
        char[] buffer = new char[1024];
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                    reader.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        stringToTokenize = stringBuilder.toString();
        //Checking for keywords
        //Doesnt work properly if text has chemical synonyms
        for (String keyword: additionalKeywords) {
            int[] tmp = new int[2];
            //Start of keyword
            tmp[0] = stringToTokenize.indexOf(keyword);
            tmp[1] = tmp[0] + keyword.length() - 1;
            chemicals.add(tmp);
        }
    }

    /* Reset the stored position for this object when reset() is called.
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        position = 0;
        chemicals = new ArrayList<>();

    }
}

还有Analyzer的代码：

public class ChemAnalyzer extends Analyzer{

    List<String> additionalKeywords;
    public ChemAnalyzer(List<String> ad) {
        additionalKeywords = ad;
    }
    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {
        Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
        TokenStream filter = new LowerCaseFilter(tokenizer);
        return new TokenStreamComponents(tokenizer, filter);
    }

}

问题是这段代码不适用于 Lucene 6

【问题讨论】：

这是什么意思，它不适用于 Lucene 6？编译错误？漏洞？不想要的行为？
在 lucene 6 createComponents 中有不同的描述。

标签： java lucene tokenize

【解决方案1】：

这是我在github search 中找到的，猜你必须创建一个没有读取的新标记器。

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    return new TokenStreamComponents(new WhitespaceTokenizer()); }

【讨论】：