【问题标题】:Lucene 6.1 Custom Tokenizer and AnalyzerLucene 6.1 自定义标记器和分析器
【发布时间】:2016-07-30 21:20:07
【问题描述】:

我正在寻求有关 Lucene 6.1 API 的帮助。

我尝试扩展 Lucene 的 TokenizerAnalyzer,但我不了解所有指南。在所有教程中,用户的Tokenizer 会覆盖增量。在构造函数中他们有Reader 类,在用户的Analyzer 类中他们覆盖createComponents 方法。但是在 Lucene 中它只有 1 个 String 参数,那么如何将 Reader 添加到我的Analyzer

我的代码:

public class ChemTokenizer extends Tokenizer{
    protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
    protected String stringToTokenize;
    protected int position = 0;
    protected List<int[]> chemicals = new ArrayList<>();

    @Override
    public boolean incrementToken() throws IOException {
        // Clear anything that is already saved in this.charTermAttribute
        this.charTermAttribute.setEmpty();

        // Get the position of the next symbol
        int nextIndex = -1;
        Pattern p = Pattern.compile("[^A-zА-я]");
        Matcher m = p.matcher(stringToTokenize.substring(position));
        nextIndex = m.start();
        // Did we lose chemicals?
        for (int[] pair: chemicals) {
            if (pair[0] < nextIndex && pair[1] > nextIndex) {
                //We are in the chemical name
                if (position == pair[0]) {
                    nextIndex = pair[1];
                }
                else {
                    nextIndex = pair[0];
                }
            }
        }
        // Next separator was found
        if (nextIndex != -1) {
            String nextToken = stringToTokenize.substring(position, nextIndex);
            charTermAttribute.append(nextToken);
            position = nextIndex + 1;
            return true;
        }
        // Last part of text
        else if (position < stringToTokenize.length()) {
            String nextToken = stringToTokenize.substring(position);
            charTermAttribute.append(nextToken);
            position = stringToTokenize.length();
            return true;
        }
        else {
            return false;
        }
    }
    public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
        int numChars;
        char[] buffer = new char[1024];
        StringBuilder stringBuilder = new StringBuilder();
        try {
            while ((numChars =
                    reader.read(buffer, 0, buffer.length)) != -1) {
                stringBuilder.append(buffer, 0, numChars);
            }
        }
        catch (IOException e) {
            throw new RuntimeException(e);
        }
        stringToTokenize = stringBuilder.toString();
        //Checking for keywords
        //Doesnt work properly if text has chemical synonyms
        for (String keyword: additionalKeywords) {
            int[] tmp = new int[2];
            //Start of keyword
            tmp[0] = stringToTokenize.indexOf(keyword);
            tmp[1] = tmp[0] + keyword.length() - 1;
            chemicals.add(tmp);
        }
    }

    /* Reset the stored position for this object when reset() is called.
     */
    @Override
    public void reset() throws IOException {
        super.reset();
        position = 0;
        chemicals = new ArrayList<>();

    }
}

还有Analyzer的代码:

public class ChemAnalyzer extends Analyzer{

    List<String> additionalKeywords;
    public ChemAnalyzer(List<String> ad) {
        additionalKeywords = ad;
    }
    @Override
    protected TokenStreamComponents createComponents(String s, Reader reader) {
        Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
        TokenStream filter = new LowerCaseFilter(tokenizer);
        return new TokenStreamComponents(tokenizer, filter);
    }

}

问题是这段代码不适用于 Lucene 6

【问题讨论】:

  • 这是什么意思,它不适用于 Lucene 6?编译错误?漏洞?不想要的行为?
  • 在 lucene 6 createComponents 中有不同的描述。

标签: java lucene tokenize


【解决方案1】:

这是我在github search 中找到的,猜你必须创建一个没有读取的新标记器。

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    return new TokenStreamComponents(new WhitespaceTokenizer()); }

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-12-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-26
    • 1970-01-01
    • 2011-04-25
    相关资源
    最近更新 更多