如何在 lucene 4.x 中正确实现委派标记器？答案

【问题标题】：How to correctly implement a delegating tokenizer in lucene 4.x?如何在 lucene 4.x 中正确实现委派标记器？
【发布时间】：2014-11-18 08:49:36
【问题描述】：

documentation 在“创建委托”部分下建议的幼稚方法无法按预期工作，因为它会导致委托 Tokenizer 合同违规：

private static class TokenizerWrapper extends Tokenizer {
  public TokenizerWrapper(Reader _input) {
    super(_input);
    delegate = new WhitespaceTokenizer(input);
  }

  @Override
  public void reset() throws IOException {
    logger.info("TokenizerWrapper.reset()");
    super.reset();
    delegate.setReader(input);
    delegate.reset();
  }

  @Override
  public final boolean incrementToken() throws IOException {
    logger.info("TokenizerWrapper.incrementToken()");
    return delegate.incrementToken();
  }

  private final WhitespaceTokenizer delegate;
}

给我以下日志：

14:30:12.885 [main] INFO  test.GapTest - TokenizerWrapper.reset()
14:30:12.886 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.897 [main] INFO  test.GapTest - TokenizerWrapper.reset()
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at test.GapTest$TestTokenizer.reset(GapTest.java:152)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:599)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:67)

像这样覆盖 close() 方法：

  @Override
  public void close() throws IOException {
    logger.info("TokenizerWrapper.close()");
    super.close();
    logger.info("TokenizerWrapper.delegate.close()");
    tokenizer.close();
    // tokenizer.setReader(input);
  }

也无济于事，但出现不同的错误：

15:36:49.561 [main] INFO  test.GapTest - setting field "text" to "some text"
15:36:49.569 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.605 [main] INFO  test.GapTest - createComponents()
15:36:49.633 [main] INFO  test.GapTest - TokenizerWrapper(_input)
15:36:49.638 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.639 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
15:36:49.648 [main] INFO  test.GapTest - setting field "text" to "some text 1"
15:36:49.648 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'address'
    at    org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:617)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:72)

也就是说，

它已成功处理第一个文档（“文本”字段中有“一些文本”），
然后已开始处理第二个文档（“some text 1”），
[貌似]成功处理了第一个标记（单词“some”，我在调试器中检查过），
然后因内部状态不一致而中断（DefaultIndexingChain.PerField.invert() 中的invertState.posIncrAttribute.getPositionIncrement(IndexableField field, boolean first) 返回 0，而它的“正常”行为是返回 1）

当然，我可以通过进一步封装和变通方法来处理这个特定的错误，但是在实现这样一个看似简单的任务时，我的方向可能是错误的。请提出建议。

【问题讨论】：

标签： java lucene delegates tokenize

【解决方案1】：

我在我的项目中创建了一个抽象类，它正好解决了这个问题。关键的地方当然是incrementToken、reset、close和end方法。随意使用这些位或整个东西。

import java.io.IOException;
import java.io.Reader;
import java.util.Iterator;

import com.google.common.collect.Iterators;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.ClassicTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import static vyre.util.search.LuceneVersion.VERSION_IN_USE;

/**
 * Allows to easily manipulate with {@link ClassicTokenizer} by delegating calls to it but hiding all implementation details.
 *
 * @author Mindaugas Žakšauskas
 */
public abstract class ClassicTokenizerDelegate extends Tokenizer {

    private final ClassicTokenizer classicTokenizer;

    private final CharTermAttribute termAtt;

    private final TypeAttribute typeAtt;

    /**
     * Internal buffer of tokens if any of standard tokens was split into many.
     */
    private Iterator<String> pendingTokens = Iterators.emptyIterator();

    protected ClassicTokenizerDelegate(Reader input) {
        super(input);
        this.classicTokenizer = new ClassicTokenizer(VERSION_IN_USE, input);
        termAtt = addAttribute(CharTermAttribute.class);
        typeAtt = addAttribute(TypeAttribute.class);
    }

    /**
     * Is called during tokenization for each token produced by {@link ClassicTokenizer}. Subclasses can call {@link #setTerm(String)} to override
     * current token or {@link #setTerms(Iterator)} if current token needs to be split into more than one token.
     *
     * @return true whether next token exists false otherwise.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     */
    protected abstract boolean onNextToken();

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve current term.
     *
     * @return current term.
     * @see #getType()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getTerm() {
        return new String(termAtt.buffer(), 0, termAtt.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve type of current term.
     *
     * @return type of current term.
     * @see #getTerm()
     * @see #setTerm(String)
     * @see #setTerms(Iterator)
     * @see #onNextToken()
     */
    protected String getType() {
        return typeAtt.type();
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term.
     *
     * @param term the term to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerms(Iterator) setTerms(Iterator) - if you want to override current term with more than one term
     * @see #onNextToken()
     */
    protected void setTerm(String term) {
        termAtt.copyBuffer(term.toCharArray(), 0, term.length());
    }

    /**
     * Subclasses can call this method during execution of {@link #onNextToken()} to override current term with more than one term.
     *
     * @param terms the terms to override with.
     * @see #getTerm()
     * @see #getType()
     * @see #setTerm(String)
     * @see #onNextToken()
     */
    protected void setTerms(Iterator<String> terms) {
        setTerm(terms.next());
        pendingTokens = terms;
    }

    @Override
    public final boolean incrementToken() throws IOException {
        if (pendingTokens.hasNext()) {
            setTerm(pendingTokens.next());
            return true;
        }

        clearAttributes();
        if (!classicTokenizer.incrementToken()) {
            return false;
        }

        typeAtt.setType(classicTokenizer.getAttribute(TypeAttribute.class).type());        // copy type attribute from classic tokenizer attribute

        CharTermAttribute stTermAtt = classicTokenizer.getAttribute(CharTermAttribute.class);
        setTerm(new String(stTermAtt.buffer(), 0, stTermAtt.length()));

        return onNextToken();
    }

    @Override
    public void close() throws IOException {
        super.close();
        if (input != null) {
            input.close();
        }
        classicTokenizer.close();
    }

    @Override
    public void end() throws IOException {
        super.end();
        classicTokenizer.end();
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        this.classicTokenizer.setReader(input);        // important! input has to be carried over to delegate because of poor design of Lucene
        classicTokenizer.reset();
    }
}

【讨论】：

感谢您提供代码 sn-p！可能对我的项目有用，谢谢分享！

【解决方案2】：

我认为明确表达它会很有用：

TokenizerWrapper 和 delegate 不共享属性集。所以即使第一个文档的索引似乎是正确的，它不是，没有任何东西进入索引器。为了进行有意义的委派，需要在TokenizerWrapper 中（完全或部分）镜像delegate 的属性，就像@mindas 在setTerm() 中所做的那样。

或者也许我错了，并且有一些“神奇的机器”可以将delegate.attributes 重用为TokenizerWrapper.attributes？

【讨论】：