【问题标题】:How to correctly implement a delegating tokenizer in lucene 4.x?如何在 lucene 4.x 中正确实现委派标记器?
【发布时间】:2014-11-18 08:49:36
【问题描述】:

documentation 在“创建委托”部分下建议的幼稚方法无法按预期工作,因为它会导致委托 Tokenizer 合同违规:

private static class TokenizerWrapper extends Tokenizer {
  public TokenizerWrapper(Reader _input) {
    super(_input);
    delegate = new WhitespaceTokenizer(input);
  }

  @Override
  public void reset() throws IOException {
    logger.info("TokenizerWrapper.reset()");
    super.reset();
    delegate.setReader(input);
    delegate.reset();
  }

  @Override
  public final boolean incrementToken() throws IOException {
    logger.info("TokenizerWrapper.incrementToken()");
    return delegate.incrementToken();
  }

  private final WhitespaceTokenizer delegate;
}

给我以下日志:

14:30:12.885 [main] INFO  test.GapTest - TokenizerWrapper.reset()
14:30:12.886 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.889 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
14:30:12.897 [main] INFO  test.GapTest - TokenizerWrapper.reset()
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
    at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
    at test.GapTest$TestTokenizer.reset(GapTest.java:152)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.analysis.TokenFilter.reset(TokenFilter.java:70)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:599)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:67)

像这样覆盖 close() 方法:

  @Override
  public void close() throws IOException {
    logger.info("TokenizerWrapper.close()");
    super.close();
    logger.info("TokenizerWrapper.delegate.close()");
    tokenizer.close();
    // tokenizer.setReader(input);
  }

也无济于事,但出现不同的错误:

15:36:49.561 [main] INFO  test.GapTest - setting field "text" to "some text"
15:36:49.569 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.605 [main] INFO  test.GapTest - createComponents()
15:36:49.633 [main] INFO  test.GapTest - TokenizerWrapper(_input)
15:36:49.638 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.639 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.640 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.641 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
15:36:49.648 [main] INFO  test.GapTest - setting field "text" to "some text 1"
15:36:49.648 [main] INFO  test.GapTest - Adding created document to the index
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.reset()
15:36:49.648 [main] INFO  test.GapTest - TokenizerWrapper.incrementToken()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.close()
15:36:49.649 [main] INFO  test.GapTest - TokenizerWrapper.delegate.close()
Exception in thread "main" java.lang.IllegalArgumentException: first position increment must be > 0 (got 0) for field 'address'
    at    org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:617)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:342)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:301)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:241)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:454)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1231)
    at test.GapTest.main(GapTest.java:72)

也就是说,

  1. 它已成功处理第一个文档(“文本”字段中有“一些文本”),
  2. 然后已开始处理第二个文档(“some text 1”),
  3. [貌似]成功处理了第一个标记(单词“some”,我在调试器中检查过),
  4. 然后因内部状态不一致而中断(DefaultIndexingChain.PerField.invert() 中的invertState.posIncrAttribute.getPositionIncrement(IndexableField field, boolean first) 返回 0,而它的“正常”行为是返回 1)

当然,我可以通过进一步封装和变通方法来处理这个特定的错误,但是在实现这样一个看似简单的任务时,我的方向可能是错误的。请提出建议。

【问题讨论】:

    标签: java lucene delegates tokenize


    【解决方案1】:

    我在我的项目中创建了一个抽象类,它正好解决了这个问题。关键的地方当然是incrementTokenresetcloseend方法。随意使用这些位或整个东西。

    import java.io.IOException;
    import java.io.Reader;
    import java.util.Iterator;
    
    import com.google.common.collect.Iterators;
    import org.apache.lucene.analysis.Tokenizer;
    import org.apache.lucene.analysis.standard.ClassicTokenizer;
    import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
    import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
    
    import static vyre.util.search.LuceneVersion.VERSION_IN_USE;
    
    /**
     * Allows to easily manipulate with {@link ClassicTokenizer} by delegating calls to it but hiding all implementation details.
     *
     * @author Mindaugas Žakšauskas
     */
    public abstract class ClassicTokenizerDelegate extends Tokenizer {
    
        private final ClassicTokenizer classicTokenizer;
    
        private final CharTermAttribute termAtt;
    
        private final TypeAttribute typeAtt;
    
        /**
         * Internal buffer of tokens if any of standard tokens was split into many.
         */
        private Iterator<String> pendingTokens = Iterators.emptyIterator();
    
        protected ClassicTokenizerDelegate(Reader input) {
            super(input);
            this.classicTokenizer = new ClassicTokenizer(VERSION_IN_USE, input);
            termAtt = addAttribute(CharTermAttribute.class);
            typeAtt = addAttribute(TypeAttribute.class);
        }
    
        /**
         * Is called during tokenization for each token produced by {@link ClassicTokenizer}. Subclasses can call {@link #setTerm(String)} to override
         * current token or {@link #setTerms(Iterator)} if current token needs to be split into more than one token.
         *
         * @return true whether next token exists false otherwise.
         * @see #getTerm()
         * @see #getType()
         * @see #setTerm(String)
         * @see #setTerms(Iterator)
         */
        protected abstract boolean onNextToken();
    
        /**
         * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve current term.
         *
         * @return current term.
         * @see #getType()
         * @see #setTerm(String)
         * @see #setTerms(Iterator)
         * @see #onNextToken()
         */
        protected String getTerm() {
            return new String(termAtt.buffer(), 0, termAtt.length());
        }
    
        /**
         * Subclasses can call this method during execution of {@link #onNextToken()} to retrieve type of current term.
         *
         * @return type of current term.
         * @see #getTerm()
         * @see #setTerm(String)
         * @see #setTerms(Iterator)
         * @see #onNextToken()
         */
        protected String getType() {
            return typeAtt.type();
        }
    
        /**
         * Subclasses can call this method during execution of {@link #onNextToken()} to override current term.
         *
         * @param term the term to override with.
         * @see #getTerm()
         * @see #getType()
         * @see #setTerms(Iterator) setTerms(Iterator) - if you want to override current term with more than one term
         * @see #onNextToken()
         */
        protected void setTerm(String term) {
            termAtt.copyBuffer(term.toCharArray(), 0, term.length());
        }
    
        /**
         * Subclasses can call this method during execution of {@link #onNextToken()} to override current term with more than one term.
         *
         * @param terms the terms to override with.
         * @see #getTerm()
         * @see #getType()
         * @see #setTerm(String)
         * @see #onNextToken()
         */
        protected void setTerms(Iterator<String> terms) {
            setTerm(terms.next());
            pendingTokens = terms;
        }
    
        @Override
        public final boolean incrementToken() throws IOException {
            if (pendingTokens.hasNext()) {
                setTerm(pendingTokens.next());
                return true;
            }
    
            clearAttributes();
            if (!classicTokenizer.incrementToken()) {
                return false;
            }
    
            typeAtt.setType(classicTokenizer.getAttribute(TypeAttribute.class).type());        // copy type attribute from classic tokenizer attribute
    
            CharTermAttribute stTermAtt = classicTokenizer.getAttribute(CharTermAttribute.class);
            setTerm(new String(stTermAtt.buffer(), 0, stTermAtt.length()));
    
            return onNextToken();
        }
    
        @Override
        public void close() throws IOException {
            super.close();
            if (input != null) {
                input.close();
            }
            classicTokenizer.close();
        }
    
        @Override
        public void end() throws IOException {
            super.end();
            classicTokenizer.end();
        }
    
        @Override
        public void reset() throws IOException {
            super.reset();
            this.classicTokenizer.setReader(input);        // important! input has to be carried over to delegate because of poor design of Lucene
            classicTokenizer.reset();
        }
    }
    

    【讨论】:

    • 感谢您提供代码 sn-p!可能对我的项目有用,谢谢分享!
    【解决方案2】:

    我认为明确表达它会很有用:

    TokenizerWrapperdelegate 不共享属性集。所以即使第一个文档的索引似乎是正确的,它不是,没有任何东西进入索引器。为了进行有意义的委派,需要在TokenizerWrapper 中(完全或部分)镜像delegate 的属性,就像@mindas 在setTerm() 中所做的那样。

    或者也许我错了,并且有一些“神奇的机器”可以将delegate.attributes 重用为TokenizerWrapper.attributes

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2010-11-26
      • 1970-01-01
      • 1970-01-01
      • 2010-11-16
      • 2015-05-20
      相关资源
      最近更新 更多