如我所见,顺序应该是您的情况,首先您必须从原始字符串中获取 tokenStream,然后根据您的需求对 n-gram 的输入 tokenStream 进行标记。 NGramTokenFilter 中的方法可以进一步证明。
// org.apache.lucene.analysis.ngram.NGramTokenFilter
public final class NGramTokenFilter extends TokenFilter {
// line 51
public NGramTokenFilter(TokenStream input, int minGram, int maxGram) {
...
}
以下是我根据您提供的描述尝试完成它的方法。
import java.io.{ Reader, StringReader }
import org.apache.lucene.util.Version.LUCENE_34
import org.apache.lucene.analysis.ngram.NGramTokenFilter
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.analysis.{ TokenStream, Analyzer }
import org.apache.lucene.analysis.en.EnglishAnalyzer
object NgramTest extends App {
class NGramAnalyzer extends Analyzer {
def tokenStream(fieldName: String, reader: Reader): TokenStream = {
val originalStream = (new EnglishAnalyzer(LUCENE_34)).reusableTokenStream(fieldName, reader)
// n-gram with size 2 ~ 3
new NGramTokenFilter(originalStream, 2, 3)
}
}
def simpleTokenStreamList(tokenStream: TokenStream) = {
val term = tokenStream.addAttribute(classOf[CharTermAttribute])
Stream.continually(
(tokenStream.incrementToken, term.toString)
).takeWhile(_._1).map {
t => t._2
}.toList
}
val nGramAnalyzer = new NGramAnalyzer
val ngramStream = nGramAnalyzer.tokenStream("sample", new StringReader("A letter from mother"))
val result = simpleTokenStreamList(ngramStream)
// List(le, et, tt, te, er, let, ett, tte, ter, fr, ro, om, fro, rom, mo, ot, th, he, er, mot, oth, the, her)
println(result)
}
另外,在Lucene In Action 2nd,Chapter 8.2.2 Ngram filters中有对Ngram filters的详细解释。我建议您阅读一下,也许您会找到答案。
不管怎样,希望对你有帮助。