【发布时间】:2016-07-30 21:20:07
【问题描述】:
我正在寻求有关 Lucene 6.1 API 的帮助。
我尝试扩展 Lucene 的 Tokenizer 和 Analyzer,但我不了解所有指南。在所有教程中,用户的Tokenizer 会覆盖增量。在构造函数中他们有Reader 类,在用户的Analyzer 类中他们覆盖createComponents 方法。但是在 Lucene 中它只有 1 个 String 参数,那么如何将 Reader 添加到我的Analyzer?
我的代码:
public class ChemTokenizer extends Tokenizer{
protected CharTermAttribute charTermAttribute = addAttribute(CharTermAttribute.class);
protected String stringToTokenize;
protected int position = 0;
protected List<int[]> chemicals = new ArrayList<>();
@Override
public boolean incrementToken() throws IOException {
// Clear anything that is already saved in this.charTermAttribute
this.charTermAttribute.setEmpty();
// Get the position of the next symbol
int nextIndex = -1;
Pattern p = Pattern.compile("[^A-zА-я]");
Matcher m = p.matcher(stringToTokenize.substring(position));
nextIndex = m.start();
// Did we lose chemicals?
for (int[] pair: chemicals) {
if (pair[0] < nextIndex && pair[1] > nextIndex) {
//We are in the chemical name
if (position == pair[0]) {
nextIndex = pair[1];
}
else {
nextIndex = pair[0];
}
}
}
// Next separator was found
if (nextIndex != -1) {
String nextToken = stringToTokenize.substring(position, nextIndex);
charTermAttribute.append(nextToken);
position = nextIndex + 1;
return true;
}
// Last part of text
else if (position < stringToTokenize.length()) {
String nextToken = stringToTokenize.substring(position);
charTermAttribute.append(nextToken);
position = stringToTokenize.length();
return true;
}
else {
return false;
}
}
public ChemTokenizer(Reader reader,List<String> additionalKeywords) {
int numChars;
char[] buffer = new char[1024];
StringBuilder stringBuilder = new StringBuilder();
try {
while ((numChars =
reader.read(buffer, 0, buffer.length)) != -1) {
stringBuilder.append(buffer, 0, numChars);
}
}
catch (IOException e) {
throw new RuntimeException(e);
}
stringToTokenize = stringBuilder.toString();
//Checking for keywords
//Doesnt work properly if text has chemical synonyms
for (String keyword: additionalKeywords) {
int[] tmp = new int[2];
//Start of keyword
tmp[0] = stringToTokenize.indexOf(keyword);
tmp[1] = tmp[0] + keyword.length() - 1;
chemicals.add(tmp);
}
}
/* Reset the stored position for this object when reset() is called.
*/
@Override
public void reset() throws IOException {
super.reset();
position = 0;
chemicals = new ArrayList<>();
}
}
还有Analyzer的代码:
public class ChemAnalyzer extends Analyzer{
List<String> additionalKeywords;
public ChemAnalyzer(List<String> ad) {
additionalKeywords = ad;
}
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer tokenizer = new ChemTokenizer(reader,additionalKeywords);
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
}
问题是这段代码不适用于 Lucene 6
【问题讨论】:
-
这是什么意思,它不适用于 Lucene 6?编译错误?漏洞?不想要的行为?
-
在 lucene 6 createComponents 中有不同的描述。