这不是一个全面的答案 - 但我同意 WordDelimiterGraphFilter 可能对此类数据有所帮助。但是,仍然可能存在需要额外处理的测试用例。
这是我的自定义分析器,使用 WordDelimiterGraphFilter:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilterFactory;
import java.util.Map;
import java.util.HashMap;
public class IdentifierAnalyzer extends Analyzer {
private WordDelimiterGraphFilterFactory getWordDelimiter() {
Map<String, String> settings = new HashMap<>();
settings.put("generateWordParts", "1"); // e.g. "PowerShot" => "Power" "Shot"
settings.put("generateNumberParts", "1"); // e.g. "500-42" => "500" "42"
settings.put("catenateAll", "1"); // e.g. "wi-fi" => "wifi" and "500-42" => "50042"
settings.put("preserveOriginal", "1"); // e.g. "500-42" => "500" "42" "500-42"
settings.put("splitOnCaseChange", "1"); // e.g. "fooBar" => "foo" "Bar"
return new WordDelimiterGraphFilterFactory(settings);
}
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new KeywordTokenizer();
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = getWordDelimiter().create(tokenStream);
return new TokenStreamComponents(tokenizer, tokenStream);
}
@Override
protected TokenStream normalize(String fieldName, TokenStream in) {
TokenStream tokenStream = new LowerCaseFilter(in);
return tokenStream;
}
}
它使用WordDelimiterGraphFilterFactory 帮助器以及参数映射来控制应用哪些设置。
您可以在WordDelimiterGraphFilterFactoryJavaDoc 中查看可用设置的完整列表。您可能想尝试设置/取消设置不同的设置。
这里是以下 3 个输入值的测试索引构建器:
978-3-86680-192-9
TS 123
123.abc
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new IdentifierAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
Document doc;
List<String> identifiers = Arrays.asList("978-3-86680-192-9", "TS 123", "123.abc");
try (IndexWriter writer = new IndexWriter(dir, iwc)) {
for (String identifier : identifiers) {
doc = new Document();
doc.add(new TextField("identifiers", identifier, Field.Store.YES));
writer.addDocument(doc);
}
}
}
这将创建以下令牌:
为了查询上述索引数据,我使用了这个:
public static void doSearch() throws IOException, ParseException {
Analyzer analyzer = new IdentifierAnalyzer();
QueryParser parser = new QueryParser("identifiers", analyzer);
List<String> searches = Arrays.asList("9783", "9783*", "978 3", "978-3", "TS1*", "TS 1*");
for (String search : searches) {
Query query = parser.parse(search);
printHits(query, search);
}
}
private static void printHits(Query query, String search) throws IOException {
System.out.println("search term: " + search);
System.out.println("parsed query: " + query.toString());
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
System.out.println("hits: " + hits.length);
for (ScoreDoc hit : hits) {
System.out.println("");
System.out.println(" doc id: " + hit.doc + "; score: " + hit.score);
Document doc = searcher.doc(hit.doc);
System.out.println(" identifier: " + doc.get("identifiers"));
}
System.out.println("-----------------------------------------");
}
这使用了以下搜索词 - 我将所有这些词都传递到经典查询解析器中(当然,您可以通过 API 使用更复杂的查询类型):
9783
9783*
978 3
978-3
TS1*
TS 1*
唯一没有找到任何匹配文档的查询是第一个:
search term: 9783
parsed query: identifiers:9783
hits: 0
这不足为奇,因为这是一个部分标记,没有通配符。第二个查询(添加了通配符)按预期找到了一个文档。
我测试的最终查询 TS 1* 找到了三个匹配项 - 但我们想要的匹配得分最高:
search term: TS 1*
parsed query: identifiers:ts identifiers:1*
hits: 3
doc id: 1; score: 1.590861
identifier: TS 123
doc id: 0; score: 1.0
identifier: 978-3-86680-192-9
doc id: 2; score: 1.0
identifier: 123.abc