【问题标题】:Querying part-of-speech tags with Lucene 7 OpenNLP使用 Lucene 7 OpenNLP 查询词性标签
【发布时间】:2019-02-20 12:50:13
【问题描述】:

为了好玩和学习,我正在尝试使用 OpenNLP 和 Lucene 7.4 构建一个词性 (POS) 标注器。目标是一旦被索引,我实际上可以搜索一系列 POS 标签并找到与序列匹配的所有句子。我已经得到了索引部分,但我被困在查询部分。我知道 SolR 可能对此有一些功能,并且我已经检查了代码(毕竟这不是那么不言自明)。但我的目标是在 Lucene 7 中理解和实施,而不是在 SolR 中,因为我希望独立于任何顶级搜索引擎。

想法 输入句子 1:敏捷的棕狐跳过了懒惰的狗。 应用 Lucene OpenNLP 分词器导致:[The][quick][brown][fox][jumped][over][the][lazy][dogs][.] 接下来,应用 Lucene OpenNLP POS 标记结果:[DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]

输入句子 2:给我,宝贝! 应用 Lucene OpenNLP 分词器导致:[Give][it][to][me][,][baby][!] 接下来,应用 Lucene OpenNLP POS 标记结果:[VB][PRP][TO][PRP][,][UH][.]

查询:JJ NN VBD 匹配句子 1 的一部分,因此应该返回句子 1。 (此时我只对完全匹配感兴趣,即让我们把部分匹配、通配符等放在一边)

索引 首先,我创建了自己的类 com.example.OpenNLPAnalyzer:

public class OpenNLPAnalyzer extends Analyzer {
  protected TokenStreamComponents createComponents(String fieldName) {
    try {

        ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());


        TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
        NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);


        SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
        NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);

        Tokenizer source = new OpenNLPTokenizer(
                AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);

        POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
        NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);

        // Perhaps we should also use a lower-case filter here?

        TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);

        // Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
        TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);

        return new TokenStreamComponents(source, payloadFilter);
    }
    catch (IOException e) {
        throw new RuntimeException(e.getMessage());
    }              

}

请注意,我们使用的是围绕 OpenNLPPOSFilter 包裹的 TypeAsPayloadTokenFilter。这意味着,我们的 POS 标签将被索引为有效负载,而我们的查询(无论看起来如何)也必须搜索有效负载。

查询 这就是我卡住的地方。我不知道如何查询有效载荷,无论我尝试什么都行不通。请注意,我使用的是 Lucene 7,似乎在旧版本中查询有效负载已经更改了好几次。文档极其稀缺。甚至不清楚现在要查询的正确字段名称是什么——它是“单词”还是“类型”或其他什么?例如,我尝试了这段不返回任何搜索结果的代码:

    // Step 1: Indexing
    final String body = "The quick brown fox jumped over the lazy dogs.";
    Directory index = new RAMDirectory();
    OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
    IndexWriter writer = new IndexWriter(index, indexWriterConfig);
    Document document = new Document();
    document.add(new TextField("body", body, Field.Store.YES));
    writer.addDocument(document);
    writer.close();


    // Step 2: Querying
    final int topN = 10;
    DirectoryReader reader = DirectoryReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);

    final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
    final String queryText = "JJ";
    Term term = new Term(fieldName, queryText);
    SpanQuery match = new SpanTermQuery(term);
    BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
    SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));

    System.out.println(query.toString());

    TopDocs topDocs = searcher.search(query, topN);

非常感谢您提供任何帮助。

【问题讨论】:

    标签: lucene nlp opennlp part-of-speech


    【解决方案1】:

    您为什么不使用 TypeAsSynonymFilter 而不是 TypeAsPayloadTokenFilter 并进行正常查询。所以在你的分析器中:

    :
    TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
    TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
    return new TokenStreamComponents(source, typeAsSynonymFilter);
    

    和索引方面:

    static Directory index() throws Exception {
      Directory index = new RAMDirectory();
      OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
      IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
      IndexWriter writer = new IndexWriter(index, indexWriterConfig);
      writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
      writer.addDocument(doc("Give it to me, baby!"));
      writer.close();
    
      return index;
    }
    
    static Document doc(String body){
      Document document = new Document();
      document.add(new TextField(FIELD, body, Field.Store.YES));
      return document;
    }
    

    和搜索方面:

    static void search(Directory index, String searchPhrase) throws Exception {
      final int topN = 10;
      DirectoryReader reader = DirectoryReader.open(index);
      IndexSearcher searcher = new IndexSearcher(reader);
    
      QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
      Query query = parser.parse(searchPhrase);
      System.out.println(query);
    
      TopDocs topDocs = searcher.search(query, topN);
      System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
      for(ScoreDoc scoreDoc: topDocs.scoreDocs){
        Document doc = searcher.doc(scoreDoc.doc);
        System.out.printf("\t%s\n", doc.get(FIELD));
      }
    }
    

    然后像这样使用它们:

    public static void main(String[] args) throws Exception {
      Directory index = index();
      search(index, "\"JJ NN VBD\"");    // search the sequence of POS tags
      search(index, "\"brown fox\"");    // search a phrase
      search(index, "\"fox brown\"");    // search a phrase (no hits)
      search(index, "baby");             // search a word
      search(index, "\"TO PRP\"");       // search the sequence of POS tags
    }
    

    结果如下:

    body:"JJ NN VBD"
    "JJ NN VBD" => 1 hits
        The quick brown fox jumped over the lazy dogs.
    body:"brown fox"
    "brown fox" => 1 hits
        The quick brown fox jumped over the lazy dogs.
    body:"fox brown"
    "fox brown" => 0 hits
    body:baby
    baby => 1 hits
        Give it to me, baby!
    body:"TO PRP"
    "TO PRP" => 1 hits
        Give it to me, baby!
    

    【讨论】:

    • 感谢您的想法,我会试一试并报告我的成功。尽管如此,我仍然对使用有效载荷的解决方案感兴趣。似乎这要复杂得多,我将不胜感激。例如,Lucene 7 中仍然存在于 Lucene 6 中的 PayloadTermQuery 发生了什么?这是迄今为止我找到的最好的文章,但它已经过时了:toptal.com/database/…。这似乎也已经过时了:stackoverflow.com/questions/6493249/lucene-payload-scoring
    • 我试过了,效果很好。非常感谢你的想法!现在,如果其他人仍然知道如何使用有效载荷来实现这一点,那就更棒了。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-08-22
    • 2013-09-26
    相关资源
    最近更新 更多