【问题标题】:How to match exact text in Lucene search?如何在 Lucene 搜索中匹配精确文本?
【发布时间】:2016-09-26 12:16:07
【问题描述】:

我正在尝试匹配 TITLE 列中的文本Config migration from ASA5505 8.2 to ASA5516

我的程序是这样的。

Directory directory = FSDirectory.open(indexDir);

MultiFieldQueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_35,new String[] {"TITLE"}, new StandardAnalyzer(Version.LUCENE_35));        
IndexReader reader = IndexReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);       
queryParser.setPhraseSlop(0);
queryParser.setLowercaseExpandedTerms(true);
Query query = queryParser.parse("TITLE:Config migration from ASA5505 8.2 to ASA5516");
System.out.println(queryStr);
TopDocs topDocs = searcher.search(query,100);
System.out.println(topDocs.totalHits);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println(hits.length + " Record(s) Found");
for (int i = 0; i < hits.length; i++) {
    int docId = hits[i].doc;
    Document d = searcher.doc(docId);
    System.out.println("\"Title :\" " +d.get("TITLE") );
}

但它的回归

"Title :" Config migration from ASA5505 8.2 to ASA5516
"Title :" Firewall  migration from ASA5585 to  ASA5555
"Title :" Firewall  migration from ASA5585 to  ASA5555

第二个 2 结果不是预期的。所以需要什么修改才能匹配确切的文本配置从 ASA5505 8.2 迁移到 ASA5516

我的索引函数看起来像这样

public class Lucene {
public static final String INDEX_DIR = "./Lucene";
private static final String JDBC_DRIVER = "oracle.jdbc.OracleDriver";
private static final String CONNECTION_URL = "jdbc:oracle:thin:xxxxxxx"

private static final String USER_NAME = "localhost";
private static final String PASSWORD = "localhost";
private static final String QUERY = "select * from TITLE_TABLE";

public static void main(String[] args) throws Exception {
    File indexDir = new File(INDEX_DIR);
    Lucene indexer = new Lucene();
    try {
        Date start = new Date();
        Class.forName(JDBC_DRIVER).newInstance();
        Connection conn = DriverManager.getConnection(CONNECTION_URL, USER_NAME, PASSWORD);
        SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_35);
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_35, analyzer);
        IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), indexWriterConfig);
        System.out.println("Indexing to directory '" + indexDir + "'...");
        int indexedDocumentCount = indexer.indexDocs(indexWriter, conn);
        indexWriter.close();
        System.out.println(indexedDocumentCount + " records have been indexed successfully");
        System.out.println("Total Time:" + (new Date().getTime() - start.getTime()) / (1000));
    } catch (Exception e) {
        e.printStackTrace();
    }
}

int indexDocs(IndexWriter writer, Connection conn) throws Exception {
    String sql = QUERY;
    Statement stmt = conn.createStatement();
    stmt.setFetchSize(100000);
    ResultSet rs = stmt.executeQuery(sql);
    int i = 0;
    while (rs.next()) {
        System.out.println("Addind Doc No:" + i);
        Document d = new Document();
        System.out.println(rs.getString("TITLE"));
        d.add(new Field("TITLE", rs.getString("TITLE"), Field.Store.YES, Field.Index.ANALYZED));
        d.add(new Field("NAME", rs.getString("NAME"), Field.Store.YES, Field.Index.ANALYZED));
        writer.addDocument(d);
        i++;
    }
    return i;
}
}

【问题讨论】:

    标签: java lucene


    【解决方案1】:

    这是我为你写的,效果很好:

    使用:queryParser.parse("\"Config migration from ASA5505 8.2 to ASA5516\"");

    1. 创建索引

      public static void main(String[] args) 
      {
      
          IndexWriter writer = getIndexWriter();
          Document doc = new Document();
          Document doc1 = new Document();
          Document doc2 = new Document();
          doc.add(new Field("TITLE", "Config migration from ASA5505 8.2 to ASA5516",Field.Store.YES,Field.Index.ANALYZED));
          doc1.add(new Field("TITLE", "Firewall  migration from ASA5585 to ASA5555",Field.Store.YES,Field.Index.ANALYZED));
          doc2.add(new Field("TITLE", "Firewall  migration from ASA5585 to ASA5555",Field.Store.YES,Field.Index.ANALYZED));
          try 
          {
              writer.addDocument(doc);
              writer.addDocument(doc1);
              writer.addDocument(doc2);
              writer.close();
          } catch (IOException e) {
              // TODO Auto-generated catch block
              e.printStackTrace();
          }
      }
      
      public static IndexWriter getIndexWriter()
      {
          IndexWriter indexWriter=null;
      
          try 
          {
          File file=new File("D://index//");
          if(!file.exists())
              file.mkdir();
          IndexWriterConfig conf=new IndexWriterConfig(Version.LUCENE_34, new StandardAnalyzer(Version.LUCENE_34));
          Directory directory=FSDirectory.open(file);
          indexWriter=new IndexWriter(directory, conf);
          } catch (IOException e) {
              // TODO Auto-generated catch block
              e.printStackTrace();
          }
          return indexWriter;
      }
      

      }

    2.搜索字符串

        public static void main(String[] args) 
        {
    
        IndexReader reader=getIndexReader();
    
        IndexSearcher searcher = new IndexSearcher(reader);
    
        QueryParser parser = new QueryParser(Version.LUCENE_34, "TITLE" ,new StandardAnalyzer(Version.LUCENE_34));
    
        Query query;
        try 
        {
        query = parser.parse("\"Config migration from ASA5505 8.2 to ASA5516\"");
    
        TopDocs hits = searcher.search(query,3);
    
        ScoreDoc[] document = hits.scoreDocs;
        int i=0;
        for(i=0;i<document.length;i++)
        {
            Document doc = searcher.doc(i);
    
            System.out.println("TITLE=" + doc.get("TITLE"));
        }
            searcher.close();
    
        } 
        catch (Exception e) 
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } 
                }
    
    public static IndexReader getIndexReader()
    {
        IndexReader reader=null;
    
        Directory dir;
        try 
        {
            dir = FSDirectory.open(new File("D://index//"));
            reader=IndexReader.open(dir);
        } catch (IOException e) 
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    
        return reader;
    }   
    

    【讨论】:

    • 你得到结果了吗?你能在 TopDocs hits = searcher.search(query,3); 之前添加打印语句吗
    • 是的,它给出了完全匹配的文档。如果我打印查询,它将如下所示:TITLE:"config migration from asa5505 8.2 ? asa5516"
    • 但是区分大小写的部分呢?我想做完全匹配。
    • 实际上解析器会这样做,所以我们不必担心,我们已经传递了区分大小写的字符串。您的索引中是否有区分大小写的精确匹配多个 TITLE ?
    • 是的。索引数据具有相同的 TITLE 与案例级别差异。所以我必须只获取完全匹配的 TITLE
    【解决方案2】:

    尝试PhraseQuery如下:

    BooleanQuery mainQuery= new BooleanQuery(); 
    String searchTerm="config migration from asa5505 8.2 to asa5516";
    String strArray[]= searchTerm.split(" ");
    for(int index=0;index<strArray.length;index++)
    {
        PhraseQuery query1 = new PhraseQuery();
         query1.add(new Term("TITLE",strArray[index]));
         mainQuery.add(query1,BooleanClause.Occur.MUST);
    }
    

    然后执行mainQuery

    查看stackoverflow的this线程,它可以帮助您使用PhraseQuery进行精确搜索。

    【讨论】:

    • 这不是构建PhraseQuery 的工作方式。您需要单独将您的术语添加到查询中 (query.add(new Term("Title", "config"); query.add(new Term("Title", "migration"); ...)。由于它是手动构建的,因此您无需依赖分析器。
    • 如何进行精确匹配?
    • femtoRgon 感谢您的评论,已编辑我的答案。
    【解决方案3】:

    PVR 是正确的,在这里使用短语查询可能是正确的解决方案,但他们错过了如何使用 PhraseQuery 类。不过,您已经在使用QueryParser,因此只需将搜索文本括在引号中即可使用查询解析器语法:

    Query query = queryParser.parse("TITLE:\"Config migration from ASA5505 8.2 to ASA5516\"");
    

    根据您的更新,您在索引时和查询时使用了不同的分析器。 SimpleAnalyzerStandardAnalyzer 不要做同样的事情。除非您有很好的理由不这样做,否则您应该在索引和查询时以相同的方式进行分析。

    因此,将索引代码中的分析器更改为StandardAnalyzer(反之亦然,查询时使用SimpleAnalyzer),您应该会看到更好的结果。

    【讨论】:

    • 还是没有结果。搜索文本转换为小写,如 TITLE:"config migration from asa5505 8.2 ? asa5516? 来自哪里?
    • 那个?表示已删除的停用词,是 StandardAnalyzer 的预期行为。这在我的测试中有效。我会看看该字段是如何被索引的。那里正在使用什么分析仪等。
    • @SantoshHegde - 我已经用你添加的索引代码编辑了我现在看到的问题。希望这可以为您解决问题。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-09-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多