如何用lucene索引pdf文件答案

【问题标题】：How to index pdf file with lucene如何用lucene索引pdf文件
【发布时间】：2014-05-20 14:03:50
【问题描述】：

我必须在我的项目中使用 lucene 创建一个全文搜索，所以我必须在 mysql 数据库中索引一个 blob 列（包含文件 pdf、doc、xsl、xml 和图像），我不使用 doc、xsl 和 xml有任何问题，但我无法获得结果的 pdf 文件

    public class Indexfile {
  public static void main(String[] args) throws Exception {

        RemoteControlServiceConnection a = new RemoteControlServiceConnection(
                "jdbc:mysql://localhost:3306/Test","root", "root" );
        Connection conn = a.getConnexionMySQL();
        final File INDEX_DIR = new File("index");
        IndexWriter writer = new IndexWriter(INDEX_DIR,
                new StandardAnalyzer(),
                true);

        String query = "SELECT id, name ,document FROM Table_document";
        Statement statement = conn.createStatement();
        ResultSet result = statement.executeQuery(query);

        while (result.next()) {
            Document document = new Document();
            document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
            document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
            document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
             writer.addDocument(text);
            }
        }

        writer.close();


    }
}

我使用搜索

    public class searchlucene {
    public static void main(String[] args) throws Exception {
    StandardAnalyzer analyzer = new StandardAnalyzer();
    String qu = "montbel*"; // put your keyword here
   // String IndexStoreDir = "index-directory";
    try {
        Query q = new QueryParser("document", analyzer).parse(qu);
        int hitspp = 100; //hits per page
        IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
        TopDocCollector collector = new TopDocCollector(hitspp);
        searcher.search(q, collector);
        ScoreDoc[] hits = collector.topDocs().scoreDocs;
        System.out.println("Found " + hits.length + " hits.");
        for (int i = 0; i < hits.length; ++i) {
              int docId = hits[i].doc;
              Document d = searcher.doc(docId);
              System.out.println((i + 1) + ". " + d.get("name"));
          }
          searcher.close();
      } catch (Exception ex1) {
      }
}}

【问题讨论】：

我真的很惊讶 doc 的工作原理，因为它是一种二进制格式。要索引 PDF 文件，我要做的是获取 PDF 数据，使用例如 PDFBox 将其转换为文本，然后索引该文本内容。但是，也许您希望“升级”以使用 Apache SOLR，我相信它具有索引特定文件类型的内置功能。

标签： java mysql pdf lucene

【解决方案1】：

要解析任何类型的文件，请使用Tika project，然后使用 Lucene 对其进行索引。 Tika 已经包含太多 API (pdfBox....)

【讨论】：

【解决方案2】：

首先您需要将PDF 文件内容转换为文本，然后将该文本添加到索引中。

例如：

您可以使用PDFBox 将pdf 内容转换为文本：

String contents = "";
PDDocument doc = null;
try {
    doc = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();

    stripper.setLineSeparator("\n");
    stripper.setStartPage(1);
    stripper.setEndPage(5);// this mean that it will index the first 5 pages only
    contents = stripper.getText(doc);

} catch(Exception e){
    e.printStackTrace();
}

然后将内容添加到LuceneDocument，例如：

luceneDoc.add(new Field(CONTENT_FIELD, allContents.toString(), Field.Store.NO, Field.Index.TOKENIZED));

【讨论】：

【解决方案3】：

    First you can read your pdf through itext just like
try{
        PdfReader readerObj = new PdfReader("file path");
            int n = readerObj.getNumberOfPages();
            String content=PdfTextExtractor.getTextFromPage(reader, 2); //Extracting the content from a particular page.
            document.close();
}catch(Exception e){
    e.printStackTrace();
}

    add your pdf content to lucene document
    doc.add(new Field("pdfContent", content, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));

【讨论】：