【发布时间】:2014-05-20 14:03:50
【问题描述】:
我必须在我的项目中使用 lucene 创建一个全文搜索,所以我必须在 mysql 数据库中索引一个 blob 列(包含文件 pdf、doc、xsl、xml 和图像),我不使用 doc、xsl 和 xml有任何问题,但我无法获得结果的 pdf 文件
public class Indexfile {
public static void main(String[] args) throws Exception {
RemoteControlServiceConnection a = new RemoteControlServiceConnection(
"jdbc:mysql://localhost:3306/Test","root", "root" );
Connection conn = a.getConnexionMySQL();
final File INDEX_DIR = new File("index");
IndexWriter writer = new IndexWriter(INDEX_DIR,
new StandardAnalyzer(),
true);
String query = "SELECT id, name ,document FROM Table_document";
Statement statement = conn.createStatement();
ResultSet result = statement.executeQuery(query);
while (result.next()) {
Document document = new Document();
document.add(new Field("id", result.getString("id"), Field.Store.YES, Field.Index.NO));
document.add(new Field("name", result.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
document.add(new Field("document", result.getString("document"), Field.Store.YES, Field.Index.TOKENIZED));
writer.addDocument(text);
}
}
writer.close();
}
}
我使用搜索
public class searchlucene {
public static void main(String[] args) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
String qu = "montbel*"; // put your keyword here
// String IndexStoreDir = "index-directory";
try {
Query q = new QueryParser("document", analyzer).parse(qu);
int hitspp = 100; //hits per page
IndexSearcher searcher = new IndexSearcher(IndexReader.open("index"));
TopDocCollector collector = new TopDocCollector(hitspp);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("name"));
}
searcher.close();
} catch (Exception ex1) {
}
}}
【问题讨论】:
-
我真的很惊讶 doc 的工作原理,因为它是一种二进制格式。要索引 PDF 文件,我要做的是获取 PDF 数据,使用例如 PDFBox 将其转换为文本,然后索引该文本内容。但是,也许您希望“升级”以使用 Apache SOLR,我相信它具有索引特定文件类型的内置功能。