【发布时间】:2017-05-02 20:51:14
【问题描述】:
就像说 Elasticsearch 是后端代码,java 编码是前端编码。 我的 PDF 文件将保存在 Elasticsearch 那里。 现在我需要使用java编码前端编码来提取pdf文件,然后juz将索引发送到elasticsearch后端 我将使用 java xml 与 elasticsearch 连接
private void readElasticSearchConfig() {
String configparam = factoryType.serverXML.getAdapterConfigParams();
if (configparam != null && configparam.length() > 0) {
xmlepath = StringUtility.configParamsLookup("|", configparam, "NEWS_STORY_FOLDER");
newssource = StringUtility.configParamsLookup("|", configparam, "news_source");
indexserver = StringUtility.configParamsLookup("|", configparam, "indexserver");
isInsertElasticSearchIndex = true;
out.println("Read xmlpath = " + xmlepath + "->newssource :" + newssource + "->indexserver :" + indexserver);
}
}
xml 中的示例 NEWS_STORY_FOLDER=D:/NEWS_ARCHIVE/Bursa/newsStory/|news_source=N|indexserver=http://127.0.0.1:9200/news/TRKD/
之后,所有数据将插入到 bean 文件中。 下面是java前端编码
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
public void generateJsonObject(NewsContentObj newsContentObj, String sNewsID) {
try {
Gson gson = new GsonBuilder().disableHtmlEscaping().create();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
File folder = new File("D:/workspace/AdapterReuters_TRKD_News/bin/test");
String fileName = "D:\\workspace\\AdapterReuters_TRKD_News\\bin\\test\\Order Summary.pdf";
FileInputStream inputstream = new FileInputStream(new File(fileName));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
newsContentObj.setContent(handler.toString());
out.println("contents test :" + newsContentObj.getContent());
String Json = gson.toJson(newsContentObj);
// out.println("String Builder :" +sContent.toString());
out.println("JSON :" + Json);
sendIndexer(sNewsID, Json);
} catch (Exception ex) {
out.println("News Id :" + sNewsID + " -> Exception :" + ex);
ex.printStackTrace();
}
}
private void sendIndexer(String nid, String json) {
try {
String url = indexserver + nid;
StringEntity reqEntity = new StringEntity(json, "application/json", "UTF8");
HttpPost post = new HttpPost(url);
post.setEntity(reqEntity);
CloseableHttpClient httpclient = HttpClients.createDefault();
CloseableHttpResponse res = httpclient.execute(post);
// Issue to solve: if sleep is not applied,
// JQC will be too quick to respond and call back ES causing blank data as ES had not finish index new data
// below is just temp fix, most likely need migrate to use ES API to get actual push index success
//Thread.sleep(5000);
// Debug purpose
// out.println("Send Indexer status: " + res.getStatusLine());
} catch (UnsupportedEncodingException uee) {
out.println("Send Indexer encoding exception: This should not happen unless hardcoded item being changed!");
} catch (ClientProtocolException cpe) {
out.println("Send Indexer CPE exception: " + cpe);
} catch (IOException ioe) {
out.println("Send Indexer IO exception: " + ioe);
}
}
第一个问题:
如何使用 java 编码连接来自弹性搜索的输入 pdf 文件?我需要在 xml 文件中添加任何内容吗?
-
连接后如何解压pdf文件?我尝试在生成 JsonObject 时使用示例,但出现“线程“Thread-25”中的异常 java.lang.NoClassDefFoundError: org/apache/pdfbox/pdmodel/PDDocument”失败,我该怎么办?
李>
谢谢
【问题讨论】:
-
将 PDFBox 和 Fontbox 添加到您的项目中。
标签: java elasticsearch