使用 Cloudera Search 索引 PDF 文档答案

【问题标题】：Indexing PDF documents using Cloudera Search使用 Cloudera Search 索引 PDF 文档
【发布时间】：2017-05-19 08:25:08
【问题描述】：

我一直在尝试使用 Cloudera Search 又名 Apache Solr 来索引 pdf 文档。首先，我能够索引 twitter 推文。后来我尝试索引 PDF 文件。我已经使用 solrctl 和默认架构创建了相应的集合。我使用的morphline文件是（我这里屏蔽了zkHost的IP地址）...

solrLocator : {
  # Name of solr collection
  #collection : collection1
  collection : pdfs

  # ZooKeeper ensemble
  #zkHost : "127.0.0.1:2181/solr"
  zkHost : "xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181/solr"

  # The maximum number of documents to send to Solr per network batch (throughput knob)
  # batchSize : 100
}
morphlines : [

{

id : morphlinepdfs

importCommands : ["org.kitesdk.**", "org.apache.solr.**"]

commands : [

{ detectMimeType { includeDefaultMimeTypes : true } }

{

solrCell {

solrLocator : ${solrLocator}

captureAttr : true

lowernames : true

capture : [id, title, author, content, content_type, subject, description, keywords, category, resourcename, url, last_modified, links]

parsers : [ { parser : org.apache.tika.parser.pdf.PDFParser } ]

}

}

{ generateUUID { field : id } }

{ sanitizeUnknownSolrFields { solrLocator : ${solrLocator} } }

{ loadSolr: { solrLocator : ${solrLocator} } }

]

}

]

PDF 元数据字段存在于 schema.xml 文件中，例如...

<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="resourcename" type="text_general" indexed="true" stored="true"/>
   <field name="url" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

但在 solr /select 查询输出中，我只得到内容和内容类型字段。如何获取 solr 前端查询中的所有元数据？我是否需要修改 schema.xml 或相应的 morphline 文件？我也可以索引 PDF 内容中的字段吗？

我用来索引pdf文件的命令是：

hadoop --config /etc/hadoop/conf.cloudera.yarn jar /usr/lib/solr/contrib/mr/search-mr-1.0.0-cdh5.8.2-job.jar org.apache.solr.hadoop.MapReduceIndexerTool -D 'mapred.child.java.opts=-Xmx500m' --log4j /usr/share/doc/search-1.0.0+cdh5.8.2+0/examples/solr-nrt/log4j.properties --morphline-file /usr/share/doc/search-1.0.0+cdh5.8.2+0/examples/solr-nrt/test-morphlines/solrPDF.conf --output-dir hdfs://xxxxxx:8020/user/root/outdir --verbose --go-live --zk-host xxxxx:2181/solr --collection pdfs hdfs://xxxxxx:8020/user/root/indir

提前致谢。

【问题讨论】：

标签： indexing solr cloudera morphline

【解决方案1】：

我发现了问题。事实上，我使用的 PDF 文件没有任何元数据。我已经尝试过使用其他 PDF 文件并得到了结果。希望对其他人有所帮助。

【讨论】：