apache solr 使用 xml 文档索引 pdf 文件答案

【问题标题】：apache solr Index pdf files with xml documentsapache solr 使用 xml 文档索引 pdf 文件
【发布时间】：2020-05-02 20:01:25
【问题描述】：

如何使用 xml 文档在 apache solr（版本 8）中索引 pdf 文件示例：

<add>
<doc>
<field name="id">filePath</field>
<field name="title">the title</field>
<field name="description">description of the pdf file</field>
<field name="Creator">jhone doe</field>
<field name="Language">English</field>
<field name="Publisher">Publisher_name</field>
<field name="tags">some_tag</field>
<field name="is_published">true</field>
<field name="year">2002</field>
<field name="file">path_to_the_file/file_name.pdf</field>
</doc>
</add>

更新

如何将literal.id 设置为文件路径

【问题讨论】：

带有pdf文件路径的xml文档是从其他程序生成的
最常见的方式是将这些 PDF 文件提交到extracting request handler。然后，您可以使用literal.id 等来包含自定义数据。另一种选择是使用数据导入处理程序use the TikaEntityTransformer - 使用 XML 支持解析 XML，然后使用 TikaEntityProcessor 处理 PDF 内容。

标签： solr

【解决方案1】：

好的，我就是这样做的

我正在使用 solr DHI 在 solrconfig.xml 中

<requestHandler name="/dataimport_fromXML" class="org.apache.solr.handler.dataimport.DataImportHandler">

        <lst name="defaults">
            <str name="config">data.import.xml</str>
            <str name="update.chain">dedupe</str>
        </lst>
 </requestHandler>

和 data.import.xml 文件

<dataConfig>
    <dataSource type="BinFileDataSource" name="data"/>
    <dataSource type="FileDataSource" name="main"/>
    <document>
        <!-- url : the url for the xml file that holde the metadata -->
        <entity name="rec" processor="XPathEntityProcessor" url="${solr.install.dir:}solr/solr_core_name/filestore/docs_metaData/metaData.xml" forEach="/docs/doc" dataSource="main" transformer="RegexTransformer,DateFormatTransformer">
            <field column="resourcename" xpath="//resourcename" name="resourceName" />
            <field column="title" xpath="//title" name="title" />
            <field column="subject" xpath="//subject" name="subject"/>
            <field column="description" xpath="//description" name="description"/>
            <field column="comments" xpath="//comments" name="comments"/>
            <field column="author" xpath="//author" name="author"/>
            <field column="keywords" xpath="//keywords" name="keywords"/>
            <!-- baseDir: path to the folder that containt the files (pdf | doc | docx | ...) -->
            <entity name="files" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="${solr.install.dir:}solr/solr_core_name/filestore/docs_folder" fileName="${rec.resourcename}" onError="skip" recursive="false">
                <field column="fileAbsolutePath" name="filePath" />
                <field column="resourceName" name="resourceName" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                <!-- for etch file extracte metadata if not in the xml metadata file -->
                <entity name="file" processor="TikaEntityProcessor" dataSource="data" format="text" url="${files.fileAbsolutePath}" onError="skip" recursive="false">
                    <field column="title" name="title" meta="true"/>
                    <field column="subject" name="subject" meta="true"/>
                    <field column="description" name="description" meta="true"/>
                    <field column="comments" name="comments" meta="true"/>
                    <field column="Author" name="author" meta="true"/>
                    <field column="Keywords" name="keywords" meta="true"/>
                </entity>
            </entity>
        </entity>
    </document>
</dataConfig>

之后你要做的就是创建xml文件（metaData.xml）

<docs>
    <doc>
        <resourcename>fileName.pdf</resourcename>
        <title></title>
        <subject></subject>
        <description></description>
        <comments></comments>
        <author></author>
        <keywords></keywords>
    </doc>
</docs>

将所有文件放在一个文件夹中

"${solr.install.dir:}solr/solr_core_name/filestore/docs_folder"

${solr.install.dir:} 是 solr 主文件夹

关于问题的更新

如何将literal.id 设置为filePath

在 data.import.xml 中将 fileAbsolutePath 映射到 id

<field column="fileAbsolutePath" name="id" />

最后一件事

在这个例子中，id是我使用的自动生成的

<updateRequestProcessorChain name="dedupe">

witch 根据内容的哈希创建一个唯一的 id 以避免重复

【讨论】：