【问题标题】:Stormcrawler: Apache Tika for parsing PDF propertiesStormcrawler:用于解析 PDF 属性的 Apache Tika
【发布时间】:2018-10-21 07:11:42
【问题描述】:

我添加了 Tika 作为我的 StormCrawler 实现的参考,它可以在爬网中获取 PDF 文档。但是,TitleAuthors 和其他属性不会被解析。我尝试了对'index.md.mapping:'的不同组合并将相应的属性添加到ES_IndexInit,但Kibana(索引)中PDF文档的内容字段是总是空的。一切都适用于 HTML 页面。如果我遗漏了什么或者我可以看一个例子,你能帮忙指点一下吗?


es-crawler.flux:

name: "crawler"

包括: - 资源:真 文件:“/crawler-default.yaml” 覆盖:假

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "es-conf.yaml"
  override: true

喷口: - id:“喷口” 类名:“com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout” 并行度:10

螺栓: - id:“分区器” 类名:“com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt” 并行度:1 - id:“提取器” 类名:“com.digitalpebble.stormcrawler.bolt.FetcherBolt” 并行度:1 - id:“站点地图” 类名:“com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt” 并行度:1 - id:“解析” 类名:“com.digitalpebble.stormcrawler.bolt.JSoupParserBolt” 并行度:5 - id:“索引” 类名:“com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt” 并行度:1 - id:“状态” 类名:“com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt” 并行度:1 - id:“status_metrics” 类名:“com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt” 并行度:4 - id:“redirection_bolt” 类名:“com.digitalpebble.stormcrawler.tika.RedirectionBolt” 并行度:1 - id:“parser_bolt” 类名:“com.digitalpebble.stormcrawler.tika.ParserBolt” 并行度:1

流: - 来自:“喷口” 至:“分区” 分组: 类型:随机播放

  • 来自:“喷口” 至:“status_metrics” 分组: 类型:随机播放

  • 来自:“分区器” 至:“取器” 分组: 类型:字段 参数:[“键”]

  • 来自:“提取器” 至:“站点地图” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“站点地图” 到:“解析” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“解析” 至:“索引” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“提取器” 至:“状态” 分组: 类型:字段 参数:[“网址”] streamId:“状态”

  • 来自:“站点地图” 至:“状态” 分组: 类型:字段 参数:[“网址”] streamId:“状态”

  • 来自:“解析” 至:“状态” 分组: 类型:字段 参数:[“网址”] streamId:“状态”

  • 来自:“索引” 至:“状态” 分组: 类型:字段 参数:[“网址”] streamId:“状态”

  • 来自:“解析” 至:“redirection_bolt” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“redirection_bolt” 至:“parser_bolt” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“redirection_bolt” 至:“索引” 分组: 类型:LOCAL_OR_SHUFFLE

  • 来自:“parser_bolt” 至:“索引” 分组: 类型:LOCAL_OR_SHUFFLE

es-injector.flux: name: "injector"

包括: - 资源:真 文件:“/crawler-default.yaml” 覆盖:假

- resource: false
  file: "crawler-conf.yaml"
  override: true

- resource: false
  file: "es-conf.yaml"
  override: true

- resource: false
  file: "injection-conf.yaml"
  override: true

组件: - id:“方案” 类名:“com.digitalpebble.stormcrawler.util.StringTabScheme” 构造函数参数: - 发现

喷口: - id:“喷口” 类名:“com.digitalpebble.stormcrawler.spout.FileSpout” 并行度:1 构造函数参数: - “。” - “种子.txt” - 参考:“方案”

螺栓: - id:“状态” 类名:“com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt” 并行度:1 - id:“parser_bolt” 类名:“com.digitalpebble.stormcrawler.tika.ParserBolt” 并行度:1

流: - 来自:“喷口” 至:“状态” 分组: 类型:字段 参数:[“网址”]

pom.xml: http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<groupId>xyz.com</groupId>
<artifactId>search</artifactId>
<version>search1.0</version>
<packaging>jar</packaging>

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>exec-maven-plugin</artifactId>
            <version>1.3.2</version>
            <executions>
                <execution>
                    <goals>
                        <goal>exec</goal>
                    </goals>
                </execution>
            </executions>
            <configuration>
                <executable>java</executable>
                <includeProjectDependencies>true</includeProjectDependencies>
                <includePluginDependencies>false</includePluginDependencies>
                <classpathScope>compile</classpathScope>
            </configuration>
        </plugin>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>1.3.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <createDependencyReducedPom>false</createDependencyReducedPom>
                        <transformers>
                            <transformer
                                implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                            <transformer
                                implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                              <mainClass>org.apache.storm.flux.Flux</mainClass>
                              <manifestEntries>
                                <Change></Change>
                                <Build-Date></Build-Date>
                              </manifestEntries>
                            </transformer>
                        </transformers>
                        <!-- The filters below are necessary if you want to include the Tika
                            module -->
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

<dependencies>
    <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>storm-core</artifactId>
        <version>1.1.1</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.storm</groupId>
        <artifactId>flux-core</artifactId>
        <version>1.0.2</version>
    </dependency>
    <dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-core</artifactId>
        <version>1.7</version>
    </dependency>
    <dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-elasticsearch</artifactId>
        <version>1.7</version>
    </dependency>
    <dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-tika</artifactId>
        <version>1.7</version>
    </dependency>
</dependencies>

【问题讨论】:

  • 您能分享一个您缺少字段的 URL 吗?您是否在 Kibana 中重新加载了索引定义? (如果您添加一个字段,它不会自动刷新)。您是否尝试过调试,例如在本地运行拓扑时使用 Eclipse?如何将 Tika 螺栓连接到拓扑的其余部分?您是否获得了 PDF 的文本内容?
  • 感谢朱利安的回复。我正在尝试使用本地 url,但我正在处理的 pdf 是:adobe.com/digitalimag/pdfs/about_metadata.pdf。我确实刷新了 Kibana 中的定义。我还没有尝试调试[我将尝试使用 master 分支]。 PS:我已将我的文件添加到原始问题中。
  • 我尝试使用这个“公共类 ParserBoltTest”测试用例测试 PDF url,但这总是失败并出现空指针异常。除了将正确的 url 放入 parse("...") 中之外,这些测试用例是否需要任何其他需要配置的先决条件?

标签: web-crawler apache-tika stormcrawler


【解决方案1】:

您的 pom 和 Flux 文件看起来没问题。您可以将注入作为主要助焊剂的一部分以保持简单。

crawler-conf.yaml 中有什么?您是否在字段名称前加上“parse.”?

这是从您在上面发布的 URL 中提取的元数据

parse.dcterms:modified: 2004-09-29T20:21:18Z
parse.pdf:PDFVersion: 1.4
parse.access_permission:can_print: true
parse.pdf:docinfo:subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.pdf:docinfo:modified: 2004-09-29T20:21:18Z
parse.access_permission:extract_for_accessibility: true
parse.created: Fri Sep 24 15:56:30 BST 2004
parse.pdf:docinfo:created: 2004-09-24T14:56:30Z
parse.xmpTPg:NPages: 7
parse.access_permission:fill_in_form: true
parse.producer: Adobe PDF Library 6.0
parse.pdf:docinfo:title: About Metadata
parse.pdf:docinfo:producer: Adobe PDF Library 6.0
parse.dc:format: application/pdf; version=1.4
parse.access_permission:assemble_document: true
parse.access_permission:modify_annotations: true
parse.dc:title: About Metadata
parse.access_permission:can_print_degraded: true
parse.xmpMM:DocumentID: adobe:docid:indd:de7d50b0-0fc1-11d9-b0d4-cd42e793ca90
parse.xmpMM:DerivedFrom:DocumentID: adobe:docid:indd:a04d199f-0f11-11d9-b74d-bb0abf4f1ab0
parse.title: About Metadata
parse.Creation-Date: 2004-09-24T14:56:30Z
parse.modified: 2004-09-29T20:21:18Z
parse.resourceName: /digitalimag/pdfs/about_metadata.pdf
parse.dc:description: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.Last-Save-Date: 2004-09-29T20:21:18Z
parse.creator: Adobe Systems Incorporated
parse.pdf:encrypted: false
parse.trapped: False
parse.pdf:docinfo:creator: Adobe Systems Incorporated
parse.date: 2004-09-29T20:21:18Z
parse.meta:save-date: 2004-09-29T20:21:18Z
parse.Author: Adobe Systems Incorporated
parse.X-Parsed-By: org.apache.tika.parser.DefaultParser
parse.X-Parsed-By: org.apache.tika.parser.pdf.PDFParser
parse.pdf:docinfo:creator_tool: Adobe InDesign CS (3.0.1)
parse.dcterms:created: 2004-09-24T14:56:30Z
parse.access_permission:can_modify: true
parse.subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.meta:author: Adobe Systems Incorporated
parse.access_permission:extract_content: true
parse.xmp:CreatorTool: Adobe InDesign CS (3.0.1)
parse.dc:creator: Adobe Systems Incorporated
parse.cp:subject: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 
parse.pdf:docinfo:trapped: False
parse.meta:creation-date: 2004-09-24T14:56:30Z
parse.xmpMM:DerivedFrom:InstanceID: de7d50af-0fc1-11d9-b0d4-cd42e793ca90
parse.Last-Modified: 2004-09-29T20:21:18Z
parse.Content-Type: application/pdf
parse.description: By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient. A wide variety of industries use metadata, but for the purposes of digital imaging, there are currently only a few technical structures or schema that are being employed. A schema is a set of properties and their defined meanings, such as the type of value (date, size, URL, or any useful designation). 

你的 conf 应该包含类似

的内容
  indexer.md.mapping:
  - parse.title=title
  - parse.Author=author

从测试用例的代码可以猜到,需要在external/tika/src/test/resources/中添加文件,并在测试中引用文件名代码,如下例中的 about_metadata.pdf

 @Test
public void testMetadata() throws IOException {

    bolt.prepare(new HashMap(), TestUtil.getMockedTopologyContext(),
            new OutputCollector(output));

    parse("https://www.adobe.com/digitalimag/pdfs/about_metadata.pdf",
            "about_metadata.pdf");

    List<List<Object>> outTuples = output.getEmitted();

    // single document
    Assert.assertEquals(1, outTuples.size());
    // metadata
    Metadata md = (Metadata) outTuples.get(0).get(2);
    Assert.assertTrue(
            md.getFirstValue("parse.pdf:docinfo:subject").contains(
                    "By simple definition, metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient."));

}

更新

仔细观察,问题出在你的助焊剂上。重定向螺栓将元组发送到名为“tika”的定制流上的 Tika。因此定义应该是

from: "redirection_bolt"
to: "parser_bolt"
grouping:
  type: LOCAL_OR_SHUFFLE
  streamId: "tika"

【讨论】:

  • 谢谢朱利安。我能够通过将 Tika 用于 html 和 pdf 并从 es-crawler.flux 中删除 JSoup 解析器来解决这个问题。我仍然不确定在同时使用 JSoup 和 Tika 时我可能做错了什么。
  • @V_P 你不需要将 JSOUP 连接到索引器,重定向螺栓会这样做 ``` from: "parse" to: "index" grouping: type: LOCAL_OR_SHUFFLE ```
  • 我没有机会检查这一点,因为我遇到了 SC 的另一个问题。我现在可以使用 Tika,但是,我会尝试一下,我很肯定这次更新应该可以解决它。
  • SC 有一个 localhost ES 的设置,与 --local 和 --remote 完美配合。当我将其更改为写入 Dev Elastic Search 服务器(集群)并更新时,ES_IndexInit 和 es_conf.to 具有 url 和集群名称的属性,但这永远不会写入该服务器。我不确定我错过了什么。我在日志中看不到任何错误。
  • 下载后的 SC 具有 localhost ES 的设置,与 --local 和 --remote 完美配合。当我将其更改为写入 Dev Elastic Search 服务器(集群)并且我已更新时,ES_IndexInit 和 es_conf.yaml(es.indexer.addresses 和 cluster.name)具有正确的属性值。 ES_IndexInit 按预期在该服务器上创建索引,但这永远不会写入该服务器。我在日志中看不到任何错误,并且我看到 URL 显示为在日志中发现。这甚至不写入状态索引。我不确定我错过了什么?
猜你喜欢
  • 2023-04-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-09-09
相关资源
最近更新 更多