【发布时间】:2019-03-08 21:13:30
【问题描述】:
我有一个很常见的任务,有几千个网站,并且必须解析尽可能多的网站(当然是以适当的方式)。
首先,我使用 JSoup 解析器进行了类似stormcrawlerfight 的配置。 生产力非常好,非常稳定,一分钟大约 8k 提取。
然后我想增加解析 PDF/doc/等的可能性。所以我添加了 Tika 解析器来解析非 HTML 文档。但我看到了这种指标:
所以有时有好几分钟,有时会在一分钟内下降到数百分钟。 当我删除 Tika 流记录时 - 一切恢复正常。 所以一般的问题是,如何找到这种行为的原因,瓶颈。也许我错过了一些设置?
es-injector.flux:
name: "injector"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-custom-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.spout.FileSpout"
parallelism: 1
constructorArgs:
- "."
- "feeds.txt"
- true
bolts:
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBol t"
parallelism: 1
streams:
- from: "spout"
to: "status"
grouping:
type: CUSTOM
customClass:
className: "com.digitalpebble.stormcrawler.util.URLStreamGrouping"
constructorArgs:
- "byHost"
streamId: "status"
es-crawler.flux:
name: "crawler"
includes:
- resource: true
file: "/crawler-default.yaml"
override: false
- resource: false
file: "crawler-custom-conf.yaml"
override: true
- resource: false
file: "es-conf.yaml"
override: true
spouts:
- id: "spout"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
parallelism: 10
bolts:
- id: "partitioner"
className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
parallelism: 1
- id: "fetcher"
className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
parallelism: 1
- id: "sitemap"
className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
parallelism: 1
- id: "parse"
className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
parallelism: 5
- id: "index"
className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
parallelism: 1
- id: "status"
className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
parallelism: 4
- id: "status_metrics"
className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
parallelism: 1
- id: "redirection_bolt"
className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
parallelism: 1
- id: "parser_bolt"
className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
parallelism: 1
streams:
- from: "spout"
to: "partitioner"
grouping:
type: SHUFFLE
- from: "spout"
to: "status_metrics"
grouping:
type: SHUFFLE
- from: "partitioner"
to: "fetcher"
grouping:
type: FIELDS
args: ["key"]
- from: "fetcher"
to: "sitemap"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "sitemap"
to: "parse"
grouping:
type: LOCAL_OR_SHUFFLE
# This is not needed as long as redirect_bolt is sending html content to index?
# - from: "parse"
# to: "index"
# grouping:
# type: LOCAL_OR_SHUFFLE
- from: "fetcher"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "sitemap"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "index"
to: "status"
grouping:
type: FIELDS
args: ["url"]
streamId: "status"
- from: "parse"
to: "redirection_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "redirection_bolt"
to: "parser_bolt"
grouping:
type: LOCAL_OR_SHUFFLE
streamId: "tika"
- from: "redirection_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
- from: "parser_bolt"
to: "index"
grouping:
type: LOCAL_OR_SHUFFLE
更新:我发现worker.log 中出现内存不足错误,即使我已将worker.heap.size 设置为4Gb,工作进程提升到10-15Gb。 .
更新 2: 限制内存使用后,我没有看到 OutOfMemory 错误,但性能非常低。
没有 Tika - 我看到每分钟 15k 获取。 使用 Tika - 一切都在高杠之后,每分钟只有数百个。
我在工作日志中看到了这一点: https://paste.ubuntu.com/p/WKBTBf8HMV/
CPU 使用率非常高,但日志中没有任何内容。
【问题讨论】:
-
类似“不正确的mimetype - 传递:alliedmotion.com/wp-content/uploads/documents/…”这样的日志没什么好担心的,它只是意味着Jsoup解析器正在获取一个非html文档来解析并将其传递给Tika 解析器。
-
"Could not find unacked tuple for ..." -> 不是一个大问题,请参阅github.com/DigitalPebble/storm-crawler/issues/689
-
Storm UI 表明没有明显的瓶颈,从日志来看 Fetcher 没有太多工作要做。也许看看 spouts 的指标,看看查询需要多长时间?可能是随着 Tika 解析对 CPU 负载的增加,您的机器的 CPU 已达到最大值,而 ES 正在努力及时返回结果
-
@Julien,问题是有数百万个网址在等待。当我禁用 Tika 并重新启动爬虫时 - 我得到了完整的 CPU 负载,但每分钟有 15k 个请求,使用 Tika 我有时会得到低 CPU,有时是大 CPU 负载,但速度是非 Tika 速度的 1/10。我建议 Tika 只解析 pdf/doc 文档。仅仅因为 Tika 对速度的影响如此之大?
-
它看起来是这样,但我很惊讶它没有反映在容量指标上。也许将日志级别设置为 DEBUG 看看是否可以找到有趣的东西