【问题标题】:Nutch fetch command not fetching dataNutch fetch 命令未获取数据
【发布时间】:2016-01-18 04:30:19
【问题描述】:

我有一个具有以下软件堆栈的集群设置:

nutch-branch-2.3.1, gora-hbase 0.6.1 Hadoop 2.5.2, hbase-0.98.8-hadoop2

所以初始命令是:inject, generate, fetch, parse, updatedb 其中前两个,即注入,生成工作正常,但对于 nutch 命令(即使它执行成功)它没有获取任何数据,并且由于 fetch 过程失败,其后续进程也失败了。

请查找每个进程的计数器日志:

注入工作:

2016-01-08 14:12:45,649 INFO  [main] mapreduce.Job: Counters: 31
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=114853
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=836443
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=2
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=179217
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=59739
        Total vcore-seconds taken by all map tasks=59739
        Total megabyte-seconds taken by all map tasks=183518208
    Map-Reduce Framework
        Map input records=29973
        Map output records=29973
        Input split bytes=94
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=318
        CPU time spent (ms)=24980
        Physical memory (bytes) snapshot=427704320
        Virtual memory (bytes) snapshot=5077356544
        Total committed heap usage (bytes)=328728576
    injector
        urls_injected=29973
    File Input Format Counters 
        Bytes Read=836349
    File Output Format Counters 
        Bytes Written=0

生成作业:

2016-01-08 14:14:38,257 INFO  [main] mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=137140
        FILE: Number of bytes written=623942
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=937
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=2
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=43788
        Total time spent by all reduces in occupied slots (ms)=305690
        Total time spent by all map tasks (ms)=14596
        Total time spent by all reduce tasks (ms)=61138
        Total vcore-seconds taken by all map tasks=14596
        Total vcore-seconds taken by all reduce tasks=61138
        Total megabyte-seconds taken by all map tasks=44838912
        Total megabyte-seconds taken by all reduce tasks=313026560
    Map-Reduce Framework
        Map input records=14345
        Map output records=14342
        Map output bytes=1261921
        Map output materialized bytes=137124
        Input split bytes=937
        Combine input records=0
        Combine output records=0
        Reduce input groups=14342
        Reduce shuffle bytes=137124
        Reduce input records=14342
        Reduce output records=14342
        Spilled Records=28684
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=1299
        CPU time spent (ms)=39600
        Physical memory (bytes) snapshot=2060779520
        Virtual memory (bytes) snapshot=15215738880
        Total committed heap usage (bytes)=1864892416
    Generator
        GENERATE_MARK=14342
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
2016-01-08 14:14:38,429 INFO  [main] crawl.GeneratorJob: GeneratorJob: finished at 2016-01-08 14:14:38, time elapsed: 00:01:47
2016-01-08 14:14:38,431 INFO  [main] crawl.GeneratorJob: GeneratorJob: generated batch id: 1452242570-1295749106 containing 14342 URLs

抓取:

../nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1452242566-14060 -crawlId 1 -threads 50


2016-01-08 14:14:43,142 INFO  [main] fetcher.FetcherJob: FetcherJob: starting at 2016-01-08 14:14:43
2016-01-08 14:14:43,145 INFO  [main] fetcher.FetcherJob: FetcherJob: batchId: 1452242566-14060
2016-01-08 14:15:53,837 INFO  [main] mapreduce.Job: Job job_1452239500353_0024 completed successfully
2016-01-08 14:15:54,286 INFO  [main] mapreduce.Job: Counters: 50
    File System Counters
        FILE: Number of bytes read=44
        FILE: Number of bytes written=349279
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=1087
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=1
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
    Job Counters 
        Launched map tasks=1
        Launched reduce tasks=2
        Data-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=30528
        Total time spent by all reduces in occupied slots (ms)=136535
        Total time spent by all map tasks (ms)=10176
        Total time spent by all reduce tasks (ms)=27307
        Total vcore-seconds taken by all map tasks=10176
        Total vcore-seconds taken by all reduce tasks=27307
        Total megabyte-seconds taken by all map tasks=31260672
        Total megabyte-seconds taken by all reduce tasks=139811840
    Map-Reduce Framework
        Map input records=0
        Map output records=0
        Map output bytes=0
        Map output materialized bytes=28
        Input split bytes=1087
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=28
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =2
        Failed Shuffles=0
        Merged Map outputs=2
        GC time elapsed (ms)=426
        CPU time spent (ms)=11140
        Physical memory (bytes) snapshot=1884893184
        Virtual memory (bytes) snapshot=15245959168
        Total committed heap usage (bytes)=1751646208
    FetcherStatus
        HitByTimeLimit-QueueFeeder=0
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=0
    File Output Format Counters 
        Bytes Written=0
2016-01-08 14:15:54,314 INFO  [main] fetcher.FetcherJob: FetcherJob: finished at 2016-01-08 14:15:54, time elapsed: 00:01:11

请指教。

【问题讨论】:

    标签: hadoop hbase nutch


    【解决方案1】:

    自从我使用 nutch 以来已经有一段时间了,但是从记忆中,有一段时间可以继续获取页面。例如,如果您今天抓取http://helloworld.com,并尝试在今天再次发出 fetch 命令,那么它可能会在没有获取任何内容的情况下完成,因为 url http://helloworld.com 上的生存时间晚了 x 天(忘记了默认时间生活)。

    我认为您可以通过清除 crawl_db 并重试来解决此问题 - 或者现在可能有一个命令将 timetolive 设置为 0。

    【讨论】:

    • pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch 这个网址可能有用 - 我认为它有点旧,但解释了超时的获取过程
    • 感谢您的及时回复@andrew.butkus。我尝试将 120 秒作为参数添加到 db.fetch.interval.default 但我没有工作。
    • 是您从中获取的网址,其中包含 robots.txt - 它可能是目标站点拒绝抓取作为 nutches 网络礼仪功能的一部分
    • 感谢安德鲁的回复。是的,这可能是一个原因,但我认为这不是我的情况,因为我添加了一组 ~29K 的 URL 来抓取。但它连一个都没有。
    • 请记住,上面的页面已经很老了,从那时起,您的 nutch 版本中的参数可能已经改变。此外,如果您现在添加它,我怀疑您的 crawl_db 已经绑定到默认 TTL,因此此命令将无效,除非您清除您的 crawl_db 并重新启动。出于好奇,您是否设置了爬行深度?
    【解决方案2】:

    最后经过几个小时的研发,我发现问题是由于 nutch 中的一个错误,就像“通过选项/参数 -batchId <id> 传递给 GeneratorJob 的批处理 id 被忽略,生成的批处理 id 用于标记当前批。”。此处列为问题https://issues.apache.org/jira/browse/NUTCH-2143

    特别感谢andrew-butkus :)

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-04-21
      • 1970-01-01
      • 2023-02-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多