【问题标题】:What is the best way to freshen a Nutch index?更新 Nutch 指数的最佳方法是什么?
【发布时间】:2009-03-12 19:59:41
【问题描述】:

我已经有一年左右没有看过 Nutch 了,看起来它已经发生了很大的变化。关于重新抓取的文档不清楚。更新现有 Nutch 索引的最佳方法是什么?

【问题讨论】:

    标签: indexing nutch full-text-indexing


    【解决方案1】:

    此脚本大致基于 Nutch 常见问题解答中的脚本,起初对我不起作用:

    #!/bin/sh
    #
    # Automate crawling my site
    #
    crawldir=./crawl
    urldir=./urls
    NUTCH_HOME=${NUTCH_HOME:=.}
    
    nutch=$NUTCH_HOME/bin/nutch
    
    # Make sure the crawl directories exist
    mkdir -p $crawldir/crawldb $crawldir/segments $crawldir/linkdb
    
    # Inject the initial urls
    $nutch inject $crawldir/crawldb $urldir
    
    depth=1
    while(true) ; do
      echo "beginning crawl at depth $depth"
      echo "-generate"
      $nutch generate $crawldir/crawldb $crawldir/segments
      if [ $? -ne 0 ] ; then
        echo "finishing at depth $depth - no more urls"
        break
      fi
    
      segment=`/bin/ls -rtd $crawldir/segments/*|tail -1`
    
      echo "$nutch fetch $segment"
      $nutch fetch $segment
      if [ $? -ne 0 ] ; then
        echo "fetch failed at depth $depth, deleting segment"
        rm -rf $segment
        continue;
      fi
    
      echo "$nutch updatedb $crawldir/crawldb $segment"
      $nutch updatedb $crawldir/crawldb $segment
      depth=`expr $depth + 1`
    done
    
    echo "$nutch mergesegs $crawldir/MERGEDsegs $crawldir/segments/*"
    $nutch mergesegs $crawldir/MERGEDsegs $crawldir/segments/*
    if [ $? -eq 0 ] ; then
      rm -rf $crawldir/segments/*
      mv $crawldir/MERGEDsegs/* $crawldir/segments
      rmdir $crawldir/MERGEDsegs
    else
      echo "Something went wrong"
      exit
    fi
    
    echo "$nutch invertlinks $crawldir/linkdb -dir $crawldir/segments"
    $nutch invertlinks $crawldir/linkdb -dir $crawldir/segments
    
    echo "$nutch index $crawldir/NEWindexes $crawldir/crawldb $crawldir/linkdb $crawldir/segments/*"
    $nutch index $crawldir/NEWindexes $crawldir/crawldb $crawldir/linkdb \
    $crawldir/segments/*
    
    echo "$nutch dedup $crawldir/NEWindexes"
    $nutch dedup $crawldir/NEWindexes
    
    echo "$nutch merge $crawldir/MERGEDindexes $crawldir/NEWindexes"
    $nutch merge $crawldir/MERGEDindexes $crawldir/NEWindexes
    
    mv $crawldir/index $crawldir/OLDindexes
    mv $crawldir/MERGEDindexes $crawldir/index
    

    【讨论】:

      【解决方案2】:

      我们将 nutch 与 solr 结合使用。我们的 Nutch 指数约为。 80 MB 包含 5000 个网站。到目前为止,最好的重新抓取方法是删除索引并从头开始创建。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-10-09
        • 2019-12-24
        • 2022-01-11
        相关资源
        最近更新 更多