【问题标题】:Installing Apache Nutch on Windows在 Windows 上安装 Apache Nutch
【发布时间】:2018-06-20 20:35:43
【问题描述】:

我正在尝试在 Windows 7(64 位)上将 Apache Solr 与 Apache Nutch 1.14 集成,但在尝试运行 Nutch 时出现错误。

我已经做过的事情:

  • 将 JAVA_HOME 环境变量设置为:C:\Program Files\Java\jdk1.8.0_25 或 C:\Progra~1\Java\jdk1.8.0_25
  • https://github.com/steveloughran/winutils/tree/master/hadoop-3.0.0/bin 下载 Hadoop WinUtils 文件,将它们放入 c:\winutils\bin,将 HADOOP_HOME 环境变量设置为 c:\winutil,并将 c:\winutil\bin 文件夹添加到 PATH。李>

(我尝试了 Hadoop WinUtils 2.7.1 也没有成功)。

我得到的错误:

$ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
  Injecting seed URLs
  /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Injector: starting at 2018-06-20 07:14:47
  Injector: crawlDb: TestCrawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
    at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:115)
    at org.apache.hadoop.mapred.LocalDistributedCacheManager.setup(LocalDistributedCacheManager.java:125)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:163)
    at org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:731)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:240)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:417)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)
  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Failed with exit value 1.

http://www.java2s.com/Code/Jar/h/Downloadhadoopcore121jar.htm 下载 hadoop-core-1.1.2.jar 文件并将其粘贴到 NUTCH_HOME/lib 文件夹后,我收到以下错误:

$ bin/crawl -i -D http://localhost:8983/solr/ -s urls/ TestCrawl 2
  Injecting seed URLs
  /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Injector: starting at 2018-06-20 23:19:49
  Injector: crawlDb: TestCrawl/crawldb
  Injector: urlDir: urls
  Injector: Converting injected urls to crawl db entries.
  Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.Job.getInstance(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/hadoop/mapreduce/Job;
    at org.apache.nutch.crawl.Injector.inject(Injector.java:401)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)
  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl/crawldb urls/
  Failed with exit value 1.

如果我没有设置 HADOOP_HOME 变量,我会收到以下异常:

Injector: java.io.IOException: (null) entry in command string: null chmod 0644 C:\cygwin64\home\apache-nutch-1.14\TestCrawl\crawldb\.locked
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:869)
    at org.apache.hadoop.util.Shell.execCommand(Shell.java:852)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
    at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
    at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
    at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
    at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:398)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
    at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:854)
    at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:1154)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:59)
    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:81)
    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:178)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:398)
    at org.apache.nutch.crawl.Injector.run(Injector.java:563)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.crawl.Injector.main(Injector.java:528)

  Error running:
    /home/apache-nutch-1.14/bin/nutch inject TestCrawl//crawldb urls/
  Failed with exit value 127.

如果能得到任何帮助,我将不胜感激!

【问题讨论】:

  • 如果 Nutch 支持 Hadoop 3.x,我会感到惊讶
  • 另外,我尝试了Hadoop WinUtils 2.7.1版本:github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin,没有成功
  • 您实际运行的是哪个 Hadoop 版本?这将包含一个 Hadoop 核心 jar 文件,因此无需自己下载
  • 我按照教程:wiki.apache.org/nutch/NutchTutorial,并没有说明安装 Hadoop。你觉得我需要吗?如果是这样,我怎样才能在 Windows 上正确安装它?谢谢!
  • 嗯,你说你有 HADOOP_HOME 变量,这意味着你已经下载了 Hadoop 二进制文件,而不仅仅是 winutils

标签: java hadoop solr nutch


【解决方案1】:

当你执行 Crawl 时,只需执行以下命令

bin/crawl -s urls/ TestCrawl/ 2

之后你就可以使用这个(-D with class)

bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/YOURCORE TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/* -filter -normalize -deleteGone

或者你可以在 conf/nutch-site.xml 中指定

<property>
    <name>solr.server.url</name>
    <value>http://localhost:8983/solr/YOURCORE/</value>
    <description>Defines the Solr URL into which data should be indexed using the indexer-solr plugin.</description>
</property> 

【讨论】:

  • 我在尝试运行第一个命令时遇到了同样的错误:bin/crawl -s urls/ TestCrawl/ 2
  • 我无法评论您的第一篇文章,但您必须删除 .locked 文件
  • 但我不能——它是在运行第一个命令时自动生成的
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-06-23
  • 2011-03-29
  • 2018-09-23
  • 2017-05-27
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多