Nutch 1.11(1.x) 和 Solr 5.3.1(5.x) 之间的集成答案

【问题标题】：Integration between Nutch 1.11(1.x) and Solr 5.3.1(5.x)Nutch 1.11(1.x) 和 Solr 5.3.1(5.x) 之间的集成
【发布时间】：2016-03-19 15:41:42
【问题描述】：

我刚开始使用 Nutch 1.11 和 Solr 5.3.1。

我想使用 Nutch 抓取数据，然后索引并准备使用 Solr 进行搜索。

我知道如何使用Nutch的bin/crawl命令从网络爬取数据，并成功地从我本地的一个网站上获取了很多数据。

我还在Solr根文件夹下使用以下命令在本地启动了一个新的Solr服务器，

bin/solr start

并使用以下命令在示例文件夹下启动示例files core：

bin/solr create -c files -d example/files/conf

我可以登录下面的管理 url 并管理 files 核心，

http://localhost:8983/solr/#/files

所以我相信我正确启动了Solr，并开始使用Nutch 的bin/nutch index 命令将Nutch 数据发布到Solr：

bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-params solr.server.url=127.0.0.1:8983/solr/files \
-dir crawl/segments

希望Solr5 的新 Auto Schema 功能，我可以让自己安静下来，但是，我得到了以下错误（从日志文件复制）：

WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2.
INFO  segment.SegmentChecker - Segment dir is complete: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3.
INFO  indexer.IndexingJob - Indexer: starting at 2015-12-14 15:21:39
INFO  indexer.IndexingJob - Indexer: deleting gone documents: false
INFO  indexer.IndexingJob - Indexer: URL filtering: false
INFO  indexer.IndexingJob - Indexer: URL normalizing: false
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  indexer.IndexingJob - Active IndexWriters :
SolrIndexWriter
    solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent')
    solr.server.url : URL of the Solr instance (mandatory)
    solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type)
    solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.commit.size : buffer size when sending to Solr (default 1000)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s1
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s2
INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: file:/user/nutch/apache-nutch-1.11/crawl/segments/s3
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/staging/user117437667/.staging/job_local117437667_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
WARN  conf.Configuration - file:/tmp/hadoop-user/mapred/local/localRunner/user/job_local117437667_0001/job_local117437667_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
INFO  solr.SolrMappingReader - source: content dest: content
INFO  solr.SolrMappingReader - source: title dest: title
INFO  solr.SolrMappingReader - source: host dest: host
INFO  solr.SolrMappingReader - source: segment dest: segment
INFO  solr.SolrMappingReader - source: boost dest: boost
INFO  solr.SolrMappingReader - source: digest dest: digest
INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local117437667_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre>    Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>

</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:512)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

我记得这个

org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected mime type application/octet-stream but got text/html.

与Solr url 有关，但我仔细检查了我使用127.0.0.1:8983/solr/files 的url，我认为它是正确的。

有谁知道问题出在哪里？我在网上搜索，在这里，没有任何有用的东西。

注意：我还尝试了在examples/files/conf/solrconfig.xml 中禁用Solr5 的Auto Schema 功能并将examples/files/conf/managed-schema.xml 替换为Nutch 的conf/schema.xml 的方法，仍然命中相同错误。

更新：尝试DEPRECATED命令bin/nutch solrindex（感谢Thangaperumal）后，之前的错误消失了，但又遇到了另一个错误：

bin/nutch solrindex http://127.0.0.1:8983/solr/files crawl/crawldb -linkdb crawl/linkdb crawl/segments/s1

错误信息：

INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
INFO  solr.SolrIndexWriter - Deleting 0 documents
INFO  solr.SolrIndexWriter - Indexing 250 documents
WARN  mapred.LocalJobRunner - job_local1306504137_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Unable to invoke function processAdd in script: update-script.js: Can't unambiguously select between fixed arity signatures [(java.lang.String, java.io.Reader), (java.lang.String, java.lang.String)] of the method org.apache.solr.analysis.TokenizerChain.tokenStream for argument types [java.lang.String, null]
    at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
    at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
    at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:134)
    at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:85)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:493)
    at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:422)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:356)
    at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)

【问题讨论】：

Nutch 需要运行 hadoop 吗？我不确定。
我也遇到了这个问题，使用 Solr 5.4.1。

标签： solr nutch solr5

【解决方案1】：

您是否尝试过使用以下方法指定 Solr URL：

-D solr.server.url=http://localhost:8983/solr/files

而不是-params 方法？至少这是crawl 脚本的正确语法。而且由于两者都调用下划线 java 类来完成这项工作。

bin/nutch index crawl/crawldb \
-linkdb crawl/linkdb \
-D solr.server.url=http://127.0.0.1:8983/solr/files \
-dir crawl/segments

【讨论】：

试过你的方法，打java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=http://127.0.0.1:8983/solr/files，我正在关注bin/nutch index帮助信息来打电话。

【解决方案2】：

相反，试试这个语句来整合 solr 和 nutch

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/

【讨论】：

bin/nutch 帮助消息提示，这不是solrindex 命令##DEPRECATED##？我还尝试了对网址进行一些编辑的方式，有进展！让我更新问题。
我猜你的命令也会起作用，如果你给出 -params solr.server.url=127.0.0.1:8983/solr
据我记得，核心必须在 nutch conf 文件之一中指定（类似于 schema.xml）。所以你不需要在 solr.server.url 中指定 core
好吧，有趣的事实是，如果我只使用没有核心名称的 url，那么它会像以前一样遇到同样的错误，然后我添加核心名称，它会出现一个新错误，做你知道第二种错误吗？我在问题中更新它。
此错误是因为您的 nutch 数据字段和 solr 之间的字段列表不匹配（拼写错误的字段/数据类型不匹配/遗漏的字段）。您必须交叉检查 update-script.js 中的字段