Solr fetchIndex 命令在分片节点上莫名其妙地失败答案

【问题标题】：Solr fetchIndex command fails inexplicably on sharded nodesSolr fetchIndex 命令在分片节点上莫名其妙地失败
【发布时间】：2020-12-15 19:05:24
【问题描述】：

我在通过 REST 调用 fetchIndex command 时遇到了一个奇怪的问题。我正在尝试使用 fetchIndex 将数据从一个 solrcloud 实例传播到另一个实例。我对文档的阅读似乎表明这应该是可能的：

获取索引

强制指定的从站从其主站获取索引的副本。 http://slave_host:port/solr/core_name/replication?command=fetchindex

如果您愿意，您可以传递一个额外的属性，例如 masterUrl 或压缩（或标签中指定的任何其他参数）来从主服务器进行一次性复制。这消除了在从属设备中对主设备进行硬编码的需要。

我遇到的问题是复制开始时出现许多意外异常。例如，从“从”节点：

2020-12-15 00:17:17.442 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Starting replication process
2020-12-15 00:17:17.445 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Number of files in latest index in master: 17
2020-12-15 00:17:17.449 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2020-12-15 00:17:17.449 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@C:\scratch\solr-7.7.3\example\cloud\node1\solr\techproducts_shard1_replica_n1\data\index.20201215001717446 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5577fa1; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2020-12-15 00:17:17.455 ERROR (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Error fetching file, doing one retry...:org.apache.solr.common.SolrException: Unable to download _0.si completely. Downloaded 551!=533
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1700)
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1580)
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1550)
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1030)
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:569)
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:346)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:425)
        at org.apache.solr.handler.ReplicationHandler.lambda$fetchIndex$0(ReplicationHandler.java:346)
        at java.lang.Thread.run(Thread.java:748)

这些异常会导致复制中止。有几个关于 SO 的问题引用了这样的错误 (solr ReplicationHandler - SnapPull failed to download files)，但似乎与这种情况无关。

这个问题很容易重现，只使用基本的 solr 安装，没有特殊数据。我正在使用 Solr 7.7.3。

复制步骤：

在“主”机器上解压 solr。
执行./bin/solr -e cloud 部署示例solr 云。接受所有默认值，除了：
- 将集合命名为“techproducts”而不是“gettingstarted”
- 选择“sample_techproducts_configs”配置集。
将示例 techproducts 数据加载到 solr：bin/post -c techproducts ./example/exampledocs/*。
在另一台计算机或 VM 上重复步骤 1 和 2。不要加载 techproducts 数据 - 我们希望使用 fetchIndex 来复制它。
在第二台机器上加载 postman 或您选择的 REST 客户端并调用 fetchIndex 命令： GET http://<second machine>:8983/solr/techproducts/replication?command=fetchindex&masterUrl=http://<first machine>:8983/solr/techproducts

这应该会在“从”机器的日志中产生如上所示的错误输出。我受制于使用 Solr 7.7.3 的任务，但我尝试了不同的 JVM 以及 Windows 和 Linux 主机。所有组合产生相同的结果。

我觉得好像我一定错过了什么，但我不确定是什么。任何意见或建议都会非常有帮助。

我也很好奇如何通过 SolrJ 以编程方式正确调用此行为，但一旦此问题得到解决，最好留给另一个问题。

编辑： 通过将示例云中的分片/副本数量减少到一个，我已经能够使用此过程成功复制。我现在正在调查在每个分片的基础上执行这些索引复制需要做些什么，但我还没有答案。

【问题讨论】：

标签： solr lucene replication solrj solrcloud

【解决方案1】：

事实证明，我在此过程的早期将集合和核心混为一谈，但没有注意到。在提供的 REST URL 中，

GET http://:8983/solr/techproducts/replication?command=fetchindex&masterUrl=http://:8983/solr/techproducts

我发布了集合名称而不是核心名称。一个恰当的例子：

GET http://:8983/solr/techproducts_shard1_replica_n1/replication?command=fetchindex&masterUrl=http://:8983/solr/techproducts_shard1_replica_n1

当然，为了正确复制整个云实例，需要为每个核心重复此 REST 请求。奇怪的是，当使用集合而不是核心调用复制端点时，Solr 不会产生显式错误消息，但仍然会尝试复制。自然，当涉及多个分片时，这会导致目标节点尝试命中“移动目标”——针对集合的查询可能会命中任何核心，并且这些文件将不符合预期，从而导致上述错误消息。

【讨论】：