nutch 没有在 solr 中索引 specifig teg答案

【问题标题】：nutch not indexing specifig teg in solrnutch 没有在 solr 中索引 specifig teg
【发布时间】：2016-05-02 13:53:26
【问题描述】：

我正在使用提取器插件。 https://github.com/BayanGroup/nutch-custom-search 我按照 github 上提到的步骤进行操作。这是我的配置： 1) 提取器.xml 标题" />

2) nutch-site.xml
<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|metatags|msexcel|msword|mspowerpoint|pdf)|extractor|scoring-opic|index-(basic|anchor|more|metadata)|query-(basic|site|url|lang)|urlnormalizer-(pass|regex|basic)</value>
</property>
3)  added field in schema.xml of solr and nutch   <field name="aakashtitle" type="string" stored="true" indexed="true" multiValued="true"/>
4)I added plugin in parse-plugins.xml
I am not getting any error but my data is not indexing in solr??
please help . and thanks!

【问题讨论】：

1) extractors.xml

标签： solr nutch

【解决方案1】：

我快速浏览了 GH 存储库，因为代码实际上像普通的 ParseFilter 一样工作，您应该能够使用 parsechecker 命令检查数据是否正确提取：

$ bin/nutch parsechecker <URL>

这应该输出 Nutch 提取的常用数据（内容类型、签名、url）和ParseData（状态、标题、外链等）以及从插件中提取的任何其他信息。

您也可以使用indexchecker 命令：

$ bin/nutch indexchecker <URL>

这将输出将由活动索引插件 (Solr/ES) 索引的实际字段。

【讨论】：

谢谢！它现在正在工作。但我想为 nucth 构建我们自己的插件以提取特定标签。有什么想法吗？
您可以将自己的插件实现为HTMLParseFilter，如果您这样做，请查看github.com/apache/nutch/blob/master/src/plugin/headings/src/…，这是一个基本的提取插件，易于理解。您也可以使用issues.apache.org/jira/browse/NUTCH-1870，它是一个 WIP，也许您需要调整补丁以在 trunk/1.11 上工作，但这是一个很好的起点，可以让您使用 XPath 指定要提取的数据。
非常感谢！我现在将尝试编写自己的插件来提取特定标签