Nutch：如何向 ElasticSearch 提供额外的字段？答案

【问题标题】：Nutch: how to feed additional fields to ElasticSearch?Nutch：如何向 ElasticSearch 提供额外的字段？
【发布时间】：2017-06-16 22:13:15
【问题描述】：

我正在使用 Nutch 1.13 和 ES 2.4.5 来抓取特定网站并构建 Google Site Search 的替代品。我对此很陌生，所以我与默认安装/配置/等没有太大偏差。归根结底，我猜我的 ES 索引中有一组标准字段：

_index, _type, _id, url, title, content

还有其他一些。只有url、title 和content 对我有用——我只需要对我的网站进行全文搜索。但是，我希望在 ES 中包含更多字段。例如，content-length 或 mime-type 等 - 我相信 Nutch 在进行爬行时应该已经在内部某个地方拥有它们。如何将它们提供给 ES 索引？

【问题讨论】：

标签： elasticsearch web-crawler nutch

【解决方案1】：

您必须编写一个IndexingFilter 插件来添加这些字段以进行索引。

您的IndexingFilter 将如下所示：

public class AddField implements IndexingFilter {

    private Configuration conf;

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) {
        String content = parse.getText();
        doc.add("pageLength", content.length());
        // add more field
        // ...

        return doc;
    }

    //Boilerplate
    public Configuration getConf() {
        return conf;
    }

    //Boilerplate
    public void setConf(Configuration conf) {
        this.conf = conf;
    }
}

您可以找到如何编写类似的插件here。

【讨论】：