Apache Nutch 2.1 - 如何获得完整的源代码

【问题标题】：Apache Nutch 2.1 - How get complete source codeApache Nutch 2.1 - 如何获得完整的源代码
【发布时间】：2024-01-24 08:13:01
【问题描述】：

我正在尝试编写自己的 Nutch 插件来抓取网页。问题是我需要确定是否有一些特殊标签，例如在网页上。官方文档中有一些说明可以使用 Document.getElementsByTagName("foo") 但这对我不起作用。你有什么想法吗？

我的第二个问题是，如果我在上面识别了标签，我想从这个网页中获取一些其他标签，其中标签被识别...有没有办法存储在某个时刻被抓取的网页的完整源代码?

谢谢，简。

【问题讨论】：

好吧，我的错……第二个问题解决了：-*.com/questions/5123757/…-*.com/questions/10007178/…

标签： apache tags nutch web-crawler

【解决方案1】：

如果您想根据 HTML 标签提取内容，可以查看 xpath-filter 插件：http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ 您可以编写一个 xpath 查询并在插件中配置它以提取您需要的信息。

另一种选择是编写一个插件（就像您现在所做的那样）并使用 HTML/XML 解析器来获取信息。当我需要从特定的 div 中提取一些内容时，这是我所做的：

  @Override
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        //LOG.info("filter init: ");
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent); 
        Element contentwrapper = document.select("div#content").first();

        //LOG.info("fullcontent");
        //LOG.info(contentwrapper);


        // Add field
        doc.add("contentwrapper", contentwrapper.text());

        return doc;
  }

【讨论】：