如何在没有 readseg 命令的情况下从 Nutch 段中读取答案

【问题标题】：How to read from Nutch segments without readseg command如何在没有 readseg 命令的情况下从 Nutch 段中读取
【发布时间】：2016-11-23 23:21:03
【问题描述】：

我正在使用 Nutch 抓取一些网站，我正在抓取 this site。

我找到了these five segments 以及找到的所有文档（大约 10.000 个文档）。现在我想不使用使用readseg 命令处理文档的内容，也就是说，不是将段转储为纯文本。

为此，只有每个段的子目录content对我有用（文档的标签和内容）。

我意识到在content 目录中还有两个容器：data 和index。但是我没有找到对它们的任何解释，以及如何阅读它们来处理里面的内容。对于这个问题，我也找到了一些pointers，但我还没有理解算法思想。

Nutch 段中的内容如何存储，如何读取？如果想要给出一个简短的例子（但不是必需的），我已经给出了集合网站和片段。

【问题讨论】：

标签： java web-crawler nutch

【解决方案1】：

您需要对内容做什么？例如，您可以编写一个自定义 IndexWriter。它将在索引步骤期间被调用，并允许您访问内容。或者查看“转储”命令 (org.apache.nutch.tools.FileDumper) 并修改代码。

顺便说一句，Tom White 的“Hadoop 权威指南”有一个关于 Nutch 数据结构的精彩章节。

如果您想对页面进行进一步处理，例如 NLP 或分类，Behemoth 可用于将 Nutch 段转换为 HDFS 上的“中性”数据结构，然后可以使用各种工具进行处理。

【讨论】：

【解决方案2】：

根据@JulienNioche 的回复，这是我的实现。

// file is the root directory of the segments.
private static void indexSegments(File file)
        throws IOException, IllegalAccessException, InstantiationException {
    // Do not try to index files that cannot be read.
    if (file.canRead() & file.isDirectory()) {
        // List with all the segments.
        File[] segmentDirs = file.listFiles();
        if (segmentDirs == null) {
            System.err.println("No segment directories found in '" +
                                file.getAbsolutePath() + "'");
            return;
        }
        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);
        // Index all the segments.
        for (File segment : segmentDirs) {
            /* Only the content of the documents managed in
             * the segment is useful for the system. */
            String segmentData = segment.getAbsolutePath() + "/" +
                    Content.DIR_NAME + "/part-00000/data";
            if (!new File(segmentData).exists()) {
                System.out.println("Skipping segment: '" + segment.getName() +
                                   "': no data directory present.");
                continue;
            }
            SequenceFile.Reader reader =
                    new SequenceFile.Reader(fs, new Path(segmentData), conf);
            Writable key = (Writable) reader.getKeyClass().newInstance();
            // Index all the documents managed in the current segment.
            while (reader.next(key)) {
                Content content = new Content();
                reader.getCurrentValue(content);
                String url = key.toString();
                String baseName = FilenameUtils.getBaseName(url);
                String extension = FilenameUtils.getExtension(url);
                // Skips the document if it's not a XML file.
                String mimeType = new Tika().detect(content.getContent());
                if (mimeType == null | !mimeType.equals(MediaType.APPLICATION_XML.toString())) {
                    System.out.println("Skipping document: '" + baseName +
                                       "': not a XML file.");
                    continue;
                }
                /* Content of the document. */
                ByteArrayInputStream bas = new ByteArrayInputStream(content.getContent());
                int n = bas.available();
                byte[] bytes = new byte[n];
                bas.read(bytes, 0, n);
                bas.close();
                String docContent = new String(bytes, StandardCharsets.UTF_8);
                // TODO: Do what you want with the content.
            }
        }
    }
}

【讨论】：

【解决方案3】：

我知道这是一个旧的 Q，但我在试图找到同一个 Q 的答案时偶然发现了它。我搜索了一些答案以提出这个简单的 java 循环来获取段内容。关键类是读取索引和数据文件的 org.apache.hadoop.io.MapFile.Reader。免责声明我是 nutch 和 hadoop 的新手，但这对我有用。

private void readContent(Path[] segmentPaths) throws Exception {
    
    String[] fileTypes = {"content", "crawl_fetch", "parse_data", "parse_text"};
    String partR = "part-r-00000";
    
    for (Path path : segmentPaths) {
        for (String type : fileTypes) {
            Path file = new Path(path, type + "/" + partR);
            MapFile.Reader reader = new MapFile.Reader(file, conf);
            
            WritableComparable key = (WritableComparable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
            Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
            while (reader.next(key, value)) {
                System.out.printf("%s\t%s\n", key, value);
            }
            reader.close();
        }
        
    }
}

【讨论】：