Nutch 2.3 未在 Cassandra 中正确存储爬网数据答案

【问题标题】：Nutch 2.3 not storing crawl data correctly in CassandraNutch 2.3 未在 Cassandra 中正确存储爬网数据
【发布时间】：2015-03-02 15:41:09
【问题描述】：

我正在使用带有 Cassandra 后端的 Nutch 2.3 的大多数默认选项进行爬网。作为种子列表，使用了一个包含 71 个 url 的文件，我正在使用以下命令进行爬网：

bin/crawl ~/dev/urls/ crawlid1 5

密钥存储在 Cassandra 中，并创建了 f、p 和 sc 列族，但是，如果我尝试读取 WebPage 对象，则内容和文本字段为空，尽管输出表明获取和解析器作业据说跑了。

此外，尽管 db.update.additions.allowed 的默认值为 true，但不会向链接 db 添加新链接。

完成后，我尝试用下面的代码读出爬取数据。这仅显示一些正在填充的字段。查看 FetcherJob 和 ParserJob 中的代码，我看不出 content 或 text 字段应该为空的任何原因。我可能缺少一些基本设置，但谷歌搜索我的问题并没有产生任何结果。我还在 ParserMapper 和 FetcherMapper 中设置了断点，它们似乎被执行了。

有人知道如何使用 Nutch 2 在 Cassandra 中存储获取/解析的内容吗？

import static java.nio.charset.StandardCharsets.UTF_8;

import java.io.Closeable;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Map.Entry;

import org.apache.gora.query.Query;
import org.apache.gora.query.Result;
import org.apache.gora.store.DataStore;
import org.apache.gora.store.DataStoreFactory;
import org.apache.gora.util.GoraException;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.storage.WebPage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * Reads the rows from a {@link DataStore} as a {@link WebPage}.
 * 
 * @author Jeroen Vlek, jv@datamantics.com Created: Feb 25, 2015
 *
 */
public class NutchWebPageReader implements Closeable {
    private static final Logger LOGGER = LoggerFactory.getLogger(NutchWebPageReader.class);

    DataStore<String, WebPage> dataStore;

    /**
     * Initializes the datastore field with the {@link Configuration} as defined
     * in gora.properties in the classpath.
     */
    public NutchWebPageReader() {
        try {
            dataStore = DataStoreFactory.getDataStore(String.class, WebPage.class, new Configuration());
        } catch (GoraException e) {
            throw new RuntimeException(e);
        }
    }

    /**
     * @param args
     */
    public static void main(String[] args) {
        Map<String, WebPage> pages = null;
        try (NutchWebPageReader pageReader = new NutchWebPageReader()) {
            pages = pageReader.getAllPages();
        } catch (IOException e) {
            LOGGER.error("Could not close page reader.", e);
        }
        LOGGER.info("Found {} results.", pages.size());

        for (Entry<String, WebPage> entry : pages.entrySet()) {
            String key = entry.getKey();
            WebPage page = entry.getValue();
            String content = "null";
            if (page.getContent() != null) {
                new String(page.getContent().array(), UTF_8);
            }
            LOGGER.info("{} with content {}", key, content);
        }
    }

    /**
     * @return
     * 
     */
    public Map<String, WebPage> getAllPages() {
        Query<String, WebPage> query = dataStore.newQuery();
        Result<String, WebPage> result = query.execute();
        Map<String, WebPage> resultMap = new HashMap<>();
        try {
            while (result.next()) {
                resultMap.put(result.getKey(), dataStore.get(result.getKey()));
            }
        } catch (Exception e) {
            LOGGER.error("Something went wrong while processing the query result.", e);
        }

        return resultMap;
    }

    /*
     * (non-Javadoc)
     * 
     * @see java.io.Closeable#close()
     */
    @Override
    public void close() throws IOException {
        dataStore.close();
    }

}

这是我的 nutch-site.xml：

<property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.cassandra.store.CassandraStore</value>
    <description>Default class for storing data</description>
</property>
<property>
    <name>http.agent.name</name>
    <value>Nibbler</value>
</property>
<property>
    <name>fetcher.verbose</name>
    <value>true</value>
    <description>If true, fetcher will log more verbosely.</description>
</property>
<property>
    <name>fetcher.parse</name>
    <value>true</value>
    <description>If true, fetcher will parse content. NOTE: previous
        releases would
        default to true. Since 2.0 this is set to false as a safer default.</description>
</property>
<property>
    <name>http.content.limit</name>
    <value>999999999</value>
</property>

编辑

我使用的是 Cassandra 2.0.12，但我只是尝试使用 2.0.2 并没有解决问题。所以我正在使用的版本：

Nutch：2.3（git clone 在标签“release-2.3”签出）
强拉：0.5 英寸努奇
卡桑德拉：2.0.2

将 result.get() 更改为 dataStore.get(result.getKey()) 导致某些字段实际被填充，但内容和文本仍然为空。

一些输出：

[jvlek@orochimaru nutch]$ runtime/local/bin/nutch inject ~/dev/urls/
InjectorJob: starting at 2015-03-02 18:34:29
InjectorJob: Injecting urlDir: /home/jvlek/dev/urls
InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 69
Injector: finished at 2015-03-02 18:34:32, elapsed: 00:00:02
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key:    http://www.wired.com/
baseUrl:        null
status: 0 (null)
fetchTime:      1425317669727
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:      0
modifiedTime:   0
prevModifiedTime:       0
protocolStatus: (null)
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :       y
marker dist :   0
reprUrl:        null
metadata _csh_ :        ??

[jvlek@orochimaru nutch]$ runtime/local/bin/nutch generate -batchId 1
GeneratorJob: starting at 2015-03-02 18:34:50
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2015-03-02 18:34:54, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1 containing 66 URLs
[jvlek@orochimaru nutch]$ runtime/local/bin/nutch readdb -url http://www.wired.com/
key:    http://www.wired.com/
baseUrl:        null
status: 0 (null)
fetchTime:      1425317669727
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:      0
modifiedTime:   0
prevModifiedTime:       0
protocolStatus: (null)
parseStatus:    (null)
title:  null
score:  1.0
marker _injmrk_ :       y
marker _gnmrk_ :        1
marker dist :   0
reprUrl:        null
batchId:        1
metadata _csh_ :        ??

【问题讨论】：

标签： web-crawler nutch gora

【解决方案1】：

您使用的是什么版本的 Gora？

请您删除数据库并执行：

nutch inject ~/dev/urls/
nutch generate -batchId 1
nutch fetch 1

然后

nutch readdb -url <some known url> -content

它是否显示正确的信息？如果答案是肯定的，那么做：

nutch parse 1
nutch updatedb
nutch readdb -url <some known url> -content

【讨论】：

抓取成功，但没有存储内容。第一个和第二个“readdb”都给出相同的输出。没有显示文本或原始 html。我正在使用 Gora 0.6
上面写着“status: 2 (status_fetched)”和“protocolStatus: SUCCESS, args=[]”？
啊哈，不，它没有。它说“状态：0（null）”和“protocolStatus：（null）”，“parseStatus：（null）”。
在注入和生成后使用一个 url 执行 readdb 并添加到您的答案中，请：)
更正：我使用 Gora 0.5 构建了 Nutch 运行时。 NutchWebpageReader 使用 0.6。我会尝试用 0.6 构建另一个爬虫