Mahout 预计算项目-项目相似度 - 慢速推荐答案

【问题标题】：Mahout precomputed Item-item similarity - slow recommendationMahout 预计算项目-项目相似度 - 慢速推荐
【发布时间】：2013-09-03 08:53:16
【问题描述】：

我遇到了 Mahout 中预先计算的项目相似性的性能问题。

我有 400 万用户，他们拥有大致相同数量的项目，大约有 1 亿个用户项目偏好。我想根据文档的 TF-IDF 向量的余弦相似度做基于内容的推荐。由于动态计算速度很慢，我预先计算了前 50 个最相似文档的成对相似度，如下所示：

我使用seq2sparse 来生成 TF-IDF 向量。
我使用mahout rowId 来生成mahout 矩阵
我使用 mahout rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess 生成了前 50 个最相似的文档

我使用 hadoop 预先计算所有这些。对于 400 万个项目，输出仅为 2.5GB。

然后我使用docIndex 将reducers 生成的文件的内容加载到Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ... 以解码文档的ID。它们已经是整数了，但是 rowId 已经从 1 开始解码它们，所以我必须把它取回来。

为了推荐，我使用以下代码：

ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);

CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(),  model.getNumItems());

Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);

我正在尝试使用有限的数据模型（160 万个项目），但我在内存中加载了所有项目-项目成对相似性。我设法使用 40GB 将所有内容加载到主内存中。

当我想为一位用户做推荐时

Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);

推荐过程经过的时间是554.938583083秒，而且它没有产生任何推荐。现在我真的很担心推荐的表现。我玩过CandidateItemsStrategy 和MostSimilarItemsCandidateItemsStrategy 的数字，但性能没有任何提升。

难道不是预先计算所有内容以加快推荐过程的想法吗？有人可以帮助我并告诉我我做错了什么以及我做错了什么。另外，为什么在主内存中加载 parwise 相似性会呈指数级增长？在Collection<GenericItemSimilarity.ItemItemSimilarity> mahout 矩阵的 40GB 主内存中加载了 2.5GB 的文件？我知道这些文件是使用IntWritable、VectorWritable hashMap 键值对文件进行序列化的，并且对于ItemItemSimilarity 矩阵中的每个向量值，该键必须重复，但这有点太多了，你不觉得吗？

提前谢谢你。

【问题讨论】：

标签： mahout recommendation-engine mahout-recommender

【解决方案1】：

我对使用 Collection 为预先计算的值计算推荐所需的时间进行了纠正。显然我已经把long startTime = System.nanoTime();放在我的代码顶部，而不是在List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);之前。这计算了将数据集和预先计算的项目相似性加载到主内存所需的时间。

但是我支持内存消耗。我通过使用自定义ItemSimilarity 并加载了预先计算的相似性的HashMap<Long, HashMap<Long, Double> 对其进行了改进。我使用了 trove 库来减少空间需求。

这是一个详细的代码。自定义 ItemSimilarity：

public class TextItemSimilarity implements ItemSimilarity{

    private TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix;

    public WikiTextItemSimilarity(TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix){
        this.correlationMatrix = correlationMatrix;
    }

    @Override
    public void refresh(Collection<Refreshable> alreadyRefreshed) {
    }

    @Override
    public double itemSimilarity(long itemID1, long itemID2) throws TasteException {
        TLongDoubleHashMap similarToItemId1 = correlationMatrix.get(itemID1);   
        if(similarToItemId1 != null && !similarToItemId1.isEmpty() &&  similarToItemId1.contains(itemID2)){
            return similarToItemId1.get(itemID2);
        }   
        return 0;
    }
    @Override
    public double[] itemSimilarities(long itemID1, long[] itemID2s) throws TasteException {
        double[] result = new double[itemID2s.length];
        for (int i = 0; i < itemID2s.length; i++) {
            result[i] = itemSimilarity(itemID1, itemID2s[i]);
        }
        return result;
    }
    @Override
    public long[] allSimilarItemIDs(long itemID) throws TasteException {
        return correlationMatrix.get(itemID).keys();
    }
}

我使用Collection<GenericItemSimilarity.ItemItemSimilarity>的数据集的总内存消耗为30GB，使用TLongObjectHashMap<TLongDoubleHashMap>和自定义TextItemSimilarity时的空间要求为17GB。使用Collection<GenericItemSimilarity.ItemItemSimilarity> 时的时间性能为0.05 秒，使用TLongObjectHashMap<TLongDoubleHashMap> 时为0.07 秒。我也相信使用CandidateItemsStrategy 和MostSimilarItemsCandidateItemsStrategy 会在表演中发挥重要作用

我想如果你想节省一些空间使用 trove HashMap，如果你想要更好的性能，你可以使用Collection<GenericItemSimilarity.ItemItemSimilarity>。

【讨论】：