生成 N*N 矩阵的高效算法答案

【问题标题】：Efficient algorithm to generate an N*N matrix生成 N*N 矩阵的高效算法
【发布时间】：2013-03-24 13:46:40
【问题描述】：

我正在做一个简单的协同过滤（CF）。这是一个项目到项目的CF。例如，我有一个包含 N 个项目的巨型 dict，其中 key 是产品名称，value 是购买它们的客户列表：

d={
item1:[customer1,customer3,customer7], 
item2:[customer3, customer5],
...
itemN:[customerX...customerY]
}

我还有一个小函数可以计算每个项目之间客户的相似度，例如item1 vs item2：

def littlefunction(...):

    #convert them to a set
    item1=set(d['item1'])
    item2=set(d['item2'])

    commonCustomer=item1.intersect(item2)
    totalCustomer=item1.union(item2)

    similarity=float(len(commonCustomer))/len(totalCustomer)

要获得指定的每个项目的最相似项目，我必须扫描，计算 N 次相似度然后排序。所以对于 N 个项目，复杂度是O(N*N)。

现在每个项目的运行时间为 2 分钟（N 大约 = 3 百万）。要生成一个完整的 N*N 相似度表，需要 100,000 小时。有比这种蛮力方法更好的算法吗？每个项目只需要前几个结果。

【问题讨论】：

用 C++ 重写并使用几乎最优的数据结构和算法，每件商品的时间下降到大约 0.5 秒（我假设平均每件商品有 6.5 个客户）。仍然超过 400 小时，但由于这些东西并行性很好，你可以在一个 16 CPU 的机器上在一个周末完成它，并且仍然有时间计算几百万位数的 pi。
MK，你有 key=product, value=customer-list 的 dict d。 ❶您是否已经有一个 key=customer, value=product-list 的字典，正如 JM 的回答中所建议的那样？ ❷一共有多少客户？

标签： python algorithm collaborative-filtering

【解决方案1】：

创建一个倒排索引：

customer1: [item1, item3, item8, ...]
customer2: [item7, item8, item74, ...]

那么你可以：

查找商品以获取购买该商品的客户列表
查找每个客户以获取客户购买的商品列表

每个项目的时间应该从 2 分钟到少于 2 秒。

第二个索引需要更多内存，但您没有复制数据。如果内存有问题，您可以将其存储在一个简单的数据库中，并且仍然比您当前使用的 N^2 算法快得多。

更多细节

您想创建一个 N*N 矩阵来显示任意两项之间的相似性。使用我的技术，您可以执行以下操作：

Create an N*N matrix, and initialize it to 0.
for each item
  Get the list of customers who bought the item (from your item-to-customer index).
  Create an empty dictionary of related items
  for each customer in that list
    for each item that the customer bought
      update the dictionary (add new item, or increase count)
    end for
  end for
  You now have a dictionary that contains the related items,
  and how many customers bought each one. You can update the matrix row
  for the current item from that dictionary.
end for

【讨论】：

第二个索引如何帮助计算相似度？
@n.m.是的，我也想问。其实我提到的d={...}就是从这个...
@MKYung: 想一想，你可以通过获取当前商品的客户列表，获取列表中每个客户的商品列表，然后只计算这些项目的相似性（对于所有其他项目，它为零）。实际收益将取决于平均每件商品有多少客户，反之亦然。
太棒了！经过一段时间的消化，我终于完成了！现在速度更快了！！

【解决方案2】：

这与@Jim Mischel 的回答基本相同，构建显式反向索引，但不构建显式 NxN 矩阵，并且是一个实现而不是整个系统的草图。

对于大量的 NUM_CUSTOMERS（比如 1,000,000），它的性能不是很好（大约 20 秒进行 3 次 similiarty_for 检查），但数据是随机的。据推测，实际客户购买数据的相关性更高，随机性更小。

首先，构建一些用于测试的合成数据：

import random
import operator

NUM_CUSTOMERS = 10000
NUM_ITEMS = 1000
MIN_PER_CUST = 2
MAX_PER_CUST = 10
NUM_INTEREST = 10

customers = ["customer_{}".format(num) for num in xrange(1, NUM_CUSTOMERS+1)]
items = ["item_{}".format(num) for num in xrange(1, NUM_ITEMS+1)]

customer_items = {
    customer: random.sample(items, random.randint(MIN_PER_CUST, MAX_PER_CUST))
    for customer in customers}

item_customers = {}
for customer, this_items in customer_items.iteritems():
    for item in this_items:
        item_customers.setdefault(item, [])
        item_customers[item].append(customer)

然后定义几个函数：

def similarity(item1, item2):
    item1_set = set(item_customers[item1])
    item2_set = set(item_customers[item2])

    num_common = len(item1_set.intersection(item2_set))
    num_total = len(item1_set.union(item2_set))

    return float(num_common) / num_total

def similarity_for(item):
    to_check = {
        itm for customer in item_customers[item]
            for itm in customer_items[customer] }
    to_check.discard(item)
    rankings = {
        item2: similarity(item, item2)
        for item2 in to_check }
    return sorted(rankings.iteritems(), key=operator.itemgetter(1), reverse=True)

然后，运行它：

for index, item in enumerate(sorted(random.sample(items, 3))):
    rankings = similarity_for(item)
    print "check {}; {} (of {}):".format(index, item, len(rankings))
    for item_other, ranking in rankings[:NUM_INTEREST]:
        print "  {}: {}".format(item_other, ranking)

获得如下所示的结果：

% python /tmp/similarity.py
check 0; item_121 (of 309):
  item_520: 0.0283018867925
  item_361: 0.027027027027
  item_536: 0.0265486725664
  item_637: 0.0238095238095
  item_515: 0.020202020202
  item_750: 0.019801980198
  item_960: 0.0192307692308
  item_25: 0.0190476190476
  item_548: 0.018691588785
  item_841: 0.018691588785
check 1; item_714 (of 298):
  item_162: 0.0285714285714
  item_491: 0.0272727272727
  item_617: 0.0265486725664
  item_949: 0.0260869565217
  item_788: 0.0192307692308
  item_446: 0.0190476190476
  item_558: 0.018691588785
  item_606: 0.0181818181818
  item_177: 0.0181818181818
  item_577: 0.018018018018
check 2; item_991 (of 352):
  item_758: 0.0298507462687
  item_204: 0.025
  item_85: 0.0247933884298
  item_769: 0.024
  item_501: 0.0232558139535
  item_860: 0.0227272727273
  item_408: 0.0225563909774
  item_480: 0.0223880597015
  item_73: 0.0220588235294
  item_593: 0.021897810219

【讨论】：