sklearn BallTree 给出了意想不到的结果答案

【问题标题】：sklearn BallTree giving unexpected resultssklearn BallTree 给出了意想不到的结果
【发布时间】：2016-02-13 04:23:45
【问题描述】：

我做错了什么？

我正在尝试使用 sklearn 的 BallTree 来提出类似的集合，然后针对给定集合中可能缺少的项目生成一些建议。

import random
from sklearn.neighbors import BallTree
import numpy

collections = []  # 10k sample collections of between
                  # 7 and 15 (of a possible 300...) items

for sample in range(0, 10000):  # build sample data
   items = random.sample(range(1, 300), random.randint(7, 15))
   collections.append(items)    

darray = numpy.zeros((len(collections), max(map(len, collections))))  # 10k x 15 matrix

for c_cnt, items in enumerate(collections):  # populate matrix
   for cnt, i in enumerate(sorted(items)):
      darray[C_cnt][cnt] = i

query = BallTree(darray).query(darray[0], k=15)

nearest_neighbors = query[1][0]

# test the results against the first item!

all_sets = [set(darray[0]) & set(darray[item]) for item in nearest_neighbors]
for item in all_sets:
    print item  # intersection of the neighbor

我得到以下结果：

set([0.0, 130.0, 167.0, 290.0, 162.0, 144.0, 17.0, 214.0]) # Nearest neighbor is itself! Awesome!
set([0.0])  # WTF? The second closest item shares only 1 item?
set([0.0, 290.0])
set([0.0, 17.0])
set([0.0, 130.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0, 162.0])
set([0.0, 144.0, 162.0])  # uhh okay, i would expect this to be higher up
set([0.0, 144.0, 17.0])

我观察到，较高的建议项往往具有与我尝试比较的数组相同的非零值长度。我可以用我的数据做一些准备来解决这个问题吗？

【问题讨论】：

标签： python scikit-learn

【解决方案1】：

默认情况下，BallTree 计算向量之间的欧几里得距离，因此它不适合您所考虑的计算类型。

举个简单的例子，假设你有以下两组：

collections[0] = [1, 3]
collections[1] = [1, 2, 3]

当您像上面那样将它们转换为darray 中的向量时，它们会变成这样：

darray[0] = [1, 3, 0]
darray[1] = [1, 2, 3]

这些之间的欧几里得距离并不能反映集合中相似条目的数量，这就是结果不是您预期的原因。

您正在寻找的距离度量可能不是欧几里得距离，而是Jaccard distance，它衡量集合之间的相似性。 BallTree 为集合的布尔表示实现了这一点；也就是说，对于上述数据，向量将变为

darray[0] = [True, False, True]
darray[1] = [True, True, True]

第一个条目表示1 是否在集合中，第二个条目表示2 是否在集合中，依此类推。这是“one-hot encoding”的一个版本。

对于您提供的示例数据，您可以这样计算结果：

import numpy as np
from sklearn.neighbors import BallTree
from sklearn.feature_extraction import DictVectorizer

# for replicability
np.random.seed(0)

# Compute the collections using a more efficient method
collections = [np.random.choice(300, replace=False,
                                size=np.random.randint(7, 15))
               for _ in range(10000)]

# Use DictVectorizer to compute binary representation of collections
dicts = [dict(zip(c, np.ones_like(c))) for c in collections]
darray = DictVectorizer(sparse=False, dtype=bool).fit_transform(dicts)

# Compute 15 nearest neighbors for the first collection
dist, ind = BallTree(darray, metric='jaccard').query(darray[0], k=15)
for i in ind[0]:
    print(set(collections[0]) & set(collections[i]))

我得到以下结果：

{225, 226, 261, 166, 296, 52, 150, 246, 215, 221, 223}
{52, 261, 221, 215}
{225, 226, 166, 150}
{223, 150, 215}
{225, 261, 166, 221}
{226, 261, 223}
{261, 150, 221}
{223, 52, 166, 215}
{296, 226, 166, 223}
{296, 221, 150}
{223, 52, 215}
{52, 261, 246}
{296, 225, 52}
{296, 225, 221}
{225, 150, 223}

请注意，Jaccard 相似度不仅仅是交集的大小，而是由并集大小归一化的大小。交集的大小本身没有距离度量的属性，因此不能直接用 BallTree 计算。

编辑：我应该补充一点，如果集合中有很多条目，则此方法变得站不住脚，因为布尔编码矩阵变得太大。使用 Jaccard 距离计算非常高维的邻居搜索的最佳方法可能是通过局部敏感哈希，尽管我不知道适合这个问题的易于使用的 Python 实现。

【讨论】：