【发布时间】:2016-02-13 04:23:45
【问题描述】:
我做错了什么?
我正在尝试使用 sklearn 的 BallTree 来提出类似的集合,然后针对给定集合中可能缺少的项目生成一些建议。
import random
from sklearn.neighbors import BallTree
import numpy
collections = [] # 10k sample collections of between
# 7 and 15 (of a possible 300...) items
for sample in range(0, 10000): # build sample data
items = random.sample(range(1, 300), random.randint(7, 15))
collections.append(items)
darray = numpy.zeros((len(collections), max(map(len, collections)))) # 10k x 15 matrix
for c_cnt, items in enumerate(collections): # populate matrix
for cnt, i in enumerate(sorted(items)):
darray[C_cnt][cnt] = i
query = BallTree(darray).query(darray[0], k=15)
nearest_neighbors = query[1][0]
# test the results against the first item!
all_sets = [set(darray[0]) & set(darray[item]) for item in nearest_neighbors]
for item in all_sets:
print item # intersection of the neighbor
我得到以下结果:
set([0.0, 130.0, 167.0, 290.0, 162.0, 144.0, 17.0, 214.0]) # Nearest neighbor is itself! Awesome!
set([0.0]) # WTF? The second closest item shares only 1 item?
set([0.0, 290.0])
set([0.0, 17.0])
set([0.0, 130.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0])
set([0.0, 162.0])
set([0.0, 144.0, 162.0]) # uhh okay, i would expect this to be higher up
set([0.0, 144.0, 17.0])
我观察到,较高的建议项往往具有与我尝试比较的数组相同的非零值长度。我可以用我的数据做一些准备来解决这个问题吗?
【问题讨论】:
标签: python scikit-learn