【发布时间】:2019-03-17 07:34:30
【问题描述】:
我写了自己的共享最近邻(SNN)聚类算法,根据原来的paper。本质上,我为每个数据点获取最近的邻居,使用 Jaccard 距离预先计算距离矩阵,并将距离矩阵传递给 DBSCAN。
为了加速算法,我只计算两个数据点之间的 Jaccard 距离,如果它们是彼此最近的邻居并且有超过一定数量的共享邻居。我还利用了距离矩阵的对称性,因为我只计算了一半的矩阵。
但是,我的算法速度较慢,并且比常见的聚类算法(例如 K-Means 或 DBSCAN)花费的时间要长得多。有人可以查看我的代码并建议我如何改进我的代码并使算法更快吗?
def jaccard(a,b):
"""
Computes the Jaccard distance between two arrays.
Parameters
----------
a: an array.
b: an array.
"""
A = np.array(a, dtype='int')
B = np.array(b, dtype='int')
A = A[np.where(A > -1)[0]]
B = B[np.where(B > -1)[0]]
union = np.union1d(A,B)
intersection = np.intersect1d(A,B)
return 1.0 - len(intersection)*1.0 / len(union)
def iterator_dist(indices, k_min=5):
"""
An iterator that computes the Jaccard distance for any pair of stars.
Parameters:
indices: the indices of nearest neighbors in the chemistry-velocity
space.
"""
for n in range(len(indices)):
for m in indices[n][indices[n] > n]:
if len(np.intersect1d(indices[n], indices[m])) > k_min:
dist = jaccard(indices[n], indices[m])
yield (n, m, dist)
# load data here
data =
# hyperparameters
n_neighbors =
eps =
min_samples =
k_min =
# K Nearest Neighbors
nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(data)
distances, indices = nbrs.kneighbors()
# distance matrix
S = lil_matrix((len(distances), len(distances)))
for (n, m, dist) in iterator_dist(indices, k_min):
S[n,m] = dist
S[m,n] = dist
db = DBSCAN(eps=eps, min_samples=min_samples, metric='precomputed',
n_jobs=-1).fit(S)
labels = db.labels_
【问题讨论】:
-
要求改进工作代码更适合Code Review。
标签: python algorithm performance cluster-analysis