如何在tensorflow中实现DBSCAN聚类？答案

【问题标题】：How to implement DBSCAN clustering in tensorflow?如何在tensorflow中实现DBSCAN聚类？
【发布时间】：2018-09-30 18:08:18
【问题描述】：

我正在寻找一种在 tensorflow 中使用 DBSCAN 算法对一组特征进行聚类的方法，但是我找不到任何相关内容。

TensorFlow 提供 K-Means 聚类 (tf.contrib.learn.KMeansClustering)，但我需要 DBSCAN 算法。

任何人都可以建议我使用python/java 编写的任何现有包装器吗？

关于如何从头开始实现它的任何指针？

附：我知道有DBSCAN 的sklearn 和类似库，但我特别需要TensorFlow。

【问题讨论】：

标签： python tensorflow cluster-analysis dbscan

【解决方案1】：

我知道我迟到了一年，但为了以后的参考： here 是我实现的类似 DBSCAN 的算法。例如，它可能给出的结果与您从 sklearn 中实现的算法获得的结果略有不同，尤其是对于可能属于多个集群的观察。我知道这可能不是最佳的。我知道，在实现算法时，TF 并不是最佳选择。但也许有人会发现代码很有价值。

相关代码：

import tensorflow as tf
import numpy as np

def run(vals, epsilon=4, min_points=4):

    def merge_core_points_into_clusters(elems):
        row = elems
        mat = core_points_connection_matrix
        nonempty_intersection_inds = tf.where(tf.reduce_any(tf.logical_and(row, mat), axis=1))
        cumul = tf.logical_or(row, mat)
        subcumul = tf.gather_nd(cumul, nonempty_intersection_inds)
        return tf.reduce_any(subcumul, axis=0)

    def label_clusters(elems):
        return tf.reduce_min(tf.where(elems))

    def get_subsets_for_labels(elems):
        val = elems[0]
        labels = elems[1]
        conn = relation_matrix

        inds = tf.where(tf.equal(labels, val))
        masks = tf.gather_nd(conn, inds)
        return tf.reduce_any(masks, axis=0)

    def scatter_labels(elems):
        label = tf.expand_dims(elems[0], 0)
        mask = elems[1]
        return label*tf.cast(mask, dtype=tf.int64)

    data_np = np.array(vals)

    eps = epsilon
    min_pts = min_points

    in_set = tf.placeholder(tf.float64)

    # distance matrix
    r = tf.reduce_sum(in_set*in_set, 1)
    # turn r into column vector
    r = tf.reshape(r, [-1, 1])
    dist_mat = tf.sqrt(r - 2*tf.matmul(in_set, tf.transpose(in_set)) + tf.transpose(r))

    # for every point show, which points are within eps distance of that point (including that point)
    relation_matrix = dist_mat <= eps

    # number of points within eps-ball for each point
    num_neighbors = tf.reduce_sum(tf.cast(relation_matrix, tf.int64), axis=1)

    # for each point show, whether this point is core point
    core_points_mask = num_neighbors >= min_pts

    # indices of core points
    core_points_indices = tf.where(core_points_mask)

    core_points_connection_matrix = tf.cast(core_points_mask, dtype=tf.int64) * tf.cast(relation_matrix, dtype=tf.int64)
    core_points_connection_matrix = tf.cast(core_points_connection_matrix, dtype=tf.bool)
    core_points_connection_matrix = tf.logical_and(core_points_connection_matrix, core_points_mask)

    merged = tf.map_fn(
        merge_core_points_into_clusters,
        core_points_connection_matrix,
        dtype=tf.bool
    )

    nonempty_clusters_records = tf.gather_nd(merged, core_points_indices)

    marked_core_points = tf.map_fn(label_clusters, nonempty_clusters_records, dtype=tf.int64)

    _, labels_core_points = tf.unique(marked_core_points, out_idx=tf.int64)

    labels_core_points = labels_core_points+1

    unique_labels, _ = tf.unique(labels_core_points)

    labels_all = tf.scatter_nd(
        tf.cast(core_points_indices, tf.int64),
        labels_core_points,
        shape=tf.cast(tf.shape(core_points_mask), tf.int64)
    )

    # for each label return mask, which points should have this label
    ul_shape = tf.shape(unique_labels)
    labels_tiled = tf.maximum(tf.zeros([ul_shape[0], 1], dtype=tf.int64), labels_all)

    labels_subsets = tf.map_fn(
        get_subsets_for_labels,
        (unique_labels, labels_tiled),
        dtype=tf.bool
    )

    final_labels = tf.map_fn(
        scatter_labels,
        elems=(tf.expand_dims(unique_labels, 1), labels_subsets),
        dtype=tf.int64
    )

    final_labels = tf.reduce_max(final_labels, axis=0)

    with tf.Session() as sess:

        results = (sess.run(final_labels, feed_dict={in_set:data_np})).reshape((1, -1))

    results = results.reshape((-1, 1))

    return results

【讨论】：

请在此处添加链接的相关部分。
是的，我做了一些基准测试，但不幸的是只在 CPU 上。我不记得细节了，但速度要慢得多。
更新：我刚刚在 100 和 1000 个数据点（每个 2 个特征）上对其进行了粗略的测试，似乎有 3 个数量级的差异有利于 sklearn 的实现。详细信息（在 Jupyter 中使用 %%timeit 获得）：100 分：118ms (TF) vs 925 µs (sklearn)； 1000 分：22.4 秒（TF）与 7.9 毫秒（sklearn）。很高兴看到一个像样的 GPU 可以在多大程度上提高 TF 实现的性能。
是否升级到 tf v2.x？
@LuisFelipe 很遗憾没有，我很抱歉承认我已经很长时间没有维护存储库了。但请随意分叉甚至为 repo 做出贡献。