【问题标题】:How would I implement k-means with TensorFlow?我将如何使用 TensorFlow 实现 k-means?
【发布时间】:2016-02-10 20:06:27
【问题描述】:

使用内置梯度下降优化器的介绍教程很有意义。然而,k-means 不仅仅是我可以插入梯度下降的东西。似乎我必须编写自己的优化器,但考虑到 TensorFlow 原语,我不太确定如何做到这一点。

我应该采取什么方法?

【问题讨论】:

    标签: k-means tensorflow


    【解决方案1】:

    现在您可以直接使用(或从中获取灵感)KMeansClustering Estimator。你可以看看its implementation on GitHub

    【讨论】:

      【解决方案2】:

      到目前为止,我看到的大多数答案都只关注 2d 版本(当您需要对 2 维中的点进行聚类时)。这是我在任意维度上的聚类实现。


      k-means algorithm in n dims 的基本思想:

      • 随机生成k个起点
      • 执行此操作,直到您超出耐心或集群分配没有改变:
        • 将每个点分配给最近的起点
        • 通过取其集群中的平均值来重新计算每个起点的位置

      为了能够以某种方式验证结果,我将尝试对 MNIST 图像进行聚类。

      import numpy as np
      import tensorflow as tf
      from random import randint
      from collections import Counter
      from tensorflow.examples.tutorials.mnist import input_data
      
      mnist = input_data.read_data_sets("MNIST_data/")
      X, y, k = mnist.test.images, mnist.test.labels, 10
      

      所以这里X是我要聚类(10000, 784)的数据,y是实数,k是聚类数(与位数相同。现在实际算法:

      # select random points as a starting position. You can do better by randomly selecting k points.
      start_pos = tf.Variable(X[np.random.randint(X.shape[0], size=k),:], dtype=tf.float32)
      centroids = tf.Variable(start_pos.initialized_value(), 'S', dtype=tf.float32)
      
      # populate points
      points           = tf.Variable(X, 'X', dtype=tf.float32)
      ones_like        = tf.ones((points.get_shape()[0], 1))
      prev_assignments = tf.Variable(tf.zeros((points.get_shape()[0], ), dtype=tf.int64))
      
      # find the distance between all points: http://stackoverflow.com/a/43839605/1090562
      p1 = tf.matmul(
          tf.expand_dims(tf.reduce_sum(tf.square(points), 1), 1),
          tf.ones(shape=(1, k))
      )
      p2 = tf.transpose(tf.matmul(
          tf.reshape(tf.reduce_sum(tf.square(centroids), 1), shape=[-1, 1]),
          ones_like,
          transpose_b=True
      ))
      distance = tf.sqrt(tf.add(p1, p2) - 2 * tf.matmul(points, centroids, transpose_b=True))
      
      # assign each point to a closest centroid
      point_to_centroid_assignment = tf.argmin(distance, axis=1)
      
      # recalculate the centers
      total = tf.unsorted_segment_sum(points, point_to_centroid_assignment, k)
      count = tf.unsorted_segment_sum(ones_like, point_to_centroid_assignment, k)
      means = total / count
      
      # continue if there is any difference between the current and previous assignment
      is_continue = tf.reduce_any(tf.not_equal(point_to_centroid_assignment, prev_assignments))
      
      with tf.control_dependencies([is_continue]):
          loop = tf.group(centroids.assign(means), prev_assignments.assign(point_to_centroid_assignment))
      
      sess = tf.Session()
      sess.run(tf.global_variables_initializer())
      
      # do many iterations. Hopefully you will stop because of has_changed is False
      has_changed, cnt = True, 0
      while has_changed and cnt < 300:
          cnt += 1
          has_changed, _ = sess.run([is_continue, loop])
      
      # see how the data is assigned
      res = sess.run(point_to_centroid_assignment)
      

      现在是时候检查我们的集群有多好。为此,我们将集群中出现的所有实数组合在一起。之后,我们将看到该集群中最受欢迎的选择。在完美聚类的情况下,我们将在每组中只有一个值。在随机簇的情况下,每个值将在组中大致相等。

      nums_in_clusters = [[] for i in xrange(10)]
      for cluster, real_num in zip(list(res), list(y)):
          nums_in_clusters[cluster].append(real_num)
      
      for i in xrange(10):
          print Counter(nums_in_clusters[i]).most_common(3)
      

      这给了我这样的东西:

      [(0, 738), (6, 18), (2, 11)]
      [(1, 641), (3, 53), (2, 51)]
      [(1, 488), (2, 115), (7, 56)]
      [(4, 550), (9, 533), (7, 280)]
      [(7, 634), (9, 400), (4, 302)]
      [(6, 649), (4, 27), (0, 14)]
      [(5, 269), (6, 244), (0, 161)]
      [(8, 646), (5, 164), (3, 125)]
      [(2, 698), (3, 34), (7, 14)]
      [(3, 712), (5, 290), (8, 110)]
      

      这很好,因为大多数计数都在第一组中。您会看到聚类混淆了 7 和 9、4 和 5。但 0 的聚类效果非常好。

      一些改进方法:

      • 多次运行该算法并选择最佳算法(基于到集群的距离)
      • 处理没有分配给集群的情况。在我的例子中,你会在 means 变量中得到 Nan,因为 count 是 0。
      • 随机点初始化。

      【讨论】:

        【解决方案3】:

        (注意:您现在可以获取a more polished version of this code as a gist on github。)

        您绝对可以这样做,但您需要定义自己的优化标准(对于 k-means,它通常是最大迭代次数以及分配稳定的时间)。这是一个示例,说明您可以如何做到这一点(可能有更优化的方法来实现它,当然还有更好的方法来选择初始点)。如果你真的很努力地远离在 python 中迭代地做事情,基本上就像你会在 numpy 中做这件事:

        import tensorflow as tf
        import numpy as np
        import time
        
        N=10000
        K=4
        MAX_ITERS = 1000
        
        start = time.time()
        
        points = tf.Variable(tf.random_uniform([N,2]))
        cluster_assignments = tf.Variable(tf.zeros([N], dtype=tf.int64))
        
        # Silly initialization:  Use the first two points as the starting                
        # centroids.  In the real world, do this better.                                 
        centroids = tf.Variable(tf.slice(points.initialized_value(), [0,0], [K,2]))
        
        # Replicate to N copies of each centroid and K copies of each                    
        # point, then subtract and compute the sum of squared distances.                 
        rep_centroids = tf.reshape(tf.tile(centroids, [N, 1]), [N, K, 2])
        rep_points = tf.reshape(tf.tile(points, [1, K]), [N, K, 2])
        sum_squares = tf.reduce_sum(tf.square(rep_points - rep_centroids),
                                    reduction_indices=2)
        
        # Use argmin to select the lowest-distance point                                 
        best_centroids = tf.argmin(sum_squares, 1)
        did_assignments_change = tf.reduce_any(tf.not_equal(best_centroids,
                                                            cluster_assignments))
        
        def bucket_mean(data, bucket_ids, num_buckets):
            total = tf.unsorted_segment_sum(data, bucket_ids, num_buckets)
            count = tf.unsorted_segment_sum(tf.ones_like(data), bucket_ids, num_buckets)
            return total / count
        
        means = bucket_mean(points, best_centroids, K)
        
        # Do not write to the assigned clusters variable until after                     
        # computing whether the assignments have changed - hence with_dependencies
        with tf.control_dependencies([did_assignments_change]):
            do_updates = tf.group(
                centroids.assign(means),
                cluster_assignments.assign(best_centroids))
        
        sess = tf.Session()
        sess.run(tf.initialize_all_variables())
        
        changed = True
        iters = 0
        
        while changed and iters < MAX_ITERS:
            iters += 1
            [changed, _] = sess.run([did_assignments_change, do_updates])
        
        [centers, assignments] = sess.run([centroids, cluster_assignments])
        end = time.time()
        print ("Found in %.2f seconds" % (end-start)), iters, "iterations"
        print "Centroids:"
        print centers
        print "Cluster assignments:", assignments
        

        (请注意,真正的实现需要更加小心初始集群选择,避免所有点都进入一个集群等问题。这只是一个快速演示。我已经更新了我之前的答案它更清晰和“值得举例”。)

        【讨论】:

        • 我应该解释得更好一点。它需要 N 个点并制作它们的 K 个副本。它需要 K 个当前质心并制作 N 个副本。然后它减去这两个大张量以获得从每个点到每个质心的 N*K 距离。它计算距离的平方和,并使用“argmin”为每个点找到最佳距离。然后它使用 dynamic_partition 根据它们的聚类分配将点分组到 K 个不同的张量中,找到每个聚类中的均值,并根据它设置质心。
        猜你喜欢
        • 2021-05-04
        • 2011-07-24
        • 1970-01-01
        • 2020-06-21
        • 2013-07-24
        • 2011-05-15
        • 2011-06-10
        • 2012-11-06
        • 2020-03-10
        相关资源
        最近更新 更多