使用推力库获取最近的质心？（K-均值）答案

【问题标题】：Get nearest centroid using Thrust library? (K-Means)使用推力库获取最近的质心？（K-均值）
【发布时间】：2014-07-21 03:47:12
【问题描述】：

我已经完成了距离的计算并存储在推力矢量中，例如，我有 2 个质心和 5 个数据点，我计算距离的方式是，对于每个质心，我首先用 5 个数据点计算距离并存储在数组，然后与另一个质心在距离的一维数组中，就像这样：

for (int i = 0; i < centroids.size(); ++i)
{
    computeDistance(Data, distances, centroids[i], nDataPoints, nDimensions);
}

产生一个向量 1d，例如：

DistancesValues = {10, 15, 20, 12, 10, 5, 17, 22, 8, 7}

DatapointsIndex = {1, 2,  3,   4,  5,  1,  2,  3, 4, 5}

其中前 5 个值表示质心 1，其他 5 个值表示质心 2。

我想知道是否有推力函数可以将计数存储在每个质心的最小值的另一个数组中？

比较每个索引的值，Result应该是：

Counts = {2, 3}

在哪里：

CountOfCentroid 1 = 2       
CountOfCentroid 2 = 3

【问题讨论】：

标签： c++ cuda k-means thrust

【解决方案1】：

这是一种可能的方法：

创建一个额外的质心索引向量：

DistancesValues = {10, 15, 20, 12, 10, 5, 17, 22,  8, 7}
DatapointsIndex = {1,   2,  3,  4,  5, 1,  2,  3,  4, 5}
CentroidIndex   = {1,   1,  1,  1,  1, 2,  2,  2,  2, 2}

现在做一个 sort_by_key，使用 DatapointsIndex 作为键，其他两个向量压缩在一起作为值。这具有重新排列所有 3 个向量的效果，以便 DatapointsIndex 具有组合在一起的相似索引：
```
DatapointsIndex = {1, 1, 2, 2, 3, 3, 4, 4, 5, 5} 
```
（另外2个向量也相应重新排列）。
现在做一个 reduce_by_key。如果我们选择thrust::minimum 运算符，我们会得到一个有效地选择组中最小值的归约（而不是对组中的值求和）。 reduce_by_key 意味着这种类型的归约是在每组连续的相似键上完成的。因此，我们将再次使用DatapointsIndex 作为我们的键向量，并将其他两个向量压缩在一起作为我们的值向量。我们不关心reduce_by_key 的大部分输出，除了来自CentroidIndex 向量的结果向量。通过计算此结果向量中的质心索引，我们可以获得所需的输出。

这是一个完整的例子：

$ cat t428.cu
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <stdio.h>
#define NUM_POINTS 5
#define NUM_CENTROID 2
#define DSIZE (NUM_POINTS*NUM_CENTROID)

int main(){

  int DistancesValues[DSIZE] = {10, 15, 20, 12, 10, 5, 17, 22, 8, 7};
  int DatapointsIndex[DSIZE] = {1, 2,  3,   4,  5,  1,  2,  3, 4, 5};
  int CentroidIndex[DSIZE]   = {1, 1, 1, 1, 1, 2, 2, 2, 2, 2};

  thrust::device_vector<int> DV(DistancesValues, DistancesValues + DSIZE);
  thrust::device_vector<int> DI(DatapointsIndex, DatapointsIndex + DSIZE);
  thrust::device_vector<int> CI(CentroidIndex, CentroidIndex + DSIZE);
  thrust::device_vector<int> Ra(NUM_POINTS);
  thrust::device_vector<int> Rb(NUM_POINTS);

  thrust::sort_by_key(DI.begin(), DI.end(), thrust::make_zip_iterator(thrust::make_tuple(DV.begin(), CI.begin())));
  thrust::reduce_by_key(DI.begin(), DI.end(), thrust::make_zip_iterator(thrust::make_tuple(DV.begin(), CI.begin())), thrust::make_discard_iterator(), thrust::make_zip_iterator(thrust::make_tuple(Ra.begin(), Rb.begin())), thrust::equal_to<int>(), thrust::minimum<thrust::tuple<int, int> >());
  printf("CountOfCentroid 1 = %d\n", thrust::count(Rb.begin(), Rb.end(), 1));
  printf("CountOfCentroid 2 = %d\n", thrust::count(Rb.begin(), Rb.end(), 2));

  return 0;
}

$ nvcc -arch=sm_20 -o t428 t428.cu
$ ./t428
CountOfCentroid 1 = 2
CountOfCentroid 2 = 3
$

正如 Eric 在他的回答 here 中指出的那样（您的问题几乎与那个问题重复），sort_by_key 可能是不必要的。这些数据的重新排序遵循常规模式，因此我们不需要利用排序的复杂性，因此可以巧妙地使用迭代器对数据进行重新排序。在这种情况下，可以通过一次调用 reduce_by_key 来完成整个操作（大约）。

【讨论】：

了不起的罗伯特！非常感谢！我还会检查 Eric 的答案，看看我是否可以使用 Reduce_by_keys 完成整个操作。