密度聚类(Density peaks Clustering)Python实现

原文：http://blog.csdn.net/kryolith/article/details/39832573

Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.

这是朋友推荐给我的一篇文章，大体的思想是：通过计算全局（当然也可以优化到局部）所有Sample相互之间的距离，并对某个Sample与其他所有点之间的距离进行排序，通过一个threshold对距离进行分割获取有效后，根据该Sample有效距离内的其他Sample的数量来推算该点的密度，再计算低密度点到高密度点的最小距离（最高密度点取最大距离），通过密度和最小距离这两个参数构建直角坐标系，在这个直角坐标系中最右上角的点即聚类中心点。（这一块的理论可以参考http://blog.csdn.net/itplus/article/details/38926837的文章，他讲得很详细，我这里只是大概描述一下这个算法的流程）

其实原来打算用在自己的实验中的，结果发现效果并不是我所需要的，所以这里我就公开一下python实现的代码，供大家参考，如果有不足请一定指出，共同学习进步。

先上效果图一张（上图是Sample分布图，下图则是密度-距离图）：

密度聚类(Density peaks Clustering)Python实现

由于数据点我选得比较少，所以下图看起来非中心点的分布看起来就不那么可爱了。而这些点的分布完全看距离分割阈值threshold的取值。

废话不多说，先上主算法代码：

[python]view
plain copy

import numpy as np  

import matplotlib.pyplot as plt  

from sklearn.datasets import make_blobs  

def distanceNorm(Norm,D_value):  

    # initialization  

    # Norm for distance  

    if Norm == '1':  

        counter = np.absolute(D_value);  

        counter = np.sum(counter);  

    elif Norm == '2':  

        counter = np.power(D_value,2);  

        counter = np.sum(counter);  

        counter = np.sqrt(counter);  

    elif Norm == 'Infinity':  

        counter = np.absolute(D_value);  

        counter = np.max(counter);  

    else:  

        raise Exception('We will program this later......');  

    return counter;  

def chi(x):  

    if x < 0:  

        return 1;  

    else:  

        return 0;  

def fit(features,labels,t,distanceMethod = '2'):  

    # initialization  

    distance = np.zeros((len(labels),len(labels)));  

    distance_sort = list();  

    density = np.zeros(len(labels));  

    distance_higherDensity = np.zeros(len(labels));  

    # compute distance  

    for index_i in xrange(len(labels)):  

        for index_j in xrange(index_i+1,len(labels)):  

            D_value = features[index_i] - features[index_j];  

            distance[index_i,index_j] = distanceNorm(distanceMethod,D_value);  

            distance_sort.append(distance[index_i,index_j]);  

    distance += distance.T;  

    # compute optimal cutoff  

    distance_sort = np.array(distance_sort);  

    cutoff = int(np.round(distance_sort[len(distance_sort) * t]));  

    # computer density  

    for index_i in xrange(len(labels)):  

        distance_cutoff_i = distance[index_i] - cutoff;  

        for index_j in xrange(1,len(labels)):  

            density[index_i] += chi(distance_cutoff_i[index_j]);  

    # search for the max density  

    Max = np.max(density);  

    MaxIndexList = list();  

    for index_i in xrange(len(labels)):  

        if density[index_i] == Max:  

            MaxIndexList.extend([index_i]);  

    # computer distance_higherDensity  

    Min = 0;  

    for index_i in xrange(len(labels)):  

        if index_i in MaxIndexList:  

            distance_higherDensity[index_i] = np.max(distance[index_i]);  

            continue;  

        else:  

            Min = np.max(distance[index_i]);  

        for index_j in xrange(1,len(labels)):  

            if density[index_i] < density[index_j] and distance[index_i,index_j] < Min:  

                Min = distance[index_i,index_j];  

            else:  

                continue;  

        distance_higherDensity[index_i] = Min;  

    return density,distance_higherDensity,cutoff;

distanceNorm()这个函数用来选择距离公式，这也是原文被吐槽的一个点。文章中压根就没提关于距离的计算方法，并称对高维数据有着不错的效果，所以这里我就把常用的三种范数的距离公式并列，以供选择使用。chi()函数就是源自于文章的希腊字母的读音【Χχ（chi）】，类似sgn()的函数。fit()中的参数t是文章说提到的动态分割阈值的参数，根据作者的经验参数是1%~2%，具体的证明过程这里就不给出了，如果有兴趣可以看原文，如果懒可以去看我前面说的那篇中文的blog。

下面是我当时写的其他一些相关的方法：

[python]view
plain copy

def searchCenter(density,distance_higherDensity):  

    area = np.multiply(density,distance_higherDensity);  

    distance_higherDensity_max = np.max(distance_higherDensity);  

    area_max = np.max(area);  

    for index_i in xrange(len(distance_higherDensity)):  

        if distance_higherDensity[index_i] == distance_higherDensity_max and area[index_i] == area_max:  

            return index_i;

searchCenter()顾名思义就是寻找中心点的算法，因为我当时只是用来做单类聚类，所以需求只需要最右上角的Sample，这里就写得很简单了。这也是文章中另外一个被吐槽的点儿，我们怎么用计算的方法去判断一个Sample是不是样本中心点，我这里用的是密度*距离，但事实上有的时候这个分布会很怪异，所以具体情况请根据自己的实验数据来看。

附上资源下载地址：http://download.csdn.net/detail/kryolith/8007759

Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.

先上效果图一张（上图是Sample分布图，下图则是密度-距离图）：

密度聚类(Density peaks Clustering)Python实现

由于数据点我选得比较少，所以下图看起来非中心点的分布看起来就不那么可爱了。而这些点的分布完全看距离分割阈值threshold的取值。

废话不多说，先上主算法代码：

[python]view
plain copy

import numpy as np  

import matplotlib.pyplot as plt  

from sklearn.datasets import make_blobs  

def distanceNorm(Norm,D_value):  

    # initialization  

    # Norm for distance  

    if Norm == '1':  

        counter = np.absolute(D_value);  

        counter = np.sum(counter);  

    elif Norm == '2':  

        counter = np.power(D_value,2);  

        counter = np.sum(counter);  

        counter = np.sqrt(counter);  

    elif Norm == 'Infinity':  

        counter = np.absolute(D_value);  

        counter = np.max(counter);  

    else:  

        raise Exception('We will program this later......');  

    return counter;  

def chi(x):  

    if x < 0:  

        return 1;  

    else:  

        return 0;  

def fit(features,labels,t,distanceMethod = '2'):  

    # initialization  

    distance = np.zeros((len(labels),len(labels)));  

    distance_sort = list();  

    density = np.zeros(len(labels));  

    distance_higherDensity = np.zeros(len(labels));  

    # compute distance  

    for index_i in xrange(len(labels)):  

        for index_j in xrange(index_i+1,len(labels)):  

            D_value = features[index_i] - features[index_j];  

            distance[index_i,index_j] = distanceNorm(distanceMethod,D_value);  

            distance_sort.append(distance[index_i,index_j]);  

    distance += distance.T;  

    # compute optimal cutoff  

    distance_sort = np.array(distance_sort);  

    cutoff = int(np.round(distance_sort[len(distance_sort) * t]));  

    # computer density  

    for index_i in xrange(len(labels)):  

        distance_cutoff_i = distance[index_i] - cutoff;  

        for index_j in xrange(1,len(labels)):  

            density[index_i] += chi(distance_cutoff_i[index_j]);  

    # search for the max density  

    Max = np.max(density);  

    MaxIndexList = list();  

    for index_i in xrange(len(labels)):  

        if density[index_i] == Max:  

            MaxIndexList.extend([index_i]);  

    # computer distance_higherDensity  

    Min = 0;  

    for index_i in xrange(len(labels)):  

        if index_i in MaxIndexList:  

            distance_higherDensity[index_i] = np.max(distance[index_i]);  

            continue;  

        else:  

            Min = np.max(distance[index_i]);  

        for index_j in xrange(1,len(labels)):  

            if density[index_i] < density[index_j] and distance[index_i,index_j] < Min:  

                Min = distance[index_i,index_j];  

            else:  

                continue;  

        distance_higherDensity[index_i] = Min;  

    return density,distance_higherDensity,cutoff;

下面是我当时写的其他一些相关的方法：

[python]view
plain copy

def searchCenter(density,distance_higherDensity):  

    area = np.multiply(density,distance_higherDensity);  

    distance_higherDensity_max = np.max(distance_higherDensity);  

    area_max = np.max(area);  

    for index_i in xrange(len(distance_higherDensity)):  

        if distance_higherDensity[index_i] == distance_higherDensity_max and area[index_i] == area_max:  

            return index_i;

附上资源下载地址：http://download.csdn.net/detail/kryolith/8007759