如何使用 dbscan 知道每个集群中的行数？答案

【问题标题】：How to know the count of rows in each cluster with dbscan?如何使用 dbscan 知道每个集群中的行数？
【发布时间】：2020-10-15 07:58:06
【问题描述】：

csv 数据是这样的：

device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1
A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008
A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008
A0001,2020-08-05 05:20:55+00:00,22.994716,114.2998,0.0,,3.8,202008
A0001,2020-08-05 05:24:02+00:00,22.994916,114.299683,0.0,,2.1,202008
A0001,2020-08-05 05:24:30+00:00,22.99545,114.2998,0.0,,6.5,202008
A0001,2020-08-05 05:29:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:34:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:39:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:53+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:45:40+00:00,22.995433,114.299766,0.0,,5.8,202008

我使用csv中的经纬度数据生成dbscan聚类图像，每个聚类的颜色不同。

import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd

def draw_with_dbscan(para_csv_path_name,para_csv_name,para_save_path):
    df = pd.read_csv(para_csv_path_name, encoding='utf-8', parse_dates=[1], low_memory=False)
    X = df[['latitude', 'longitude']]
    X = X.drop_duplicates()
    kms_per_rad = 6371.0088  # mean radius of the earth
    epsilon = 1.5 / kms_per_rad  # The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. default=0.5
    dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(X)))
    fac_cluster_labels = dbsc.labels_
    # get the number of clusters
    num_clusters = len(set(dbsc.labels_))
    # turn the clusters into a pandas series,where each element is a cluster of points
    dbsc_clusters = pd.Series([X[fac_cluster_labels == n] for n in range(num_clusters)])
    # get centroid of each cluster
    fac_centroids = dbsc_clusters.map(get_centroid)
    # unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
    cent_lats, cent_lons = zip(*fac_centroids)
    # from these lats/lons create a new df of one representative point for eac cluster
    centroids_pd = pd.DataFrame({'longitude': cent_lons, 'latitude': cent_lats})
    # Plot the faciity clusters and cluster centroid
    fig, ax = plt.subplots(figsize=[20, 10])
    facility_scatter = ax.scatter(X['longitude'], X['latitude'], c=fac_cluster_labels,
                                  edgecolor='None', alpha=0.7, s=120)
    centroid_scatter = ax.scatter(centroids_pd['longitude'], centroids_pd['latitude'], marker='x', linewidths=2,
                                  c='k', s=50)
    ax.set_title('Facility Clusters & Facility Centroid', fontsize=30)
    ax.set_xlabel('Longitude', fontsize=24)
    ax.set_ylabel('Latitude', fontsize=24)
    ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right',
              fontsize=20)
    # plt.show()
    plt.savefig(para_save_path + para_csv_name.split('.')[0] + '.png')
    plt.close()


def get_centroid(cluster):
    """calculate the centroid of a cluster of geographic coordinate points
    Args:
      cluster coordinates, nx2 array-like (array, list of lists, etc)
      n is the number of points(latitude, longitude)in the cluster.
    Return:
      geometry centroid of the cluster

    """
    cluster_ary = np.asarray(cluster)
    centroid = cluster_ary.mean(axis=0)
    return centroid



if __name__ == '__main__':
    csvlName=r'E:/mydata/test.csv'
    item='test.csv'
    abnormal_dbscan_device_img_dir=r'E:/result/'
    draw_with_dbscan(csvlName, item, abnormal_dbscan_device_img_dir)

生成的图片是这样的：

但是如何通过dbscan知道每个集群中经纬度数据的行数呢？

【问题讨论】：

您能否考虑一下minimal reproducible example 与您的标题相关并提供所需的输出？
我已经修改了我的代码。
还不清楚你有什么问题，但你可以试试values, counts = np.unique(fac_cluster_labels,return_counts=True); {k:v for k,v in zip(values,counts)}
例如一个csv有100行经纬度数据，分为4个簇。紫色簇有 10 行数据，蓝色簇有 20 行数据，黄色簇有 30 行数据，绿色簇有 40 行。数据，我想知道每个集群有多少行数据。
你试过我上面建议的代码了吗？

标签： python scikit-learn

【解决方案1】：

你不妨试试：

values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}

【讨论】：

是的，这就是我想要的。