【问题标题】:How to know the count of rows in each cluster with dbscan?如何使用 dbscan 知道每个集群中的行数?
【发布时间】:2020-10-15 07:58:06
【问题描述】:

csv 数据是这样的:

device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1
A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008
A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008
A0001,2020-08-05 05:20:55+00:00,22.994716,114.2998,0.0,,3.8,202008
A0001,2020-08-05 05:24:02+00:00,22.994916,114.299683,0.0,,2.1,202008
A0001,2020-08-05 05:24:30+00:00,22.99545,114.2998,0.0,,6.5,202008
A0001,2020-08-05 05:29:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:34:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:39:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:53+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:45:40+00:00,22.995433,114.299766,0.0,,5.8,202008

我使用csv中的经纬度数据生成dbscan聚类图像,每个聚类的颜色不同。

import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd

def draw_with_dbscan(para_csv_path_name,para_csv_name,para_save_path):
    df = pd.read_csv(para_csv_path_name, encoding='utf-8', parse_dates=[1], low_memory=False)
    X = df[['latitude', 'longitude']]
    X = X.drop_duplicates()
    kms_per_rad = 6371.0088  # mean radius of the earth
    epsilon = 1.5 / kms_per_rad  # The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. default=0.5
    dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(X)))
    fac_cluster_labels = dbsc.labels_
    # get the number of clusters
    num_clusters = len(set(dbsc.labels_))
    # turn the clusters into a pandas series,where each element is a cluster of points
    dbsc_clusters = pd.Series([X[fac_cluster_labels == n] for n in range(num_clusters)])
    # get centroid of each cluster
    fac_centroids = dbsc_clusters.map(get_centroid)
    # unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
    cent_lats, cent_lons = zip(*fac_centroids)
    # from these lats/lons create a new df of one representative point for eac cluster
    centroids_pd = pd.DataFrame({'longitude': cent_lons, 'latitude': cent_lats})
    # Plot the faciity clusters and cluster centroid
    fig, ax = plt.subplots(figsize=[20, 10])
    facility_scatter = ax.scatter(X['longitude'], X['latitude'], c=fac_cluster_labels,
                                  edgecolor='None', alpha=0.7, s=120)
    centroid_scatter = ax.scatter(centroids_pd['longitude'], centroids_pd['latitude'], marker='x', linewidths=2,
                                  c='k', s=50)
    ax.set_title('Facility Clusters & Facility Centroid', fontsize=30)
    ax.set_xlabel('Longitude', fontsize=24)
    ax.set_ylabel('Latitude', fontsize=24)
    ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right',
              fontsize=20)
    # plt.show()
    plt.savefig(para_save_path + para_csv_name.split('.')[0] + '.png')
    plt.close()


def get_centroid(cluster):
    """calculate the centroid of a cluster of geographic coordinate points
    Args:
      cluster coordinates, nx2 array-like (array, list of lists, etc)
      n is the number of points(latitude, longitude)in the cluster.
    Return:
      geometry centroid of the cluster

    """
    cluster_ary = np.asarray(cluster)
    centroid = cluster_ary.mean(axis=0)
    return centroid



if __name__ == '__main__':
    csvlName=r'E:/mydata/test.csv'
    item='test.csv'
    abnormal_dbscan_device_img_dir=r'E:/result/'
    draw_with_dbscan(csvlName, item, abnormal_dbscan_device_img_dir)

生成的图片是这样的:

但是如何通过dbscan知道每个集群中经纬度数据的行数呢?

【问题讨论】:

  • 您能否考虑一下minimal reproducible example 与您的标题相关并提供所需的输出?
  • 我已经修改了我的代码。
  • 还不清楚你有什么问题,但你可以试试values, counts = np.unique(fac_cluster_labels,return_counts=True); {k:v for k,v in zip(values,counts)}
  • 例如一个csv有100行经纬度数据,分为4个簇。紫色簇有 10 行数据,蓝色簇有 20 行数据,黄色簇有 30 行数据,绿色簇有 40 行。数据,我想知道每个集群有多少行数据。
  • 你试过我上面建议的代码了吗?

标签: python scikit-learn


【解决方案1】:

你不妨试试:

values = np.unique(fac_cluster_labels,return_counts=True)
{k:v for k,v in zip(*values)}

【讨论】:

  • 是的,这就是我想要的。
猜你喜欢
  • 1970-01-01
  • 2015-07-21
  • 1970-01-01
  • 2020-06-27
  • 2018-09-15
  • 2011-11-19
  • 2019-10-01
  • 2023-03-23
  • 2014-05-22
相关资源
最近更新 更多