【发布时间】:2020-10-15 07:58:06
【问题描述】:
csv 数据是这样的:
device_id,upload_time,latitude,longitude,mileage,other_vals,speed,upload_time_1
A0001,2020-08-05 05:10:05+00:00,23.140366,114.18685,0.0,,0,202008
A0001,2020-08-05 05:10:33+00:00,22.994716,114.2998,0.0,,0,202008
A0001,2020-08-05 05:20:55+00:00,22.994716,114.2998,0.0,,3.8,202008
A0001,2020-08-05 05:24:02+00:00,22.994916,114.299683,0.0,,2.1,202008
A0001,2020-08-05 05:24:30+00:00,22.99545,114.2998,0.0,,6.5,202008
A0001,2020-08-05 05:29:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:34:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:39:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:30+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:44:53+00:00,22.995433,114.299766,0.0,,3.4,202008
A0001,2020-08-05 05:45:40+00:00,22.995433,114.299766,0.0,,5.8,202008
我使用csv中的经纬度数据生成dbscan聚类图像,每个聚类的颜色不同。
import numpy as np
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def draw_with_dbscan(para_csv_path_name,para_csv_name,para_save_path):
df = pd.read_csv(para_csv_path_name, encoding='utf-8', parse_dates=[1], low_memory=False)
X = df[['latitude', 'longitude']]
X = X.drop_duplicates()
kms_per_rad = 6371.0088 # mean radius of the earth
epsilon = 1.5 / kms_per_rad # The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. default=0.5
dbsc = (DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(X)))
fac_cluster_labels = dbsc.labels_
# get the number of clusters
num_clusters = len(set(dbsc.labels_))
# turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([X[fac_cluster_labels == n] for n in range(num_clusters)])
# get centroid of each cluster
fac_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*fac_centroids)
# from these lats/lons create a new df of one representative point for eac cluster
centroids_pd = pd.DataFrame({'longitude': cent_lons, 'latitude': cent_lats})
# Plot the faciity clusters and cluster centroid
fig, ax = plt.subplots(figsize=[20, 10])
facility_scatter = ax.scatter(X['longitude'], X['latitude'], c=fac_cluster_labels,
edgecolor='None', alpha=0.7, s=120)
centroid_scatter = ax.scatter(centroids_pd['longitude'], centroids_pd['latitude'], marker='x', linewidths=2,
c='k', s=50)
ax.set_title('Facility Clusters & Facility Centroid', fontsize=30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize=24)
ax.legend([facility_scatter, centroid_scatter], ['Facilities', 'Facility Cluster Centroid'], loc='upper right',
fontsize=20)
# plt.show()
plt.savefig(para_save_path + para_csv_name.split('.')[0] + '.png')
plt.close()
def get_centroid(cluster):
"""calculate the centroid of a cluster of geographic coordinate points
Args:
cluster coordinates, nx2 array-like (array, list of lists, etc)
n is the number of points(latitude, longitude)in the cluster.
Return:
geometry centroid of the cluster
"""
cluster_ary = np.asarray(cluster)
centroid = cluster_ary.mean(axis=0)
return centroid
if __name__ == '__main__':
csvlName=r'E:/mydata/test.csv'
item='test.csv'
abnormal_dbscan_device_img_dir=r'E:/result/'
draw_with_dbscan(csvlName, item, abnormal_dbscan_device_img_dir)
但是如何通过dbscan知道每个集群中经纬度数据的行数呢?
【问题讨论】:
-
您能否考虑一下minimal reproducible example 与您的标题相关并提供所需的输出?
-
我已经修改了我的代码。
-
还不清楚你有什么问题,但你可以试试
values, counts = np.unique(fac_cluster_labels,return_counts=True); {k:v for k,v in zip(values,counts)} -
例如一个csv有100行经纬度数据,分为4个簇。紫色簇有 10 行数据,蓝色簇有 20 行数据,黄色簇有 30 行数据,绿色簇有 40 行。数据,我想知道每个集群有多少行数据。
-
你试过我上面建议的代码了吗?
标签: python scikit-learn