【发布时间】:2018-05-14 16:11:59
【问题描述】:
想要计算每个集群内的 Instagram/酒店比率。但结果表明我实际上将非常遥远的点聚集在一起。 DBSCAN 不太可能出现这种情况。怎么了?
程序:使用 DBSCAN 对 instagram 帖子进行聚类,然后使用 1NN 对酒店进行分类。
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import KNeighborsClassifier
if __name__ == '__main__':
inst = pd.read_csv('inst.csv', encoding='utf-8')
ht = pd.read_csv('ht.csv', encoding='utf-8')
inst = inst[(inst.lat >= 48.30) & (inst.lng >= -139.06) & (inst.lat <= 60.00) & (inst.lng <= -114.03)]
ht = ht[(ht.lat >= 48.30) & (ht.lng >= -139.06) & (ht.lat <= 60.00) & (ht.lng <= -114.03)]
# kmean = KMeans(n_clusters=50,n_jobs=-1)
# kmean.fit(inst[['lat', 'lng']])
#
# ht_labels = kmean.predict(ht[['lat', 'lng']])
# inst_labels = kmean.predict(inst[['lat', 'lng']])
#
# plt.scatter(ht.lng, ht.lat, c=ht_labels, alpha=0.5)
# plt.savefig('./fig/hotel_clusters.png')
# plt.clf()
#
# plt.scatter(inst.lng, inst.lat, c=inst_labels, alpha=0.5)
# plt.savefig('./fig/instagram_posts_clusters.png')
# plt.clf()
dbs = DBSCAN(eps=0.05,min_samples=10,metric='haversine', n_jobs=-1)
ht_labels = dbs.fit_predict(ht[['lat', 'lng']])
inst_labels = dbs.fit_predict(inst[['lat', 'lng']])
plt.scatter(ht.lng, ht.lat, c=ht_labels, alpha=0.5)
plt.savefig('./fig/hotel_clusters1.png')
plt.clf()
plt.scatter(inst.lng, inst.lat, c=inst_labels, alpha=0.5)
plt.savefig('./fig/instagram_posts_clusters1.png')
plt.clf()
knn = KNeighborsClassifier(n_neighbors=1, n_jobs=-1)
knn.fit(inst[['lat', 'lng']], inst_labels)
ht_labels = knn.predict(ht[['lat', 'lng']])
plt.scatter(ht.lng, ht.lat, c=ht_labels, alpha=0.5)
plt.savefig('./fig/hotel_clusters3.png')
plt.clf()
plt.scatter(inst.lng, inst.lat, c=inst_labels, alpha=0.5)
plt.savefig('./fig/instagram_posts_clusters3.png')
plt.clf()
ht = ht[['lat', 'lng']]
ht['lb'] = ht_labels
inst = inst[['lat', 'lng']]
inst['lb'] = inst_labels
ht1 = ht.groupby(['lb']).count().reset_index().set_index('lb')
inst1 = inst.groupby(['lb']).count().reset_index().set_index('lb')
print(ht1)
ratio = inst1/ht1
print(ratio)
clu = 2
plt.scatter(ht[ht.lb == clu].lng, ht[ht.lb == clu].lat, c='black')
print(len(ht[ht.lb == clu]))
plt.scatter(ht.lng, ht.lat, c=ht_labels, alpha=0.05)
plt.savefig('./fig/hotel_clusters4.png')
plt.clf()
我找不到错误,你能帮忙吗?
【问题讨论】:
-
“非常遥远的点”有多远?从给出的信息很难判断这是否是一个 epsilon 问题。
-
喜欢的点比 epsilon 多?
-
更重要的是,中间有很多点不在同一个簇中
-
你能发一张图片吗?
-
已上传,请查看
标签: python machine-learning scikit-learn cluster-analysis dbscan